Machine Learning — Gender Identifier with NLTK in less than 15 lines of code

Natural Language Processing is really interesting world to get to know more about How computers are understanding human languages in almost all the spoken languages in the world exists today. As we all know spoken language is way different then learning language as learning language is a standard process to follow but still one finds difficult to use the learning to speak new language in day-to-day life. There is lot many exceptions, variations within almost every language that makes Natural Language Processing interesting & challenging area for interested ones. I must confess that NLTK (Natural Language Processing & ToolKit)module written in Python was really awesome to start with & I would like to thank all the founder/contributor to NLTK team to put this together for newbies like me.

The very first machine learning model I built by following NLTK documentation is Gender Identification in just less than 15 lines of code & if you are interested how can you build the same then please watch my video on youtube with title Machine Learning — Gender Identifier with NLTK in less than 15 lines of code.

Question comes in mind — How can I build Gender Identifier with NLTK, here is summary of the steps that I followed in above mentioned video & you can do the same by follow along with video :

  1. Understand the pattern of your whole data set to build feature set on which model is going to be trained
  2. As per NLTK documentation, it is observed that Female names most likely to end with ‘a’, ‘e’, ‘i’ whereas Male names are with ‘k’, ‘r’, ‘o’, ‘s’, & ‘t’. You are free to analyze & come up with your own way to find out the pattern through which to want to build your feature set. Now we know how we need to build feature set however we have not yet collected the required data for extracting feature set. NLTK team has done a very good job in providing lots of data that we can use to learn or build whatever we wish to if possible. One of the best feature is CORPUS & for Gender Identification, we are going to use names corpus that has 2 categories male name & female name.Create method in python program as per below line of codes to extract last character of the name & this will help us to create feature set based on the assumption that last character can be most useful attribute to identify gender of the name.

def gender_features(word):
return {‘lastletter’ : word[-1]}

3. Import required modules that would provide you the collection of names for female/male. In NLTK, we call collection of data as Corpus.

from nltk.corpus import names
from nltk import NaiveBayesClassifier as NBC
from nltk import classify
import random

4. Create list of tuples with name & its gender for Female & Male name list

maleNames = [ (name, ‘male’) for name in names.words(‘male.txt’) ]
femaleNames = [ (name, ‘female’) for name in names.words(‘female.txt’) ]

5. Combine Female & Male List of tuples followed by shuffling it

allNames = maleNames + femaleNames
random.shuffle(allNames)

6. Create Train/Test Data set so Machine learning model can be trained & verified for accuracy with test data set

featureData = [ (gender_features(namelist), gender) for (namelist, gender) in allNames ]
test_data = featureData[:700]
train_data = featureData[700:]

7. Trained the model

classifier = NBC.train(train_data)

8. Predict the gender for any given name

print(classifier.classify(gender_features(‘John’)))
print(classifier.classify(gender_features(‘Alexa’)))
print(classifier.classify(gender_features(‘Abhishek’)))
print(classifier.classify(gender_features(‘Renu’)))
print(classifier.classify(gender_features(‘Sunita’)))

Now, you have all the codes to start predicting the gender of the name however there is last steps to measure the accuracy of model. Then we can use test_data to measure accuracy followed by improving the accuracy with some concrete methodology or intuition based on our knowledge of the data. We will cover all these in next part. In next part, we will cover following to conclude the final model for gender identification with improved accuracy :

  1. Accuracy of Model with Test Dataset
  2. Overfitting and Underfitting concepts
  3. Error-Analysis to enrich FeatureSet
  4. How to Improve Accuracy with Error-Analysis to find pattern

For next part, stay tuned & stay healthy with learning new concepts in NLP.

BlockChain Evangelist & Enthusiast with 13 years of experience as Software Test Automation Architect - https://www.linkedin.com/in/abhishek-jain-31a72133/

BlockChain Evangelist & Enthusiast with 13 years of experience as Software Test Automation Architect - https://www.linkedin.com/in/abhishek-jain-31a72133/