Machine Learning-Gender Identifier -Improved Accuracy with Error Analysis (Please don’t forget to use proper indentation for any line of code copied from this document)

NLTK — Error Analysis Approach for identifying additional features

In my previous blog Machine Learning — Gender Identifier with NLTK in less than 15 lines of code, we have seen how to build machine learning model to identify the Gender for any given name & ended that blog by mentioning that following points to be covered in the next part :

Accuracy of Model with Test Dataset: To make any model efficient & accurate, we first need to understand how much accuracy our model is showing. To do so, we had split our feature set into 2 sets as train_data & test_data. We have trained out model & predict some random names gender & those work fine as well. But when it comes to generalize the efficiency of our model, we can not just rely on predicting some random names as collection of names are too huge to generalize with some defined number of names. Here comes the role of test_data to measure accuracy of our model. From there we will try to understand how can we improve the accuracy if needed to. Only below lines of code to be added to already written code to call accuracy method of classify module from nltk:

modelAccuracy = classify.accuracy(classifier, test_data)
print(f’Accuracy of the gender detector model is {modelAccuracy}’)

Overfitting and Underfitting concepts

Our model was giving 77.4% accuracy which is good but we can easily improve the accuracy & there are lot of ways to do it. However, I will stick to very basic one like keep doing iteration with your intuition or knowledge of the domain. In our case, what If I include first & last 2 characters form name to make our feature set. When i tried to do so then I improved the accuracy to 80.3% but this leads to Overfitting (low bias but high variance) that means model is not being generalized rather memorizing all the possibility by capturing trends for almost all data points. Overfitting also increased the complication of our model. Providing very less features and having very simple model leads to Underfitting (low variance but high bias) meaning model is not able to capture the pattern that would generalize the model with unseen data in real time. What problem, these 2 concepts lead, are — First, it works fine with test data set but when it comes to predict completely unknown or new data then it failed to predict and Second, accuracy can vary very significantly. So we need to make sure we are not either overfitting or underfitting our model with more features in our feature set. In our gender identification, You can now try to select first & last 2 characters by below lines of code in gender_features method. Your gender_features method would look like this

def gender_features(word):
return {‘first_letter’: word[:2], ‘last_letter’ : word[-2:]}

Lets run your code & measure how much accuracy is being achieved. You can play with various combination depending upon you research on the domain for which you are trying to build machine learning model. For gender identification, please read through the research being done to know how to build your feature set.

Error-Analysis to enrich FeatureSet:

Lets say we want to take standard process to improve accuracy without using our knowledge of the domain & intuition then easiest way a developer can think of analyzing error for finding root cause. In similar way what we can do it here is Analyze the wrongly predicting name. While analyzing we will try to see - Is there any common pattern in those wrongly predicting name that can be used to enrich feature set?

Analyzing wrong prediction is known as Error-Analysis approach to enrich feature set for training Model. As NLTK model has very easy way to filter out all the wrongly predicted name in one list & capture it to file to analyze the trends or pattern within wrongly identified name. We need to tweak our data set to have name so we can easily analyze the result so to do Error Analysis, Firstly, we need to replace following lines of code (Please change the train_data, test_data creation based on below lines of code) :

train_names = allNames[700:]
test_names = allNames[:700]

train_data = [(gender_features(n), gender) for (n, gender) in train_names]
test_data = [(gender_features(n), gender) for (n, gender) in test_names]

Secondly, we need to store all wrongly identified name in the list so below lines of code can be appended at the end :

erroneousNames = []
for (name, tag) in dev_names:
predictedGender = classifier.classify(gender_features(name))
if predictedGender != tag:
erroneousNames.append((tag, predictedGender, name))

for erroneousName in sorted(erroneousNames):

This will list down all the wrongly identified name in the form of tuple like Correct Gender, Predicted Gender & Name. Only task that is pending is to verify all those wrong data points & collect the pattern to enrich the FeatureSet…

I was hoping to finish all the points in this blog but I think I would one more blog to write on that with proper detail & conclude final code with improved accuracy that is acceptable…

In next blog we will cover —

Until next blog, Stay Healthy & Keep Learning :-)



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Abhishek Jain

BlockChain Evangelist & Enthusiast with 13 years of experience as Software Test Automation Architect -