SKLearn — How to design Machine Learning based solution (Quick summary of important steps for beginners)

5 min readDec 8, 2018

Python library SKLearn (earlier known as Sci-Kit Learn) has made it very easy for new beginner to implement Machine Learning algorithms. As per my experience, implementing machine learning is not that tricky but there are few trickiest steps those make Machine Learning more interesting & challenging. I am still learning how can I become comfortable in those areas but still not able to get through it. However, in last one year, I spent lot of time in understanding what essentials steps involved in developing any machine learning based solution. Through this article, I am trying to summarize how it should be done & I will be more than happy to be criticized on the mistake I am going to make. Here we go -

Step1 — Identify business problem for which Machine Learning based solution can be built :

It is really most important task to identify the business problem that can be solved with Machine Learning algorithms. Machine Learning solves variety of problems from daily life e.g. Forecasting, Pattern Recognition, Data Mining, Robotics, Expert Systems, Natural Language Processing, Vision Processing etc. Most of our day to day problem where we take decision based on our experience can be done with Machine Learning. Very good example of predicting stock price from historical data, we can feed historical data for any particular stock to any Regression Algorithm to learn about trends & then predict future price. Another example is feed in lot of X-Ray images from when human being was well then got disease to algorithms to learn about trend how X-ray image for change. Once model learns then same model can be used to predict probability of happening same disease by seeing series of X-rays of another human being. There are numerous use case where machine learning can be applied.

Step2 — Data Collection related to business problem :

There is no concrete way for collecting data for the identified business problem. There are ways to collect data & my choice to collect data are as follows -

Contact online vendor who can provide relevant data,
Social media can be utilized to collect publicly available data,
Web scraping can be used with the help of technical experts,
Propriety data can be bought from individual
Individual can be hired to collect data for their specific region

Depending upon the domain of business problem, data collection can be taken. Once you collect data from various sources then you need to normalized data to bring on common platform so data can be refined further. Also, understanding data points in order to solve identified problem really help to utilize already built tools & techniques to refine data points.

Step3 — Data Analysis & Preprocessing (matplotlib.pyplot ) :

To analyze the data, One has to not only understand the domain but also how each data point can impact the accuracy of prediction. Matplotlib & Seaborn are 2 most commonly used data visualization libraries which are used to analyze the data trends. These 2 libraries have variety of data visualization schemes that helps data scientist to understand the trends in conjunction with another data points. Once it is known that what all data points can help algorithms to learn more effectively with higher accuracy then next steps is to convert data points into the form which can be understood by algorithms.

Step4 — Data Preparation for Machine Learning algorithms (sklearn.preprocessing, NumPy & Pandas) :

After finishing data completion, it is known that what all data points (in machine learning world mostly termed as features) to be used. Then we can start analyzing data for missing values, empty data, outliers, encoded data or corrupt data etc. As mostly algorithms are not intelligent enough to handle such data so those are required to be manipulated in such a way that we don’t change the meaning of collected analyzed data. There are lot of modules provided by SKLearn library like preprocessing, Simple Imputer etc, then we have Pandas that seems to be very heavily used & how can I forget NumPy. I have started publishing articles on handling various kind of outliers within analyzed data so that Algorithms can be trained with more refined data. There are ways to drop data in case of outliers but that somehow drops quality data for other features which may contain some very vital information for algorithms. I do not prefer dropping data but there can be situation where we need to drop if we are sure that dropping will not be having any adverse effect on training models.

Step5 — Machine Algorithm evaluation to Observe Accuracy and Optimization to improve accuracy :

Once we have data ready then we need to decide which algorithm to choose to achieve highest accuracy. SKLearn has many machine learning techniques like Supervised with Regression & Classification, Unsupervised, Reinforcement & semi supervised. Few most commonly used algorithms are Linear Regression, Logistic Regression, Naïve Bayes, K-Means, K-Nearest Neighbor, Random Forest, Support Vector Machine, Hierarchical Clustering. We can experiment with all these algorithm based on which algorithms proven to be work with more accuracy. SKLearn also has module like metrics, model_selection, feature_selection & other modules that have lot of helper function that help to figure out the Accuracy & pointers to increase accuracy. However all these functions need to be complement with Human intelligence to achieve higher accuracy.

Machine Learning High Level Logical Flow

Step6 — Creation of Model for production use :

Once our model is optimized with highest accuracy achieved with test data then we need to convert the model deploy-able for production use. To do so, we need to serialize the model written in python. There can be many more ways to serialize however I used pickle & joblib of sklearn.externals module for serialization & deserialization. Once serialized the model then same can be save in either file or database or any medium developers are comfortable with. Whenever it is required to use for any real time use then same serialized object can be deserialized & used.

Step7 — Implementation of Machine Learning model :

Once model object is serialized then same can be exposed to end user through web application, web service (API) or any user interface on desktop application. It can be embedded into inside any bigger application as well if it is required to be used for serving any bigger functionality of any other application. Depending upon the use of model, it can be implemented

I will be soon publishing the working code all the above steps based on either real time problem or some in built dataset within SKLearn library of Python.

Stay tune for next one with working code…. Until then keep learning :-)