Improving your machine learning pipeline through correlation analysis

6 min readOct 8, 2019

Sometimes, you may think that there is no other possible action that you may perform to improve the results of your machine learning (ML) classifier. However, in most of the time, there’s still another step that you may perform to achieve better results.

We will work with a dataset related to the Income Prediction problem associated with the Adult Income Census. The goal is to accurately predict whether or not someone is making more or less than $50,000 a year. Moreover, we will use the Decision Tree classifier in this example.

In this article, we will show how performing a correlation analysis of the dataset features may improve the classification results.

Adult Income Census Dataset

This dataset has 15 features, from which 6 are numerical and 9 are categorical. There are 32561 samples and our objective is to predict the high_income feature. The image below summarizes all the columns used in this dataset.

RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education_num     32561 non-null int64
marital_status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital_gain      32561 non-null int64
capital_loss      32561 non-null int64
hours_per_week    32561 non-null int64
native_country    32561 non-null object
high_income       32561 non-null object
dtypes: int64(6), object(9)

A basic solution for performing a machine learning classifier involves three basic steps:

Get data
Clean, prepare and manipulate data
Modeling (train and test) & tuning your model

Now, we will present a basic solution related to each of these steps using scikit learn.

1. Get Data

The first step in any machine learning pipeline is to acquire the data. Usually, it is a .csv file that has as much information as possible. To read this file and load it into your program, we will use Pandas library. So, assuming that your file is named as income.csv, we will have the following code:

import pandas as pd
income = pd.read_csv(“income.csv”)

By performing this configuration, the dataset will be stored at a variable called income.

2. Clean, prepare and manipulate data

Usually, the original data is messy: there may be duplicate samples, missing or incorrect values and outliers. These problems may harm the effectiveness of your machine learning classifier. In this sense, cleaning, preparing and manipulating data is an indispensable step that removes all undesirable content.

Luckily, our .csv file was already cleaned and there’s no missing or null values, as pointed by the image below by applying income.isnull().sum() and
income.isna().sum() commands.

age               0
workclass         0
fnlwgt            0
education         0
education_num     0
marital_status    0
occupation        0
relationship      0
race              0
sex               0
capital_gain      0
capital_loss      0
hours_per_week    0
native_country    0
high_income       0
dtype: int64

Therefore, there is no need to apply an additional step to clean the data. If that’s not the case, you can apply different solutions to deal with this problem, such as: eliminating samples where there’s at least one missing value; replacing the missing value with the average of the remaining values; drop the feature if there’s too much null or missing values, among others.

As pointed out previously, the dataset has 9 categorical features. Unfortunately, a ML classifier only understands and processes numerical values. So, what are we supposed to do when finding a categorical feature?

It’s simple! If the ML program only works with numbers, we have to convert these features into a number representation! In this initial basic pipeline we will transform each group of a specific categorical feature in a number. For example, race feature has five different possible categories: White, Black, Asian-Pac-Inslander, Amer-Indian-Eskimo and Other. Therefore, for this feature, there’ll be five different possible values: 0 — White; 1 — Black; 2 — Asian-Pac-Inslander, 3 — Amer-Indian-Eskimo and 4 — Other.

This is performed using the following lines of code:

for name in income.select_dtypes(“object”).columns.to_list():
col = pd.Categorical(income[name])
income[name] = col.codes

Now, we are finally ready to model our ML classifier.

PS: Applying this method is not the most effective. This is only a simple solution adopted to make the ML classifier works. Better solutions involves using OneHotEncoder and pd.get_dummies strategies. Consider studying them when performing your ML workflow.

3. Modeling (train and test) & tuning your model

Finally, after applying all these pre-processing steps, we are ready to model our ML classifier. First of all, we have to separate our entire data into two subsets: train and test. We will train our model using only the train dataset and use the test subset to verify the performance of our model when facing new samples. We perform this strategy to avoid biasing in our ML model. In scikit learn, this was performed using train_test_split function.

X_train, X_test, y_train, y_test = train_test_split(income.drop(labels=”high_income”,axis=1),
income[“high_income”],
test_size=0.20,
random_state=seed,
shuffle=True,
stratify=income[“high_income”])

After that, we create our pipeline specifying the classifier and define the hyperparameters that we want to test in our model. For this specific example, we used the Decision Tree Classifier and set the hyperparameters using gini and entropy criterion.

pipe = Pipeline([(“classifier”,DecisionTreeClassifier())])
search_space = [{“classifier”:[DecisionTreeClassifier()],
“classifier__criterion”: [“gini”,”entropy”]}]

Another important step that helps to develop a good ML classifier is to perform cross validation (CV). This strategy initially splits the training set into k smaller sets (called k-fold CV). Then, for each fold, a model is trained using the others k-1 folders as training data and the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy). The figure below illustrates this procedure.

Figure 1: Cross Validation method. [https://bit.ly/30WoUEj]

We also perform a grid search (GV) for model tuning. This method is used to find the optimal hyperparameters of a model which results in the most accurate predictions.

We used the function KFold() and GridSearchCV() to perform cross validation and grid search. Finally, using grid.fit(X_train,y_train), we fit our model. The code below shows how both functions were used in this project.

kfold = KFold(n_splits=10,random_state=seed)
grid = GridSearchCV(estimator=pipe,
param_grid=search_space,
cv=kfold,
scoring=scoring,
n_jobs=-1)
best_model = grid.fit(X_train,y_train)

This configuration resulted in 0.8146 accuracy using only the training data and 0.8206 accuracy using the test data.

Correlation Analysis

Data correlation is the way in which one set of data may correspond to another set. In ML, think of how your features correspond with your output.

If some feature is not correlated with your target variable, it means that changes in the values of this feature does not correspond to changes in the target variable. So, instead of helping the modeling process, this feature will harm the classification results.

To perform the correlation analysis we use Pearson’s Correlation Coefficient. It helps you find out the relationship between two quantities and gives you the measure of the strength of association between two variables. The value of Pearson’s Correlation Coefficient can be between -1 to +1.

1 means that they are highly correlated and 0 means no correlation. -1 means that there is a negative correlation. Think of it as an inverse proportion. Using .corr() function, we are able to compute the Pearson’s correlation coeficient relating high_income with the remaining features, as shown below.

high_income       1.000000
education_num     0.335154
age               0.234037
hours_per_week    0.229689
capital_gain      0.223329
sex               0.215980
capital_loss      0.150526
education         0.079317
occupation        0.075468
race              0.071846
workclass         0.051604
native_country    0.015840
fnlwgt           -0.009463
marital_status   -0.199307
relationship     -0.250918
Name: high_income, dtype: float64

It is important to note that we want use features whose absolute value of the Pearson’s coeficient is high. Therefore, we will dismiss fnlwgt, native_country, workclass, race, occupation and education features.

list_drop = [‘fnlwgt’,’native_country’,’workclass’,’race’,’occupation’,’education’]
income = income.drop(list_drop,axis=1)

Now, considering only the remaining features, we apply the same modeling and tuning procedures defined in the previous subsection. This process results in a 0.8253 accuracy using only the training data and 0.8284 accuracy using the test data, which are better results without performing any correlation analysis.

Conclusion

This article present a basic pipeline necessary to create a machine learning classifier. Moreover, we showed how a simple correlation analysis may improve your classification results.

References

https://bit.ly/2VtZUTO

https://towardsdatascience.com/data-correlation-can-make-or-break-your-machine-learning-project-82ee11039cc9