Handling Missing Data Using Python in Machine Learning

This article gives the details of methods to handle the missing values in data and their implementation using Python.

Handling missing values is an important step in data preprocessing. It is important to handle the missing values as it reduces the accuracy and reliability of models and these values are not supported by most of the machine learning models. In any dataset, missing values can occur due to various reasons such as data entry errors, equipment malfunctions, or the inherent nature of data collection processes. Addressing these gaps is essential to ensure the integrity of the dataset and to make it usable for further analysis.

Techniques to handle missing values

There are several techniques to handle the missing values in data, including

  • deleting the feature or rows containing the null values,
  • filling the missing values with the mean or median of the feature,
  • and advanced techniques like machine learning-based imputation.

Each method has its advantages and trade-offs, and the choice of technique often depends on the context of the data and the specific requirements of the analysis.

In this tutorial, we are using the diabetes dataset containing null values in most of the features. First, we explore the data.

Exploring the data

#importing pandas library
import pandas as pd

df = pd.read_csv('diabetes_data1.csv')
print(df.shape)
df.head()
Output:

Glucose	Diastolic_BP	Skin_Fold	Serum_Insulin	BMI	Diabetes_Pedigree	Age	Class
0	148.0	72.0	35.0	NaN	33.6	0.627	50	1
1	85.0	66.0	29.0	NaN	26.6	0.351	31	0
2	183.0	64.0	NaN	NaN	23.3	0.672	32	1
3	89.0	66.0	23.0	94.0	28.1	0.167	21	0
4	137.0	40.0	35.0	168.0	43.1	2.288	33	1

To get the information about data df.info() is used.

df.info()
Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Glucose            763 non-null    float64
 1   Diastolic_BP       733 non-null    float64
 2   Skin_Fold          541 non-null    float64
 3   Serum_Insulin      394 non-null    float64
 4   BMI                757 non-null    float64
 5   Diabetes_Pedigree  768 non-null    float64
 6   Age                768 non-null    int64  
 7   Class              768 non-null    int64  
dtypes: float64(6), int64(2)
memory usage: 48.1 KB

The above output shows that all the features are of float and int data type and do not contain any categorical feature.

Check missing values

To get the count of missing values in each column df.isnull().sum() command is used.

 df.isnull().sum()
Output:

Glucose                5
Diastolic_BP          35
Skin_Fold            227
Serum_Insulin        374
BMI                   11
Diabetes_Pedigree      0
Age                    0
Class                  0
dtype: int64

The above output shows that 5 out of 8 columns contain null values. Different techniques are implemented to handle these values.

1. Deleting the features with missing values

If most values in a column are NULL then it is better to delete the column. However, removing the features with null values is not always a good practice as there may be a loss of information. The loss of information can affect the further analysis.

df1 = df.copy()
df1.drop(['Skin_Fold','Serum_Insulin'], axis=1, inplace=True)
df1.head()

It is not recommended to drop the columns when almost every column contains the missing values.

2. Deleting the rows with missing values

Removing the features with null values is not always a better option instead we can drop the rows with null values. From the output of df.isnull.sum(), it is obvious that the ‘Glucose’ and ‘BMI’ columns contain only a small percentage of missing values, so we can drop those rows.

df2 = df.copy()
df2.dropna(subset=['Glucose','BMI'], axis=0, inplace=True)
df2.shape
Output:

(752, 8)

3. Filling missing values with mean and median

Filling the missing values with the mean and median of features is better than dropping the entire feature.

  • The median is suitable when there are outliers in the data.
# Fill missing values with the median of each column
df_filled_median = df.copy()
null_valued_columns = ['Glucose','BMI']

for column in null_valued_columns:
    df_filled_median.fillna(df_filled_median[column].median(), inplace=True)
  • Mean is suitable when the data is normally distributed and does not contain outliers.
# Fill missing values with the median of each column
df_filled_median = df.copy()
null_valued_columns = ['Glucose','BMI']

for column in null_valued_columns:
    df_filled_median.fillna(df_filled_median[column].median(), inplace=True)

4. Imputing the missing values using the KNN (Machine Learning approach)

In this technique, the K-nearest algorithm is used and the missing values are imputed using the information of their nearest neighbors. The code is given below:

from sklearn.impute import KNNImputer

# separating the target and other features
y = df['Class']
X = df.drop(['Class'], axis=1)

# KNN imputer instance
model = KNNImputer(n_neighbors=5)

# Performing imputation
imputed_data = model.fit_transform(X)

You might also like

Leave a Reply

Your email address will not be published. Required fields are marked *