This article gives the details of methods to handle the missing values in data and their implementation using Python.
Handling missing values is an important step in data preprocessing. It is important to handle the missing values as it reduces the accuracy and reliability of models and these values are not supported by most of the machine learning models. In any dataset, missing values can occur due to various reasons such as data entry errors, equipment malfunctions, or the inherent nature of data collection processes. Addressing these gaps is essential to ensure the integrity of the dataset and to make it usable for further analysis.
Techniques to handle missing values
There are several techniques to handle the missing values in data, including
- deleting the feature or rows containing the null values,
- filling the missing values with the mean or median of the feature,
- and advanced techniques like machine learning-based imputation.
Each method has its advantages and trade-offs, and the choice of technique often depends on the context of the data and the specific requirements of the analysis.
In this tutorial, we are using the diabetes dataset containing null values in most of the features. First, we explore the data.
Exploring the data
#importing pandas library
import pandas as pd
df = pd.read_csv('diabetes_data1.csv')
print(df.shape)
df.head()
Output:
Glucose Diastolic_BP Skin_Fold Serum_Insulin BMI Diabetes_Pedigree Age Class
0 148.0 72.0 35.0 NaN 33.6 0.627 50 1
1 85.0 66.0 29.0 NaN 26.6 0.351 31 0
2 183.0 64.0 NaN NaN 23.3 0.672 32 1
3 89.0 66.0 23.0 94.0 28.1 0.167 21 0
4 137.0 40.0 35.0 168.0 43.1 2.288 33 1
To get the information about data df.info() is used.
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Glucose 763 non-null float64
1 Diastolic_BP 733 non-null float64
2 Skin_Fold 541 non-null float64
3 Serum_Insulin 394 non-null float64
4 BMI 757 non-null float64
5 Diabetes_Pedigree 768 non-null float64
6 Age 768 non-null int64
7 Class 768 non-null int64
dtypes: float64(6), int64(2)
memory usage: 48.1 KB
The above output shows that all the features are of float and int data type and do not contain any categorical feature.
Check missing values
To get the count of missing values in each column df.isnull().sum() command is used.
df.isnull().sum()
Output:
Glucose 5
Diastolic_BP 35
Skin_Fold 227
Serum_Insulin 374
BMI 11
Diabetes_Pedigree 0
Age 0
Class 0
dtype: int64
The above output shows that 5 out of 8 columns contain null values. Different techniques are implemented to handle these values.
1. Deleting the features with missing values
If most values in a column are NULL then it is better to delete the column. However, removing the features with null values is not always a good practice as there may be a loss of information. The loss of information can affect the further analysis.
df1 = df.copy()
df1.drop(['Skin_Fold','Serum_Insulin'], axis=1, inplace=True)
df1.head()
It is not recommended to drop the columns when almost every column contains the missing values.
2. Deleting the rows with missing values
Removing the features with null values is not always a better option instead we can drop the rows with null values. From the output of df.isnull.sum(), it is obvious that the ‘Glucose’ and ‘BMI’ columns contain only a small percentage of missing values, so we can drop those rows.
df2 = df.copy()
df2.dropna(subset=['Glucose','BMI'], axis=0, inplace=True)
df2.shape
Output:
(752, 8)
3. Filling missing values with mean and median
Filling the missing values with the mean and median of features is better than dropping the entire feature.
- The median is suitable when there are outliers in the data.
# Fill missing values with the median of each column
df_filled_median = df.copy()
null_valued_columns = ['Glucose','BMI']
for column in null_valued_columns:
df_filled_median.fillna(df_filled_median[column].median(), inplace=True)
- Mean is suitable when the data is normally distributed and does not contain outliers.
# Fill missing values with the median of each column
df_filled_median = df.copy()
null_valued_columns = ['Glucose','BMI']
for column in null_valued_columns:
df_filled_median.fillna(df_filled_median[column].median(), inplace=True)
4. Imputing the missing values using the KNN (Machine Learning approach)
In this technique, the K-nearest algorithm is used and the missing values are imputed using the information of their nearest neighbors. The code is given below:
from sklearn.impute import KNNImputer
# separating the target and other features
y = df['Class']
X = df.drop(['Class'], axis=1)
# KNN imputer instance
model = KNNImputer(n_neighbors=5)
# Performing imputation
imputed_data = model.fit_transform(X)