HANDLING MISSING VALUES
How can we deal with missing value?
In reality, we often have missing values in our dataset. How can we deal with it? There are several ways to solve missing value problems. In this case, we would like to use remove rows/columns and fill the missing data with value. The data we’re going to use can be retrieved here.
1st Method — Remove rows/columns
- Load the dataset
import pandas as pd
import numpy as np
df = pd.read_csv('car_sales.csv')
df.shape
The dataset to be used is the car sales dataset. The dataset consist of 157 rows and 16 columns. The columns of dataset are manufacturer, model, sales in thousands, year resale value, vehicle type, price in thousands, engine size, horsepower, wheelbase, width, length, curb weight, fuel capacity, fuel efficiency, latest launch and power performance factor.
df.isnull().sum()
Based on output in Picture 2, there are 36 data missing in __year_resale_value column, 2 data missing in Price_in_thousands column, etc. Then, we would like to remove missing value in the dataset by rows.
df_dropna_row = df.dropna(axis=0)
df_dropna_row.info()
Now, there are no missing values in the rows in each columns. Dimension of dataset to be 117 rows and 16 columns. Furthermore, we would like to remove missing values in the dataset by columns.
df_dropna_col = df.dropna(axis=1)
df_dropna_col.info()
As we can see in Picture 4, columns with missing value is removed. After remove missing value by columns, the dimension become 157 rows and 5 columns.
2nd Method — Filling the missing data with value
In the second method, we didn’t remove the missing value but fill the missing value with mean and median.
fillna_mean = df.fillna(df.mean())
fillna_mean.info()
fillna_mean = df.fillna(df.median())
fillna_mean.info()