Pandas Methods to Handle Missing Values in Datasets

Pandas provide numerous functions and methods to clean and preprocess the dataset to make it production-ready.

In this article, we’ll see the methods provided by pandas to handle missing values in a dataset.

df.fillna()

The DataFrame.fillna() is used to fill in the missing values with the desired value. Let’s see how we can use it.

data = {"A": [2, np.nan, 19, 34, np.nan],

"B": [np.nan, 23, 12, 34, np.nan]}

df = pd.DataFrame(data)

--------------------

A B

0 2.0 NaN

1 NaN 23.0

2 19.0 12.0

3 34.0 34.0

4 NaN NaN

Filling Arbitrary Value

filled_df = df.fillna(0)

print(filled_df)

--------------------

A B

0 2.0 0.0

1 0.0 23.0

2 19.0 12.0

3 34.0 34.0

4 0.0 0.0

We passed an arbitrary value (0) to fill those NaN values in the dataset df.

Fill Using a Dataset

We can also use a dataset to fill in the missing values.

df2 = pd.DataFrame({"A": [1,2,3,4,5], "B": [6,7,8,9,10]})

fill_using_df = df.fillna(df2)

print(fill_using_df)

--------------------

A B

0 2.0 6.0

1 2.0 23.0

2 19.0 12.0

3 34.0 34.0

4 5.0 10.0

When using the fillna() method with df2, the NaN values in the original DataFrame df are replaced by the corresponding values in df2. If a cell in df is NaN, the method will look for the corresponding value in df2 (at the same position) and use that value to fill in the NaN.

Filling Different Values in Each Column

If we want to fill in different values in each column, we can use the following approach.

values = {"A": 100, "B": 200}

diff_val = df.fillna(value=values)

print(diff_val)

--------------------

A B

0 2.0 200.0

1 100.0 23.0

2 19.0 12.0

3 34.0 34.0

4 100.0 200.0

The value dictionary holds values to fill NaN in columns A and B in the dataset. By using df.fillna(value=values), the NaN value in column A is filled with the value 100 and NaN value in column B is filled with the value 200.

df.interpolate()

The DataFrame.interpolate() method provides various interpolation techniques to fill in the missing values.

Instead of filling in hard-coded values, we can use an interpolation method to fill missing values that make the dataset even more expressive and real.

Filling Computed Value

df2 = df.interpolate() # default: linear method and axis=0

print(df2)

--------------------

A B

0 2.0 NaN

1 10.5 23.0

2 19.0 12.0

3 34.0 34.0

4 34.0 34.0

When we use df.interpolate(), the default linear interpolation method is used that fills the NaN values equally spaced ignoring the index.

For example, in column A, 10.5 is filled which is equally spaced between the values 2.0 and 19.0 with the difference of 8.5.

But if we see the fourth row in both columns, they are filled with the same value (34.0) as above them because there were no values to compute in the fifth row.

Filling Nearest Values

data = {"A": [3, np.nan, 2, np.nan, 4],

"B": [1, 4, np.nan, 2, 5]}

df = pd.DataFrame(data)

df3= df.interpolate(method='nearest')

print(df3)

--------------------

A B

0 3.0 1.0

1 3.0 4.0

2 2.0 4.0

3 2.0 2.0

4 4.0 5.0

When we use method='nearest', the NaN values are filled with the nearest valid values.

In this case, the second row in column A is filled with the value of 3.0. Why so? The nearest value is decided based on the index close to the NaN value index. The index 0 (3.0) is closest to the index 1. The same is applied to all the NaN values.

Filling Values Considering Index Values

data = {"A": [3, np.nan, 9, np.nan, 4],

"B": [1, 10, np.nan, 20, 5]}

df = pd.DataFrame(data)

df4= df.interpolate(method='values') # or method='index'

print(df4)

--------------------

A B

0 3.0 1.0

1 6.0 10.0

2 9.0 15.0

3 6.5 20.0

4 4.0 5.0

The NaN values are filled equally spaced considering the values of the index surrounding the NaN value index.

df.ffill() and df.bfill()

The DataFrame.ffill() method is used to fill the last valid value in the missing place whereas the DataFrame.bfill() method is used to fill the next valid value.

Forward Filling With ffill()

data = {"A": [2, np.nan, 19, 34, np.nan],

"B": [np.nan, 23, 12, 34, np.nan]}

df = pd.DataFrame(data)

forward_fill = df.ffill()

print(forward_fill)

--------------------

A B

0 2.0 NaN

1 2.0 23.0

2 19.0 12.0

3 34.0 34.0

4 34.0 34.0

We can see that NaN values are filled with the preceding valid values, for instance, the second row of column A is filled with 2.0 which is the same value above it.

Backward Filling With bfill()

data = {"A": [2, np.nan, 19, 34, np.nan],

"B": [np.nan, 23, 12, 34, np.nan]}

df = pd.DataFrame(data)

backward_fill = df.bfill()

print(backward_fill)

--------------------

A B

0 2.0 23.0

1 19.0 23.0

2 19.0 12.0

3 34.0 34.0

4 NaN NaN

In this case, the NaN is filled with the next valid values, for instance, the first row of column B is filled with 23.0 which is the next value in the column.

We can also see that the fifth row of columns A and B remains unfilled (NaN) due to the absence of the next valid value in the dataset.

🏆Other articles you might be interested in if you liked this one

✅Pandas df.ffill() and df.bfill() to handle missing values.

✅Merge, combine, and concatenate multiple datasets using pandas.

✅Find and delete duplicate rows from the dataset using pandas.

✅How to efficiently manage memory use when working with large datasets in pandas?

✅How to find and delete mismatched columns from datasets in pandas?

✅How does the learning rate affect the ML and DL models?

That’s all for now

Keep Coding✌✌