Pandas df.ffill() and df.bfill() - Handling Missing Values in Dataset

The DataFrame.ffill() (forward fill) propagates missing or NaN values using the previous valid value in a column or row, while DataFrame.bfill() (backward fill) propagates them using the next valid value.

Let’s see how and when to use them.

DataFrame.ffill()

The DataFrame.ffill() method fills the missing or NaN values using the previous valid value in a column or row.

Suppose we have the following dataset that contains missing values in each column.

import pandas as pd

import numpy as np

df = pd.DataFrame(

{

"Max Temperature": [22.7, 24.8, np.nan, np.nan, 29.9, np.nan, 45],

"Min Temperature": [np.nan, 20.1, 18.4, np.nan, np.nan, np.nan, 15.5],

"Avg Temperature": [21.4, 23.5, np.nan, 20.3, np.nan, 19.8, 20.4]

}

)

print(df)

--------------------

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 NaN 18.4 NaN

3 NaN NaN 20.3

4 29.9 NaN NaN

5 NaN NaN 19.8

6 45.0 15.5 20.4

Now, we can use the DataFrame.ffill() method to fill in the missing values.

ffill_df = df.ffill()

print(ffill_df)

--------------------

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 24.8 18.4 23.5

3 24.8 18.4 20.3

4 29.9 18.4 20.3

5 29.9 18.4 19.8

6 45.0 15.5 20.4

We can see that missing values are filled with preceding values, such as 24.8 in the second and third rows of the Max Temperature column and 29.9 in the fifth row of the Max Temperature column.

In the same manner, all the missing values are filled for the other two (Min Temperature and Avg Temperature) columns.

If you observe that the first row of Min Temperature remains NaN, it’s because there was no preceding value to fill.

Setting the Limit

We can also set the limit to forward fill the number of consecutive NaNs by specifying the limit parameter.

limit_ffill = df.ffill(limit=1)

print(limit_ffill)

--------------------

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 24.8 18.4 23.5

3 NaN 18.4 20.3

4 29.9 NaN 20.3

5 29.9 NaN 19.8

6 45.0 15.5 20.4

After setting the limit=1, just one NaN value is filled in the columns Max Temperature and Min Temperature when consecutive NaNs were present.

Limit Area

In pandas v2.2.0, a new parameter is added called the limit_area which is by default set to None. It can be set to 'inside' and 'outside'.

It is used with the limit parameter and if set to

None: The default behavior with no restrictions. NaNs are filled with the last valid value, subject to any limit specified.
inside: Fills only those NaNs that are surrounded by valid values.
outside: Fills only those NaNs that are not surrounded by valid values.

Using ffill() with limit=1 and limit_area='inside'

print("Original Dataset")

print(df)

print("*"*55)

in_ffill = df.ffill(limit=1,limit_area='inside')

print(in_ffill)

--------------------

Original Dataset

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 NaN 18.4 NaN

3 NaN NaN 20.3

4 29.9 NaN NaN

5 NaN NaN 19.8

6 45.0 15.5 20.4

*******************************************************

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 24.8 18.4 23.5

3 NaN 18.4 20.3

4 29.9 NaN 20.3

5 29.9 NaN 19.8

6 45.0 15.5 20.4

We can see that NaN values surrounded by valid values are filled, and because the limit was set to 1, just one NaN was filled from consecutive NaNs.

Using ffill() with limit=1 and limit_area='outside'

print("Original Dataset")

print(df)

print("*"*55)

out_ffill = df.ffill(limit=1,limit_area='outside')

print(out_ffill)

--------------------

Original Dataset

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 NaN 18.4 NaN

3 NaN NaN 20.3

4 29.9 NaN NaN

5 NaN NaN 19.8

6 45.0 15.5 20.4

*******************************************************

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 NaN 18.4 NaN

3 NaN NaN 20.3

4 29.9 NaN NaN

5 NaN NaN 19.8

6 45.0 15.5 20.4

Here, we can see that dataset isn’t affected at all, because all the NaN values are surrounded by valid values. Let’s tweak the dataset and see how it affects it.

df = pd.DataFrame(

{

"Max Temperature": [22.7, 24.8, np.nan, np.nan, 29.9, 45, np.nan],

"Min Temperature": [np.nan, 20.1, 18.4, np.nan, np.nan, np.nan, 15.5],

"Avg Temperature": [21.4, 23.5, np.nan, 20.3, np.nan, 19.8, np.nan]

}

)

print("Original Dataset")

print(df)

print("*"*55)

out_ffill = df.ffill(limit=1,limit_area='outside')

print(out_ffill)

If we run this, we’ll get this output.

Original Dataset

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 NaN 18.4 NaN

3 NaN NaN 20.3

4 29.9 NaN NaN

5 45.0 NaN 19.8

6 NaN 15.5 NaN

*******************************************************

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 NaN 18.4 NaN

3 NaN NaN 20.3

4 29.9 NaN NaN

5 45.0 NaN 19.8

6 45.0 15.5 19.8

Notice that in our original dataset (df), we kept NaN values in the seventh row of Max Temperature and Avg Temperature.

When we used df.ffill(limit=1,limit_area='outside'), the newly added NaN values were filled because they were not surrounded by valid values.

Filling Missing Values Across the Axis

By specifying the axis parameter, we can control how the missing data should be filled in.

If the axis is set to 0 or 'index', the missing values will be filled down each column, moving vertically (from above) along the rows. This means the last valid value from above (in the same column) will be used to fill in the NaN values below it.

row_ffill = df.ffill(axis=0)

print(row_ffill)

--------------------

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 24.8 18.4 23.5

3 24.8 18.4 20.3

4 29.9 18.4 20.3

5 29.9 18.4 19.8

6 45.0 15.5 20.4

Every NaN value is filled in each column moving along the row (vertically). So we can say that this is the default operation.

If the axis is set to 1 or 'columns', the missing values will be filled across each row, moving horizontally (from left to right) along the columns. This means the last valid value from the left (in the same row) will be used to fill in the NaN values to the right of it.

col_ffill = df.ffill(axis=1)

print(col_ffill)

--------------------

Max Temperature Min Temperature Avg Temperature

0 22.7 22.7 21.4

1 24.8 20.1 23.5

2 NaN 18.4 18.4

3 NaN NaN 20.3

4 29.9 29.9 29.9

5 NaN NaN 19.8

6 45.0 15.5 20.4

We can see that the third row of Avg Temperature is filled with 18.4, the value present left of it, and in the same manner, the fifth row of Min Temperature and Avg Temperature is filled with the value 29.9.

DataFrame.bfill()

The DataFrame.bfill() method fills the missing or NaN values using the next valid value in a column or row.

import pandas as pd

import numpy as np

df = pd.DataFrame(

{

"Max Temperature": [22.7, 24.8, np.nan, np.nan, 29.9, np.nan, 45],

"Min Temperature": [np.nan, 20.1, 18.4, np.nan, np.nan, np.nan, 15.5],

"Avg Temperature": [21.4, 23.5, np.nan, 20.3, np.nan, 19.8, 20.4]

}

)

print("Original Dataset")

print(df)

print("*"*55)

bfill_df = df.bfill()

print(bfill_df)

We have a dataset (df) and we are filling the NaNs using the bfill() (backward filling).

Original Dataset

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 NaN 18.4 NaN

3 NaN NaN 20.3

4 29.9 NaN NaN

5 NaN NaN 19.8

6 45.0 15.5 20.4

*******************************************************

Max Temperature Min Temperature Avg Temperature

0 22.7 20.1 21.4

1 24.8 20.1 23.5

2 29.9 18.4 20.3

3 29.9 15.5 20.3

4 29.9 15.5 19.8

5 45.0 15.5 19.8

6 45.0 15.5 20.4

We can see that the NaN values are filled with the next values, such as 45.0 and 29.9 in the Max Temperature column. In the same manner, all the NaN values are filled.

Notice that in the first row of Min Temperature, the NaN value gets filled.

Setting the Limit

The DataFrame.bfill() method also has the limit parameter to limit the number of consecutive NaN values filling backward.

bfill_df = df.bfill(limit=1)

print(bfill_df)

--------------------

Max Temperature Min Temperature Avg Temperature

0 22.7 20.1 21.4

1 24.8 20.1 23.5

2 NaN 18.4 20.3

3 29.9 NaN 20.3

4 29.9 NaN 19.8

5 45.0 15.5 19.8

6 45.0 15.5 20.4

In the above code, the limit parameter is set to 1 so only one NaN value was filled when consecutive NaNs were present.

Limit Area

The limit_area parameter in DataFrame.bfill() method is the same as the DataFrame.ffill()‘s limit_area parameter.

It is used with the limit parameter and if set to

None: The default behavior with no restrictions. NaNs are filled with the last valid value, subject to any limit specified.
inside: Fills only those NaNs that are surrounded by valid values.
outside: Fills only those NaNs that are not surrounded by valid values.

Using bfill() with limit=1 and limit_area='inside'

print("Original Dataset")

print(df)

print("*"*55)

bfill_df = df.bfill(limit=1, limit_area='inside')

print(bfill_df)

--------------------

Original Dataset

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 NaN 18.4 NaN

3 NaN NaN 20.3

4 29.9 NaN NaN

5 NaN NaN 19.8

6 45.0 15.5 20.4

*******************************************************

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 NaN 18.4 20.3

3 29.9 NaN 20.3

4 29.9 NaN 19.8

5 45.0 15.5 19.8

6 45.0 15.5 20.4

We can see that the NaN surrounded by the valid values gets filled and since the limit was set to 1, just one NaN gets filled when consecutive NaNs were present.

Using bfill() with limit=1 and limit_area='outside'

print("Original Dataset")

print(df)

print("*"*55)

bfill_df = df.bfill(limit=1, limit_area='outside')

print(bfill_df)

--------------------

Original Dataset

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 NaN 18.4 NaN

3 NaN NaN 20.3

4 29.9 NaN NaN

5 NaN NaN 19.8

6 45.0 15.5 20.4

*******************************************************

Max Temperature Min Temperature Avg Temperature

0 22.7 20.1 21.4

1 24.8 20.1 23.5

2 NaN 18.4 NaN

3 NaN NaN 20.3

4 29.9 NaN NaN

5 NaN NaN 19.8

6 45.0 15.5 20.4

We can see that in the first row of Min Temperature column, the NaN value was filled because that NaN wasn’t surrounded by the valid value.

Filling Missing Values Across the Axis

We can use axis parameter to fill in missing values along the column or row.

If the axis is set to 0 or 'index', the missing values will be filled down each column, moving vertically (from below) along the rows. This means the last valid value from below (in the same column) will be used to fill in the NaN values above it.

print("Original Dataset")

print(df)

print("*"*55)

bfill_df = df.bfill(axis=0)

print(bfill_df)

--------------------

Original Dataset

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 NaN 18.4 NaN

3 NaN NaN 20.3

4 29.9 NaN NaN

5 NaN NaN 19.8

6 45.0 15.5 20.4

*******************************************************

Max Temperature Min Temperature Avg Temperature

0 22.7 20.1 21.4

1 24.8 20.1 23.5

2 29.9 18.4 20.3

3 29.9 15.5 20.3

4 29.9 15.5 19.8

5 45.0 15.5 19.8

6 45.0 15.5 20.4

It’s like using the bfill() method without any parameters. The NaN values were filled vertically from below along the rows in each column.

If the axis is set to 1 or 'columns', the missing values will be filled across each row, moving horizontally (from right to left) along the columns. This means the last valid value from the right (in the same row) will be used to fill in the NaN values to the left of it.

print("Original Dataset")

print(df)

print("*"*55)

bfill_df = df.bfill(axis=1)

print(bfill_df)

--------------------

Original Dataset

Max Temperature Min Temperature Avg Temperature

0 22.7 NaN 21.4

1 24.8 20.1 23.5

2 NaN 18.4 NaN

3 NaN NaN 20.3

4 29.9 NaN NaN

5 NaN NaN 19.8

6 45.0 15.5 20.4

*******************************************************

Max Temperature Min Temperature Avg Temperature

0 22.7 21.4 21.4

1 24.8 20.1 23.5

2 18.4 18.4 NaN

3 20.3 20.3 20.3

4 29.9 NaN NaN

5 19.8 19.8 19.8

6 45.0 15.5 20.4

We can see that the third row of Max Temperature column gets filled by the value to its left (18.4), and the fourth and sixth rows of Max Temperature and Min Temperature get filled by the value to its left (20.3 and 19.8).

🏆Other articles you might be interested in if you liked this one

✅Merge, combine, and concatenate multiple datasets using pandas.

✅Find and delete duplicate rows from the dataset using pandas.

✅How to efficiently manage memory use when working with large datasets in pandas?

✅How to find and delete mismatched columns from datasets in pandas?

✅Upload and display images on the frontend using Flask.

✅How does the learning rate affect the ML and DL models?

That’s all for now

Keep Coding✌✌