The DataFrame.ffill()
(forward fill) propagates missing or NaN
values using the previous valid value in a column or row, while DataFrame.bfill()
(backward fill) propagates them using the next valid value.
Let’s see how and when to use them.
DataFrame.ffill()
The DataFrame.ffill()
method fills the missing or NaN
values using the previous valid value in a column or row.
Suppose we have the following dataset that contains missing values in each column.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
import pandas as pd import numpy as np df = pd.DataFrame( { "Max Temperature": [22.7, 24.8, np.nan, np.nan, 29.9, np.nan, 45], "Min Temperature": [np.nan, 20.1, 18.4, np.nan, np.nan, np.nan, 15.5], "Avg Temperature": [21.4, 23.5, np.nan, 20.3, np.nan, 19.8, 20.4] } ) print(df) -------------------- Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 NaN 18.4 NaN 3 NaN NaN 20.3 4 29.9 NaN NaN 5 NaN NaN 19.8 6 45.0 15.5 20.4 |
Now, we can use the DataFrame.ffill()
method to fill in the missing values.
1 2 3 4 5 6 7 8 9 10 11 |
ffill_df = df.ffill() print(ffill_df) -------------------- Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 24.8 18.4 23.5 3 24.8 18.4 20.3 4 29.9 18.4 20.3 5 29.9 18.4 19.8 6 45.0 15.5 20.4 |
We can see that missing values are filled with preceding values, such as 24.8
in the second and third rows of the Max Temperature
column and 29.9
in the fifth row of the Max Temperature
column.
In the same manner, all the missing values are filled for the other two (Min Temperature
and Avg Temperature
) columns.
If you observe that the first row of Min Temperature
remains NaN
, it’s because there was no preceding value to fill.
Setting the Limit
We can also set the limit to forward fill the number of consecutive NaN
s by specifying the limit
parameter.
1 2 3 4 5 6 7 8 9 10 11 |
limit_ffill = df.ffill(limit=1) print(limit_ffill) -------------------- Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 24.8 18.4 23.5 3 NaN 18.4 20.3 4 29.9 NaN 20.3 5 29.9 NaN 19.8 6 45.0 15.5 20.4 |
After setting the limit=1
, just one NaN
value is filled in the columns Max Temperature
and Min Temperature
when consecutive NaN
s were present.
Limit Area
In pandas v2.2.0, a new parameter is added called the limit_area
which is by default set to None
. It can be set to 'inside'
and 'outside'
.
It is used with the limit
parameter and if set to
None
: The default behavior with no restrictions.NaN
s are filled with the last valid value, subject to anylimit
specified.inside
: Fills only thoseNaN
s that are surrounded by valid values.outside
: Fills only thoseNaN
s that are not surrounded by valid values.
Using ffill()
with limit=1
and limit_area='inside'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
print("Original Dataset") print(df) print("*"*55) in_ffill = df.ffill(limit=1,limit_area='inside') print(in_ffill) -------------------- Original Dataset Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 NaN 18.4 NaN 3 NaN NaN 20.3 4 29.9 NaN NaN 5 NaN NaN 19.8 6 45.0 15.5 20.4 ******************************************************* Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 24.8 18.4 23.5 3 NaN 18.4 20.3 4 29.9 NaN 20.3 5 29.9 NaN 19.8 6 45.0 15.5 20.4 |
We can see that NaN
values surrounded by valid values are filled, and because the limit was set to 1
, just one NaN
was filled from consecutive NaN
s.
Using ffill()
with limit=1
and limit_area='outside'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
print("Original Dataset") print(df) print("*"*55) out_ffill = df.ffill(limit=1,limit_area='outside') print(out_ffill) -------------------- Original Dataset Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 NaN 18.4 NaN 3 NaN NaN 20.3 4 29.9 NaN NaN 5 NaN NaN 19.8 6 45.0 15.5 20.4 ******************************************************* Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 NaN 18.4 NaN 3 NaN NaN 20.3 4 29.9 NaN NaN 5 NaN NaN 19.8 6 45.0 15.5 20.4 |
Here, we can see that dataset isn’t affected at all, because all the NaN
values are surrounded by valid values. Let’s tweak the dataset and see how it affects it.
1 2 3 4 5 6 7 8 9 10 11 12 |
df = pd.DataFrame( { "Max Temperature": [22.7, 24.8, np.nan, np.nan, 29.9, 45, np.nan], "Min Temperature": [np.nan, 20.1, 18.4, np.nan, np.nan, np.nan, 15.5], "Avg Temperature": [21.4, 23.5, np.nan, 20.3, np.nan, 19.8, np.nan] } ) print("Original Dataset") print(df) print("*"*55) out_ffill = df.ffill(limit=1,limit_area='outside') print(out_ffill) |
If we run this, we’ll get this output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Original Dataset Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 NaN 18.4 NaN 3 NaN NaN 20.3 4 29.9 NaN NaN 5 45.0 NaN 19.8 6 NaN 15.5 NaN ******************************************************* Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 NaN 18.4 NaN 3 NaN NaN 20.3 4 29.9 NaN NaN 5 45.0 NaN 19.8 6 45.0 15.5 19.8 |
Notice that in our original dataset (df
), we kept NaN
values in the seventh row of Max Temperature
and Avg Temperature
.
When we used df.ffill(limit=1,limit_area='outside')
, the newly added NaN
values were filled because they were not surrounded by valid values.
Filling Missing Values Across the Axis
By specifying the axis
parameter, we can control how the missing data should be filled in.
If the axis is set to 0
or 'index'
, the missing values will be filled down each column, moving vertically (from above) along the rows. This means the last valid value from above (in the same column) will be used to fill in the NaN
values below it.
1 2 3 4 5 6 7 8 9 10 11 |
row_ffill = df.ffill(axis=0) print(row_ffill) -------------------- Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 24.8 18.4 23.5 3 24.8 18.4 20.3 4 29.9 18.4 20.3 5 29.9 18.4 19.8 6 45.0 15.5 20.4 |
Every NaN
value is filled in each column moving along the row (vertically). So we can say that this is the default operation.
If the axis is set to 1
or 'columns'
, the missing values will be filled across each row, moving horizontally (from left to right) along the columns. This means the last valid value from the left (in the same row) will be used to fill in the NaN
values to the right of it.
1 2 3 4 5 6 7 8 9 10 11 |
col_ffill = df.ffill(axis=1) print(col_ffill) -------------------- Max Temperature Min Temperature Avg Temperature 0 22.7 22.7 21.4 1 24.8 20.1 23.5 2 NaN 18.4 18.4 3 NaN NaN 20.3 4 29.9 29.9 29.9 5 NaN NaN 19.8 6 45.0 15.5 20.4 |
We can see that the third row of Avg Temperature
is filled with 18.4
, the value present left of it, and in the same manner, the fifth row of Min Temperature
and Avg Temperature
is filled with the value 29.9
.
DataFrame.bfill()
The DataFrame.bfill()
method fills the missing or NaN
values using the next valid value in a column or row.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import pandas as pd import numpy as np df = pd.DataFrame( { "Max Temperature": [22.7, 24.8, np.nan, np.nan, 29.9, np.nan, 45], "Min Temperature": [np.nan, 20.1, 18.4, np.nan, np.nan, np.nan, 15.5], "Avg Temperature": [21.4, 23.5, np.nan, 20.3, np.nan, 19.8, 20.4] } ) print("Original Dataset") print(df) print("*"*55) bfill_df = df.bfill() print(bfill_df) |
We have a dataset (df
) and we are filling the NaN
s using the bfill()
(backward filling).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
Original Dataset Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 NaN 18.4 NaN 3 NaN NaN 20.3 4 29.9 NaN NaN 5 NaN NaN 19.8 6 45.0 15.5 20.4 ******************************************************* Max Temperature Min Temperature Avg Temperature 0 22.7 20.1 21.4 1 24.8 20.1 23.5 2 29.9 18.4 20.3 3 29.9 15.5 20.3 4 29.9 15.5 19.8 5 45.0 15.5 19.8 6 45.0 15.5 20.4 |
We can see that the NaN
values are filled with the next values, such as 45.0
and 29.9
in the Max Temperature
column. In the same manner, all the NaN
values are filled.
Notice that in the first row of Min Temperature
, the NaN
value gets filled.
Setting the Limit
The DataFrame.bfill()
method also has the limit
parameter to limit the number of consecutive NaN
values filling backward.
1 2 3 4 5 6 7 8 9 10 11 |
bfill_df = df.bfill(limit=1) print(bfill_df) -------------------- Max Temperature Min Temperature Avg Temperature 0 22.7 20.1 21.4 1 24.8 20.1 23.5 2 NaN 18.4 20.3 3 29.9 NaN 20.3 4 29.9 NaN 19.8 5 45.0 15.5 19.8 6 45.0 15.5 20.4 |
In the above code, the limit
parameter is set to 1
so only one NaN
value was filled when consecutive NaN
s were present.
Limit Area
The limit_area
parameter in DataFrame.bfill()
method is the same as the DataFrame.ffill()
‘s limit_area
parameter.
It is used with the limit
parameter and if set to
None
: The default behavior with no restrictions.NaN
s are filled with the last valid value, subject to anylimit
specified.inside
: Fills only thoseNaN
s that are surrounded by valid values.outside
: Fills only thoseNaN
s that are not surrounded by valid values.
Using bfill()
with limit=1
and limit_area='inside'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
print("Original Dataset") print(df) print("*"*55) bfill_df = df.bfill(limit=1, limit_area='inside') print(bfill_df) -------------------- Original Dataset Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 NaN 18.4 NaN 3 NaN NaN 20.3 4 29.9 NaN NaN 5 NaN NaN 19.8 6 45.0 15.5 20.4 ******************************************************* Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 NaN 18.4 20.3 3 29.9 NaN 20.3 4 29.9 NaN 19.8 5 45.0 15.5 19.8 6 45.0 15.5 20.4 |
We can see that the NaN
surrounded by the valid values gets filled and since the limit was set to 1
, just one NaN
gets filled when consecutive NaN
s were present.
Using bfill()
with limit=1
and limit_area='outside'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
print("Original Dataset") print(df) print("*"*55) bfill_df = df.bfill(limit=1, limit_area='outside') print(bfill_df) -------------------- Original Dataset Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 NaN 18.4 NaN 3 NaN NaN 20.3 4 29.9 NaN NaN 5 NaN NaN 19.8 6 45.0 15.5 20.4 ******************************************************* Max Temperature Min Temperature Avg Temperature 0 22.7 20.1 21.4 1 24.8 20.1 23.5 2 NaN 18.4 NaN 3 NaN NaN 20.3 4 29.9 NaN NaN 5 NaN NaN 19.8 6 45.0 15.5 20.4 |
We can see that in the first row of Min Temperature
column, the NaN
value was filled because that NaN
wasn’t surrounded by the valid value.
Filling Missing Values Across the Axis
We can use axis
parameter to fill in missing values along the column or row.
If the axis is set to 0
or 'index'
, the missing values will be filled down each column, moving vertically (from below) along the rows. This means the last valid value from below (in the same column) will be used to fill in the NaN
values above it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
print("Original Dataset") print(df) print("*"*55) bfill_df = df.bfill(axis=0) print(bfill_df) -------------------- Original Dataset Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 NaN 18.4 NaN 3 NaN NaN 20.3 4 29.9 NaN NaN 5 NaN NaN 19.8 6 45.0 15.5 20.4 ******************************************************* Max Temperature Min Temperature Avg Temperature 0 22.7 20.1 21.4 1 24.8 20.1 23.5 2 29.9 18.4 20.3 3 29.9 15.5 20.3 4 29.9 15.5 19.8 5 45.0 15.5 19.8 6 45.0 15.5 20.4 |
It’s like using the bfill()
method without any parameters. The NaN
values were filled vertically from below along the rows in each column.
If the axis is set to 1
or 'columns'
, the missing values will be filled across each row, moving horizontally (from right to left) along the columns. This means the last valid value from the right (in the same row) will be used to fill in the NaN
values to the left of it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
print("Original Dataset") print(df) print("*"*55) bfill_df = df.bfill(axis=1) print(bfill_df) -------------------- Original Dataset Max Temperature Min Temperature Avg Temperature 0 22.7 NaN 21.4 1 24.8 20.1 23.5 2 NaN 18.4 NaN 3 NaN NaN 20.3 4 29.9 NaN NaN 5 NaN NaN 19.8 6 45.0 15.5 20.4 ******************************************************* Max Temperature Min Temperature Avg Temperature 0 22.7 21.4 21.4 1 24.8 20.1 23.5 2 18.4 18.4 NaN 3 20.3 20.3 20.3 4 29.9 NaN NaN 5 19.8 19.8 19.8 6 45.0 15.5 20.4 |
We can see that the third row of Max Temperature
column gets filled by the value to its left (18.4
), and the fourth and sixth rows of Max Temperature
and Min Temperature
get filled by the value to its left (20.3
and 19.8
).
πOther articles you might be interested in if you liked this one
β Merge, combine, and concatenate multiple datasets using pandas.
β Find and delete duplicate rows from the dataset using pandas.
β How to efficiently manage memory use when working with large datasets in pandas?
β How to find and delete mismatched columns from datasets in pandas?
β Upload and display images on the frontend using Flask.
β How does the learning rate affect the ML and DL models?
That’s all for now
Keep Codingββ