Pandas supports Copy-on-Write, an optimization technique that helps improve memory use, particularly when working with large datasets.
Starting from version 2.0 of Pandas, the Copy-on-Write (CoW) has taken effect but has not been fully implemented. Most of the optimizations that are possible through Copy-on-Write are supported.
Aim of Copy-on-Write
As the name suggests, the data will be copied when it is modified. What it means?
When a DataFrame or Series shares the same data as the original, it will initially share the same memory for the data rather than creating a copy. When the data of either the original or new DataFrame is modified, a new copy of the data is created for the DataFrame that is being modified.
This will efficiently save memory usage and improve performance when working with large datasets.
Enabling CoW in Pandas
It is not enabled by default, so we need to enable it using the copy_on_write
configuration option in Pandas.
1 2 3 4 5 6 |
import pandas as pd # Option1 pd.options.mode.copy_on_write = True # Option2 pd.set_option("mode.copy_on_write" : True) |
You can use any of the options to turn on CoW globally in your environment.
Note: CoW will be enabled by default in Pandas 3.0, so get used to it early on.
Impact of CoW in Pandas
The CoW will disallow updating the multiple pandas objects at the same time. Here’s how it will happen.
1 2 3 4 5 |
import pandas as pd df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) subset = df["A"] subset.iloc[0] = 10 df |
With CoW, the above snippet will not modify df
rather it modifies only the data of subset
.
1 2 3 4 5 6 7 8 9 10 11 |
# df A B 0 1 4 1 2 5 2 3 6 # subset A 0 10 1 2 2 3 |
inplace Operations will Not Work
Similarly, the inplace
operations will not work with CoW enabled, which directly modifies the original df
.
1 2 3 4 5 6 7 8 |
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) df["A"].replace(1, 5, inplace=True) df -------------------- A B 0 1 4 1 2 5 2 3 6 |
We can see that df
has remained unchanged and additionally, we will see a ChainedAssignmentError
warning.
The above operation can be performed in two different ways. One method is to avoid inplace
, and another is to use inplace
to directly modify the original df
at the DataFrame
level.
1 2 3 4 5 6 7 8 9 |
# Avoid inplace df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) df["A"] = df["A"].replace(1, 5) df -------------------- A B 0 5 4 1 2 5 2 3 6 |
1 2 3 4 5 6 7 8 9 |
# Using inplace at DataFrame level df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) df.replace({"A": {2: 34}}, inplace=True) df -------------------- A B 0 1 4 1 34 5 2 3 6 |
Chained Assignment will Never Work
When we modify the DataFrame or Series using multiple indexing operations in a single line of code, this is what we call the chained assignment technique.
1 2 3 4 5 |
# CoW disabled with pd.option_context("mode.copy_on_write", False): df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8]}) df["B"][df['A'] > 2] = 10 df |
The above code snippet is trying to change column B
from the original df
where column A
is greater than 2. It means the value at the 2nd and 3rd index in column B
will be modified.
Since the CoW is disabled, this operation is allowed, and the original df
will be modified.
1 2 3 4 5 |
A B 0 1 5 1 2 6 2 3 10 3 4 10 |
But, this will never work with CoW enabled in pandas.
1 2 3 4 5 6 7 8 9 10 |
# CoW enabled df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8]}) df["B"][df["A"] > 2] = 10 df -------------------- A B 0 1 5 1 2 6 2 3 7 3 4 8 |
Instead, with copy-on-write, we can use .loc
to modify the df
using multiple indexing conditions.
1 2 3 4 |
# CoW enabled df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8]}) df.loc[(df["A"] == 1) | (df["A"] > 3), "B"] = 100 df |
This will modify column B
where column A
is either 1 or greater than 3. The original df
will look like the following.
1 2 3 4 5 |
A B 0 1 100 1 2 6 2 3 7 3 4 100 |
Read-only Arrays
When a Series or DataFrame is accessed as a NumPy array, that array will be read-only if the array shares the same data with the initial DataFrame or Series.
1 2 3 4 5 6 7 8 |
df = pd.DataFrame({"A": [1, 2, 3, 4], "B": ['5', '6', '7', '8']}) arr = df.to_numpy() arr -------------------- array([[1, '5'], [2, '6'], [3, '7'], [4, '8']], dtype=object) |
In the above code snippet, arr
will be a copy because df contains two different types of arrays (int
and str
). We can perform modifications on the arr
.
1 2 3 4 5 6 7 |
arr[1, 0] = 10 arr -------------------- array([[1, '5'], [10, '6'], [3, '7'], [4, '8']], dtype=object) |
Take a look at this case.
1 2 3 |
df = pd.DataFrame({"A": [1, 2, 3, 4], "B": [5, 6, 7, 8]}) arr = df.to_numpy() arr |
The DataFrame df
has only one NumPy array (array of the same data types), so arr
shares the data with df
. This means arr
will be read-only and cannot be modified in place.
1 2 3 4 5 6 |
print(arr.flags.writeable) arr[0,0] = 10 arr -------------------- False ValueError: assignment destination is read-only |
Lazy Copy Mechanism
When two or more DataFrames share the same data, the copies will not be created immediately.
1 2 |
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) df2 = df.reset_index(drop=True) |
Both df
and df2
shares the same reference in the memory as both share the same data. The copy mechanism will trigger only when any of the DataFrame is modified.
1 2 3 4 5 6 7 8 9 10 11 12 |
df2.iloc[0, 0] = 10 print(df2) print(df) -------------------- A B 0 10 4 1 2 5 2 3 6 A B 0 1 4 1 2 5 2 3 6 |
But this is not necessary, if we don’t want initial df
, we can simply reassign it to the same variable (df
) and this process will create a new reference. This will avoid the copy-on-write process.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) print("Initial reference: ",id(df)) df = df.reset_index(drop=True) print("New reference: ",id(df)) df.iloc[0, 0] = 10 print(df) -------------------- Initial reference: 138400246865760 New reference: 138400246860336 A B 0 10 4 1 2 5 2 3 6 |
This same optimization (lazy copy mechanism) is added to the methods that don’t require a copy of the original data.
DataFrame.rename()
1 2 3 4 5 6 7 |
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) df.rename(columns={"A": "X", "B": "Y"}) -------------------- X Y 0 1 4 1 2 5 2 3 6 |
When CoW is enabled, this method returns the original DataFrame rather than creating an entire copy of the data, unlike the regular execution.
DataFrame.drop() for axis=1
Similarly, the same mechanism is implemented for DataFrame.drop()
for axis=1
(axis='columns'
).
1 2 3 4 5 6 7 |
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]}) df.drop(["A"], axis=1) -------------------- B C 0 4 7 1 5 8 2 6 9 |
Conclusion
Pandas will by default implement Copy-on-Write (CoW) in version 3.0. All these optimizations that are compliant with CoW will lead to efficient memory and resource management when working with large datasets.
This will reduce unpredictable or inconsistent behavior and greatly maximize performance.
πOther articles you might be interested in if you liked this one
β Merge, combine, and concatenate multiple datasets using pandas.
β Find and delete duplicate rows from the dataset using pandas.
β Find and Delete Mismatched Columns From DataFrames Using pandas.
β Create temporary files and directories using tempfile module in Python.
β Upload and display images on the frontend using Flask.
β How does the learning rate affect the ML and DL models?
That’s all for now
Keep Codingββ