You are currently viewing Join, Merge, and Combine Multiple Datasets Using pandas

Join, Merge, and Combine Multiple Datasets Using pandas

Data processing becomes critical when training a robust machine learning model. We occasionally need to restructure and add new data to the datasets to increase the efficiency of the data.

We’ll look at how to combine multiple datasets and merge multiple datasets with the same and different column names in this article. We’ll use the pandas library’s following functions to carry out these operations.

  • pandas.concat()
  • pandas.merge()
  • pandas.DataFrame.join()

Preparing Sample Data

We’ll create sample datasets using pandas.DataFrame() function and then perform concatenating operations on them.

Sample data creation

The code in the image will generate two datasets from data and data1 using pd.DataFrame(data) and pd.DataFrame(data1) and store them in the variables df1 and df2.

Then, using the .to_csv() function, df1 and df2 will be saved in the CSV format as 'employee.csv' and 'employee1.csv' respectively.

Here, the data that we created looks as shown in the following image.

Data preview

Combining Data Using concat()

We can use the pandas library to analyze, modify, and do other things with our CSV (comma-separated value) data. The library includes the concat() function which we will use to perform the concatenation of multiple datasets.

There are two axes on which the datasets can be concatenated: the row axis and the column axis.

Visual representation of concatenation on different axis

Combine Data Along the Row Axis

We previously created two datasets named 'employee.csv' and 'employee1.csv'. We’ll concatenate them horizontally, which means the data will be spliced across the rows.

The above code demonstrates the basic use of the concat() function. We passed a list of datasets(objects) that will be combined along the row axis by default.

Concatenated dataset

The concat() function accepts some parameters that affect the concatenation of the data.

The indices of the data are taken from their corresponding data, as seen in the above output. How do we create a new data index?

The ignore_index Parameter

When ignore_index=True is set, a new index from 0 to n-1 is created. The default value is False, which is why the indices were repeated in the above example.

Created new index

As shown in the image above, the dataset contains a new index ranging from 0 to 7.

The join Parameter

In the above image, we can see that the first four data points for the Salary and No_of_awards columns are missing.

This is due to the join parameter, which by default is set to "outer" which joins the data exactly as it is. If it is set to "inner", data that does not match another dataset is removed.

Inner join

The keys Parameter

The keys parameter creates an index from the keys which is used to differentiate and identify the original data in the concatenated objects.

Key index

The datasets were concatenated, and a multi-level index was created, with the first level representing the outermost index (Dataset1 and Dataset2 from the keys) and the second level representing the original index.

Combine Data Along the Column Axis

The datasets were concatenated along the row axis or horizontally in the previous section, but in this approach, we will stitch them vertically or along the column axis using the axis parameter.

The axis parameter is set to 0 or "index" by default, which concatenates the datasets along the row axis, but if we change its value to 1 or "columns", it concatenates the datasets along the column axis.

Datasets concatenated vertically

Merging Data Using merge()

The pandas.merge() function merges data from one or more datasets based on common columns or indices.

We’ll operate on a different dataset that we created and contains the information shown in the following image.

Sample data preview

The merge() function takes left and right parameters which are datasets to be merged.

The how Parameter

We can now specify the type of merge we want to perform on these datasets by providing the how parameter. The how parameter allows for five different types of merges:

  • inner: Default. It only includes the values that match from both datasets.
  • outer: It includes all of the values from both datasets but fills the missing values with NaN (Not a Number).
  • left: It includes all of the values from the left dataset and replaces any missing values in the right dataset with NaN.
  • right: It includes all of the values from the right dataset and replaces any missing values in the left dataset with NaN.
  • cross: It creates the Cartesian product which means that the number of rows created will be equal to the product of the row counts of both datasets. If both datasets have four rows, then four times four (4 * 4) equals sixteen (16) rows.

Examples

Performing inner merge

Inner merged dataset

We can see that only values with the same Id from both datasets have been included.

Performing outer merge

Outer merged dataset

In the case of an outer merge, all of the values from both datasets were included, and the missing fields were filled in with NaN.

Performing left merge

Left merged dataset

The matching values of the right dataset (dt2) were merged in the left dataset (dt1) and the values of the last four columns (Project_id_finalAgeSalary, and No_of_awards) were not found for A4, so they were filled in with NaN.

Performing right merge

Right merged dataset

The matching values of the left dataset (dt1) were merged in the right dataset (dt2) and the values of the first five columns (Project_id_initialNameRoleExperience, and Qualification) were not found for A6, so they were filled in with NaN.

Cross Merging the Datasets

The how parameter has five different types of merge, one of which is a cross merge.

As previously stated, it generates the Cartesian product, with the number of rows formed equal to the product of row counts from both datasets. Take a look at the illustration below to get a better understanding.

Visual representation of the cross join
Cross merged dataset

Both datasets have four rows each, and each row from dt1 is repeated four times (row count of dt2), resulting in a data set of sixteen rows.

The on, left_on & right_on Parameters

The on parameter accepts the name of a column or index(row) to join on. It could be a single name or a list of names.

The left_on and right_on parameter takes a column or index(row) name from the left and right dataset to join on. They are used when both datasets have different column names to join on.

Merging Datasets on the Same Column

To merge the datasets based on the same column, we can use the on parameter and pass the common column name that both datasets must have.

We are merging datasets dt1 and dt2 based on the 'Id' column that they both share.

Data merged on same column

The matching Id column values from both datasets were merged, and the non-matching values were removed.

Merging Datasets on Different Columns

To merge different columns in the left and right datasets, use the left_on and right_on parameters.

The joining column is the "Project_id_initial" column from the left dataset (dt1) and the "Project_id_final" column from the right dataset (dt2). The values shared by both columns will be used to merge them.

Dataset merged on different columns

As we can see, the dataset includes both columns, as well as matching rows based on the common values in both the "Project_id_initial" and "Project_id_final" columns.

Changing the Suffix of the Column

If you notice that the merged dataset has two Id columns labeled Id_x and Id_y, this is due to the suffixes parameter, which has default values _x and _y, and when overlapping column names are found in the left and right datasets, they are suffixed with default values.

This will append the suffixes "_1" and "_2" to the overlapping columns. Because both datasets have the same column name Id, the Id column will appear to be Id_1 in the left dataset and Id_2 in the right dataset.

Suffix changed

Joining Datasets Using join()

The join() method works on the DataFrame object and joins the columns based on the index values. Let’s perform a basic join operation on the dataset.

The columns of the dt2 dataset will be joined with the dt1 dataset based on the index values.

Basic join operation output

Since the index values of both datasets are the same which is 012, and 3, that’s why we got all the rows.

The join() method’s parameters can be used to manipulate the dataset. The join() method, like the merge() function, includes how and on parameters.

  • how: Default value is left join. It is the same as the how parameter of the merge() function, but the difference is that it performs index-based joins.
  • on: A column or index name is required to join on the index in the specified dataset.
  • lsuffix and rsuffix: Used to append the suffix to the left and right datasets’ overlapping columns.

Examples

Left join on an index

In the above code, we use set_index('Id') to set the Id column of the dt2 dataset as the index and perform a left join (how="left") on the Id column (on="Id") between dt1 and dt2.

This will join matching values in the Id column of the dt2 dataset with the Id column of the dt1 dataset. If any values are missing, they will be filled in by NaN.

Left joined dataset

It’s the same as when we used the merge() function, but this time we’re joining based on the index.

Right join on an index

Right joined dataset

We are joining the dt1 dataset with the index of the dt2 dataset based on the Id column. We got NaN in the first five columns for A6 because there were no values specified in the dt1 dataset.

Inner join on an index

Inner joined dataset

The datasets were joined based on matching index values, i.e., both datasets dt1 and dt2 share A1A2, and A3, so the values corresponding to these indices were joined.

Outer join on an index

Outer joined dataset

We performed the outer join, which included all of the rows from both datasets based on the Id. The corresponding values have been filled in, and missing values have been filled in with NaN.

Cross Join

We didn’t pass the on parameter, instead, we defined how the data should join (how="cross"). The resulting dataset will be the product of both datasets’ row counts.

Cross joined dataset

Conclusion

We’ve learned how to use pandas.concat()pandas.merge(), and pandas.DataFrame.join() to combine, merge, and join DataFrames.

The concat() function in pandas is a go-to option for combining the DataFrames due to its simplicity. However, if we want more control over how the data is joined and on which column in the DataFrame, the merge() function is a good choice. If we want to join data based on the index, we should use the join() method.


πŸ†Other articles you might be interested in if you liked this one

βœ…How to use assert statements for debugging in Python?

βœ…How to write unit tests using the unittest module in Python?

βœ…What are the uses of asterisk(*) in Python?

βœ…What are the init and new methods in Python?

βœ…How to build a custom deep learning model using Python?

βœ…How to generate temporary files and directories using tempfile in Python?

βœ…How to run the Flask app from the terminal?


That’s all for now

Keep coding✌✌