Merge Dataframes
Tags: #pandas #python #merging #merge #dataframes #consolidate
This notebook will help you understand how to use the pandas merge function. It explains how to merge two datasets together and consolidate multiple dataset into one.

Input

Import Library

1
import pandas as pd
2
import numpy as np
Copied!

Create dataframes to be merged

Dataframe 1
1
# Creating values to be used as datasets
2
dict1 = {
3
"student_id": [1,2,3,4,5,6,7,8,9,10],
4
"student_name": ["Peter","Dolly","Maggie","David","Isabelle","Harry","Akin","Abbey","Victoria","Sam"],
5
"student_course": np.random.choice(["Biology","Physics","Chemistry"], size=10)
6
}
Copied!
1
# Create dataframe
2
df_1 = pd.DataFrame(dict1)
3
df_1
Copied!
Dataframe 2
1
# Creating values to be used as datasets
2
dict2 = {
3
"student_id": np.random.choice([1,2,3,4,5,6,7,8,9,10], size=100),
4
"student_grade": np.random.choice(["A","B","C","D","E","F"], size=100),
5
"professors": np.random.choice(["Mark Levinson","Angela Marge","Bonnie James","Klaus Michealson"], size=100),
6
}
Copied!
1
# Create dataframe
2
df_2 = pd.DataFrame(dict2) # OR Data2=pd.read_csv(filepath)
3
df_2
Copied!

Model

pd.merge: acts like an SQL inner join and joins based on similar columns or index unless specified to join differently

Merging dataframes with same values with same column names

Using pd.merge(left, right) acts like sql inner join and only joins on the common column they have. It tries finding everything from the right and append to left 'student_id' is common to both so it has been merged into one and included all the other df_2 columns to df_1 table.
1
df = pd.merge(df_1, df_2)
Copied!

Output

Display result

1
df
Copied!

Other options

Specifiying the comon column using parameters "on"

1
df = pd.merge(df_1, df_2, on="student_id")
2
df
Copied!

Specifying what kind of Joins you want since merging does inner joins by default

  • "inner" > Inner Join: INCLUDING ROWS OF FIRST AND SECOND ONLY IF THE VALUE IS THE SAME IN BOTH DATAFRAMES
  • "outer" > Outer Join: IT JOINS ALL THE ROWS OF FIRST AND SECOND DATAFRAMES TOGETHER AND CREATE NaN VALUE IF A ROW DOESN'T HAVE A VALUE AFTER JOINING
  • "left" > Left Join: INCLUDES ALL THE ROWS IN THE FIRST DATAFRAME AND ADDS THE COLUMNS OF SECOND DATAFRAME BUT IT WON'T INCLUDE THE ROWS OF THE SECOND DATAFRAME IF IT'S NOT THE SAME WITH THE FIRST
  • "right" > Right Join: INCLUDES ALL THE ROWS OF SECOND DATAFRAME AND THE COLUMNS OF THE FIRST DATAFRAME BUT WON'T INCLUDE THE ROWS OF THE FIRST DATAFRAME IF IT'S NOT SIMILAR TO THE SECOND DATAFRAME
1
df = pd.merge(df_1, df_2, on="student_id", how='left')
2
df
Copied!

Merging dataframes with same values but different column names

We add two more parameters :
  • Left_on means merge using this column name
  • Right_on means merge using this column name i.e merge both id and student_id together since they don't have same name, they will create different columns on the new table
1
df_1 = df_1.rename(columns={"student_id": "id"}) # Renamed student_id to id so as to give this example
2
df_1
Copied!
1
df = pd.merge(df_1, df_2, left_on="id", right_on="student_id")
2
df
Copied!

Merging with the index of the first dataframe

1
df_1.set_index("id") # this will make id the new index for df_1
Copied!
1
df = pd.merge(df_1, df_2, left_index=True, right_on="student_id")#the new index will be from index of df_2 where they joined
2
df
Copied!

Merging both table on their index i.e two indexes

1
df_2.set_index("student_id") # making student_id the index of Data2
Copied!
1
df = pd.merge(df_1, df_2, left_index=True, right_index=True) # new index will be from the left index unlike when joining only one index
2
df
Copied!
Copy link
Edit on GitHub