Pandas DataFrame.merge()
Pandas merge() is defined as the process of bringing the two datasets together into one and aligning the rows based on the common attributes or columns. It is an entry point for all standard database join operations between DataFrame objects:
Syntax:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)
Parameters:
- right: DataFrame or named Series
It is an object which merges with the DataFrame.
- how: {'left', 'right', 'outer', 'inner'}, default 'inner'
Type of merge to be performed.
- left: It use only keys from the left frame, similar to a SQL left outer join; preserve key order.
- right: It use only keys from the right frame, similar to a SQL right outer join; preserve key order.
- outer: It used the union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.
- inner: It use the intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.
- on: label or list
It is a column or index level names to join on. It must be found in both the left and right DataFrames. If on is None and not merging on indexes, then this defaults to the intersection of the columns in both DataFrames.
left_on: label or list, or array-like
It is a column or index level names from the left DataFrame to use as a key. It can be an array with length equal to the length of the DataFrame.
- right_on: label or list, or array-like
It is a column or index level names from the right DataFrame to use as keys. It can be an array with length equal to the length of the DataFrame.
- left_index : bool, default False
It uses the index from the left DataFrame as the join key(s), If true. In the case of MultiIndex (hierarchical), many keys in the other DataFrame (either the index or some columns) should match the number of levels.
- right_index : bool, default False
It uses the index from the right DataFrame as the join key. It has the same usage as the left_index.
- sort: bool, default False
If True, it sorts the join keys in lexicographical order in the result DataFrame. Otherwise, the order of the join keys depends on the join type (how keyword).
- suffixes: tuple of the (str, str), default ('_x', '_y')
It suffixes to apply to overlap the column names in the left and right DataFrame, respectively. The columns use (False, False) values to raise an exception on overlapping.
- copy: bool, default True
If True, it returns a copy of the DataFrame.
Otherwise, It can avoid the copy.
- indicator: bool or str, default False
If True, It adds a column to output DataFrame "_merge" with information on the source of each row. If it is a string, a column with information on the source of each row will be added to output DataFrame, and the column will be named value of a string. The information column is defined as a categorical-type and it takes value of:
- "left_only" for the observations whose merge key appears only in 'left' of the DataFrame, whereas,
- "right_only" is defined for observations in which merge key appears only in 'right' of the DataFrame,
- "both" if the observation's merge key is found in both of them.
- validate: str, optional
If it is specified, it checks the merge type that is given below:
- "one_to_one" or "1:1": It checks if merge keys are unique in both the left and right datasets.
- "one_to_many" or "1:m": It checks if merge keys are unique in only the left dataset.
- "many_to_one" or "m:1": It checks if merge keys are unique in only the right dataset.
- "many_to_many" or "m:m": It is allowed, but does not result in checks.
Example1: Merge two DataFrames on a key
# import the pandas library
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4],
'Name': ['John', 'Parker', 'Smith', 'Parker'],
'subject_id':['sub1','sub2','sub4','sub6']})
right = pd.DataFrame({
'id':[1,2,3,4],
'Name': ['William', 'Albert', 'Tony', 'Allen'],
'subject_id':['sub2','sub4','sub3','sub6']})
print (left)
print (right)
Output
id Name subject_id
0 1 John sub1
1 2 Parker sub2
2 3 Smith sub4
3 4 Parker sub6
id Name subject_id
0 1 William sub2
1 2 Albert sub4
2 3 Tony sub3
3 4 Allen sub6
Example2: Merge two DataFrames on multiple keys:
import pandas as pd
left = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id':['sub1','sub2','sub4','sub6','sub5']})
right = pd.DataFrame({
'id':[1,2,3,4,5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id':['sub2','sub4','sub3','sub6','sub5']})
print pd.merge(left,right,on='id')
Output
id Name_x subject_id_x Name_y subject_id_y
0 1 John sub1 William sub2
1 2 Parker sub2 Albert sub4
2 3 Smith sub4 Tony sub3
3 4 Parker sub6 Allen sub6
|