- Thi
- Libraries
- Coupling dfs with
merge()
- Concatenate dfs with
concat()
- Combine 2 dataframes with missing values
Thi
Sometimes, we wanna couple multiple dataframes together. In this note, I use df
as DataFrame
, s
as Series
.
Libraries
import pandas as pd
import numpy as np
Coupling dfs with merge()
There are 4 types of merging, like in SQL.
- Inner: only includes elements that appear in both dataframes with a common key.
- Outer: includes all data from both dataframes.
- Left: includes all of the rows from the “left” dataframe along with any rows from the “right” dataframe with a common key; the result retains all columns from both of the original dataframes.
- Right: includes all of the rows from the “right” dataframe along with any rows from the “left” dataframe with a common key; the result retains all columns from both of the original dataframes.
On the same column name,
# left
df_left = pd.merge(left=df1, right=df2, how='left', on='Col_1', suffixes=('_df1', '_df2'))
# right
df_right = pd.merge(left=df1, right=df2, how='right', on='Col_1', suffixes=('_df1', '_df2'))
display_side_by_side(df1, df2, df_left, df_right)
::: col-2-flex
Col_1 | Col_2 | |
---|---|---|
0 | A | 1 |
1 | E | 3 |
2 | C | NaN |
3 | D | NaN |
4 | B | 2 |
Col_1 | Col_2 | |
---|---|---|
0 | A | 1 |
1 | B | 2 |
2 | C | -3 |
3 | F | -4 |
4 | E | NaN |
:::
::: col-2-flex flex-start
Col_1 | Col_2_df1 | Col_2_df2 | |
---|---|---|---|
0 | A | 1 | 1 |
1 | E | 3 | NaN |
2 | C | NaN | -3 |
3 | D | NaN | NaN |
4 | B | 2 | 2 |
Col_1 | Col_2_df1 | Col_2_df2 | |
---|---|---|---|
0 | A | 1 | 1 |
1 | E | 3 | NaN |
2 | C | NaN | -3 |
3 | B | 2 | 2 |
4 | F | NaN | -4 |
:::
# inner (defaut)
df_inner = pd.merge(left=df1, right=df2, on='Col_1', suffixes=('_df1', '_df2'))
# outer
df_outer = pd.merge(left=df1, right=df2, how='outer', on='Col_1', suffixes=('_df1', '_df2'))
display_side_by_side(df1, df2, df_inner, df_outer)
::: col-2-flex flex-start
Col_1 | Col_2 | |
---|---|---|
0 | A | 1 |
1 | E | 3 |
2 | C | NaN |
3 | D | NaN |
4 | B | 2 |
Col_1 | Col_2 | |
---|---|---|
0 | A | 1 |
1 | B | 2 |
2 | C | -3 |
3 | F | -4 |
4 | E | NaN |
:::
::: col-2-flex flex-start
Col_1 | Col_2_df1 | Col_2_df2 | |
---|---|---|---|
0 | A | 1 | 1 |
1 | E | 3 | NaN |
2 | C | NaN | -3 |
3 | B | 2 | 2 |
Col_1 | Col_2_df1 | Col_2_df2 | |
---|---|---|---|
0 | A | 1 | 1 |
1 | E | 3 | NaN |
2 | C | NaN | -3 |
3 | D | NaN | NaN |
4 | B | 2 | 2 |
5 | F | NaN | -4 |
:::
On the different column names,
# left
df_left = pd.merge(left=df1, right=df2, how='left', left_on='Col_1', right_on='Col_X', suffixes=('_df1', '_df2'))
display_side_by_side(df1, df2, df_left)
::: col-2-flex flex-start
Col_1 | Col_2 | |
---|---|---|
0 | A | 1 |
1 | E | 3 |
2 | C | NaN |
3 | D | NaN |
4 | B | 2 |
Col_X | Col_2 | |
---|---|---|
0 | A | 1 |
1 | B | 2 |
2 | C | -3 |
3 | F | -4 |
4 | E | NaN |
:::
::: col-2-flex flex-start
Col_1 | Col_2_df1 | Col_X | Col_2_df2 | |
---|---|---|---|---|
0 | A | 1 | A | 1 |
1 | E | 3 | E | NaN |
2 | C | NaN | C | -3 |
3 | D | NaN | NaN | NaN |
4 | B | 2 | B | 2 |
:::
The result keeps both Col_1
and Col_X
while in the case of the same column name, there is only 1 column. Other words, in this case, we only want to keep Col_1
and don’t need Col_X
. How to do that?
df_left = df1.set_index('Col_1').join(df2.set_index('Col_X'), how="left", lsuffix="_df1", rsuffix="_df2").reset_index()
display_side_by_side(df1, df2, df_left)
::: col-2-flex flex-start
Col_1 | Col_2 | |
---|---|---|
0 | A | 1.0 |
1 | E | 3.0 |
2 | C | NaN |
3 | D | NaN |
4 | B | 2.0 |
Col_X | Col_2 | |
---|---|---|
0 | A | 1.0 |
1 | B | 2.0 |
2 | C | -3.0 |
3 | F | -4.0 |
4 | E | NaN |
:::
Col_1 | Col_2_df1 | Col_2_df2 | |
---|---|---|---|
0 | A | 1.0 | 1.0 |
1 | E | 3.0 | NaN |
2 | C | NaN | -3.0 |
3 | D | NaN | NaN |
4 | B | 2.0 | 2.0 |
Concatenate dfs with concat()
# axis=0 (default)
df_concat_0 = pd.concat([df1, df2]) # the same columns
df_concat_1 = pd.concat([df1, df2], axis=1) # the same rows
df_concat_0_idx = pd.concat([df1, df2], ignore_index=True)
# ignore_index=True prevent duplicating indexes
display_side_by_side(df1, df2)
display_side_by_side(df_concat_0, df_concat_1, df_concat_0_idx)
::: col-2-flex flex-start
Col_1 | Col_2 | |
---|---|---|
0 | A | 1.0 |
1 | E | 3.0 |
2 | C | NaN |
3 | D | NaN |
4 | B | 2.0 |
Col_1 | Col_2 | |
---|---|---|
0 | A | 1.0 |
1 | B | 2.0 |
2 | C | -3.0 |
3 | F | -4.0 |
4 | E | NaN |
:::
::: col-2-flex flex-start
Col_1 | Col_2 | |
---|---|---|
0 | A | 1.0 |
1 | E | 3.0 |
2 | C | NaN |
3 | D | NaN |
4 | B | 2.0 |
0 | A | 1.0 |
1 | B | 2.0 |
2 | C | -3.0 |
3 | F | -4.0 |
4 | E | NaN |
Col_1 | Col_2 | Col_1 | Col_2 | |
---|---|---|---|---|
0 | A | 1.0 | A | 1.0 |
1 | E | 3.0 | B | 2.0 |
2 | C | NaN | C | -3.0 |
3 | D | NaN | F | -4.0 |
4 | B | 2.0 | E | NaN |
Col_1 | Col_2 | |
---|---|---|
0 | A | 1.0 |
1 | E | 3.0 |
2 | C | NaN |
3 | D | NaN |
4 | B | 2.0 |
5 | A | 1.0 |
6 | B | 2.0 |
7 | C | -3.0 |
8 | F | -4.0 |
9 | E | NaN |
:::
Combine 2 dataframes with missing values
We consider a situation in that we need to combine 2 dfs containing missing values in each. The missing values will be filled by taking from the others. For example, the value of C
in the left df can be fulfilled by the value of C
in the right df.
df_comb = df1.copy() # we don't want to change df1
df_new = df_comb.fillna(df2)
display_side_by_side(df1, df2, df_comb, df_new)
::: col-2-flex flex-start
Col_1 | Col_2 | |
---|---|---|
0 | A | 1.0 |
1 | E | 3.0 |
2 | C | NaN |
3 | D | NaN |
4 | B | 2.0 |
Col_1 | Col_2 | |
---|---|---|
0 | A | 1.0 |
1 | B | 2.0 |
2 | C | -3.0 |
3 | F | -4.0 |
4 | E | NaN |
Col_1 | Col_2 | |
---|---|---|
0 | A | 1.0 |
1 | E | 3.0 |
2 | C | NaN |
3 | D | NaN |
4 | B | 2.0 |
Col_1 | Col_2 | |
---|---|---|
0 | A | 1.0 |
1 | E | 3.0 |
2 | C | -3.0 |
3 | D | -4.0 |
4 | B | 2.0 |
:::
The following wiki, pages and posts are tagged with
Title | Type | Excerpt |
---|---|---|
basic setup using mac's new gpu | post | Wed, Oct 20, 21, initial setup on mac machine |
Resources for DS & ML & DL | post | Mon, Oct 25, 21, bugs and tuts lists books services & api frameworks |
Data combining using pandas | post | Tue, Oct 26, 21, Coupling multiple dataframes together uisng dataFrame and series |
Dataset Collection for dl ml sources | post | Tue, Oct 26, 21, datasets for Scikt-learn, public google and nlp projects with awesome-public-datasets, Open Images V6 |
Practical Machine Learning Tools and Techniques | post | Tue, Dec 28, 21, owerpoint slides for Chapters 1-12. This is a very comprehensive teaching resource, with many PPT slides covering each chapter of the book |
meet-puppeteer.md | post | javascript로 브라우저 자동화 |
Machine learning, deep learning, AI | page | DL/ML concept google search model 𝗔𝗿𝘁𝗶𝗳𝗶𝗰𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 𝗣𝗿𝗼𝗷𝗲𝗰𝘁 𝗟𝗶𝘀𝘁 |
webscraping | page | webscraping lessons, rapa, blackyak, 100 famous mountains, github actions and python install |