第 94 章 Pandas vs. dplyr谁是数据分析中最好用的宏包

本章,我们用一些例子来比较数据科学领域Pandas vs. dplyr 两个宏包的用法

首先,需要加载这个宏包

## Warning: package 'reticulate' was built under R version 4.2.3
use_python("E:/Anaconda3/python.exe") 
#use_condaenv("Anaconda3", required = TRUE)
#py_config()

94.1 加载数据

这是用 Pandas 方法,

import pandas as pd

gapminder = pd.read_csv("./demo_data/gapminder.csv")

这是用 dplyr 方法

## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
gapminder <- readr::read_csv("./demo_data/gapminder.csv")
## Rows: 1704 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): country, continent
## dbl (4): year, lifeExp, pop, gdpPercap
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

这个过程,两者没什么区别。再往下看

94.2 过滤

94.2.1 问题1 – 找出2007年的所用记录.

这是用 Pandas 方法,

gapminder[gapminder['year'] == 2007]
##                  country continent  year  lifeExp       pop     gdpPercap
## 11           Afghanistan      Asia  2007   43.828  31889923    974.580338
## 23               Albania    Europe  2007   76.423   3600523   5937.029526
## 35               Algeria    Africa  2007   72.301  33333216   6223.367465
## 47                Angola    Africa  2007   42.731  12420476   4797.231267
## 59             Argentina  Americas  2007   75.320  40301927  12779.379640
## ...                  ...       ...   ...      ...       ...           ...
## 1655             Vietnam      Asia  2007   74.249  85262356   2441.576404
## 1667  West Bank and Gaza      Asia  2007   73.422   4018332   3025.349798
## 1679         Yemen, Rep.      Asia  2007   62.698  22211743   2280.769906
## 1691              Zambia    Africa  2007   42.384  11746035   1271.211593
## 1703            Zimbabwe    Africa  2007   43.487  12311143    469.709298
## 
## [142 rows x 6 columns]

这是用 dplyr 方法,

gapminder %>%
  filter(year == 2007)
## # A tibble: 142 × 6
##    country     continent  year lifeExp       pop gdpPercap
##    <chr>       <chr>     <dbl>   <dbl>     <dbl>     <dbl>
##  1 Afghanistan Asia       2007    43.8  31889923      975.
##  2 Albania     Europe     2007    76.4   3600523     5937.
##  3 Algeria     Africa     2007    72.3  33333216     6223.
##  4 Angola      Africa     2007    42.7  12420476     4797.
##  5 Argentina   Americas   2007    75.3  40301927    12779.
##  6 Australia   Oceania    2007    81.2  20434176    34435.
##  7 Austria     Europe     2007    79.8   8199783    36126.
##  8 Bahrain     Asia       2007    75.6    708573    29796.
##  9 Bangladesh  Asia       2007    64.1 150448339     1391.
## 10 Belgium     Europe     2007    79.4  10392226    33693.
## # ℹ 132 more rows

94.2.2 问题2 – 找出2007年中 continent为 Americas 的记录

这是用 Pandas 方法,

gapminder[(gapminder['year'] == 2007) & (gapminder['continent'] == 'Americas')]
##                   country continent  year  lifeExp        pop     gdpPercap
## 59              Argentina  Americas  2007   75.320   40301927  12779.379640
## 143               Bolivia  Americas  2007   65.554    9119152   3822.137084
## 179                Brazil  Americas  2007   72.390  190010647   9065.800825
## 251                Canada  Americas  2007   80.653   33390141  36319.235010
## 287                 Chile  Americas  2007   78.553   16284741  13171.638850
## 311              Colombia  Americas  2007   72.889   44227550   7006.580419
## 359            Costa Rica  Americas  2007   78.782    4133884   9645.061420
## 395                  Cuba  Americas  2007   78.273   11416987   8948.102923
## 443    Dominican Republic  Americas  2007   72.235    9319622   6025.374752
## 455               Ecuador  Americas  2007   74.994   13755680   6873.262326
## 479           El Salvador  Americas  2007   71.878    6939688   5728.353514
## 611             Guatemala  Americas  2007   70.259   12572928   5186.050003
## 647                 Haiti  Americas  2007   60.916    8502814   1201.637154
## 659              Honduras  Americas  2007   70.198    7483763   3548.330846
## 791               Jamaica  Americas  2007   72.567    2780132   7320.880262
## 995                Mexico  Americas  2007   76.195  108700891  11977.574960
## 1115            Nicaragua  Americas  2007   72.899    5675356   2749.320965
## 1187               Panama  Americas  2007   75.537    3242173   9809.185636
## 1199             Paraguay  Americas  2007   71.752    6667147   4172.838464
## 1211                 Peru  Americas  2007   71.421   28674757   7408.905561
## 1259          Puerto Rico  Americas  2007   78.746    3942491  19328.709010
## 1559  Trinidad and Tobago  Americas  2007   69.819    1056608  18008.509240
## 1619        United States  Americas  2007   78.242  301139947  42951.653090
## 1631              Uruguay  Americas  2007   76.384    3447496  10611.462990
## 1643            Venezuela  Americas  2007   73.747   26084662  11415.805690

这是用 dplyr 方法

gapminder %>%
  filter(
    year == 2007,
    continent == "Americas"
  )
## # A tibble: 25 × 6
##    country            continent  year lifeExp       pop gdpPercap
##    <chr>              <chr>     <dbl>   <dbl>     <dbl>     <dbl>
##  1 Argentina          Americas   2007    75.3  40301927    12779.
##  2 Bolivia            Americas   2007    65.6   9119152     3822.
##  3 Brazil             Americas   2007    72.4 190010647     9066.
##  4 Canada             Americas   2007    80.7  33390141    36319.
##  5 Chile              Americas   2007    78.6  16284741    13172.
##  6 Colombia           Americas   2007    72.9  44227550     7007.
##  7 Costa Rica         Americas   2007    78.8   4133884     9645.
##  8 Cuba               Americas   2007    78.3  11416987     8948.
##  9 Dominican Republic Americas   2007    72.2   9319622     6025.
## 10 Ecuador            Americas   2007    75.0  13755680     6873.
## # ℹ 15 more rows

94.2.3 问题3 – 找出2007年中 continent为 Americas,且只包含美国 的记录

这是用 Pandas 方法,

gapminder[(gapminder['year'] == 2007) & 
          (gapminder['continent'] == 'Americas') &
          (gapminder['country'] == 'United States')]
##             country continent  year  lifeExp        pop    gdpPercap
## 1619  United States  Americas  2007   78.242  301139947  42951.65309

这是用 dplyr 方法

gapminder %>%
  filter(
    year == 2007,
    continent == "Americas",
    country == "United States"
  )
## # A tibble: 1 × 6
##   country       continent  year lifeExp       pop gdpPercap
##   <chr>         <chr>     <dbl>   <dbl>     <dbl>     <dbl>
## 1 United States Americas   2007    78.2 301139947    42952.

94.3 统计

我们再看看Pandas 和 dplyr 在常用的统计分析方面的表现。

94.3.1 问题1 – 计算2007年全球寿命均值

这是用 Pandas 方法,

gapminder[gapminder['year'] == 2007]['lifeExp'].mean()
## 67.00742253521126

这是用 dplyr 方法

gapminder %>%
  filter(year == 2007) %>%
  summarize(mean(lifeExp))
## # A tibble: 1 × 1
##   `mean(lifeExp)`
##             <dbl>
## 1            67.0

94.3.2 问题2 – 计算2007年每一个洲的寿命均值

这是用 Pandas 方法,

gapminder[gapminder['year'] == 2007].groupby(by='continent').mean()['lifeExp']
## continent
## Africa      54.806038
## Americas    73.608120
## Asia        70.728485
## Europe      77.648600
## Oceania     80.719500
## Name: lifeExp, dtype: float64

这是用 dplyr 方法

gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarize(mean(lifeExp))
## # A tibble: 5 × 2
##   continent `mean(lifeExp)`
##   <chr>               <dbl>
## 1 Africa               54.8
## 2 Americas             73.6
## 3 Asia                 70.7
## 4 Europe               77.6
## 5 Oceania              80.7

94.3.3 问题3 – 计算2007年每一个洲的人口总数然后降序输出。

这是用 Pandas 方法,

gapminder[gapminder['year'] == 2007].groupby(by='continent').sum()['pop'].sort_values(ascending=False)
## continent
## Asia        3811953827
## Africa       929539692
## Americas     898871184
## Europe       586098529
## Oceania       24549947
## Name: pop, dtype: int64

这是用 dplyr 方法

gapminder %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarize(total_pop = sum(pop)) %>%
  arrange(desc(total_pop))
## # A tibble: 5 × 2
##   continent  total_pop
##   <chr>          <dbl>
## 1 Asia      3811953827
## 2 Africa     929539692
## 3 Americas   898871184
## 4 Europe     586098529
## 5 Oceania     24549947

94.4 创建新变量

94.4.1 问题1 – 创建国内生产总值(GDP)变量,即人口数量乘以人均GDP

这是用 Pandas 方法,

gapminder['GDP'] = gapminder['pop'] * gapminder['gdpPercap']
gapminder.head()
##        country continent  year  lifeExp       pop   gdpPercap           GDP
## 0  Afghanistan      Asia  1952   28.801   8425333  779.445314  6.567086e+09
## 1  Afghanistan      Asia  1957   30.332   9240934  820.853030  7.585449e+09
## 2  Afghanistan      Asia  1962   31.997  10267083  853.100710  8.758856e+09
## 3  Afghanistan      Asia  1967   34.020  11537966  836.197138  9.648014e+09
## 4  Afghanistan      Asia  1972   36.088  13079460  739.981106  9.678553e+09

这是用 dplyr 方法

gapminder %>%
  mutate(GDP = pop * gdpPercap)
## # A tibble: 1,704 × 7
##    country     continent  year lifeExp      pop gdpPercap          GDP
##    <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>        <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # ℹ 1,694 more rows

94.4.2 问题2 – 人均GDP排名前90%的前十个国家

这是用 Pandas 方法,

gapminder_2007 = gapminder[gapminder['year'] == 2007]
gapminder_2007['percentile'] = gapminder_2007['gdpPercap'].rank(pct=True)
## <string>:1: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
gapminder_2007.sort_values(by='percentile', ascending=False)[:10]
##                country continent  year  ...    gdpPercap           GDP  percentile
## 1151            Norway    Europe  2007  ...  49357.19017  2.284214e+11    1.000000
## 863             Kuwait      Asia  2007  ...  47306.98978  1.185305e+11    0.992958
## 1367         Singapore      Asia  2007  ...  47143.17964  2.146433e+11    0.985915
## 1619     United States  Americas  2007  ...  42951.65309  1.293446e+13    0.978873
## 755            Ireland    Europe  2007  ...  40675.99635  1.671412e+11    0.971831
## 671   Hong Kong, China      Asia  2007  ...  39724.97867  2.772967e+11    0.964789
## 1487       Switzerland    Europe  2007  ...  37506.41907  2.833483e+11    0.957746
## 1091       Netherlands    Europe  2007  ...  36797.93332  6.097643e+11    0.950704
## 251             Canada  Americas  2007  ...  36319.23501  1.212704e+12    0.943662
## 695            Iceland    Europe  2007  ...  36180.78919  1.092410e+10    0.936620
## 
## [10 rows x 8 columns]

这是用 dplyr 方法

gapminder %>%
  filter(year == 2007) %>%
  mutate(percentile = ntile(gdpPercap, 100)) %>%
  arrange(desc(percentile)) %>%
  top_n(10, wt = percentile)
## # A tibble: 10 × 7
##    country          continent  year lifeExp       pop gdpPercap percentile
##    <chr>            <chr>     <dbl>   <dbl>     <dbl>     <dbl>      <int>
##  1 Norway           Europe     2007    80.2   4627926    49357.        100
##  2 Kuwait           Asia       2007    77.6   2505559    47307.         99
##  3 Singapore        Asia       2007    80.0   4553009    47143.         98
##  4 United States    Americas   2007    78.2 301139947    42952.         97
##  5 Ireland          Europe     2007    78.9   4109086    40676.         96
##  6 Hong Kong, China Asia       2007    82.2   6980412    39725.         95
##  7 Switzerland      Europe     2007    81.7   7554661    37506.         94
##  8 Netherlands      Europe     2007    79.8  16570613    36798.         93
##  9 Canada           Americas   2007    80.7  33390141    36319.         92
## 10 Iceland          Europe     2007    81.8    301931    36181.         91

94.5 小节

根据以上有限的对比,感觉dplyr语法要简练和清晰些。但是,这并不意味我们一定在Pandas 和 dplyr做取舍,因为解决现实问题,往往取决于我们对工具的熟练程度。以上只是个人观点(哈哈哈)。