第 41 章 tidyverse中的across()之美2

## Warning: package 'ggplot2' was built under R version 4.2.3
## Warning: package 'tibble' was built under R version 4.2.3
## Warning: package 'tidyr' was built under R version 4.2.2
## Warning: package 'readr' was built under R version 4.2.2
## Warning: package 'purrr' was built under R version 4.2.2
## Warning: package 'dplyr' was built under R version 4.2.3
## Warning: package 'stringr' was built under R version 4.2.2
## Warning: package 'lubridate' was built under R version 4.2.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

41.1 曾经的痛点

dplyr 1.0.0 引入了across()函数,让我们再次感受到了dplyr的强大和人性化。 across()函数与summarise()mutate()函数配合起来使用,非常方便(参考第 40 章), 但与filter()函数不是很理想,比如我们想筛选数据框有缺失值的行

## Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
## ℹ Please use `if_any()` or `if_all()` instead.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 0 × 8
## # ℹ 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
## #   bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
## #   year <int>

代码能运行,但结果明显不正确。我搜索了很久,发现只能用dplyr 1.0.0之前的filter_all()函数实现,

penguins %>% 
  filter_all( any_vars(is.na(.)) )
## # A tibble: 11 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           NA            NA                  NA          NA
##  2 Adelie  Torgersen           34.1          18.1               193        3475
##  3 Adelie  Torgersen           42            20.2               190        4250
##  4 Adelie  Torgersen           37.8          17.1               186        3300
##  5 Adelie  Torgersen           37.8          17.3               180        3700
##  6 Adelie  Dream               37.5          18.9               179        2975
##  7 Gentoo  Biscoe              44.5          14.3               216        4100
##  8 Gentoo  Biscoe              46.2          14.4               214        4650
##  9 Gentoo  Biscoe              47.3          13.8               216        4725
## 10 Gentoo  Biscoe              44.5          15.7               217        4875
## 11 Gentoo  Biscoe              NA            NA                  NA          NA
## # ℹ 2 more variables: sex <fct>, year <int>

多少让人感觉,在追求简约道路上,还是有美中不足。

41.2 dplyr 1.0.4: if_any() and if_all()

如今,dplyr 1.0.4推出了 if_any() and if_all() 两个函数,正是弥补这个缺陷

penguins %>% 
  filter(if_any(everything(), is.na))
## # A tibble: 11 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           NA            NA                  NA          NA
##  2 Adelie  Torgersen           34.1          18.1               193        3475
##  3 Adelie  Torgersen           42            20.2               190        4250
##  4 Adelie  Torgersen           37.8          17.1               186        3300
##  5 Adelie  Torgersen           37.8          17.3               180        3700
##  6 Adelie  Dream               37.5          18.9               179        2975
##  7 Gentoo  Biscoe              44.5          14.3               216        4100
##  8 Gentoo  Biscoe              46.2          14.4               214        4650
##  9 Gentoo  Biscoe              47.3          13.8               216        4725
## 10 Gentoo  Biscoe              44.5          15.7               217        4875
## 11 Gentoo  Biscoe              NA            NA                  NA          NA
## # ℹ 2 more variables: sex <fct>, year <int>

从函数形式上看,if_any 对应着 across的地位,

across(.cols = everything(), .fns = NULL, ..., .names = NULL)

if_any(.cols, .fns = NULL, ..., .names = NULL)

if_all(.cols, .fns = NULL, ..., .names = NULL)

这就意味着列方向我们有across(),行方向我们有if_any()/if_all()了,可谓 纵横武林,倚天屠龙、谁与争锋?

41.3 案例赏析

下面通过一些例子展示下这两个新函数,其中一部分案例来自官网

  • 筛选有缺失值的行
penguins %>% 
  filter(if_any(everything(), is.na))
## # A tibble: 11 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           NA            NA                  NA          NA
##  2 Adelie  Torgersen           34.1          18.1               193        3475
##  3 Adelie  Torgersen           42            20.2               190        4250
##  4 Adelie  Torgersen           37.8          17.1               186        3300
##  5 Adelie  Torgersen           37.8          17.3               180        3700
##  6 Adelie  Dream               37.5          18.9               179        2975
##  7 Gentoo  Biscoe              44.5          14.3               216        4100
##  8 Gentoo  Biscoe              46.2          14.4               214        4650
##  9 Gentoo  Biscoe              47.3          13.8               216        4725
## 10 Gentoo  Biscoe              44.5          15.7               217        4875
## 11 Gentoo  Biscoe              NA            NA                  NA          NA
## # ℹ 2 more variables: sex <fct>, year <int>

或者更简单

penguins %>% 
  filter(if_any(.fns = is.na))
## Warning: Using `if_any()` without supplying `.cols` was deprecated in dplyr 1.1.0.
## ℹ Please supply `.cols` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## # A tibble: 11 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           NA            NA                  NA          NA
##  2 Adelie  Torgersen           34.1          18.1               193        3475
##  3 Adelie  Torgersen           42            20.2               190        4250
##  4 Adelie  Torgersen           37.8          17.1               186        3300
##  5 Adelie  Torgersen           37.8          17.3               180        3700
##  6 Adelie  Dream               37.5          18.9               179        2975
##  7 Gentoo  Biscoe              44.5          14.3               216        4100
##  8 Gentoo  Biscoe              46.2          14.4               214        4650
##  9 Gentoo  Biscoe              47.3          13.8               216        4725
## 10 Gentoo  Biscoe              44.5          15.7               217        4875
## 11 Gentoo  Biscoe              NA            NA                  NA          NA
## # ℹ 2 more variables: sex <fct>, year <int>
  • 筛选全部是缺失值的行
penguins %>% 
  filter(if_all(everything(), is.na))
## # A tibble: 0 × 8
## # ℹ 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
## #   bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
## #   year <int>
  • 筛选企鹅嘴峰(长度和厚度)全部大于21mm的行
penguins %>% 
  filter(if_all(contains("bill"), ~ . > 21))
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           38.6          21.2               191        3800
## 2 Adelie  Torgersen           34.6          21.1               198        4400
## 3 Adelie  Torgersen           46            21.5               194        4200
## 4 Adelie  Dream               39.2          21.1               196        4150
## 5 Adelie  Dream               42.3          21.2               191        4150
## 6 Adelie  Biscoe              41.3          21.1               195        4400
## # ℹ 2 more variables: sex <fct>, year <int>

当然可以弄成更骚一点喔

penguins %>% 
  filter(if_all(contains("bill"), `>`, 21))
## # A tibble: 6 × 8
##   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
## 1 Adelie  Torgersen           38.6          21.2               191        3800
## 2 Adelie  Torgersen           34.6          21.1               198        4400
## 3 Adelie  Torgersen           46            21.5               194        4200
## 4 Adelie  Dream               39.2          21.1               196        4150
## 5 Adelie  Dream               42.3          21.2               191        4150
## 6 Adelie  Biscoe              41.3          21.1               195        4400
## # ℹ 2 more variables: sex <fct>, year <int>
  • 筛选企鹅嘴峰(长度或者厚度)大于21mm的行
penguins %>% 
  filter(if_any(contains("bill"), ~ . > 21))
## # A tibble: 342 × 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           34.1          18.1               193        3475
##  9 Adelie  Torgersen           42            20.2               190        4250
## 10 Adelie  Torgersen           37.8          17.1               186        3300
## # ℹ 332 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
  • 在指定的列(嘴峰长度和厚度)中检查每行的元素,如果这些元素都大于各自所在列的均值,就保留下来
bigger_than_mean <- function(x) {
  x > mean(x, na.rm = TRUE)
}

penguins %>% 
  filter(if_all(contains("bill"), bigger_than_mean))
## # A tibble: 61 × 8
##    species   island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>     <fct>             <dbl>         <dbl>             <int>       <int>
##  1 Adelie    Torgers…           46            21.5               194        4200
##  2 Adelie    Dream              44.1          19.7               196        4400
##  3 Adelie    Torgers…           45.8          18.9               197        4150
##  4 Adelie    Biscoe             45.6          20.3               191        4600
##  5 Adelie    Torgers…           44.1          18                 210        4000
##  6 Gentoo    Biscoe             44.4          17.3               219        5250
##  7 Gentoo    Biscoe             50.8          17.3               228        5600
##  8 Chinstrap Dream              46.5          17.9               192        3500
##  9 Chinstrap Dream              50            19.5               196        3900
## 10 Chinstrap Dream              51.3          19.2               193        3650
## # ℹ 51 more rows
## # ℹ 2 more variables: sex <fct>, year <int>
  • 在指定的列(嘴峰长度和嘴峰厚度)中检查每行的元素,如果这些元素都大于各自所在列的均值,就”both big”;如果这些元素有一个大于自己所在列的均值,就”one big”,(注意case_when中if_all要在if_any之前)
penguins %>% 
  filter(!is.na(bill_length_mm)) %>% 
  mutate(
    category = case_when(
      if_all(contains("bill"), bigger_than_mean) ~ "both big", 
      if_any(contains("bill"), bigger_than_mean) ~ "one big", 
      TRUE                          ~ "small"
    ))
## # A tibble: 342 × 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           36.7          19.3               193        3450
##  5 Adelie  Torgersen           39.3          20.6               190        3650
##  6 Adelie  Torgersen           38.9          17.8               181        3625
##  7 Adelie  Torgersen           39.2          19.6               195        4675
##  8 Adelie  Torgersen           34.1          18.1               193        3475
##  9 Adelie  Torgersen           42            20.2               190        4250
## 10 Adelie  Torgersen           37.8          17.1               186        3300
## # ℹ 332 more rows
## # ℹ 3 more variables: sex <fct>, year <int>, category <chr>