第 76 章 探索性数据分析-诺奖获得者

探索性数据分析(exporatory data analysis)是各种知识的综合运用。本章通过一个案例,讲解探索性数据分析的基本思路,也算是对前面几章内容的一次总结复习。

76.1 探索性

  • 数据准备(对数据要做到心中有数)

    • 描述变量
    • 数据结构
    • 缺失值及其处理
  • 数据探索(围绕探索的目标)

    • 数据规整
    • 可视化
    • 建模

76.2 数据集

这是一个诺贝尔奖获得者的数据集,

76.3 导入数据

df <- read_csv("./demo_data/nobel_winners.csv")
df
## # A tibble: 969 × 18
##    prize_year category   prize  motivation prize_share laureate_id laureate_type
##         <dbl> <chr>      <chr>  <chr>      <chr>             <dbl> <chr>        
##  1       1901 Chemistry  The N… "\"in rec… 1/1                 160 Individual   
##  2       1901 Literature The N… "\"in spe… 1/1                 569 Individual   
##  3       1901 Medicine   The N… "\"for hi… 1/1                 293 Individual   
##  4       1901 Peace      The N…  <NA>      1/2                 462 Individual   
##  5       1901 Peace      The N…  <NA>      1/2                 463 Individual   
##  6       1901 Physics    The N… "\"in rec… 1/1                   1 Individual   
##  7       1902 Chemistry  The N… "\"in rec… 1/1                 161 Individual   
##  8       1902 Literature The N… "\"the gr… 1/1                 571 Individual   
##  9       1902 Medicine   The N… "\"for hi… 1/1                 294 Individual   
## 10       1902 Peace      The N…  <NA>      1/2                 464 Individual   
## # ℹ 959 more rows
## # ℹ 11 more variables: full_name <chr>, birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>, organization_name <chr>,
## #   organization_city <chr>, organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>

如果是xlsx格式

readxl::read_excel("myfile.xlsx")

如果是csv格式

readr::read_csv("myfile.csv")

这里有个小小的提示:

  • 路径(包括文件名), 不要用中文和空格
  • 数据框中变量,也不要有中文和空格(可用下划线代替空格)

76.4 数据结构

一行就是一个诺奖获得者的记录? 确定?

缺失值及其处理

df %>% map_df(~ sum(is.na(.)))
## # A tibble: 1 × 18
##   prize_year category prize motivation prize_share laureate_id laureate_type
##        <int>    <int> <int>      <int>       <int>       <int>         <int>
## 1          0        0     0         88           0           0             0
## # ℹ 11 more variables: full_name <int>, birth_date <int>, birth_city <int>,
## #   birth_country <int>, gender <int>, organization_name <int>,
## #   organization_city <int>, organization_country <int>, death_date <int>,
## #   death_city <int>, death_country <int>

性别缺失怎么造成的?

df %>% count(laureate_type)
## # A tibble: 2 × 2
##   laureate_type     n
##   <chr>         <int>
## 1 Individual      939
## 2 Organization     30

76.5 我们想探索哪些问题?

你想关心哪些问题,可能是

  • 每个学科颁过多少次奖?
  • 这些大神都是哪个年代的人?
  • 性别比例
  • 平均年龄和获奖数量
  • 最年轻的诺奖获得者是谁?
  • 中国诺奖获得者有哪些?
  • 得奖的时候多大年龄?
  • 获奖者所在国家的经济情况?
  • 有大神多次获得诺贝尔奖,而且在不同科学领域获奖?
  • 出生地分布?工作地分布?迁移模式?
  • GDP经济与诺奖模型?
  • 诺奖分享情况?

76.6 每个学科颁过多少次奖

df %>% count(category)
## # A tibble: 6 × 2
##   category       n
##   <chr>      <int>
## 1 Chemistry    194
## 2 Economics     83
## 3 Literature   113
## 4 Medicine     227
## 5 Peace        130
## 6 Physics      222
df %>%
  count(category) %>%
  ggplot(aes(x = category, y = n, fill = category)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.25) +
  theme(legend.position = "none")
df %>%
  count(category) %>%
  ggplot(aes(x = fct_reorder(category, n), y = n, fill = category)) +
  geom_col() +
  geom_text(aes(label = n), vjust = -0.25) +
  labs(title = "Number of Nobel prizes in different disciplines") +
  theme(legend.position = "none")

也可以使用别人定义好的配色方案

library(ggthemr) # install.packages("devtools")
# devtools::install_github('cttobin/ggthemr')
ggthemr("dust")

df %>%
  count(category) %>%
  ggplot(aes(x = fct_reorder(category, n), y = n, fill = category)) +
  geom_col() +
  labs(title = "Number of Nobel prizes in different disciplines") +
  theme(legend.position = "none")

这个配色方案感觉挺好看的呢,比较适合我这种又挑剔又懒惰的人。

当然,也可以自己DIY,或者使用配色网站的主题方案(https://learnui.design/tools/data-color-picker.html#palette)

df %>%
  count(category) %>%
  ggplot(aes(x = fct_reorder(category, n), y = n)) +
  geom_col(fill = c("#003f5c", "#444e86", "#955196", "#dd5182", "#ff6e54", "#ffa600")) +
  labs(title = "Number of Nobel prizes in different disciplines") +
  theme(legend.position = "none")

让图骚动起来吧

library(gganimate) # install.packages("gganimate", dependencies = T)

df %>%
  count(category) %>%
  mutate(category = fct_reorder(category, n)) %>%
  ggplot(aes(x = category, y = n)) +
  geom_text(aes(label = n), vjust = -0.25) +
  geom_col(fill = c("#003f5c", "#444e86", "#955196", "#dd5182", "#ff6e54", "#ffa600")) +
  labs(title = "Number of Nobel prizes in different disciplines") +
  theme(legend.position = "none") +
  transition_states(category) +
  shadow_mark(past = TRUE)

和ggplot2的分面一样,动态图可以增加数据展示的维度。

76.7 看看我们伟大的祖国

df %>%
  dplyr::filter(birth_country == "China") %>%
  dplyr::select(full_name, prize_year, category)
## # A tibble: 12 × 3
##    full_name              prize_year category  
##    <chr>                       <dbl> <chr>     
##  1 Walter Houser Brattain       1956 Physics   
##  2 Chen Ning Yang               1957 Physics   
##  3 Tsung-Dao (T.D.) Lee         1957 Physics   
##  4 Edmond H. Fischer            1992 Medicine  
##  5 Daniel C. Tsui               1998 Physics   
##  6 Gao Xingjian                 2000 Literature
##  7 Charles Kuen Kao             2009 Physics   
##  8 Charles Kuen Kao             2009 Physics   
##  9 Ei-ichi Negishi              2010 Chemistry 
## 10 Liu Xiaobo                   2010 Peace     
## 11 Mo Yan                       2012 Literature
## 12 Youyou Tu                    2015 Medicine

我们发现获奖者有多个地址,就会有重复的情况,比如 Charles Kuen Kao在2009年Physics有两次,为什么重复计数了呢?

下面我们去重吧, 去重可以用distinct()函数

dt <- tibble::tribble(
  ~x, ~y, ~z,
  1, 1, "a",
  1, 1, "b",
  1, 2, "c",
  1, 2, "d"
)

dt
## # A tibble: 4 × 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     1     1 a    
## 2     1     1 b    
## 3     1     2 c    
## 4     1     2 d
dt %>% distinct_at(vars(x), .keep_all = T)
## # A tibble: 1 × 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     1     1 a
dt %>% distinct_at(vars(x, y), .keep_all = T)
## # A tibble: 2 × 3
##       x     y z    
##   <dbl> <dbl> <chr>
## 1     1     1 a    
## 2     1     2 c
nobel_winners <- df %>%
  mutate_if(is.character, tolower) %>%
  distinct_at(vars(full_name, prize_year, category), .keep_all = TRUE) %>%
  mutate(
    decade = 10 * (prize_year %/% 10),
    prize_age = prize_year - year(birth_date)
  )

nobel_winners
## # A tibble: 911 × 20
##    prize_year category   prize  motivation prize_share laureate_id laureate_type
##         <dbl> <chr>      <chr>  <chr>      <chr>             <dbl> <chr>        
##  1       1901 chemistry  the n… "\"in rec… 1/1                 160 individual   
##  2       1901 literature the n… "\"in spe… 1/1                 569 individual   
##  3       1901 medicine   the n… "\"for hi… 1/1                 293 individual   
##  4       1901 peace      the n…  <NA>      1/2                 462 individual   
##  5       1901 peace      the n…  <NA>      1/2                 463 individual   
##  6       1901 physics    the n… "\"in rec… 1/1                   1 individual   
##  7       1902 chemistry  the n… "\"in rec… 1/1                 161 individual   
##  8       1902 literature the n… "\"the gr… 1/1                 571 individual   
##  9       1902 medicine   the n… "\"for hi… 1/1                 294 individual   
## 10       1902 peace      the n…  <NA>      1/2                 464 individual   
## # ℹ 901 more rows
## # ℹ 13 more variables: full_name <chr>, birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>, organization_name <chr>,
## #   organization_city <chr>, organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>, decade <dbl>, prize_age <dbl>

这是时候,我们才对数据有了一个初步的了解

再来看看我的祖国

nobel_winners %>%
  dplyr::filter(birth_country == "china") %>%
  dplyr::select(full_name, prize_year, category)
## # A tibble: 11 × 3
##    full_name              prize_year category  
##    <chr>                       <dbl> <chr>     
##  1 walter houser brattain       1956 physics   
##  2 chen ning yang               1957 physics   
##  3 tsung-dao (t.d.) lee         1957 physics   
##  4 edmond h. fischer            1992 medicine  
##  5 daniel c. tsui               1998 physics   
##  6 gao xingjian                 2000 literature
##  7 charles kuen kao             2009 physics   
##  8 ei-ichi negishi              2010 chemistry 
##  9 liu xiaobo                   2010 peace     
## 10 mo yan                       2012 literature
## 11 youyou tu                    2015 medicine

76.8 哪些大神多次获得诺贝尔奖

nobel_winners %>% count(full_name, sort = T)
## # A tibble: 904 × 2
##    full_name                                                                   n
##    <chr>                                                                   <int>
##  1 comité international de la croix rouge (international committee of the…     3
##  2 frederick sanger                                                            2
##  3 john bardeen                                                                2
##  4 linus carl pauling                                                          2
##  5 marie curie, née sklodowska                                                 2
##  6 office of the united nations high commissioner for refugees (unhcr)         2
##  7 a. michael spence                                                           1
##  8 aage niels bohr                                                             1
##  9 aaron ciechanover                                                           1
## 10 aaron klug                                                                  1
## # ℹ 894 more rows
nobel_winners %>%
  group_by(full_name) %>%
  mutate(
    number_prize = n(),
    number_cateory = n_distinct(category)
  ) %>%
  arrange(desc(number_prize), full_name) %>%
  dplyr::filter(number_cateory == 2)
## # A tibble: 4 × 22
## # Groups:   full_name [2]
##   prize_year category  prize    motivation prize_share laureate_id laureate_type
##        <dbl> <chr>     <chr>    <chr>      <chr>             <dbl> <chr>        
## 1       1954 chemistry the nob… "\"for hi… 1/1                 217 individual   
## 2       1962 peace     the nob…  <NA>      1/1                 217 individual   
## 3       1903 physics   the nob… "\"in rec… 1/4                   6 individual   
## 4       1911 chemistry the nob… "\"in rec… 1/1                   6 individual   
## # ℹ 15 more variables: full_name <chr>, birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>, organization_name <chr>,
## #   organization_city <chr>, organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>, decade <dbl>, prize_age <dbl>,
## #   number_prize <int>, number_cateory <int>

76.9 大神在得奖的时候是多大年龄?

nobel_winners %>%
  count(prize_age) %>%
  ggplot(aes(x = prize_age, y = n)) +
  geom_col()
## Warning: Removed 1 rows containing missing values (`position_stack()`).
nobel_winners %>%
  group_by(category) %>%
  summarise(mean_prize_age = mean(prize_age, na.rm = T))
## # A tibble: 6 × 2
##   category   mean_prize_age
##   <chr>               <dbl>
## 1 chemistry            58.0
## 2 economics            67.2
## 3 literature           64.7
## 4 medicine             58.0
## 5 peace                61.4
## 6 physics              55.4
nobel_winners %>%
  mutate(category = fct_reorder(category, prize_age, median, na.rm = TRUE)) %>%
  ggplot(aes(category, prize_age)) +
  geom_point() +
  geom_boxplot() +
  coord_flip()
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `category = fct_reorder(category, prize_age, median, na.rm =
##   TRUE)`.
## Caused by warning:
## ! `fct_reorder()` removing 30 missing values.
## ℹ Use `.na_rm = TRUE` to silence this message.
## ℹ Use `.na_rm = FALSE` to preserve NAs.
## Warning: Removed 30 rows containing non-finite values (`stat_boxplot()`).
## Warning: Removed 30 rows containing missing values (`geom_point()`).
nobel_winners %>%
  dplyr::filter(!is.na(prize_age)) %>%
  group_by(decade, category) %>%
  summarize(
    average_age = mean(prize_age),
    median_age = median(prize_age)
  ) %>%
  ggplot(aes(decade, average_age, color = category)) +
  geom_line()
## `summarise()` has grouped output by 'decade'. You can override using the
## `.groups` argument.
## Warning: package 'ggridges' was built under R version 4.2.2
nobel_winners %>%
  ggplot(aes(
    x = prize_age,
    y = category,
    fill = category
  )) +
  geom_density_ridges()
## Picking joint bandwidth of 3.78
## Warning: Removed 30 rows containing non-finite values
## (`stat_density_ridges()`).

他们60多少岁才得诺奖,大家才23或24岁,还年轻,不用焦虑喔。

nobel_winners %>%

  ggplot(aes(x = prize_age, fill = category, color = category)) +
  geom_density() +
  facet_wrap(vars(category)) +
  theme(legend.position = "none")
## Warning: Removed 30 rows containing non-finite values (`stat_density()`).

有同学说要一个个的画,至于group_split()函数,下次课在讲

nobel_winners %>%
  group_split(category) %>%
  map(
    ~ ggplot(data = .x, aes(x = prize_age)) +
      geom_density() +
      ggtitle(.x$category)
  )
## [[1]]
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).
## 
## [[2]]
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).
## 
## [[3]]
## 
## [[4]]
## 
## [[5]]
## Warning: Removed 27 rows containing non-finite values (`stat_density()`).
## 
## [[6]]
## Warning: Removed 1 rows containing non-finite values (`stat_density()`).

也可以用强大的group_by() + group_map()组合,我们会在第 37 章讲到

nobel_winners %>%
  group_by(category) %>%
  group_map(
    ~ ggplot(data = .x, aes(x = prize_age)) +
      geom_density() +
      ggtitle(.y)
  )

76.10 性别比例

nobel_winners %>%
  dplyr::filter(laureate_type == "individual") %>%
  count(category, gender) %>%
  group_by(category) %>%
  mutate(prop = n / sum(n))
## # A tibble: 12 × 4
## # Groups:   category [6]
##    category   gender     n    prop
##    <chr>      <chr>  <int>   <dbl>
##  1 chemistry  female     4 0.0229 
##  2 chemistry  male     171 0.977  
##  3 economics  female     1 0.0128 
##  4 economics  male      77 0.987  
##  5 literature female    14 0.124  
##  6 literature male      99 0.876  
##  7 medicine   female    12 0.0569 
##  8 medicine   male     199 0.943  
##  9 peace      female    14 0.14   
## 10 peace      male      86 0.86   
## 11 physics    female     2 0.00980
## 12 physics    male     202 0.990

各年代性别比例

nobel_winners %>%
  dplyr::filter(laureate_type == "individual") %>%
  # mutate(decade = glue::glue("{round(prize_year - 1, -1)}s")) %>%
  count(decade, category, gender) %>%
  group_by(decade, category) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(decade, category, fill = prop)) +
  geom_tile(size = 0.7) +
  # geom_text(aes(label = scales::percent(prop, accuracy = .01))) +
  geom_text(aes(label = scales::number(prop, accuracy = .01))) +
  facet_grid(vars(gender)) +
  scale_fill_gradient(low = "#FDF4E9", high = "#834C0D")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
library(ggbeeswarm) # install.packages("ggbeeswarm")
## Warning: package 'ggbeeswarm' was built under R version 4.2.3
nobel_winners %>%
  ggplot(aes(
    x = category,
    y = prize_age,
    colour = gender,
    alpha = gender
  )) +
  ggbeeswarm::geom_beeswarm() +
  coord_flip() +
  scale_color_manual(values = c("#BB1288", "#5867A6")) +
  scale_alpha_manual(values = c(1, .4)) +
  theme_minimal() +
  theme(legend.position = "top") +
  labs(
    title = "Gender imbalance of Nobel laureates",
    subtitle = "data frome 1901-2016",
    colour = "Gender",
    alpha = "Gender",
    y = "age in prize"
  )
## Warning: Removed 30 rows containing missing values (`geom_point()`).
nobel_winners %>%
  count(decade,
    category,
    gender = coalesce(gender, laureate_type)
  ) %>%
  group_by(decade, category) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(decade, n, fill = gender)) +
  geom_col() +
  facet_wrap(~category) +
  labs(
    x = "Decade",
    y = "# of nobel prize winners",
    fill = "Gender",
    title = "Nobel Prize gender distribution over time"
  )

76.11 这些大神都是哪个年代出生的人?

nobel_winners %>%
  select(category, birth_date) %>%
  mutate(year = floor(year(birth_date) / 10) * 10) %>%
  count(category, year) %>%
  dplyr::filter(!is.na(year)) %>%
  ggplot(aes(x = year, y = n)) +
  geom_col() +
  scale_x_continuous(breaks = seq(1810, 1990, 20)) +
  geom_text(aes(label = n), vjust = -0.25) +
  facet_wrap(vars(category))

课堂练习,哪位同学能把图弄得好看些?

76.12 最年轻的诺奖获得者?

nobel_winners %>%
  dplyr::filter(prize_age == min(prize_age, na.rm = T))
## # A tibble: 1 × 20
##   prize_year category prize     motivation prize_share laureate_id laureate_type
##        <dbl> <chr>    <chr>     <chr>      <chr>             <dbl> <chr>        
## 1       2014 peace    the nobe… "\"for th… 1/2                 914 individual   
## # ℹ 13 more variables: full_name <chr>, birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>, organization_name <chr>,
## #   organization_city <chr>, organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>, decade <dbl>, prize_age <dbl>
nobel_winners %>%
  dplyr::filter(
    rank(prize_year - year(birth_date)) == 1
  )
## # A tibble: 1 × 20
##   prize_year category prize     motivation prize_share laureate_id laureate_type
##        <dbl> <chr>    <chr>     <chr>      <chr>             <dbl> <chr>        
## 1       2014 peace    the nobe… "\"for th… 1/2                 914 individual   
## # ℹ 13 more variables: full_name <chr>, birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>, organization_name <chr>,
## #   organization_city <chr>, organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>, decade <dbl>, prize_age <dbl>
nobel_winners %>%
  arrange(
    prize_year - year(birth_date)
  )
## # A tibble: 911 × 20
##    prize_year category prize    motivation prize_share laureate_id laureate_type
##         <dbl> <chr>    <chr>    <chr>      <chr>             <dbl> <chr>        
##  1       2014 peace    the nob… "\"for th… 1/2                 914 individual   
##  2       1915 physics  the nob… "\"for th… 1/2                  21 individual   
##  3       1932 physics  the nob… "\"for th… 1/1                  38 individual   
##  4       1933 physics  the nob… "\"for th… 1/2                  40 individual   
##  5       1936 physics  the nob… "\"for hi… 1/2                  43 individual   
##  6       1957 physics  the nob… "\"for th… 1/2                  69 individual   
##  7       1923 medicine the nob… "\"for th… 1/2                 313 individual   
##  8       1961 physics  the nob… "\"for hi… 1/2                  76 individual   
##  9       1976 peace    the nob…  <NA>      1/2                 536 individual   
## 10       2011 peace    the nob… "\"for th… 1/3                 871 individual   
## # ℹ 901 more rows
## # ℹ 13 more variables: full_name <chr>, birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>, organization_name <chr>,
## #   organization_city <chr>, organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>, decade <dbl>, prize_age <dbl>
nobel_winners %>%
  top_n(1, year(birth_date) - prize_year)
## # A tibble: 1 × 20
##   prize_year category prize     motivation prize_share laureate_id laureate_type
##        <dbl> <chr>    <chr>     <chr>      <chr>             <dbl> <chr>        
## 1       2014 peace    the nobe… "\"for th… 1/2                 914 individual   
## # ℹ 13 more variables: full_name <chr>, birth_date <date>, birth_city <chr>,
## #   birth_country <chr>, gender <chr>, organization_name <chr>,
## #   organization_city <chr>, organization_country <chr>, death_date <date>,
## #   death_city <chr>, death_country <chr>, decade <dbl>, prize_age <dbl>

76.13 平均年龄和获奖数量

df1 <- nobel_winners %>%
  group_by(category) %>%
  summarise(
    mean_prise_age = mean(prize_age, na.rm = T),
    total_num = n()
  )
df1
## # A tibble: 6 × 3
##   category   mean_prise_age total_num
##   <chr>               <dbl>     <int>
## 1 chemistry            58.0       175
## 2 economics            67.2        78
## 3 literature           64.7       113
## 4 medicine             58.0       211
## 5 peace                61.4       130
## 6 physics              55.4       204
df1 %>%
  ggplot(aes(mean_prise_age, total_num)) +
  geom_point(aes(color = category)) +
  geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

76.14 出生地与工作地分布

nobel_winners_clean <- nobel_winners %>%
  mutate_at(
    vars(birth_country, death_country),
    ~ ifelse(str_detect(., "\\("), str_extract(., "(?<=\\().*?(?=\\))"), .)
  ) %>%
  mutate_at(
    vars(birth_country, death_country),
    ~ case_when(
      . == "scotland" ~ "united kingdom",
      . == "northern ireland" ~ "united kingdom",
      str_detect(., "czech") ~ "czechia",
      str_detect(., "germany") ~ "germany",
      TRUE ~ .
    )
  ) %>%
  select(full_name, prize_year, category, birth_date, birth_country, gender, organization_name, organization_country, death_country)
nobel_winners_clean %>% count(death_country, sort = TRUE)
## # A tibble: 45 × 2
##    death_country                n
##    <chr>                    <int>
##  1 <NA>                       329
##  2 united states of america   203
##  3 united kingdom              79
##  4 germany                     56
##  5 france                      51
##  6 sweden                      28
##  7 switzerland                 26
##  8 italy                       14
##  9 russia                      11
## 10 spain                       10
## # ℹ 35 more rows

76.15 迁移模式

nobel_winners_clean %>%
  mutate(
    colour = case_when(
      death_country == "united states of america" ~ "#FF2B4F",
      death_country == "germany" ~ "#fcab27",
      death_country == "united kingdom" ~ "#3686d3",
      death_country == "france" ~ "#88398a",
      death_country == "switzerland" ~ "#20d4bc",
      TRUE ~ "gray60"
    )
  ) %>%
  ggplot(aes(
    x = 0,
    y = fct_rev(factor(birth_country)),
    xend = death_country,
    yend = 1,
    colour = colour,
    alpha = (colour != "gray60")
  )) +
  geom_curve(
    curvature = -0.5,
    arrow = arrow(length = unit(0.01, "npc"))
  ) +
  scale_x_discrete() +
  scale_y_discrete() +
  scale_color_identity() +
  scale_alpha_manual(values = c(0.1, 0.2), guide = F) +
  scale_size_manual(values = c(0.1, 0.4), guide = F) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    plot.background = element_rect(fill = "#F0EFF1", colour = "#F0EFF1"),
    legend.position = "none",
    axis.text.x = element_text(angle = 40, hjust = 1)
  )

76.16 地图

## here() starts at E:/R_for_Data_Science
## Warning: package 'sf' was built under R version 4.2.3
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
## Warning: package 'countrycode' was built under R version 4.2.3
# countrycode('Albania', 'country.name', 'iso3c')

nobel_winners_birth_country <- nobel_winners_clean %>%
  count(birth_country) %>%
  filter(!is.na(birth_country)) %>%
  mutate(ISO3 = countrycode(birth_country,
    origin = "country.name", destination = "iso3c"
  ))


global <-
  sf::st_read("./demo_data/worldmap/TM_WORLD_BORDERS_SIMPL-0.3.shp") %>%
  st_transform(4326)
## Reading layer `TM_WORLD_BORDERS_SIMPL-0.3' from data source 
##   `E:\R_for_Data_Science\demo_data\worldmap\TM_WORLD_BORDERS_SIMPL-0.3.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 246 features and 11 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -180 ymin: -90 xmax: 180 ymax: 83.57027
## Geodetic CRS:  WGS 84
global %>%
  full_join(nobel_winners_birth_country, by = "ISO3") %>%
  ggplot() +
  geom_sf(aes(fill = n),
    color = "white",
    size = 0.1
  ) +
  labs(
    x = NULL, y = NULL,
    title = "Nobel Winners by country",
    subtitle = "color of map indicates number of Nobel lauretes",
    fill = "num of Nobel lauretes",
    caption = "Made: wang_minjie"
  ) +
  scale_fill_gradientn(colors = c("royalblue1", "magenta", "orange", "gold"), na.value = "white") +
  # scale_fill_gradient(low = "wheat1", high = "red") +
  theme_void() +
  theme(
    legend.position = c(0.1, 0.3),
    plot.background = element_rect(fill = "gray")
  )
# Determine to 10 Countries
topCountries <- nobel_winners_clean %>%
  count(birth_country, sort = TRUE) %>%
  na.omit() %>%
  top_n(8)
## Selecting by n
topCountries
## # A tibble: 8 × 2
##   birth_country                n
##   <chr>                    <int>
## 1 united states of america   259
## 2 united kingdom              99
## 3 germany                     80
## 4 france                      54
## 5 sweden                      29
## 6 poland                      26
## 7 russia                      26
## 8 japan                       24
df4 <- nobel_winners_clean %>%
  filter(birth_country %in% topCountries$birth_country) %>%
  group_by(birth_country, category, prize_year) %>%
  summarise(prizes = n()) %>%
  mutate(cumPrizes = cumsum(prizes))
## `summarise()` has grouped output by 'birth_country', 'category'. You can
## override using the `.groups` argument.
df4
## # A tibble: 489 × 5
## # Groups:   birth_country, category [47]
##    birth_country category  prize_year prizes cumPrizes
##    <chr>         <chr>          <dbl>  <int>     <int>
##  1 france        chemistry       1906      1         1
##  2 france        chemistry       1912      2         3
##  3 france        chemistry       1913      1         4
##  4 france        chemistry       1935      2         6
##  5 france        chemistry       1970      1         7
##  6 france        chemistry       1987      1         8
##  7 france        chemistry       2016      1         9
##  8 france        economics       1983      1         1
##  9 france        economics       1988      1         2
## 10 france        economics       2014      1         3
## # ℹ 479 more rows
## Warning: package 'gganimate' was built under R version 4.2.2
df4 %>%
  mutate(prize_year = as.integer(prize_year)) %>%
  ggplot(aes(x = birth_country, y = category, color = birth_country)) +
  geom_point(aes(size = cumPrizes), alpha = 0.6) +
  # geom_text(aes(label = cumPrizes)) +
  scale_size_continuous(range = c(2, 30)) +
  transition_reveal(prize_year) +
  labs(
    title = "Top 10 countries with Nobel Prize winners",
    subtitle = "Year: {frame_along}",
    y = "Category"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 22),
    axis.title = element_blank()
  ) +
  scale_color_brewer(palette = "RdYlBu") +
  theme(legend.position = "none") +
  theme(plot.margin = margin(5.5, 5.5, 5.5, 5.5))

76.17 出生地和工作地不一样的占比

nobel_winners_clean %>%
  select(category, birth_country, death_country) %>%
  mutate(immigration = if_else(birth_country == death_country, 0, 1))
## # A tibble: 911 × 4
##    category   birth_country death_country  immigration
##    <chr>      <chr>         <chr>                <dbl>
##  1 chemistry  netherlands   germany                  1
##  2 literature france        france                   0
##  3 medicine   poland        germany                  1
##  4 peace      switzerland   switzerland              0
##  5 peace      france        france                   0
##  6 physics    germany       germany                  0
##  7 chemistry  germany       germany                  0
##  8 literature germany       germany                  0
##  9 medicine   india         united kingdom           1
## 10 peace      switzerland   switzerland              0
## # ℹ 901 more rows

76.18 诺奖分享者

nobel_winners %>%
  separate(prize_share, into = c("num", "deno"), sep = "/", remove = FALSE)
nobel_winners %>%
  filter(category == "medicine") %>%
  mutate(
    num_a = as.numeric(str_sub(prize_share, 1, 1)),
    num_b = as.numeric(str_sub(prize_share, -1)),
    share = num_a / num_b,
    year = prize_year %% 10,
    decade = 10 * (prize_year %/% 10)
  ) %>%
  group_by(prize_year) %>%
  mutate(n = row_number()) %>%
  ggplot() +
  geom_col(aes(x = "", y = share, fill = as.factor(n)),
    show.legend = FALSE
  ) +
  coord_polar("y") +
  facet_grid(decade ~ year, switch = "both") +
  labs(title = "Annual Nobel Prize sharing") +
  theme_void() +
  theme(
    plot.title = element_text(face = "bold", vjust = 8),
    strip.text.x = element_text(
      size = 7,
      margin = margin(t = 5)
    ),
    strip.text.y = element_text(
      size = 7,
      angle = 180, hjust = 1, margin = margin(r = 10)
    )
  )

76.19 其它

没有回答的问题,大家自己花时间探索下。

76.20 延伸阅读

  • 有些图可以再美化下
## Warning in rm(df, df1, df4, dt, global, nobel_winners,
## nobel_winners_birth_country, : object 'scale_color_continuous' not found
## Warning in rm(df, df1, df4, dt, global, nobel_winners,
## nobel_winners_birth_country, : object 'scale_color_discrete' not found
## Warning in rm(df, df1, df4, dt, global, nobel_winners,
## nobel_winners_birth_country, : object 'scale_color_gradient' not found