Chapter 3 Clustering

3.1 Number of Clusters

Clustering Methods

  • C1: average silhouette width
  • C2: gap statistics
  • C3: Hierarchical (spatial) cluster analysis cutoff = 50m
  • C4: Hierarchical (spatial) cluster analysis cutoff = 10km
  • C5: Hierarchical (spatial) cluster analysis cutoff = 200km
## # A tibble: 6 x 4
##   patientID     n  uniq    c3
##       <dbl> <int> <int> <int>
## 1         4   127   102     9
## 2         9    86    81    22
## 3        10    77    21     9
## 4        11   186   144     8
## 5        14   135   123     5
## 6        17    56    36    22

3.1.1 Distribution of Number of locations (within 50m) per participant

## t$c3 : 
##         Frequency Percent Cum. percent
## 1              19     9.5          9.5
## 2              21    10.5         20.0
## 3              22    11.0         31.0
## 4              19     9.5         40.5
## 5               9     4.5         45.0
## 6              10     5.0         50.0
## 7               8     4.0         54.0
## 8               5     2.5         56.5
## 9              12     6.0         62.5
## 10              8     4.0         66.5
## 11              3     1.5         68.0
## 12              4     2.0         70.0
## 13              4     2.0         72.0
## 14              5     2.5         74.5
## 15              4     2.0         76.5
## 16              4     2.0         78.5
## 17              3     1.5         80.0
## 18              3     1.5         81.5
## 19              4     2.0         83.5
## 20              1     0.5         84.0
## 21              2     1.0         85.0
## 22              5     2.5         87.5
## 23              1     0.5         88.0
## 24              3     1.5         89.5
## 25              1     0.5         90.0
## 26              1     0.5         90.5
## 28              2     1.0         91.5
## 29              1     0.5         92.0
## 31              2     1.0         93.0
## 32              3     1.5         94.5
## 33              1     0.5         95.0
## 36              5     2.5         97.5
## 41              3     1.5         99.0
## 43              1     0.5         99.5
## 99              1     0.5        100.0
##   Total       200   100.0        100.0

Only 9.5% of participants recorded videos in a single location.

3.2 Distance from base (most frequent location)

3.2.1 Defining most frequent location (mfl)

## # A tibble: 9 x 5
##   clust    x1    x2   loc   mfl
##   <int> <dbl> <dbl> <int> <dbl>
## 1     1 -117.  32.9    87     1
## 2     2 -117.  32.9     3     0
## 3     3 -117.  32.9     2     0
## 4     4 -117.  32.9     5     0
## 5     5 -122.  37.8    23     0
## 6     6 -117.  32.9     1     0
## 7     7 -122.  37.0     1     0
## 8     8 -117.  32.9     1     0
## 9     9 -117.  33.0     4     0

3.2.2 Geodesic distance from mfl

##   to_cluster distance_m dist_cat     GC_dist  GC_cat loc
## 1          1      0.000     base      0.0000    base  87
## 2          2   1851.759    local   1851.2128   local   3
## 3          3   1767.785    local   1767.1582   local   2
## 4          4   1876.144    local   1875.2589   local   5
## 5          5 710341.286  distant 709687.8539 distant  23
## 6          6    330.870    local    330.1236   local   1
## 7          7 632069.323  distant 631383.2844 distant   1
## 8          8   1777.499    local   1776.6424   local   1
## 9          9  18830.318    local  18809.6115   local   4

3.2.3 Distance summaries

## # A tibble: 3 x 11
##   dist_cat avg_gd min_gd max_gd  sd_gd avg_GC min_GC max_GC  sd_GC  freq n_clust
##   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <int>   <int>
## 1 base     0.     0.     0.        NA  0.     0.     0.        NA     87       1
## 2 local    4.41e3 3.31e2 1.88e4  7092. 4.40e3 3.30e2 1.88e4  7084.    16       6
## 3 distant  6.71e5 6.32e5 7.10e5 55347. 6.71e5 6.31e5 7.10e5 55370.    24       2

3.3 Final dataset

## # A tibble: 6 x 13
##   patientID    c3 dist_cat c3_avg_gd c3_min_gd c3_max_gd c3_sd_gd c3_avg_GC
##       <dbl> <int> <fct>        <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1         4     9 base            0         0         0       NA         0 
## 2         4     9 local        4406.      331.    18830.    7092.     4402.
## 3         4     9 distant    671205.   632069.   710341.   55347.   670536.
## 4         9    22 base            0         0         0       NA         0 
## 5         9    22 local        8479.      228.    24159.    8299.     8475.
## 6        10     9 base            0         0         0       NA         0 
## # … with 5 more variables: c3_min_GC <dbl>, c3_max_GC <dbl>, c3_sd_GC <dbl>,
## #   freq <int>, n_clust <int>