Chapter 4 Mapping & Spatial Analysis
David
jaspatial, leaflet, mapbox
4.1 Setup
Geospatial analysis in R can require significant setup of your computer and installation of a variety of packages. We have setup a R package, jaspatial
that installs and loads the required packages, including sf
, leaflet
, tigris
, mapview
, and rmapshaper
.
To get started, install japspatial
from GitHub using the following code:
::install_github("januaryadvisors/jaspatial") devtools
After install, you should load the jaspatial
library and run two additional commands to get your environment ready for analysis: load_geo_packages()
, which loads sf
and other packages you’ll need, and set_geo_options()
, which sets the correct options for packages like tigris
.
#Let's load up some standard packages first
library(tidyverse)
library(here)
library(janitor)
#Then japspatial
library(jaspatial)
load_geo_packages()
set_geo_options()
4.2 Reading shapefiles
There are several ways to read shapefiles into R, but the one I would recommend is sf::read_sf()
. It accommodates several different kinds of spatial files (shapefiles, geojson, etc.), and also allows you to import files directly from the web.
Below is an example of how you can import Harris County Commissioner Precinct boundaries directly from the City of Houston’s GIS website.
<- read_sf("https://opendata.arcgis.com/datasets/7237db114eeb416cb481f4450d8a0fa6_11.geojson") sf
If you’re reading in a shapefile, you only need to read in the file that ends in .shp
. These types of files typically have a number of additional files that are important but are not necessary to read into R.
4.3 Cleaning shapefiles
After you read in your shapefile, you want to first clean it up. Spatial files often come with different map projections. If you want to create a map or analyse multiple shapefiles at once (e.g., a point-in-polygon analysis), you’ll want to standardize your shapefiles so that they share the same projection.
The jaspatial
package comes with a function called clean_shape
that performs several cleaning tasks. First, it converts the projection to wgs84
. Second, it cleans the field names using janitor::clean_names
to make them more R-friendly.
<- clean_shape(.data = sf)
clean_sf
names(sf)
## [1] "OBJECTID" "PCT_NO" "AREA_IN_MI" "AREA_IN_SQ_FT"
## [5] "ShapeSTArea" "ShapeSTLength" "geometry"
names(clean_sf)
## [1] "objectid" "pct_no" "area_in_mi" "area_in_sq_ft"
## [5] "shape_st_area" "shape_st_length" "geometry"
You also have the option to use a different map projection. To do so, simply add an additional argument to clean_shape()
:
#This is an alternative projection that's good for Texas
<- st_crs("+proj=utm +zone=14 +ellps=GRS80 +datum=NAD83 +units=m +no_defs +towgs84=0,0,0")
utm14n
<- clean_shape(.data = sf, .crs = utm14n) clean_sf_utmn14n
4.4 Mapping your data
When you’re starting out with a new shapefile, you’ll often want to take a quick peek and make sure your data looks right. Rather than building a leaflet
map, which requires several lines of code (see below), you can use mapview
to pull up a quick map.
mapview(clean_sf)
TODO: Leaflet map section here
4.5 Point-in-polygon analysis
One of the most basic geospatial techniques is a point-in-polygon analysis. This type of analysis is useful when you have two spatial files — one with location points (longitude and latitude), and the other with polygons (e.g., administrative boundaries).
One type of question you might want to know is: Which locations are within which geographic boundaries? Let’s use the example of Health Dept Facilities in Harris County. We’d like to know which Commissioner Precinct each facility falls.
#First, we need to read in the facilities file and clean it up
<- read_sf("https://opendata.arcgis.com/datasets/dc3d27e2fb8b4c76ab6713909e7269a9_11.geojson") %>%
facilities clean_shape()
#Next, we perform a spatial join
<- st_join(facilities, clean_sf) joined_sf
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
## although coordinates are longitude/latitude, st_intersects assumes that they are planar
#Let's take a look at two of the columns
select(joined_sf, name, pct_no)
## Simple feature collection with 32 features and 2 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: -95.60304 ymin: 29.6563 xmax: -95.22971 ymax: 29.90181
## CRS: +proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0
## # A tibble: 32 × 3
## name pct_no geometry
## <chr> <int> <POINT [°]>
## 1 MAGNOLIA HEALTH CENTER (Dental Only) 2 (-95.30104 29.73468)
## 2 KASHMERE MULTI-SERVICE CENTER 1 (-95.31653 29.80433)
## 3 FIFTH WARD MULTI-SERVICE CENTER 1 (-95.33082 29.77226)
## 4 MAGNOLIA MULTI-SERVICE CENTER 2 (-95.30101 29.73463)
## 5 MAGNOLIA MSC WIC CENTER 2 (-95.30104 29.73455)
## 6 AIRLINE WIC CENTER 2 (-95.38422 29.85171)
## 7 SOUTHWEST WIC CENTER 3 (-95.49388 29.71094)
## 8 NORTHSIDE WIC CENTER 1 (-95.34071 29.83726)
## 9 LA NUEVA CASA DE AMIGOS WIC CENTER 2 (-95.36171 29.77545)
## 10 BRAESNER WIC CENTER 1 (-95.5289 29.67224)
## # … with 22 more rows
That might be enough for some analyses. But you might also want to summarize the data and count up how many facilities fall within each precinct.
<- joined_sf %>%
summary st_drop_geometry() %>% #You don't have to drop the geometry but it improves runtime
count(pct_no)
summary
## # A tibble: 3 × 2
## pct_no n
## <int> <int>
## 1 1 16
## 2 2 10
## 3 3 6
An alternative method to count
is to use group_by
— both are tidyverse verbs that you would normally use to summarize non-spatial data. This could be useful if you want to perform additional commands beyond just counting up the number of points in each polygon.
<- joined_sf %>%
summary_alt st_drop_geometry() %>%
group_by(pct_no) %>%
summarise(
n = n()
)
summary_alt
## # A tibble: 3 × 2
## pct_no n
## <int> <int>
## 1 1 16
## 2 2 10
## 3 3 6
4.6 Polygon-in-polygon analysis
Another common geospatial technique is a polygon-in-polygon analysis. This comes up most often when you have summary data at one level of geography (e.g., census tracts) and wish to summarize the data at a different level of geography (e.g., neighborhoods).
In these cases, what you need to do is first create a crosswalk between the two types of geography. You want to determine what share of the original geospatial layer falls within the new layer. In the above example, we would want to know what share of each census tract falls within each neighborhood.
One you’ve calculated the overlap, you want to redistribute the data in the original layer to the new layer based on the amount of overlap. For count data, this would be a weighted sum. For percentage or rate data, a weighted average.
(Sidenote: This approach assumes that the population is evenly distributed within each census tract. This is a fair assumption for most data and is oftentimes the best we can do. But if you’re using population data, this approach can be improved by using residential location information. Rather than weighting based on the amount of area overlap, you weight based on how many residential homes overlap between geographies. See David’s repo for Vision Galveston for an example of this method.)
Here’s an example using census tract and Super Neighborhood data. In this example, we have ACS data on the number of residents who identify as Hispanic/Latino and we want to generate estimates at the Super Neighborhood level.
We’ll start by pulling in the latest Census 2020 redistricting file from Census 2020 (which includes a census tract shapefile) and the City of Houston Super Neighborhood file. We’re going to pull the number of Hispanic residents and the total number of residents. Remember that we need to clean and standardize these shapefiles before using them together.
library(tidycensus)
<- get_decennial(
cen2020 geography = "tract",
variables = c(total_pop = "P1_001N", hisp = "P2_003N"),
state = 48, county = c(201, 157, 167, 339), #City of Houston stretches across multiple counties
cache_table = T, year = 2020, geometry = T , #This argument means we also grab the shapefile
sumfile="pl"
%>%
) spread(variable, value) %>% #turn from long to wide
clean_shape()
## Getting data from the 2020 decennial Census
<- read_sf("https://opendata.arcgis.com/datasets/c3bfee99cbc14a899e4a603ee73203ee_3.geojson") %>%
supern clean_shape()
Next, we need to create the crosswalk. The jaspatial
package provides a quick and easy function to create a crosswalk (TO BE ADDED), but here’s what’s going on under the hood:
options(scipen = 999) #I like to turn off scientific notation first
#First, find the intersection and calculate the overlap area
= st_intersection(cen2020, supern) %>% #Intersect
intersection mutate(intersect_area = st_area(.)) %>% #Calculate the intersecting area
::select(geoid, snbname, intersect_area) %>% #Only keep what you need
dplyrst_drop_geometry() #You can drop the geometry at this point
## although coordinates are longitude/latitude, st_intersection assumes that they are planar
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
#Second, join the intersection with original shapefile and calculate the % overlap (coverage)
= cen2020 %>%
crosswalk mutate(total_area = st_area(.)) %>%
left_join(., intersection, by="geoid") %>%
mutate(coverage = as.numeric(intersect_area/total_area)) %>%
filter(!is.na(snbname)) %>%
st_drop_geometry() %>%
::select(geoid, snbname, coverage)
dplyr
crosswalk
## # A tibble: 1,416 × 3
## geoid snbname coverage
## <chr> <chr> <dbl>
## 1 48201411100 GREENWAY / UPPER KIRBY AREA 0.00151
## 2 48201411100 AFTON OAKS / RIVER OAKS AREA 0.994
## 3 48201411100 NEARTOWN - MONTROSE 0.00469
## 4 48201522102 SPRING BRANCH NORTH 0.999
## 5 48201522102 SPRING BRANCH CENTRAL 0.00100
## 6 48201423402 BRAYS OAKS 1
## 7 48201433100 SHARPSTOWN 0.709
## 8 48201433100 WESTWOOD 0.256
## 9 48201433100 ALIEF 0.0348
## 10 48201222602 GREATER GREENSPOINT 1
## # … with 1,406 more rows
The coverage column now indicates the share (0-1) of each census tract that is in each neighborhood. This is the weight you should use to redistribute the census tract data and calculate neighbothood estimates. Now you’re ready to do that!
= left_join(cen2020, crosswalk, by="geoid") %>%
neighborhood_estimate1 st_drop_geometry() %>%
filter(!is.na(snbname)) %>%
mutate(
#Use across and multiply each variable by the coverage
across(c("hisp", "total_pop"), ~.*coverage)
%>%
) group_by(snbname) %>%
summarise(
across(c("hisp", "total_pop"), sum, na.rm=T)
)
neighborhood_estimate1
## # A tibble: 88 × 3
## snbname hisp total_pop
## <chr> <dbl> <dbl>
## 1 ACRES HOME 19503. 30765.
## 2 ADDICKS PARK TEN 14468. 22766.
## 3 AFTON OAKS / RIVER OAKS AREA 14146. 16212.
## 4 ALIEF 53430. 106706.
## 5 ASTRODOME AREA 13951. 16040.
## 6 BRAEBURN 6858. 18907.
## 7 BRAESWOOD 18904. 22286.
## 8 BRAYS OAKS 34936. 57542.
## 9 BRIAR FOREST 31931. 41909.
## 10 CARVERDALE 1984. 4938.
## # … with 78 more rows
The above example is what you would do if you’re using count data. The following example is slightly modified if you’re using data that has percentages. However, you MUST have the proper denominator for the first level of geography.
#Just going to make that dataset real quick
= cen2020 %>%
percent_data mutate(hisp_pct = hisp/total_pop * 100)
#Now let's use it to aggregate up to the neighborhood level
= left_join(percent_data, crosswalk, by="geoid") %>%
neighborhood_estimate2 st_drop_geometry() %>%
filter(!is.na(snbname)) %>%
mutate(
#First, calculate the numerator for each tract
hisp = (hisp_pct/100) * total_pop,
#Then weight the numerator and denominator
across(c("hisp", "total_pop"), ~.*coverage)
%>%
) group_by(snbname) %>%
summarise(
across(c("hisp", "total_pop"), sum, na.rm=T)
%>%
) mutate(
#Then calculate the percentage again
hisp_pct = hisp/total_pop * 100
)
neighborhood_estimate2
## # A tibble: 88 × 4
## snbname hisp total_pop hisp_pct
## <chr> <dbl> <dbl> <dbl>
## 1 ACRES HOME 19503. 30765. 63.4
## 2 ADDICKS PARK TEN 14468. 22766. 63.6
## 3 AFTON OAKS / RIVER OAKS AREA 14146. 16212. 87.3
## 4 ALIEF 53430. 106706. 50.1
## 5 ASTRODOME AREA 13951. 16040. 87.0
## 6 BRAEBURN 6858. 18907. 36.3
## 7 BRAESWOOD 18904. 22286. 84.8
## 8 BRAYS OAKS 34936. 57542. 60.7
## 9 BRIAR FOREST 31931. 41909. 76.2
## 10 CARVERDALE 1984. 4938. 40.2
## # … with 78 more rows