1 Introduction

Regional urban development prediction is one of the most important elements of decision-making in the real estate and planning industry. It affects the decisions of various stakeholders, such as real estate developers, real estate investors, tenants, and planners. An accurate prediction can help developers and investors to maximize their profit by providing information on when to buy or develop the property. Households can also make a better decision whether to own or rent a house by estimating cost and future capital gain. Planners can design more efficient strategies to enhance economic and environmental resilience for the urban and suburban regions. For example, a development prediction based on land cover data can provide insights into future demand’s impact on environmentally sensitive lands. The following project shows how geospatial machine learning is used for a site-specific development prediction model.

The purpose of this project is to evaluate the development suitability of the study area by comparing predicted development demand and environmental sensitivity. The project will conduct a binary linear regression based on the following datasets: land cover, land cover change, census demographics, and distance to highways. The model trained with 2000 and 2010 datasets will be used to predict 2020 land supply and demand. Here, the ten-year growth trends between 2010 and 2020 are assumed to be similar to those between 2000 and 2010. 8.1: Environmental sensitivity of the land will be measured based on land cover type.

Austin Metropolitan Area (or Austin-Round Rock-San Marcos) was selected as a project target area. Austin MSA has been experiencing rapid economic growth and urban sprawl due to the relocations of large companies. For Austin MSA, balancing economic growth and environmental protection will become a real challenge, considering that they are one of the most natural disaster-prone metropolitan areas in the state of Texas. Although this project does not take account of recent urban growth factors (such as moving companies and employees), it can provide a general concept of how such modeling can assess the environmental suitability of potential developments.

This is a final project for CPLN 675 Land Use and Environmental Modeling course and below codes and the structure are largely based on Urban Growth Modeling from 2021_4_21_sprawl_chapter.html

1.1 Setup

library(tidyverse)
library(sf)
library(raster)
library(kableExtra)
library(tidycensus)
library(tigris)
library(FNN)
library(QuantPsyc)
library(caret)
library(yardstick)
library(pscl)
library(plotROC) 
library(ggrepel)
library(pROC)
library(grid)
library(gridExtra)
library(viridis)
library(rmarkdown)
library(knitr)
library(dplyr)

mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

plotTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle = element_text(face="italic"),
    plot.caption = element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),
    panel.grid.major = element_line("grey80", size = 0.1),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = NA, fill=NA, size=2),
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(size=12),
    axis.title = element_text(size=12),
    axis.text = element_text(size=10),
    plot.background = element_blank(),
    legend.background = element_blank(),
    legend.title = element_text(colour = "black", face = "italic"),
    legend.text = element_text(colour = "black", face = "italic"),
    strip.text.x = element_text(size = 14)
  )
}

palette2 <- c("#41b6c4","#253494")
palette4 <- c("#a1dab4","#41b6c4","#2c7fb8","#253494")
palette5 <- c("#ffffcc","#a1dab4","#41b6c4","#2c7fb8","#253494")
palette10 <- c("#f7fcf0","#e0f3db","#ccebc5","#a8ddb5","#7bccc4",
               "#4eb3d3","#2b8cbe","#0868ac","#084081","#f7fcf0")

quintileBreaks <- function(df,variable) {
  as.character(quantile(df[[variable]],
                        c(.01,.2,.4,.6,.8),na.rm=T))
}

#This function can be used to convert a polygon sf to centroids xy coords.
xyC <- function(aPolygonSF) {
  as.data.frame(
    cbind(x=st_coordinates(st_centroid(aPolygonSF))[,1],
          y=st_coordinates(st_centroid(aPolygonSF))[,2]))
} 

#this function convert a raster to a data frame so it can be plotted in ggplot
rast <- function(inRaster) {
  data.frame(
    xyFromCell(inRaster, 1:ncell(inRaster)), 
    value = getValues(inRaster)) }


setwd("~/Desktop/Final_Austin")

2 Data wrangling & feature engineering

2.1 Land Cover Change (2001-2011)

The dependent variable in this model is a land cover change between 2001 and 2010. This dataset classifies land by land cover type, providing information on whether the land is developed or not developed. Austin MSA boundary data was downloaded from CAPCOG GIS Open Data (https://regional-open-data-capcog.opendata.arcgis.com/datasets/capcog-county-boundaries?geometry=-101.488%2C29.497%2C-94.023%2C31.156). 2001-2011 land cover change data was retrieved from USGS ScienceBase Catalog (ttps://www.sciencebase.gov/catalog/item/5dfc2280e4b0b207a9fe8235). National land cover data was clipped by Austin MSA shapefile and sampled to 1,000 by 1,000ft^2 fishnet, so that it can be later joined with population data. The data, which was initially categorized by 18 land cover types, was reclassified into two categories - “developed” and “undeveloped” - such that developed grid cells receive a value of 1 and all others receive a value of 0. Since this is the land cover change data, the cells with a value of 1 indicate the areas that changed from undeveloped to developed.

Some map visualizations were omitted since R Markdown could not process them.

austin_msa <- 
  st_read("austin_msa.shp")


lc_2001_2011 <- raster("nlcd_2001_to_2011_landcover_change_pixels_2011_edition_2014_10_10.img")


# re-project the MSA layer and try a clip
austin_msa <- austin_msa %>% st_transform(crs = "+proj=aea +lat_0=23 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 
              +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs ")
lc_change <- crop(x = lc_2001_2011, y = austin_msa)


# re-project both layers back to the TX southern plane
# NAD 1983 StatePlane Texas Central FIPS 4203 Feet 102739

lc_change <- projectRaster(from = lc_change, crs = "ESRI:102739")

writeRaster(lc_change, 
           filename = ("lc_change.tif"), 
           overwrite = TRUE)

lc_change <- raster("lc_change.tif")

lc_change <- projectRaster(from = lc_change, crs = "ESRI:102739")
austin_msa <- st_transform(austin_msa, crs = "ESRI:102739")

ggplot() +
  geom_sf(data=austin_msa) +
  geom_raster(data=rast(lc_change) %>% na.omit %>% filter(value > 0), 
              aes(x,y,fill=as.factor(value))) +
  scale_fill_viridis(direction = -1, discrete=TRUE, name ="Land Cover\nChange") +
  labs(title = "Land Cover Change") +
  mapTheme()

#Reclassify land cover
reclassMatrix <- matrix(c(0, 12, 0,
                          12, 24, 1,
                          24, Inf, 0),
                        ncol = 3, byrow = T)

#reclassMatrix

lc_change2 <- 
  reclassify(lc_change,reclassMatrix)

lc_change2[lc_change2 < 1] <- NA

names(lc_change2) <- "lc_change"

#ggplot() +
#  geom_sf(data=austin_msa) +
#  geom_raster(data=rast(lc_change2) %>% na.omit, 
#              aes(x,y,fill=as.factor(value))) +
#  scale_fill_viridis(discrete=TRUE, name ="Land Cover\nChange") + 
#  labs(title="Development land use change") +
#  mapTheme()

austin_msa_fishnet <- 
  st_make_grid(austin_msa, 2000) %>%
  st_sf()

austin_msa_fishnet <-
  austin_msa_fishnet[austin_msa,]

st_write(austin_msa_fishnet, dsn = "austin_fishnet.shp", layer = "Austin_highway.shp", 
         driver = "ESRI Shapefile", overwrite = TRUE)

#ggplot() +
#  geom_sf(data=austin_msa_fishnet) +
#  labs(title="Fishnet, 1000 foot resolution") +
#  mapTheme()

changePoints <-
  rasterToPoints(lc_change2) %>%
  as.data.frame() %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(austin_msa_fishnet))

fishnet <- 
  aggregate(changePoints, austin_msa_fishnet, sum) %>%
  mutate(lc_change = ifelse(is.na(lc_change),0,1),
         lc_change = as.factor(lc_change))

ggplot() +
  geom_sf(data=austin_msa) +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)$x, y=xyC(fishnet)$y, colour=lc_change)) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development"),
                      name = "") +
  labs(title = "Land cover development change", subtitle = "As fishnet centroids") +
  mapTheme()

2.2 Land Cover in 2001

Since the propensity of new development takes account of existing land cover type, 2001 land cover data was used as an independent variable. The data was retrieved from the MRLC NLCD viewer (https://www.mrlc.gov/viewer/). It was reclassified as the following table and aggregated to the fishnet.

The maps below show reclassified data by new classification.

lc_2001_box <- raster("NLCD_2001_Land_Cover_L48_20190424_HECAxKZdOYtUmzpSfILY.tiff")

lc_2001_box <- projectRaster(from = lc_2001_box, crs = "ESRI:102739")


lc_2001 <- crop(x = lc_2001_box, y = austin_msa)

writeRaster(lc_2001, 
            filename = ("lc_2001.tif"), 
            overwrite = TRUE)

lc_2001 <- raster("lc_2001.tif")

ggplot() +
  geom_sf(data=austin_msa) +
  geom_raster(data=rast(lc_2001) %>% na.omit %>% filter(value > 0), 
              aes(x,y,fill=as.factor(value))) +
  scale_fill_viridis(discrete=TRUE, name ="") +
  labs(title = "Land Cover, 2001") +
  mapTheme()

developed <- lc_2001 == 21 | lc_2001 == 22 | lc_2001 == 23 | lc_2001 == 24
forest <- lc_2001 == 41 | lc_2001 == 42 | lc_2001 == 43 
farm <- lc_2001 == 81 | lc_2001 == 82 
wetlands <- lc_2001 == 90 | lc_2001 == 95 
otherUndeveloped <- lc_2001 == 52 | lc_2001 == 71 | lc_2001 == 31 
water <- lc_2001 == 11

names(developed) <- "developed"
names(forest) <- "forest"
names(farm) <- "farm"
names(wetlands) <- "wetlands"
names(otherUndeveloped) <- "otherUndeveloped"
names(water) <- "water"

aggregateRaster <- function(inputRasterList, theFishnet) {
  #create an empty fishnet with the same dimensions as the input fishnet
  theseFishnets <- theFishnet %>% dplyr::select()
  #for each raster in the raster list
  for (i in inputRasterList) {
    #create a variable name corresponding to the ith raster
    varName <- names(i)
    #convert raster to points as an sf
    thesePoints <-
      rasterToPoints(i) %>%
      as.data.frame() %>%
      st_as_sf(coords = c("x", "y"), crs = st_crs(theFishnet)) %>%
      filter(.[[1]] == 1)
    #aggregate to the fishnet
    thisFishnet <-
      aggregate(thesePoints, theFishnet, length) %>%
      mutate(!!varName := ifelse(is.na(.[[1]]),0,1))
    #add to the larger fishnet
    theseFishnets <- cbind(theseFishnets,thisFishnet)
  }
  #output all aggregates as one large fishnet
  return(theseFishnets)
}

theRasterList <- c(developed,forest,farm,wetlands,otherUndeveloped,water)

aggregatedRasters <-
  aggregateRaster(theRasterList, austin_msa_fishnet) %>%
  dplyr::select(developed,forest,farm,wetlands,otherUndeveloped,water) %>%
  mutate_if(is.numeric,as.factor)

aggregatedRasters %>%
  gather(var,value,developed:water) %>%
  st_cast("POLYGON") %>%    #just to make sure no weird geometries slipped in
  mutate(X = xyC(.)$x,
         Y = xyC(.)$y) %>%
  ggplot() +
  geom_sf(data=austin_msa) +
  geom_point(aes(X,Y, colour=as.factor(value))) +
  facet_wrap(~var) +
  scale_colour_manual(values = palette2,
                      labels=c("Other","Land Cover"),
                      name = "") +
  labs(title = "Land cover types, 2001",
       subtitle = "As fishnet centroids") +
  mapTheme()

3 Census Demographics

Population and population change are important indicators of development demand. Census data for both 2000 and 2010 were downloaded via tidycensus package. The data was aggregated to fishnet and reinterpreted using a technique called areal weighted interpolation. In order to avoid assigning the same population value repeatedly to different fishnet grid cells, the area weighted interpolation function was used. “st_interpolate_aw assigns a proportion of a tract’s population to a grid cell weighted by the proportion of the tract that intersects the grid cell (Steif, 2019)”

austinPop00 <- 
  get_decennial(geography = "tract", variables = "P001001", year = 2000,
                state = 48, geometry = TRUE, 
                county=c("Williamson", "Travis", "Hays", "Caldwell", "Bastrop")) 



austinPop00 <- austinPop00 %>%
  rename(pop_2000 = value) %>%
  st_transform(st_crs(austin_msa_fishnet))

austinPop10 <- 
  get_decennial(geography = "tract", variables = "P001001", year = 2010,
                state = 48, geometry = TRUE,
                county=c("Williamson", "Travis", "Hays", "Caldwell", "Bastrop")) %>%
  rename(pop_2010 = value) %>%
  st_transform(st_crs(austin_msa_fishnet)) %>%
  st_buffer(-1)

grid.arrange(
  ggplot() +
    geom_sf(data = austinPop00, aes(fill=factor(ntile(pop_2000,5))), colour=NA) +
    scale_fill_manual(values = palette5,
                      labels=quintileBreaks(austinPop00,"pop_2000"),
                      name="Quintile\nBreaks") +
    labs(title="Population, Austin MSA: 2000") +
    mapTheme(), 
  
  ggplot() +
    geom_sf(data = austinPop10, aes(fill=factor(ntile(pop_2010,5))), colour=NA) +
    scale_fill_manual(values = palette5,
                      labels=quintileBreaks(austinPop10,"pop_2010"),
                      name="Quintile\nBreaks") +
    labs(title="Population, Austin MSA: 2010") +
    mapTheme(), ncol=2)

austinMSA_fishnet <- tibble::rownames_to_column(austin_msa_fishnet, "fishnetID")


fishnetPopulation00 <-
  st_interpolate_aw(austinPop00["pop_2000"],austinMSA_fishnet, extensive=TRUE) %>%
  as.data.frame(.)

fishnetPopulation00 <- tibble::rownames_to_column(fishnetPopulation00, "row_names")


fishnetPopulation00 <- fishnetPopulation00 %>%
  left_join(austinMSA_fishnet, ., by=c("fishnetID"='row_names')) %>% 
  mutate(pop_2000 = replace_na(pop_2000,0)) %>%
  dplyr::select(pop_2000)


fishnetPopulation10 <-
  st_interpolate_aw(austinPop10["pop_2010"],austinMSA_fishnet, extensive=TRUE) %>%
  as.data.frame(.)

fishnetPopulation10 <- tibble::rownames_to_column(fishnetPopulation10, "row_names")

fishnetPopulation10 <- fishnetPopulation10 %>%
  left_join(austinMSA_fishnet, ., by=c("fishnetID"='row_names')) %>% 
  mutate(pop_2010 = replace_na(pop_2010,0)) %>%
  dplyr::select(pop_2010)

fishnetPopulation <- 
  cbind(fishnetPopulation00,fishnetPopulation10) %>%
  dplyr::select(pop_2000,pop_2010) %>%
  mutate(pop_Change = pop_2010 - pop_2000)

grid.arrange(
  ggplot() +
    geom_sf(data=austinPop10, aes(fill=factor(ntile(pop_2010,5))),colour=NA) +
    scale_fill_manual(values = palette5,
                      labels=substr(quintileBreaks(austinPop10,"pop_2010"),1,4),
                      name="Quintile\nBreaks") +
    labs(title="Population, Austin MSA: 2010",
         subtitle="Represented as tracts; Boundaries omitted") +
    mapTheme(),
  
  ggplot() +
    geom_sf(data=fishnetPopulation, aes(fill=factor(ntile(pop_2010,5))),colour=NA) +
    scale_fill_manual(values = palette5,
                      labels=substr(quintileBreaks(fishnetPopulation,"pop_2010"),1,4),
                      name="Quintile\nBreaks") +
    labs(title="Population, Austin MSA: 2010",
         subtitle="Represented as fishnet gridcells; Boundaries omitted") +
    mapTheme(), ncol=2)

3.1 Highway Distance

This variable represents new development of the study area. The National Highway System (NHS) data was downloaded from Texas Department of Transportation website. It contains up-to-date interstates, principal arterial, strategic highway network, major strategic sighway setwork connectors, and intermodal connectors. Accessibility is an important determinant of urban growth and sprawl.

Below, new development is mapped with the 2020 highway system of Austin MSA.

TXhighway <-  read_sf("TXhighway/TxDOT_National_Highway_System.shp")
TXhighway <- st_transform(TXhighway, crs = "ESRI:102739")


Austin_highway <- st_crop(x = TXhighway, y = austin_msa)

st_write(Austin_highway, dsn = "Austin_highway.shp", layer = "Austin_highway.shp", 
         driver = "ESRI Shapefile", overwrite = TRUE)


austinHighways <-
  read_sf("Austin_highway.shp") %>%
  st_transform(st_crs(austin_msa)) %>%
  st_intersection(austin_msa)

ggplot() +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2],colour=lc_change),size=1.5) +
  geom_sf(data=austinHighways) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","New Development")) +
  labs(title = "New Development and Highways",
       subtitle = "As fishnet centroids") +
  mapTheme()

3.2 Distance to highways

Accessibility is an important determinant of urban growth and sprawl. Distance to highways represents the degree of accessibility. Here, “Near_Distance” function in ArcGIS was used to calculate each fishnet grid cell’s distance to its nearest highways. distance.shp was in ArcGIS and imported to R.

distance_shp <- st_read("distance.shp")


ggplot() +
  geom_sf(data=austin_msa) +
  geom_point(data=distance_shp, aes(x=xyC(distance_shp)[,1], 
                                    y=xyC(distance_shp)[,2], 
                                    colour=factor(ntile(NEAR_DIST,5))),size=1.5) +
  scale_colour_manual(values = palette5,
                      labels=substr(quintileBreaks(distance_shp,"NEAR_DIST"),1,4),
                      name="Quintile\nBreaks") +
  geom_sf(data=austinHighways, colour = "red") +
  labs(title = "Distance to Highways",
       subtitle = "As fishnet centroids; Highways visualized in red") +
  mapTheme()

3.3 The Spatial lag of development

The last variable represents the spatial lag of development. This model hypothesizes that the existing development trend is sticky, limiting future development boundaries to already developed areas. So, the variable calculates (nn_function) the distance to existing development. Under this assumption, grid cells with lower values are more likely to develop.

nn_function <- function(measureFrom,measureTo,k) {
  #convert the sf layers to matrices
  measureFrom_Matrix <-
    as.matrix(measureFrom)
  measureTo_Matrix <-
    as.matrix(measureTo)
  nn <-   
    get.knnx(measureTo, measureFrom, k)$nn.dist
  output <-
    as.data.frame(nn) %>%
    rownames_to_column(var = "thisPoint") %>%
    gather(points, point_distance, V1:ncol(.)) %>%
    arrange(as.numeric(thisPoint)) %>%
    group_by(thisPoint) %>%
    summarize(pointDistance = mean(point_distance)) %>%
    arrange(as.numeric(thisPoint)) %>% 
    dplyr::select(-thisPoint) %>%
    pull()
  
  return(output)  
}

fishnet$lagDevelopment <-
  nn_function(xyC(fishnet),
              xyC(filter(aggregatedRasters,developed==1)),
              2)

ggplot() +
  geom_sf(data=austin_msa) +
  geom_point(data=fishnet, 
             aes(x=xyC(fishnet)[,1], y=xyC(fishnet)[,2], 
                 colour=factor(ntile(lagDevelopment,5))), size=1.5) +
  scale_colour_manual(values = palette5,
                      labels=substr(quintileBreaks(fishnet,"lagDevelopment"),1,7),
                      name="Quintile\nBreaks") +
  labs(title = "Spatial lag to 2001 development",
       subtitle = "As fishnet centroids") +
  mapTheme()

3.4 MSA Counties

Austin MSA county geometries were downloaded using the tigris package. This will be used in later county-level analysis.

options(tigris_class = "sf")

studyAreaCounties <- 
  counties("Texas") %>%
  st_transform(st_crs(austin_msa)) %>%
  dplyr::select(NAME) %>%
  .[st_buffer(austin_msa,-10000), , op=st_intersects]

ggplot() +
  geom_sf(data=studyAreaCounties) +
  labs(title = "Study area counties") +
  mapTheme()

3.5 Final Dataset

Now, all the feature layers are gathered into a final dataset.

dat <- 
  cbind(fishnet, distance_shp, fishnetPopulation, aggregatedRasters) %>%
  dplyr::select(lc_change, developed, forest, farm, wetlands, otherUndeveloped, water,
                pop_2000, pop_2010, pop_Change, NEAR_DIST,lagDevelopment) %>%
  st_join(studyAreaCounties) %>%
  mutate(developed10 = ifelse(lc_change == 1 & developed == 1, 0, developed)) %>%
  filter(water == 0)

4 Exploratory Analysis

In this section, we can explore the weight of each independent variable on dependent variable - development change. As confirmed in below table, a land conversion rate of Austin MSA is much higher than Houston MSA’s (Steif, 2019).

dat %>%
  dplyr::select(NEAR_DIST,lagDevelopment,lc_change) %>%
  gather(Variable, Value, -lc_change, -geometry) %>%
  ggplot(., aes(lc_change, Value, fill=lc_change)) + 
  geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
  facet_wrap(~Variable) +
  scale_fill_manual(values = palette2,
                    labels=c("No Change","New Development"),
                    name="") +
  labs(title="New development as a function of the countinuous variables") +
  plotTheme()

dat %>%
  dplyr::select(pop_2000,pop_2010,pop_Change,lc_change) %>%
  gather(Variable, Value, -lc_change, -geometry) %>%
  ggplot(., aes(lc_change, Value, fill=lc_change)) + 
  geom_bar(position = "dodge", stat = "summary", fun.y = "mean") +
  facet_wrap(~Variable) +
  scale_fill_manual(values = palette2,
                    labels=c("No Change","New Development"),
                    name="") +
  labs(title="New development as a function of factor variables") +
  plotTheme()

dat %>%
  dplyr::select(lc_change:otherUndeveloped,developed) %>%
  gather(Land_Cover_Type, Value, -lc_change, -geometry) %>%
  st_set_geometry(NULL) %>%
  group_by(lc_change, Land_Cover_Type) %>%
  summarize(n = sum(as.numeric(Value))) %>%
  ungroup() %>%
  mutate(Conversion_Rate = paste0(round(100 * n/sum(n), 2), "%")) %>%
  filter(lc_change == 1) %>%
  dplyr::select(Land_Cover_Type,Conversion_Rate) %>%
  kable() %>% kable_styling(full_width = F)

5 Predicting for 2010

5.1 Modeling

Six logistic regression models were trained and tested on the dataset. Model6, which uses 2001 land cover types, spatial lag development, and distance to the highways had the highest McFadden R-Squared value. Therefore, model6 was employed for final prediction.

set.seed(3456)
trainIndex <- 
  createDataPartition(dat$developed, p = .50,
                      list = FALSE,
                      times = 1)
datTrain <- dat[ trainIndex,]
datTest  <- dat[-trainIndex,]

nrow(dat)

Model1 <- glm(lc_change ~ wetlands + forest  + farm + otherUndeveloped, 
              family="binomial"(link="logit"), data = datTrain)

Model2 <- glm(lc_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment, 
              family="binomial"(link="logit"), data = datTrain)

Model3 <- glm(lc_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_2000, 
              family="binomial"(link="logit"), data = datTrain)          

Model4 <- glm(lc_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment + pop_2010, 
              family="binomial"(link="logit"), data = datTrain)              

Model5 <- glm(lc_change ~ wetlands + forest  + farm + otherUndeveloped + pop_Change + lagDevelopment, 
              family="binomial"(link="logit"), data = datTrain)              

Model6 <- glm(lc_change ~ wetlands + forest  + farm + otherUndeveloped + lagDevelopment+ NEAR_DIST, 
              family="binomial"(link="logit"), data = datTrain) 



modelList <- paste0("Model", 1:6)
map_dfc(modelList, function(x)pR2(get(x)))[4,] %>%
  setNames(paste0("Model",1:6)) %>%
  gather(Model,McFadden) %>%
  ggplot(aes(Model,McFadden)) +
  geom_bar(stat="identity") +
  labs(title= "McFadden R-Squared by Model") +
  plotTheme()

The below histogram visualizes the distribution of predicted probabilities by observed classes: “no change” and “new development”. Here, two thresholds for the highest accuracy can be found: 0.4 and 0.5.

testSetProbs <- 
  data.frame(class = datTest$lc_change,
             probs = predict(Model6, datTest, type="response")) 

ggplot(testSetProbs, aes(probs)) +
  geom_density(aes(fill=class), alpha=0.5) +
  scale_fill_manual(values = palette2,
                    labels=c("No Change","New Development")) +
  labs(title = "Histogram of test set predicted probabilities",
       x="Predicted Probabilities",y="Density") +
  plotTheme()

6 Accuracy

As confirmed in below table, the threshold of 50% has a higher accuracy but also has a higher specificity. Specificity can be translated as the True Negative rate, so the model with a higher specificity can better predict “no change”. The model with a higher sensitivity (or the True Positive rate) can better predict “new development”.

options(yardstick.event_first = FALSE)

testSetProbs <- 
  testSetProbs %>% 
  mutate(predClass_40 = as.factor(ifelse(testSetProbs$probs >= 0.4,1,0)),
         predClass_50 = as.factor(ifelse(testSetProbs$probs >= 0.5,1,0))) 

testSetProbs %>%
  dplyr::select(-probs) %>%
  gather(Variable, Value, -class) %>%
  group_by(Variable) %>%
  summarize(Sensitivity = round(yardstick::sens_vec(class,factor(Value)),2),
            Specificity = round(yardstick::spec_vec(class,factor(Value)),2),
            Accuracy = round(yardstick::accuracy_vec(class,factor(Value)),2)) %>% 
  kable() %>%
  kable_styling(full_width = F)

If the planning project focuses more on predicting new developments, Threshold_40_Pct map below is a better option as it can predict sensitivity with the accuracy rate of 68%.

predsForMap <-         
  dat %>%
  mutate(probs = predict(Model6, dat, type="response") ,
         Threshold_40_Pct = as.factor(ifelse(probs >= 0.4 ,1,0)),
         Threshold_50_Pct =  as.factor(ifelse(probs >= 0.5 ,1,0))) %>%
  dplyr::select(lc_change,Threshold_40_Pct,Threshold_50_Pct) %>%
  gather(Variable,Value, -geometry) %>%
  st_cast("POLYGON")

ggplot() +
  geom_point(data=predsForMap, aes(x=xyC(predsForMap)[,1], y=xyC(predsForMap)[,2], colour=Value)) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("No Change","New Development"),
                      name="") +
  labs(title="Development predictions - Low threshold") + mapTheme()

ConfusionMatrix.metrics <-
  dat %>%
  mutate(probs = predict(Model6, dat, type="response") ,
         Threshold_40_Pct = as.factor(ifelse(probs >= 0.40 ,1,0)),
         Threshold_50_Pct =  as.factor(ifelse(probs >= 0.50 ,1,0))) %>%
  mutate(TrueP_40 = ifelse(lc_change  == 1 & Threshold_40_Pct == 1, 1,0),
         TrueN_40 = ifelse(lc_change  == 0 & Threshold_40_Pct == 0, 1,0),
         TrueP_50 = ifelse(lc_change  == 1 & Threshold_50_Pct == 1, 1,0),
         TrueN_50 = ifelse(lc_change  == 0 & Threshold_50_Pct == 0, 1,0)) %>%
  dplyr::select(., starts_with("True")) %>%
  gather(Variable, Value, -geometry) %>%
  st_cast("POLYGON") 

ggplot(data=ConfusionMatrix.metrics) +
  geom_point(aes(x=xyC(ConfusionMatrix.metrics)[,1], 
                 y=xyC(ConfusionMatrix.metrics)[,2], colour = as.factor(Value))) +
  facet_wrap(~Variable) +
  scale_colour_manual(values = palette2, labels=c("Correct","Incorrect"),
                      name="") +
  labs(title="Development predictions - Low threshold") + mapTheme()

6.1 Generalizability

Below examines whether the model can be applied to other random features with a different geographical extent. In this section, the function called spatialCV tests the model to each county. The table shows how the model performs in county-level analysis.

#counties

spatialCV <- function(dataFrame, uniqueID, dependentVariable, modelName) {
  
  #initialize a data frame 
  endList <- list()
  
  #create a list that is all the spatial group unqiue ids in the data frame (ie counties)    
  uniqueID_List <- unique(dataFrame[[uniqueID]])  
  x <- 1
  y <- length(uniqueID_List)
  
  #create a counter and while it is less than the number of counties...  
  while(x <= y) 
  {
    #call a current county    
    currentUniqueID <- uniqueID_List[x]
    #create a training set comprised of units not in that county and a test set of units
    #that are that county
    training <- dataFrame[ which(dataFrame[[uniqueID]] != uniqueID_List[x]),]
    testing <- dataFrame[ which(dataFrame[[uniqueID]] == uniqueID_List[x]),]
    #create seperate xy vectors
    trainingX <- training[ , -which(names(training) %in% c(dependentVariable))]
    testingX <- testing[ , -which(names(testing) %in% c(dependentVariable))]
    
    trainY <- training[[dependentVariable]]
    testY <- testing[[dependentVariable]]
    #Calculate predictions on the test county as part of a data frame including the observed
    #outcome and the unique county ID    
    thisPrediction <- 
      data.frame(class = testY,
                 probs = predict(modelName, testingX, type="response"),
                 county = currentUniqueID) 
    
    #Row bind the predictions to a data farme
    endList <- rbind(endList, thisPrediction)
    #iterate counter    
    x <- x + 1 
  } 
  #return the final list of counties and associated predictions  
  return (as.data.frame(endList))
}

dat01 <- na.omit(dat) 

spatialCV_counties <-
  spatialCV(dat01,"NAME","lc_change", Model6) %>%
  mutate(predClass = as.factor(ifelse(probs >= 0.4 ,1,0)))


spatialCV_metrics <-
  spatialCV_counties %>% 
  group_by(county) %>% 
  summarize(Observed_Change = sum(as.numeric(as.character(class))),
            Sensitivity = round(yardstick::sens_vec(class,predClass),2),
            Specificity = round(yardstick::spec_vec(class,predClass),2),
            Accuracy = round(yardstick::accuracy_vec(class,predClass),2)) 

spatialCV_metrics %>%
  kable() %>%
  kable_styling(full_width = F)

7 Demand Projection

7.1 Population Change

Population change data between 2010 and 2020 is an important contributor in forecasting land cover demand for 2020. Therefore, 2020 population estimates by county was retrieved from https://www.texas-demographics.com/counties_by_population.

dat <-
  dat %>%
  mutate(lagDevelopment = nn_function(xyC(.), xyC(filter(.,developed10 == 1)),2))

countyPopulation_2020 <- 
  data.frame(
    NAME = 
      c("Williamson", "Travis", "Hays", "Caldwell", "Bastrop"),
    county_projection_2020 = 
      c(547604, 1226805, 213366, 42144, 84522)) %>%
  left_join(
    dat %>%
      st_set_geometry(NULL) %>%
      group_by(NAME) %>%
      summarize(county_population_2010 = round(sum(pop_2010))))

countyPopulation_2020 %>%
  gather(Variable,Value, -NAME) %>%
  ggplot(aes(reorder(NAME,-Value),Value)) +
  geom_bar(aes(fill=Variable), stat = "identity") +
  scale_fill_manual(values = palette2,
                    labels=c("2020","2010"),
                    name="Population") +
  labs(title="Population Change by county: 2010 - 2020",
       x="County", y="Population") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  plotTheme()

7.2 Predicting development demand

2020 population data is distributed to each grid cell but weighted by the cell’s existing population data to consider the spatial lag in development. Then, 2010 population is subtracted to measure the likelihood of infill development.

dat_infill <-
  dat %>%
  #calculate population change
  left_join(countyPopulation_2020) %>%
  mutate(proportion_of_county_pop = pop_2010 / county_population_2010,
         pop_2020.infill = proportion_of_county_pop * county_projection_2020,
         pop_Change = round(pop_2020.infill - pop_2010),2) %>%
  dplyr::select(-county_projection_2020, -county_population_2010, 
                -proportion_of_county_pop, -pop_2020.infill) %>%
  #predict for 2020
  mutate(predict_2020.infill = predict(Model6,. , type="response"))

dat_infill %>%
  ggplot() +  
  geom_point(aes(x=xyC(dat_infill)[,1], y=xyC(dat_infill)[,2], colour = factor(ntile(predict_2020.infill,5)))) +
  scale_colour_manual(values = palette5,
                      labels=substr(quintileBreaks(dat_infill,"predict_2020.infill"),1,4),
                      name="Quintile\nBreaks") +
  geom_sf(data=studyAreaCounties, fill=NA, colour="black", size=1.5) +
  labs(title= "Development Demand in 2020: Predicted Probabilities") +
  mapTheme()

8 Evaluating Development Suitability

The last step is to evaluate the development suitability of study area based on predicted supply and demand. In addition to previous analyses, environmental sensitivity must be taken into consideration for resilient urban planning. By integrating these three components, planners can see which areas have both future demand and supply for development while having suitable land for development.

8.1 Environmental Sensitivity

In this project, we assume environmentally sensitive land as wetland and forest. Environmental sensitivity is determined by measuring the total amount of wetlands and fores and the amount of lost sensitive land between 2001 and 2011. The lost sensitive land between 2001 and 2011 is marked in blue in the following map.

# 2011 land cover data

lc_2011_box <- raster("NLCD_2011_Land_Cover_L48_20190424_F71SnxdMU77CTzp6Z4h4.tiff")

lc_2011_box <- projectRaster(from = lc_2011_box, crs = "ESRI:102739")


lc_2011 <- crop(x = lc_2011_box, y = austin_msa)

writeRaster(lc_2011, 
            filename = ("lc_2011.tif"), 
            overwrite = TRUE)

lc_2011 <- raster("lc_2011.tif")




developed11 <- lc_2011 == 21 | lc_2011 == 22 | lc_2011 == 23 | lc_2011 == 24
forest11 <- lc_2011 == 41 | lc_2011 == 42 | lc_2011 == 43 
farm11 <- lc_2011 == 81 | lc_2011 == 82 
wetlands11 <- lc_2011 == 90 | lc_2011 == 95 
otherUndeveloped11 <- lc_2011 == 52 | lc_2011 == 71 | lc_2011 == 31 
water11 <- lc_2011 == 11

names(developed11) <- "developed11"
names(forest11) <- "forest11"
names(farm11) <- "farm11"
names(wetlands11) <- "wetlands11"
names(otherUndeveloped11) <- "otherUndeveloped11"
names(water11) <- "water11"




theRasterList11 <- c(developed11,forest11,farm11,wetlands11,otherUndeveloped11,water11)

dat2 <-
  aggregateRaster(theRasterList11, dat) %>%
  dplyr::select(developed11,forest11,farm11,wetlands11,otherUndeveloped11,water11) %>%
  st_set_geometry(NULL) %>%
  bind_cols(.,dat) %>%
  st_sf() %>%
  st_cast("POLYGON")


dat2 <-
  dat2 %>%
  mutate(sensitive_lost11 = ifelse(forest == 1 & forest11 == 0 |
                                     wetlands == 1 & wetlands11 == 0,1,0))

ggplot() +
  geom_point(data=dat2, aes(x=xyC(dat2)[,1], y=xyC(dat2)[,2], colour=as.factor(sensitive_lost11))) +
  scale_colour_manual(values = palette2,
                      labels=c("No Change","Sensitive Lost"),
                      name = "") +
  labs(title = "Sensitive lands lost: 2001 - 2011",
       subtitle = "As fishnet centroids") +
  mapTheme()

8.2 Allocation

In this stage, we can determine development rights of the land depending on predicted demand, predicted supply, and environmental sensitivity. The map below shows the predicted development demand and suitable land for development of Williamson County in Austin MSA. The data suggests that Williamson County has a high potential for urban growth. There is a considerable portion of environmentally suitable land that is not yet developed but projected to have a high demand for development.

Williamson <-
  dat2 %>%
  mutate(Development_Demand = predict(Model6, dat2, type="response")) %>%
  filter(NAME == "Williamson") 

Will_landUse <- rbind(
  filter(Williamson, forest11 == 1 | wetlands11 == 1 ) %>%
    dplyr::select() %>% mutate(Land_Use = "Not Suitable"),
  filter(Williamson, developed11 == 1) %>%
    dplyr::select() %>% mutate(Land_Use = "Developed"))

grid.arrange(
  ggplot() +
    geom_sf(data=Williamson, aes(fill=factor(ntile(Development_Demand,5))), colour=NA) +
    geom_point(data=Will_landUse, aes(x=xyC(Will_landUse)[,1], 
                                          y=xyC(Will_landUse)[,2], colour=Land_Use),
               shape = 15, size = 0.1) +
    geom_sf(data=st_intersection(austinHighways,filter(studyAreaCounties, NAME=="Williamson")), size=0.5) +
    scale_fill_manual(values = palette5, name="Development\nDemand",
                      labels=substr(quintileBreaks(Williamson,"Development_Demand"),1,5)) +
    scale_colour_manual(values = c("black","red")) + 
    labs(title = "Development Potential, 2020: Williamson") + mapTheme() +
    guides(fill = guide_legend(order = 1), colour = guide_legend(order = 2)),
  
  ggplot() +
    geom_sf(data=Williamson, aes(fill=factor(ntile(pop_Change,5))), colour=NA) +
    geom_point(data=Will_landUse, aes(x=xyC(Will_landUse)[,1], 
                                          y=xyC(Will_landUse)[,2], colour=Land_Use),
               shape = 15, size = 0.1) +
    geom_sf(data=st_intersection(austinHighways,filter(studyAreaCounties, NAME=="Williamson")), size=0.5) +
    scale_fill_manual(values = palette5, name="Population\nChange",
                      labels=substr(quintileBreaks(Williamson,"pop_Change"),1,5)) +
    scale_colour_manual(values = c("black","red")) + 
    labs(title = "Projected Population, 2020: Williamson") + mapTheme() +
    guides(fill = guide_legend(order = 1), colour = guide_legend(order = 2)), ncol=2)

Urban Growth Modeling - Austin Metropolitan Area

Claire Ahn

May 8, 2021