Chapter 1 CASE STUDY #1

Analysis of McDonald & Starbucks locations in NY ZIP codes

1.1 Data and Package

First, you need to install few packages.

Needed <- c("ggmap", "ggplot2", "zipcode", "gridExtra", "caTools", "ROCR", 
            "rpart", "rpart.plot", "glmnet", "useful")   
install.packages(Needed, dependencies=TRUE)   

Then load the packages into environment

library(ggmap)
library(ggplot2)
library(zipcode)
library(gridExtra)
library(caTools)
library(ROCR)
library(rpart)
library(rpart.plot)
library(glmnet)
library(useful)
library(rattle)
## Warning: Failed to load RGtk2 dynamic library, attempting to install it.

Then, load the data to analyze into environment. (Later we will provide a link to download the data either from a package or vishalsingh.org)

load("ny.RData")

To ensure the data is properly loaded.

head(ny)
##     ZIP   primary_city state1         county1    logpop female collegeplus
## 1  6390 Fishers Island     NY  Suffolk County  5.634790  54.64       43.72
## 2 10001       New York     NY New York County  9.956886  52.25       66.42
## 3 10002       New York     NY New York County 11.306332  52.22       28.10
## 4 10003       New York     NY New York County 10.918537  52.68       75.80
## 5 10004       New York     NY New York County  7.864804  40.55       78.39
## 6 10005       New York     NY New York County  8.504108  51.19       85.88
##   black  white asian under18 over65 single divorced samesex lessHS poverty
## 1  0.00 100.00  0.00   16.79   8.57  26.29     2.39    0.00   0.00    6.98
## 2 10.46  63.85 16.77   11.03  11.93  62.48    10.50    2.86   8.61   13.37
## 3  6.95  30.48 46.02   15.78  17.36  40.70     9.86    0.66  37.79   28.41
## 4  3.57  76.85 14.08    6.71  10.68  63.75     8.13    2.29   4.49    4.73
## 5  6.11  71.89 15.67   11.87   3.00  53.82     6.82    3.37   1.61    6.88
## 6  1.03  75.36 20.55    6.06   0.43  63.37     3.66    3.28   2.08    1.28
##   foreign logincome HouseValue N_Starbucks N_McDonald Has_Starbucks
## 1    6.79  11.92669     650000           0          0             0
## 2   23.73  11.12424     632400          13          8             1
## 3   44.18  10.38613     572000           1          2             1
## 4   19.65  11.39190     802300           8          4             1
## 5   27.50  11.75546     886800           4          1             1
## 6   27.78  11.67746    1000001           5          0             1
##   Has_McDonald                Has_any Validation log.HouseValue
## 1            0                Neither   Training       13.38473
## 2            1 Both McD and Starbucks   Training       13.35728
## 3            1 Both McD and Starbucks   Training       13.25689
## 4            1 Both McD and Starbucks   Training       13.59524
## 5            1 Both McD and Starbucks   Training       13.69537
## 6            0              Starbucks   Training       13.81551

1.2 Clean up geo information

To visualize the data, we need clean geo identification. Here we will use zipcode package.

ny$zip = clean.zipcodes(ny$ZIP)
data(zipcode)
ny=merge(ny,zipcode,by.x="zip",by.y="zip")

Let’s examine data points on map and see the zipcodes we have in our dataset

ggmap(get_map("New York, New York",zoom=12,color = "bw")) + 
  geom_point(data=ny, aes(x=longitude,y=latitude),  color='red') 

Now let’s check the locations of Mcdonald and Starbuck stores

ggmap(get_map("New York, New York",zoom=12,color = "bw"))+ 
  geom_point(data=ny, aes(x=longitude,y=latitude,group=city,colour=factor(Has_any)), size=4) + 
  labs(colour="Store Location")

1.3 Decision Tree Model

The validation column in data defines rows to use to train and test model. We now run a simple decision model.

Train <- ny[ny$Validation=="Training",]
Test <-ny[ny$Validation=="Validation",]

If the validation column does not exist in the data, you will need to create it. Here is a typical way to create validation column.

#set.seed(3000)
#spl = sample.split(ny$logHouseValue, SplitRatio=0.7)
#Train = subset(ny, spl==TRUE)
#Test = subset(ny, spl==FALSE)

Let’s run a CART model.

tree1<- rpart(Has_McDonald ~logpop+logincome+log.HouseValue+black+over65+lessHS+female +single+poverty, 
      data=Train, method="class")
      
prp(tree1)

We can optimize our tree using fancyRpartPlot(). This function is from rattle package.

fancyRpartPlot(tree1)