7 Review

The best way to review for the midterm will be to read the notes in this book and to carefully go through the problem set questions.

All of the questions in the problem sets are designed to be straight-forward applications of the class material. Oftentimes they are just class examples with re-named variables. Try to think about what different parts of the examples are doing. Try to edit things and see if they still work. For ex. an example makes deciles of population density, what if I made deciles of income instead?

The midterm questions will similarly be straight-forward applications of class examples. There may be cases where things are presented in a new form or combine concepts, but if you have a grasp of the in-class and problem set code there will be absolutely nothing surprising about the midterm.

My other tips for the midterm are: knit your rmarkdown often so you don’t get stuck at the end; read every question fully before starting; pay attention to extra information I’ve provided; pay attention to the words I use in the questions and how those words relate to the course material.

7.1 Review of Loops

acs <- rio::import("https://github.com/marctrussler/IDS-Data/raw/main/ACSCountyData.csv")

Let’s try another use: What is the correlation of population density and percent commuting by car within each state?

To start with, what is the correlation for all the counties? I can get that via the cor function. When using this function I add the argument use="pairwise.complete" to tell R to only worry about cases where we have values for both variables.

cor(acs$population.density, acs$percent.car.commute, use="pairwise.complete")
#> [1] -0.3972397

Predictably, there is a negative correlation. As population density gets higher the percent of people using a car to commute gets lower.

But we might be curious if that correlation is different in different places. Maybe in the South where there is less investment in public transportation there will be a lower correlation?

How would we determine this correlation just for Alabama?

To do so I need to subset both variables:

cor(acs$population.density[acs$state.abbr=="AL"],
    acs$percent.car.commute[acs$state.abbr=="AL"])
#> [1] -0.2173754

Now put this in a loop and save the results to make graphing easier. In the same loop I’m also going to record what the Census region of each state is.

states <- unique(acs$state.abbr)
within.state.cor <- rep(NA, length(states))
census.region <- rep(NA, length(states))

for(i in 1:length(states)){
  within.state.cor[i] <- cor(acs$population.density[acs$state.abbr==states[i]],
                             acs$percent.car.commute[acs$state.abbr==states[i]],
                             use="pairwise.complete")
  #Why do I need to use unique here?
  census.region[i] <- unique(acs$census.region[acs$state.abbr==states[i]])
}

cbind(states,within.state.cor, census.region)
#>       states within.state.cor      census.region
#>  [1,] "AL"   "-0.2173754217653"    "south"      
#>  [2,] "AK"   "0.306765991557315"   "west"       
#>  [3,] "AZ"   "-0.0978290084303064" "west"       
#>  [4,] "AR"   "0.105742290830552"   "south"      
#>  [5,] "CA"   "-0.653978927772517"  "west"       
#>  [6,] "CO"   "-0.0323180568609168" "west"       
#>  [7,] "CT"   "-0.67901228835113"   "northeast"  
#>  [8,] "DE"   "-0.886927793640683"  "south"      
#>  [9,] "DC"   NA                    "south"      
#> [10,] "FL"   "-0.259511132475236"  "south"      
#> [11,] "GA"   "-0.389881360777326"  "south"      
#> [12,] "HI"   "0.104587443762592"   "west"       
#> [13,] "ID"   "0.169041971631012"   "west"       
#> [14,] "IL"   "-0.685983927063127"  "midwest"    
#> [15,] "IN"   "-0.199856034276139"  "midwest"    
#> [16,] "IA"   "0.112736925807877"   "midwest"    
#> [17,] "KS"   "0.243180720391614"   "midwest"    
#> [18,] "KY"   "-0.294974731369198"  "south"      
#> [19,] "LA"   "-0.615420891410212"  "south"      
#> [20,] "ME"   "-0.197529385128518"  "northeast"  
#> [21,] "MD"   "-0.756892276791218"  "south"      
#> [22,] "MA"   "-0.889129033490708"  "northeast"  
#> [23,] "MI"   "0.101872768345178"   "midwest"    
#> [24,] "MN"   "-0.21613646697179"   "midwest"    
#> [25,] "MS"   "0.029415641383925"   "south"      
#> [26,] "MO"   "-0.201194925123038"  "midwest"    
#> [27,] "MT"   "0.440739634106128"   "west"       
#> [28,] "NE"   "0.175310702653679"   "midwest"    
#> [29,] "NV"   "0.382886407964416"   "west"       
#> [30,] "NH"   "0.310082253891291"   "northeast"  
#> [31,] "NJ"   "-0.971976737123834"  "northeast"  
#> [32,] "NM"   "0.0928055499087419"  "west"       
#> [33,] "NY"   "-0.92500515456373"   "northeast"  
#> [34,] "NC"   "-0.321664588357605"  "south"      
#> [35,] "ND"   "0.475260375708698"   "midwest"    
#> [36,] "OH"   "-0.251860957515806"  "midwest"    
#> [37,] "OK"   "0.0301127777870135"  "south"      
#> [38,] "OR"   "-0.475943129470221"  "west"       
#> [39,] "PA"   "-0.842429745927983"  "northeast"  
#> [40,] "RI"   "-0.320574836829983"  "northeast"  
#> [41,] "SC"   "-0.55083792261745"   "south"      
#> [42,] "SD"   "0.377332376588293"   "midwest"    
#> [43,] "TN"   "-0.231591746107272"  "south"      
#> [44,] "TX"   "-0.0698796287685257" "south"      
#> [45,] "UT"   "0.0372286222243687"  "west"       
#> [46,] "VT"   "-0.34891159979597"   "northeast"  
#> [47,] "VA"   "-0.712800662117221"  "south"      
#> [48,] "WA"   "-0.238851566121112"  "west"       
#> [49,] "WV"   "-0.154420314967612"  "south"      
#> [50,] "WI"   "-0.046798738079877"  "midwest"    
#> [51,] "WY"   "0.400225390529855"   "west"

Looks good! Note that we don’t get a correlation for DC because there is only one “county” In DC.

What’s the best way to look at these? Our hypothesis said that states in different regions might have different correlations, so let’s visualize by region.

I’m going to make our results into a dataframe #And remove DC.

st.cor <- cbind.data.frame(states,within.state.cor, census.region)
st.cor <- st.cor[st.cor$states!="DC",]

I want to organize this dataset by region and then by correlation size. To do so we can add a second argument to order:

st.cor <- st.cor[order(st.cor$census.region, st.cor$within.state.cor),]
st.cor
#>    states within.state.cor census.region
#> 14     IL      -0.68598393       midwest
#> 36     OH      -0.25186096       midwest
#> 24     MN      -0.21613647       midwest
#> 26     MO      -0.20119493       midwest
#> 15     IN      -0.19985603       midwest
#> 50     WI      -0.04679874       midwest
#> 23     MI       0.10187277       midwest
#> 16     IA       0.11273693       midwest
#> 28     NE       0.17531070       midwest
#> 17     KS       0.24318072       midwest
#> 42     SD       0.37733238       midwest
#> 35     ND       0.47526038       midwest
#> 31     NJ      -0.97197674     northeast
#> 33     NY      -0.92500515     northeast
#> 22     MA      -0.88912903     northeast
#> 39     PA      -0.84242975     northeast
#> 7      CT      -0.67901229     northeast
#> 46     VT      -0.34891160     northeast
#> 40     RI      -0.32057484     northeast
#> 20     ME      -0.19752939     northeast
#> 30     NH       0.31008225     northeast
#> 8      DE      -0.88692779         south
#> 21     MD      -0.75689228         south
#> 47     VA      -0.71280066         south
#> 19     LA      -0.61542089         south
#> 41     SC      -0.55083792         south
#> 11     GA      -0.38988136         south
#> 34     NC      -0.32166459         south
#> 18     KY      -0.29497473         south
#> 10     FL      -0.25951113         south
#> 43     TN      -0.23159175         south
#> 1      AL      -0.21737542         south
#> 49     WV      -0.15442031         south
#> 44     TX      -0.06987963         south
#> 25     MS       0.02941564         south
#> 37     OK       0.03011278         south
#> 4      AR       0.10574229         south
#> 5      CA      -0.65397893          west
#> 38     OR      -0.47594313          west
#> 48     WA      -0.23885157          west
#> 3      AZ      -0.09782901          west
#> 6      CO      -0.03231806          west
#> 45     UT       0.03722862          west
#> 32     NM       0.09280555          west
#> 12     HI       0.10458744          west
#> 13     ID       0.16904197          west
#> 2      AK       0.30676599          west
#> 29     NV       0.38288641          west
#> 51     WY       0.40022539          west
#> 27     MT       0.44073963          west

And now graph this:

which(st.cor$census.region=="northeast")
#> [1] 13 14 15 16 17 18 19 20 21
which(st.cor$census.region=="south")
#>  [1] 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
which(st.cor$census.region=="west")
#>  [1] 38 39 40 41 42 43 44 45 46 47 48 49 50

plot(1:50, st.cor$within.state.cor, pch=16, ylim=c(-1,1), axes=F,
     xlab="",ylab="Correlation Population Density & Percent Car Commute")
abline(h=0, lty=2, col="gray80")
axis(side=2, at=seq(-1,1,.5), las=2)
axis(side=1, at=1:50, labels=st.cor$states, cex.axis=.5)
abline(v=c(12.5,21.5,37.5), lty=3)
text(c(6,17,29,44), c(1,1,1,1), labels = c("Midwest","Northeast","South","West"))

Here is a second double loop example:

What if we wanted to create a new ACS dataset that took the mean of all these variables for all of the states. So this same dataset, at the state rather than county level.

I’m going to load a fresh copy of ACS because i’ve created a bunch of stuff above:

acs <- rio::import("https://github.com/marctrussler/IDS-Data/raw/main/ACSCountyData.csv")
names(acs)
#>  [1] "V1"                      "county.fips"            
#>  [3] "county.name"             "state.full"             
#>  [5] "state.abbr"              "state.alpha"            
#>  [7] "state.icpsr"             "census.region"          
#>  [9] "population"              "population.density"     
#> [11] "percent.senior"          "percent.white"          
#> [13] "percent.black"           "percent.asian"          
#> [15] "percent.amerindian"      "percent.less.hs"        
#> [17] "percent.college"         "unemployment.rate"      
#> [19] "median.income"           "gini"                   
#> [21] "median.rent"             "percent.child.poverty"  
#> [23] "percent.adult.poverty"   "percent.car.commute"    
#> [25] "percent.transit.commute" "percent.bicycle.commute"
#> [27] "percent.walk.commute"    "average.commute.time"   
#> [29] "percent.no.insurance"

What we wantto do is to take averages for all of the variables from population to percent no insurance

We’d want to run code that looks like this, where we want to change both the state and the variable:

mean(acs[acs$state.abbr=="AL", "population"])
#> [1] 72607.16

Let’s create a list of the variables we want means for

vars <- names(acs)[9:28]
vars
#>  [1] "population"              "population.density"     
#>  [3] "percent.senior"          "percent.white"          
#>  [5] "percent.black"           "percent.asian"          
#>  [7] "percent.amerindian"      "percent.less.hs"        
#>  [9] "percent.college"         "unemployment.rate"      
#> [11] "median.income"           "gini"                   
#> [13] "median.rent"             "percent.child.poverty"  
#> [15] "percent.adult.poverty"   "percent.car.commute"    
#> [17] "percent.transit.commute" "percent.bicycle.commute"
#> [19] "percent.walk.commute"    "average.commute.time"

And we know how to make a list of states:

states <- unique(acs$state.abbr)
states
#>  [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "DC" "FL" "GA"
#> [12] "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME" "MD" "MA"
#> [23] "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY"
#> [34] "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX"
#> [45] "UT" "VT" "VA" "WA" "WV" "WI" "WY"

How can we take advantage of the loop functionality to index both of these things?

Let’s try this way first (THIS IS WRONG)

i<- 1
mean(acs[acs$state.abbr==states[i], vars[i]])
#> [1] 72607.16

What would happen if we ran this code in a loop?

That doesn’t work! We are telling R to get the mean of the first state for the first variable, then the mean of the second state for the second variable, mean of the third state for the third variable…

mean(acs[acs$state.abbr==states[1], vars[1]])
#> [1] 72607.16
mean(acs[acs$state.abbr==states[2], vars[2]])
#> [1] 7.880517

Instead we need to say: Ok, R. Start with the first variable (population). For this variable, loop through all of the states. Once you’re finished with that, move on to the second variable and loop through all of the states again.

We want to do something like this, where we are preserving the loop that already works, and thenprogressively working through each variable:

for(i in 1:length(states)){
  mean(acs[acs$state.abbr==states[i], "population"], na.rm=T)
}


for(i in 1:length(states)){
  mean(acs[acs$state.abbr==states[i], "population.density"], na.rm=T)
}


for(i in 1:length(states)){
  mean(acs[acs$state.abbr==states[i], "percent.senior"], na.rm=T)
}

To do so, we want to set up an “outer” loop that moves through the variables

We get something like this:

for(j in 1: length(vars)){
  for(i in 1:length(states)){
    mean(acs[acs$state.abbr==states[i], vars[j]], na.rm=T)
  }
}

Notice that we are now indexing by two objects: i & j. i is indexing states, j is indexing variable names.

So R will start the outside loop with j=1, it will then start the inside loop with i=1, and keep running that inside loop until i=51. Then it will go back to the start of the outside loop and make j=2…

But where are we going to save these estimates? We now have to save them in a matrix with both rows and columns.

state.demos <- matrix(NA, nrow=length(states), ncol=length(vars))

Think about what we want to do here: we want to

for(i in 1:length(states)){
  state.demos[i,1] <-  mean(acs[acs$state.abbr==states[i], vars[1]], na.rm=T)
}


for(i in 1:length(states)){
  state.demos[i,2] <-  mean(acs[acs$state.abbr==states[i], vars[2]], na.rm=T)
}


for(i in 1:length(states)){
  state.demos[i,3] <-  mean(acs[acs$state.abbr==states[i], vars[3]], na.rm=T)
}

Notice how the numbers are increasing the same way on the left and right side of the assignment operator

We need to tell R where to save each mean, with the correct row and column:

for(j in 1: length(vars)){
  
  for(i in 1:length(states)){
    state.demos[i,j] <-  mean(acs[acs$state.abbr==states[i], vars[j]], na.rm=T)
  }
  
}

It’s easy, then, to combine this matrix with the state names we have saved, and to use the vars vector to create labels.

state.demos <- cbind.data.frame(states,state.demos)

names(state.demos)[2:21] <- vars

Note, additionally, that we could have organized that loop in the opposite way, making the outer loop index through states, and the inner loop index through variables:

This gives the exact same result:

state.demos2 <- matrix(NA, nrow=length(states), ncol=length(vars))


for(i in 1: length(states)){
  
  for(j in 1:length(vars)){
    state.demos2[i,j] <-  mean(acs[acs$state.abbr==states[i], vars[j]], na.rm=T)
  }
  
}

state.demos2 <- cbind.data.frame(states,state.demos2)

names(state.demos2)[2:21] <- vars

WARNING: THIS ISN’T ACTUALLY HOW YOU WOULD FIND OUT THE AVERAGE FOR EACH STATE. Just looking at the mean of county values for each of these variables assumes each county is of the same population. They obviously are not!

7.2 If statements

The conditional statements we’ve worked with so far have evaluated our data to say, if this condition holds in this row, take this action. Sometimes we want to extend this and say, if a certain condition holds, Run this code

We can do this via if statements. Similar to when we do this for a single line of code, we start with a logical statment – a statement that produces a single T or F, and then R will only run the code if the statement is True.

So for example:

operation <- "add"
if (operation == "add") { 
  print("I'm going to add:")
  4+4
}
#> [1] "I'm going to add:"
#> [1] 8

But if we do the following it won’t work:

operation <- "multiply"

if (operation == "add") { 
  print("I'm going to add:")
  4+4
  
}

Nothing happened here, and that is by design. R thought: in order for me to run this code chunk,this logical statement has to be True, it’s not, so I’ll skip it.

But if we set up a second if statement it will work:

if (operation == "multiply") { 
  print("I'm going to multiply:")
  4*4
}
#> [1] "I'm going to multiply:"
#> [1] 16

We can even run both at the same time and only get the one that evaluates as True.

operation <- "multiply"
if (operation == "add") { 
  print("I'm going to add:")
  4+4
  
}
if (operation == "multiply") { 
  print("I'm going to multiply:")
  4*4
  
}
#> [1] "I'm going to multiply:"
#> [1] 16

operation <- "add"

if (operation == "add") { 
  print("I'm going to add:")
  4+4
  
}
#> [1] "I'm going to add:"
#> [1] 8
if (operation == "multiply") { 
  print("I'm going to multiply:")
  4*4
}

Right now our if statement says: if condition is True, run code, if it’s not, do nothing. But often times we want to say: if condition is True, run code, if it’s not, run this other code.We can accomplish this by adding an “else”

if(2+2==4){
  print("code chunk 1")
} else {
  print("code chunk 2")
}
#> [1] "code chunk 1"

if(2+2==5){
  print("code chunk 1")
} else {
  print("code chunk 2")
}
#> [1] "code chunk 2"

So if we wanted a chunk of code that adds if we tell it to, but otherwise multiplies:

operation <- "add"

if (operation == "add") { 
  print("I'm going to add:")
  4+4
  
} else { 
  print("I'm going to multiply:")
  4*4
}
#> [1] "I'm going to add:"
#> [1] 8

operation <- "subtract"

if (operation == "add") { 
  print("I'm going to add:")
  4+4
  
} else { 
  print("I'm going to multiply:")
  4*4
}
#> [1] "I'm going to multiply:"
#> [1] 16

The nice thing about these is they are infinitely stackable using else if ()

operation  <- "subtract"

if (operation == "add") { 
  print("I'm going to add:")
  4+4
  
} else if (operation=="multiply") { 
  print("I'm going to multiply:")
  4*4
} else {
  print("Please enter a valid operator.")
}
#> [1] "Please enter a valid operator."

Where is this sort of thing helpful? This is a bit of a “you’ll know it when you see it” situation. Where I use them most is inside loops. For example in election work I am often loading in many snapshots of data from across an election night, and I might only run certain functions when a state has reached a certain threshold of completeness.

Here is an example of where it might be helpful. Here are the 2020 presidential election results by county:

pres <- rio::import("https://github.com/marctrussler/IDS-Data/raw/main/CountyPresData2020.Rds")

What if I want to record the winning margin in each state. That is: regardless of whether Trump or Biden carried the state, how much did they win by?

First, let’s write code that finds the dem and rep margin in each state:

head(pres)
#>   state county fips.code biden.votes trump.votes
#> 1    AK   ED 1     02901        3477        3511
#> 2    AK  ED 10     02910        2727        8081
#> 3    AK  ED 11     02911        3130        7096
#> 4    AK  ED 12     02912        2957        7893
#> 5    AK  ED 13     02913        2666        4652
#> 6    AK  ED 14     02914        4261        6714
#>   other.votes
#> 1         326
#> 2         397
#> 3         402
#> 4         388
#> 5         395
#> 6         468
pres$total.votes <- pres$biden.votes + pres$trump.votes + pres$other.votes

state <- unique(pres$state)
dem.perc <- NA
rep.perc <- NA

i <- 1
for(i in 1:length(state)){
  dem.perc[i] <- sum(pres$biden.votes[pres$state==state[i]])/ sum(pres$total.votes[pres$state==state[i]])
  rep.perc[i] <- sum(pres$trump.votes[pres$state==state[i]])/ sum(pres$total.votes[pres$state==state[i]])
  
}
results <- cbind.data.frame(state, dem.perc, rep.perc)

Pretty good! But to also calculate the winning margin in each state our code will need to be different dependent on the code of those two columns. Specifically, if the dem is the winner then the winning margin is dem-rep but if the rep is the winner then the winning margin is rep-dem. We can accomplish this via an if statement:

state <- unique(pres$state)
dem.perc <- NA
rep.perc <- NA
winner.margin <- NA
i <- 1
for(i in 1:length(state)){
  dem.perc[i] <- sum(pres$biden.votes[pres$state==state[i]])/ sum(pres$total.votes[pres$state==state[i]])
  rep.perc[i] <- sum(pres$trump.votes[pres$state==state[i]])/ sum(pres$total.votes[pres$state==state[i]])
  
  if(dem.perc[i]>rep.perc[i]){
    winner.margin[i] <- dem.perc[i]- rep.perc[i]
  } else {
    winner.margin[i] <- rep.perc[i]-dem.perc[i]
  }
  
}
results <- cbind.data.frame(state,dem.perc, rep.perc, winner.margin)
results <- results[order(results$winner.margin),]

Why would this have been wrong?

state <- unique(pres$state)
dem.perc <- NA
rep.perc <- NA
winner.margin <- NA
i <- 1
for(i in 1:length(state)){
  dem.perc[i] <- sum(pres$biden.votes[pres$state==state[i]])/ sum(pres$total.votes[pres$state==state[i]])
  rep.perc[i] <- sum(pres$trump.votes[pres$state==state[i]])/ sum(pres$total.votes[pres$state==state[i]])
  
  if(dem.perc[i]>rep.perc[i]){
    winner.margin[i] <- dem.perc[i]- rep.perc[i]
  } 
    winner.margin[i] <- rep.perc[i]-dem.perc[i]
  
}
results <- cbind.data.frame(state,dem.perc, rep.perc, winner.margin)
results <- results[order(results$winner.margin),]