5 Cleaning & Reshaping

When we talk about cleaning data, there are 4 broad steps:

  1. Reshaping and tidying
  2. Filtering and subsetting
  3. Transforming/re-coding/cleaning variables
  4. Aggregating and/or merging the data

Steps 2 and 3 we talked about last week, and we will soon talk about step 4.

Step 1 is what we are going to talk about today, and is a topic with a good amount of complexity.

5.1 Unit of Analysis

I have discussed before the importance of knowing the unit of analysis of the data you are working with. This is really going to come into play now.

Once again, to get the unit of analysis ask: what does each row uniquely represent?

Last week when we were working with the ACS data this was very clear as each row represented one of the approximately 3000 counties in the United States.

But this can get more complicated.

Compare this dataset:

#>   county date_1_1_2021 date_1_2_2021 date_1_3_2021
#> 1     c1           990             9           215
#> 2     c2           140          1000           305
#> 3     c3           797           323           141
#> 4     c4           343            34           645
#>   date_1_4_2021
#> 1           646
#> 2             4
#> 3           260
#> 4           785

To this dataset:

#>    county          date covid.cases
#> 1      c1 date_1_1_2021         990
#> 2      c2 date_1_1_2021         140
#> 3      c3 date_1_1_2021         797
#> 4      c4 date_1_1_2021         343
#> 5      c1 date_1_2_2021           9
#> 6      c2 date_1_2_2021        1000
#> 7      c3 date_1_2_2021         323
#> 8      c4 date_1_2_2021          34
#> 9      c1 date_1_3_2021         215
#> 10     c2 date_1_3_2021         305
#> 11     c3 date_1_3_2021         141
#> 12     c4 date_1_3_2021         645
#> 13     c1 date_1_4_2021         646
#> 14     c2 date_1_4_2021           4
#> 15     c3 date_1_4_2021         260
#> 16     c4 date_1_4_2021         785

First notice: these two datasets contain the exact same information. This is data on covid cases in 4 counties for the first 4 days of 2021. The same 16 case numbers exist in both datasets.

What is the unit of analysis of the first dataset, and what is the unit of analysis of the second dataset?

The first dataset has county as the unit of observation. Each row is a county and every county is only in the dataset once.

The second dataset also has county as a variable, but does county uniquely identify each row? No! There are 4 rows for each county, one for each date. In this case the unit of analysis of the data is county-date.

These two datasets, representing the same data, have two different units of analysis.

More broadly, we might consider these two datasets as representing the same data in both “Wide” and “Long” formats.

  • Wide data has one observation per row, and columns have data for a particular variable spread out over columns.
  • Long data might have observations occur in several rows, but each variable only exists in only one column.

In the above example, the first dataset was wide with respect to county. There is only one row for county, but the data for covid cases was spread across several rows. The second dataset was long with respect to county. Each county existed across multiple rows, but case numbers are represented in only one column.

Here is another example:

#>    district candidate votes
#> 1         1     Biden  1362
#> 2         1     Trump  1073
#> 3         2     Biden  1453
#> 4         2     Trump  1147
#> 5         3     Biden  1481
#> 6         3     Trump  1076
#> 7         4     Biden  1072
#> 8         4     Trump  1495
#> 9         5     Biden  1372
#> 10        5     Trump  1286

What is the unit of analysis of this data? Can it be district? No! there are two rows for every district so district does not uniquely identify rows. What about candidate? No! The candidate also does not uniquely identify rows. Here district-candidate uniquely identifies rows.

And once again, we can represent the same information like this:

#>   district Biden Trump
#> 1        1  1362  1073
#> 2        2  1453  1147
#> 3        3  1481  1076
#> 4        4  1072  1495
#> 5        5  1372  1286

This is the same 10 vote counts now represented in a wide format.

The first dataset is long with respect to district: districts occurred across several rows, but the vote information only existed in one column.

The second dataset is wide with respect to district: districts only exist in one row, but now the vote information is spread across multiple rows.

OK, let’s get even more wild: how would we charecterize this dataset:

#>   candidate    1    2    3    4    5
#> 1     Biden 1362 1453 1481 1072 1372
#> 2     Trump 1073 1147 1076 1495 1286

This is the same information again! Now presented in a new way. Now the unit of analysis is candidate. We would say that this data is wide with respect to candidate. Each candidate is in only one row, and the vote information is spread across multiple columns, one for each district.

Which of these is the “right” way to display this data? There is no such thing! None is right or wrong. The three different versions of this vote data are all helpful for different tasks. Knowing and understanding a particular task requires, and being able to use R to switch between these differently shaped datasets is what is key.

I won’t lie: reshaping is somewhat tricky and I have to look up how to do it pretty much every time I have a task that requires it. But it is also an invaluable step in the data cleaning process.

Let’s jump into how we do reshaping. The tools we are using are also from the tidyr package that we used last week, so we will have to load that in.

First let’s load in the first example from above:

library(tidyr)
dat <- rio::import("https://github.com/marctrussler/IDS-Data/raw/main/WideExample.RDS")
dat
#>   county date_1_1_2021 date_1_2_2021 date_1_3_2021
#> 1     c1           990             9           215
#> 2     c2           140          1000           305
#> 3     c3           797           323           141
#> 4     c4           343            34           645
#>   date_1_4_2021
#> 1           646
#> 2             4
#> 3           260
#> 4           785

We are starting here with wide data that we want to make long. There is one row for each of the 4 counties, and the information for cases is spread across 4 columns.

To go from wide to long data we use the gather() command.

Here are the instructions we need to give gather()

#gather(our dataset, 
#       key = what we want to call the new column we are creating that is currently in columns,
#       value = what we want to call the new column that will hold the data currently spread across #columns, 
#       which columns contain the data currently)

Applying to this example:

gather(dat, 
       key=date,
       value=covid.cases, 
       date_1_1_2021:date_1_4_2021)
#>    county          date covid.cases
#> 1      c1 date_1_1_2021         990
#> 2      c2 date_1_1_2021         140
#> 3      c3 date_1_1_2021         797
#> 4      c4 date_1_1_2021         343
#> 5      c1 date_1_2_2021           9
#> 6      c2 date_1_2_2021        1000
#> 7      c3 date_1_2_2021         323
#> 8      c4 date_1_2_2021          34
#> 9      c1 date_1_3_2021         215
#> 10     c2 date_1_3_2021         305
#> 11     c3 date_1_3_2021         141
#> 12     c4 date_1_3_2021         645
#> 13     c1 date_1_4_2021         646
#> 14     c2 date_1_4_2021           4
#> 15     c3 date_1_4_2021         260
#> 16     c4 date_1_4_2021         785

There is nothing special about the names I gave it in this step, I could do:

gather(dat, 
       key=batman,
       value=robin, 
       date_1_1_2021:date_1_4_2021)
#>    county        batman robin
#> 1      c1 date_1_1_2021   990
#> 2      c2 date_1_1_2021   140
#> 3      c3 date_1_1_2021   797
#> 4      c4 date_1_1_2021   343
#> 5      c1 date_1_2_2021     9
#> 6      c2 date_1_2_2021  1000
#> 7      c3 date_1_2_2021   323
#> 8      c4 date_1_2_2021    34
#> 9      c1 date_1_3_2021   215
#> 10     c2 date_1_3_2021   305
#> 11     c3 date_1_3_2021   141
#> 12     c4 date_1_3_2021   645
#> 13     c1 date_1_4_2021   646
#> 14     c2 date_1_4_2021     4
#> 15     c3 date_1_4_2021   260
#> 16     c4 date_1_4_2021   785

(But I wouldn’t do that).

Like separate() from last week, gather() takes our whole dataset as an argument and outputs a whole dataset. As such, we need to save the output:

#>    county          date covid.cases
#> 1      c1 date_1_1_2021         990
#> 2      c2 date_1_1_2021         140
#> 3      c3 date_1_1_2021         797
#> 4      c4 date_1_1_2021         343
#> 5      c1 date_1_2_2021           9
#> 6      c2 date_1_2_2021        1000
#> 7      c3 date_1_2_2021         323
#> 8      c4 date_1_2_2021          34
#> 9      c1 date_1_3_2021         215
#> 10     c2 date_1_3_2021         305
#> 11     c3 date_1_3_2021         141
#> 12     c4 date_1_3_2021         645
#> 13     c1 date_1_4_2021         646
#> 14     c2 date_1_4_2021           4
#> 15     c3 date_1_4_2021         260
#> 16     c4 date_1_4_2021         785

How do we put this back into its original format?

We can do that with the spread() which takes us from long to wide data. The spread command works like:

#spread(the dataset we are operating on, 
#       key = the variable in the dataset we want to make into columns, 
#       value = the variable in the dataset we want to put into the key columns)
#

So for this case:

dat.w <- spread(dat.l,
                key=date,
                value=covid.cases)
dat.w
#>   county date_1_1_2021 date_1_2_2021 date_1_3_2021
#> 1     c1           990             9           215
#> 2     c2           140          1000           305
#> 3     c3           797           323           141
#> 4     c4           343            34           645
#>   date_1_4_2021
#> 1           646
#> 2             4
#> 3           260
#> 4           785

For spread, it does matter what the names we put in key and value are, because we are telling R to go get specific columns in the dataset.

What would we do if we wanted a dataset where the unit of analysis is date? Thinking through the spread command, the key variable is the variable we want to make into the columns. If date is the unit of analysis then we have to put county into the columns:

spread(dat.l,
                key=county,
                value=covid.cases)
#>            date  c1   c2  c3  c4
#> 1 date_1_1_2021 990  140 797 343
#> 2 date_1_2_2021   9 1000 323  34
#> 3 date_1_3_2021 215  305 141 645
#> 4 date_1_4_2021 646    4 260 785

Let’s think about the second dataset

dat2
#>    district candidate votes
#> 1         1     Biden  1362
#> 2         1     Trump  1073
#> 3         2     Biden  1453
#> 4         2     Trump  1147
#> 5         3     Biden  1481
#> 6         3     Trump  1076
#> 7         4     Biden  1072
#> 8         4     Trump  1495
#> 9         5     Biden  1372
#> 10        5     Trump  1286

Here this is long data, we have districts spanning across multiple rows, one for each candidate.

We might want this to be wide data if we want a single summation of the election for each district.

Again, to convert from long to wide we need to tell it what column we want in new columns, and where the values are to fill in those cells are:

dat2.w <- spread(dat2,
                 key=candidate,
                 value=votes)

dat2.w
#>   district Biden Trump
#> 1        1  1362  1073
#> 2        2  1453  1147
#> 3        3  1481  1076
#> 4        4  1072  1495
#> 5        5  1372  1286

We went from the unit of analysis being district-candidates, to the unit of analysis being districts.

Then for example, we could calculate each candidates percent of the total vote:

dat2.w$total.votes <- dat2.w$Biden + dat2.w$Trump
dat2.w$perc.biden <- dat2.w$Biden/dat2.w$total.votes
dat2.w$perc.trump <- dat2.w$Trump/dat2.w$total.votes

dat2.w
#>   district Biden Trump total.votes perc.biden perc.trump
#> 1        1  1362  1073        2435  0.5593429  0.4406571
#> 2        2  1453  1147        2600  0.5588462  0.4411538
#> 3        3  1481  1076        2557  0.5791944  0.4208056
#> 4        4  1072  1495        2567  0.4176081  0.5823919
#> 5        5  1372  1286        2658  0.5161776  0.4838224

You don’t often want to delete data, but let’s say we just wanted to make a nice table and only wanted district, perc.biden, perc.trump, and to mak ethe values rounded.

You can delete variables:

dat2.table <- dat2.w
dat2.table$Biden <- NULL
dat2.table$Trump <- NULL
dat2.table$total.votes <- NULL
dat2.table
#>   district perc.biden perc.trump
#> 1        1  0.5593429  0.4406571
#> 2        2  0.5588462  0.4411538
#> 3        3  0.5791944  0.4208056
#> 4        4  0.4176081  0.5823919
#> 5        5  0.5161776  0.4838224

We can also give a list of variables to keep and subset that way:

to.keep <- c("district","perc.biden","perc.trump")
dat2.table <- dat2.w[to.keep]
dat2.w <- dat2.w[to.keep]

dat2.table
#>   district perc.biden perc.trump
#> 1        1  0.5593429  0.4406571
#> 2        2  0.5588462  0.4411538
#> 3        3  0.5791944  0.4208056
#> 4        4  0.4176081  0.5823919
#> 5        5  0.5161776  0.4838224

To round a variable we use…. round()

round(717.128210, 0)
#> [1] 717

dat2.table$perc.biden <- round(dat2.table$perc.biden*100,2)
dat2.table$perc.trump <- round(dat2.table$perc.trump*100,2)

dat2.table
#>   district perc.biden perc.trump
#> 1        1      55.93      44.07
#> 2        2      55.88      44.12
#> 3        3      57.92      42.08
#> 4        4      41.76      58.24
#> 5        5      51.62      48.38

What if we wanted to study candidates, instead of districts?? It should be pretty clear what to do:

dat.2.candidates <- spread(dat2,
                           key=district,
                           value = votes)
dat.2.candidates
#>   candidate    1    2    3    4    5
#> 1     Biden 1362 1453 1481 1072 1372
#> 2     Trump 1073 1147 1076 1495 1286

Again: there is nothing right or wrong about any of these datasets. All of them just present the same information in different ways. Translating in between them

5.2 A Full Data Cleaning Example

In the second half of this week we are going to go through a full example of cleaning a dataset from start to finish.

library(tidyr)
library(lubridate)
#> Warning: package 'lubridate' was built under R version
#> 4.1.3
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

Is there a relationship between age and voting for Trump among Millenials? We know that there is definitely a relationship in the population as a whole, but does that same logic extend to within an age group that is very anti Trump?

Could definitely see older millenials maybe being more concerned about taxes?

Today we are going to work with some survey data to ultimately answer that question, but we first have to clean that data to be able to work with it.

5.2.1 Reshaping and reformatting data

Today we’ll be working again Generation Forward survey data.

genfor <- rio::import("https://github.com/marctrussler/IDS-Data/raw/main/Genfor.RDS")

The unit of analysis we want to work with are individuals, which in this case are American millenials.

Is it the case that individuals in this dataset are uniquely identified in each row? What other things seem like they need to be cleaned?

head(genfor)
#>     GENF_ID      WEIGHT1 Q0 Q1 approval.party
#> 1 792562474 0.4921835398  2  2     democratic
#> 2 792562474 0.4921835398  2  2     republican
#> 3 794325578 1.2025737846  4  5     democratic
#> 4 794325578 1.2025737846  4  5     republican
#> 5 795569196 0.1318494446  1  5     democratic
#> 6 795569196 0.1318494446  1  5     republican
#>   approval.value Q10A_1 Q10A_2 Q10A_3 Q10A_4 Q10A_5 Q10A_6
#> 1              3      0      0      0      0      0      0
#> 2              2      0      0      0      0      0      0
#> 3              3      0      0      0      0      0      0
#> 4              4      0      0      0      0      0      0
#> 5              2      0      0      0      0      0      0
#> 6              3      0      0      0      0      0      0
#>   Q10A_7 Q10A_8 Q10A_9 Q10A_10 Q10A_11 Q10A_12 Q10A_13
#> 1      0      0      0       0       0       0       0
#> 2      0      0      0       0       0       0       0
#> 3      0      0      0       0       0       0       0
#> 4      0      0      0       0       0       0       0
#> 5      0      0      0       0       0       0       0
#> 6      0      0      0       0       0       0       0
#>   Q10A_14 Q10A_15 Q10A_16 Q10A_17 Q10A_18 Q10A_19 Q10A_20
#> 1       0       0       1       0       0       0       0
#> 2       0       0       1       0       0       0       0
#> 3       0       0       0       0       0       0       0
#> 4       0       0       0       0       0       0       0
#> 5       0       0       0       0       0       0       0
#> 6       0       0       0       0       0       0       0
#>   Q10A_21 Q10A_22 partyid7       date duration     device
#> 1       0       0        5 2017-10-27       11    Desktop
#> 2       0       0        5 2017-10-27       11    Desktop
#> 3       1       0        4 2017-10-27        5 Smartphone
#> 4       1       0        4 2017-10-27        5 Smartphone
#> 5       0       1        3 2017-10-26       11 Smartphone
#> 6       0       1        3 2017-10-26       11 Smartphone
#>   gender age educ state
#> 1      1  18    9    PA
#> 2      1  18    9    PA
#> 3      2  27    6    OK
#> 4      2  27    6    OK
#> 5      2  22   10    NC
#> 6      2  22   10    NC
  1. There are two rows for every observation
  2. The contents of the Q10A variable are spread across 22 columns
  3. The column names are inconsistently formatted and not very descriptive
  4. Most of the variables could be cleaned up a bit so that they’re easier to interpret

Just as a reference for later, I’m going to save an original version of the dataset.

genfor.untouched <- genfor

Just as a proof of the above, we can compare the number of rows with how many unique entries there are for ID:

nrow(genfor)
#> [1] 3752
length(unique(genfor$GENF_ID))
#> [1] 1876

Let’s figure this out.

With regard to the approval variable, is the genforward data formatted as wide or long?

head(genfor[c("GENF_ID","approval.party","approval.value")])
#>     GENF_ID approval.party approval.value
#> 1 792562474     democratic              3
#> 2 792562474     republican              2
#> 3 794325578     democratic              3
#> 4 794325578     republican              4
#> 5 795569196     democratic              2
#> 6 795569196     republican              3

Notice that every value of row 1 equals the value in row 2, except for approval.party and approval.value.

genfor[1,] == genfor[2,] 
#>   GENF_ID WEIGHT1   Q0   Q1 approval.party approval.value
#> 1    TRUE    TRUE TRUE TRUE          FALSE          FALSE
#>   Q10A_1 Q10A_2 Q10A_3 Q10A_4 Q10A_5 Q10A_6 Q10A_7 Q10A_8
#> 1   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE
#>   Q10A_9 Q10A_10 Q10A_11 Q10A_12 Q10A_13 Q10A_14 Q10A_15
#> 1   TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE
#>   Q10A_16 Q10A_17 Q10A_18 Q10A_19 Q10A_20 Q10A_21 Q10A_22
#> 1    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE
#>   partyid7 date duration device gender  age educ state
#> 1     TRUE TRUE     TRUE   TRUE   TRUE TRUE TRUE  TRUE

So with respect to the “approval” variables genfor is a long dataset, where each observation (i.e. each survey respondent) has two rows of data. And their approval ratings for the Democratic and Republican parties are split across the two rows.

Our first step in cleaning the data will be to condense those two rows into one.

To do this, we’ll use the spread() function from the tidyr package. We have to tell the function which variable we’ll want to use as our new column names(the “key”) and which variable contains the values that will populate those new columns (the “value”):

genfor <- spread(genfor,
                 key = approval.party,
                 value = approval.value)

head(genfor)
#>     GENF_ID      WEIGHT1 Q0 Q1 Q10A_1 Q10A_2 Q10A_3 Q10A_4
#> 1 792562474 0.4921835398  2  2      0      0      0      0
#> 2 794325578 1.2025737846  4  5      0      0      0      0
#> 3 795569196 0.1318494446  1  5      0      0      0      0
#> 4 803023033 0.8249129727  2  2      1      0      0      0
#> 5 809729125  0.214907857  4  4      0      0      0      0
#> 6 812940493 0.8648717697  1  5      0      0      1      0
#>   Q10A_5 Q10A_6 Q10A_7 Q10A_8 Q10A_9 Q10A_10 Q10A_11
#> 1      0      0      0      0      0       0       0
#> 2      0      0      0      0      0       0       0
#> 3      0      0      0      0      0       0       0
#> 4      0      0      0      0      0       0       0
#> 5      0      0      0      0      0       0       1
#> 6      0      0      0      0      0       0       0
#>   Q10A_12 Q10A_13 Q10A_14 Q10A_15 Q10A_16 Q10A_17 Q10A_18
#> 1       0       0       0       0       1       0       0
#> 2       0       0       0       0       0       0       0
#> 3       0       0       0       0       0       0       0
#> 4       0       0       0       0       0       0       0
#> 5       0       0       0       0       0       0       0
#> 6       0       0       0       0       0       0       0
#>   Q10A_19 Q10A_20 Q10A_21 Q10A_22 partyid7       date
#> 1       0       0       0       0        5 2017-10-27
#> 2       0       0       1       0        4 2017-10-27
#> 3       0       0       0       1        3 2017-10-26
#> 4       0       0       0       0        7 2017-10-29
#> 5       0       0       0       0        2 2017-11-01
#> 6       0       0       0       0        1 2017-10-29
#>   duration     device gender age educ state democratic
#> 1       11    Desktop      1  18    9    PA          3
#> 2        5 Smartphone      2  27    6    OK          3
#> 3       11 Smartphone      2  22   10    NC          2
#> 4        8 Smartphone      1  32   12    TN          4
#> 5       45 Smartphone      2  25   10    NY          1
#> 6        8 Smartphone      2  28   14    NM          2
#>   republican
#> 1          2
#> 2          4
#> 3          3
#> 4          2
#> 5          3
#> 6          4

Notice the other nice thing about the spread function is that it ignored all variables that were the same across the two rows. The only thing that happened is that we got two new variables.

5.2.2 Dealing with individual variables that are spread across multiple columns

Now let’s deal with the Q10A variable. Let’s give it a quick look:

Occasionally (NOT ALWAYS, OR EVEN USUALLY) datsets will come prepackaged with extra information contained within each variable that will tell you some information about the variable and what the values are. We can access that information using the attributes function

attributes(genfor.untouched$Q10A_1)
#> $label
#> [1] "[Abortion] What do you think is the most important problem facing this country?"
#> 
#> $format.stata
#> [1] "%8.0g"
#> 
#> $labels
#>  No Yes 
#>   0   1
attributes(genfor.untouched$Q10A_2)
#> $label
#> [1] "[National debt] What do you think is the most important problem facing this country?"
#> 
#> $format.stata
#> [1] "%8.0g"
#> 
#> $labels
#>  No Yes 
#>   0   1
attributes(genfor.untouched$Q10A_16)
#> $label
#> [1] "[Taxes] What do you think is the most important problem facing this country?"
#> 
#> $format.stata
#> [1] "%8.0g"
#> 
#> $labels
#>  No Yes 
#>   0   1

So it’s the same question, just split across multiple columns, where each column represents a different issue.

(Most of the time this information about variables will be contained in a separate pdf “codebook”.)

So we’re going to want to collapse these 22 columns into one. To do that we’ll use the gather() function, to move these data with respect to this variable into a long format.

We have to tell it the key (which is the new name we want to give our key variable), the value (which is the new name we want to give our values variable), and a vector of all the columns that we’re collapsing/gathering:

genfor <- gather(genfor,
                 key = top.issue, ## we could have called this anything
                 value = val, ## this could be anything also
                 Q10A_1:Q10A_22)

Now let’s see what we have:

nrow(genfor)
#> [1] 41272
#View(genfor)

We went from 1876 rows to tens of thousands

We’ve transferred the unit of observations into individual issue-importance level. For each individual we have a row for how they answered the issue importance question.

We need to get rid of all those extra rows. We’re not actually interested in when a person does not care about an issue. So we can drop all the rowswhere ‘val’ is zero:

genfor <- genfor[genfor$val != 0,]
nrow(genfor)
#> [1] 2149

That got us most of the way there, but we’ve still got more rows than we should? What’s going on?

#View(genfor)

All the people who didn’t answer the question were coded as 99 for every issue. So now there’s 22 rows for each of those people.

What we want is one row per person. We’ll use the duplicated() function to do this. This function takes a vector and returns a FALSE if it’s the first time this value has shown up in the vector, and a TRUE if it’s appeared before.

head(duplicated(genfor$GENF_ID))
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE

If we use the ! to flip the TRUEs and FALSEs, then we’ll have a vector we can useto ensure that we only have one row per person:

genfor <- genfor[!duplicated(genfor$GENF_ID),]
head(genfor)
#>        GENF_ID      WEIGHT1 Q0 Q1 partyid7       date
#> 4    803023033 0.8249129727  2  2        7 2017-10-29
#> 9    831421601 1.3049241841  2  1        7 2017-10-26
#> 107 1219666547 0.1913640053  3  2        2 2017-10-26
#> 149 1364524431 3.8634938329  2  1        7 2017-10-27
#> 180 1465399167  1.263106426 98 98       -1 2017-11-01
#> 182 1476757020 1.1191845832  3  3        7 2017-10-26
#>     duration     device gender age educ state democratic
#> 4          8 Smartphone      1  32   12    TN          4
#> 9         12    Desktop      2  23   10    CA          4
#> 107        9 Smartphone      1  26   11    IL          2
#> 149        8 Smartphone      2  32    9    UT          4
#> 180     8965 Smartphone      2  25    9    GA         NA
#> 182       10 Smartphone      2  19    9    WA          4
#>     republican top.issue val
#> 4            2    Q10A_1   1
#> 9            1    Q10A_1   1
#> 107          4    Q10A_1   1
#> 149          1    Q10A_1   1
#> 180         NA    Q10A_1  99
#> 182          2    Q10A_1   1

Now we have one last step. We don’t need to keep both ‘top.issue’ and ‘val’. ‘top.issue’ gives us all the information we need, except for the people whowere 99’s in ‘val’. So let’s set ‘top.issue’ to NA for those people, and then we can drop the ‘val’ variable.

genfor$top.issue[genfor$val == 99] <- NA
genfor$val <- NULL 

5.2.3 Cleaning column names

What are some of the issues with our column names?

names(genfor)
#>  [1] "GENF_ID"    "WEIGHT1"    "Q0"         "Q1"        
#>  [5] "partyid7"   "date"       "duration"   "device"    
#>  [9] "gender"     "age"        "educ"       "state"     
#> [13] "democratic" "republican" "top.issue"

Remember that the column names are simply stored as a character vector in one of the attributes of a dataframe.

We can manipulate this attribute just like any other attribute of an object.Let’s use the tolower() function to convert all the column names to lowercase:

names(genfor) <- tolower(names(genfor))

Now let’s give some of the other variables more descriptive names.

names(genfor)
#>  [1] "genf_id"    "weight1"    "q0"         "q1"        
#>  [5] "partyid7"   "date"       "duration"   "device"    
#>  [9] "gender"     "age"        "educ"       "state"     
#> [13] "democratic" "republican" "top.issue"
attributes(genfor.untouched$Q0) # this one is 2016 presidential vote
#> $label
#> [1] "Did you vote for Hillary Clinton, Donald Trump, someone else, or not vote in the 2016 presidential election?"
#> 
#> $format.stata
#> [1] "%8.0g"
#> 
#> $labels
#>                                Hillary Clinton 
#>                                              1 
#>                                   Donald Trump 
#>                                              2 
#>                                   Someone else 
#>                                              3 
#> Did not vote in the 2016 presidential election 
#>                                              4 
#>                                 SKIPPED ON WEB 
#>                                             98 
#>                                        refused 
#>                                             99
attributes(genfor.untouched$Q1) # this one is Trump approval
#> $label
#> [1] "Overall, do you approve, disapprove or neither approve nor disapprove of the way that President Donald Trump is doing his job?"
#> 
#> $format.stata
#> [1] "%8.0g"
#> 
#> $labels
#>               Strongly approve 
#>                              1 
#>               Somewhat approve 
#>                              2 
#> Neither approve nor disapprove 
#>                              3 
#>            Somewhat disapprove 
#>                              4 
#>            Strongly disapprove 
#>                              5 
#>                 SKIPPED ON WEB 
#>                             98 
#>                        refused 
#>                             99

We can, if we like, rename things in this way. Accessing the particular entry in the names vector and assigning a new value

names(genfor)[3] <- "vote2016"

A slightly quicker solution is to use the rename() function from dplyr. We have to specify the dataframe we’re manipulating, then tell the function the old and new names of the columns in the format: ‘newname = oldname’. We don’t need to use any quotation marks with this function.

#install.packages("dplyr")
genfor <- dplyr::rename(genfor,
                 weight = weight1,
                 approve.trump = q1,
                 approve.dem = democratic,
                 approve.rep = republican)
head(genfor)
#>        genf_id       weight vote2016 approve.trump partyid7
#> 4    803023033 0.8249129727        2             2        7
#> 9    831421601 1.3049241841        2             1        7
#> 107 1219666547 0.1913640053        3             2        2
#> 149 1364524431 3.8634938329        2             1        7
#> 180 1465399167  1.263106426       98            98       -1
#> 182 1476757020 1.1191845832        3             3        7
#>           date duration     device gender age educ state
#> 4   2017-10-29        8 Smartphone      1  32   12    TN
#> 9   2017-10-26       12    Desktop      2  23   10    CA
#> 107 2017-10-26        9 Smartphone      1  26   11    IL
#> 149 2017-10-27        8 Smartphone      2  32    9    UT
#> 180 2017-11-01     8965 Smartphone      2  25    9    GA
#> 182 2017-10-26       10 Smartphone      2  19    9    WA
#>     approve.dem approve.rep top.issue
#> 4             4           2    Q10A_1
#> 9             4           1    Q10A_1
#> 107           2           4    Q10A_1
#> 149           4           1    Q10A_1
#> 180          NA          NA      <NA>
#> 182           4           2    Q10A_1
names(genfor)
#>  [1] "genf_id"       "weight"        "vote2016"     
#>  [4] "approve.trump" "partyid7"      "date"         
#>  [7] "duration"      "device"        "gender"       
#> [10] "age"           "educ"          "state"        
#> [13] "approve.dem"   "approve.rep"   "top.issue"

I’ve got a personal pet-peeve about having underscores in column names. Let’s substitute the underscore in ‘genf_id’ for a period.

We’ll use gsub() to substitute the _ for a .

names(genfor) <- gsub("_",".",names(genfor))

5.2.4 Recoding values in a vector or in the column of a dataframe

The first thing we want to do is make sure our variables are the correct type of data (i.e. character, numeric, logical, etc)

summary(genfor)
#>     genf.id                   weight        vote2016     
#>  Min.   :7.926e+08   0.229665156 :  14   Min.   : 1.000  
#>  1st Qu.:2.571e+09   0.2958000119:  14   1st Qu.: 1.000  
#>  Median :4.293e+09   0.255285515 :  12   Median : 2.000  
#>  Mean   :4.303e+09   0.305879976 :  12   Mean   : 2.576  
#>  3rd Qu.:6.048e+09   0.1508269493:   9   3rd Qu.: 4.000  
#>  Max.   :7.875e+09   0.39396181  :   9   Max.   :99.000  
#>                      (Other)     :1806                   
#>  approve.trump       partyid7          date          
#>  Min.   : 1.000   Min.   :-1.000   Length:1876       
#>  1st Qu.: 3.000   1st Qu.: 2.000   Class :character  
#>  Median : 5.000   Median : 3.000   Mode  :character  
#>  Mean   : 4.721   Mean   : 3.121                     
#>  3rd Qu.: 5.000   3rd Qu.: 4.000                     
#>  Max.   :99.000   Max.   : 7.000                     
#>                                                      
#>     duration        device              gender     
#>  Min.   :    1   Length:1876        Min.   :1.000  
#>  1st Qu.:    9   Class :character   1st Qu.:1.000  
#>  Median :   14   Mode  :character   Median :2.000  
#>  Mean   : 1184                      Mean   :1.531  
#>  3rd Qu.:   82                      3rd Qu.:2.000  
#>  Max.   :19341                      Max.   :2.000  
#>  NA's   :2                                         
#>       age             educ          state          
#>  Min.   :18.00   Min.   : 1.00   Length:1876       
#>  1st Qu.:21.00   1st Qu.: 9.00   Class :character  
#>  Median :25.00   Median :10.00   Mode  :character  
#>  Mean   :25.15   Mean   :10.28                     
#>  3rd Qu.:29.00   3rd Qu.:12.00                     
#>  Max.   :34.00   Max.   :14.00                     
#>                                                    
#>   approve.dem     approve.rep     top.issue        
#>  Min.   :1.000   Min.   :1.000   Length:1876       
#>  1st Qu.:2.000   1st Qu.:3.000   Class :character  
#>  Median :2.000   Median :4.000   Mode  :character  
#>  Mean   :2.414   Mean   :3.241                     
#>  3rd Qu.:3.000   3rd Qu.:4.000                     
#>  Max.   :4.000   Max.   :4.000                     
#>  NA's   :242     NA's   :277

class(genfor$weight)
#> [1] "factor"

There’s something wrong with the ‘weight’ variable. It should be numeric, but it doesn’t have summary stats presented about it like the other numeric variables.

is.numeric(genfor$weight) 
#> [1] FALSE

is.factor(genfor$weight) 
#> [1] TRUE

So the weight variable, which should be numeric, is being stored as a factor. A factor variable is a special type of variable that contains both numeric and descriptive information. To be honest: they are a bit of a holdover from older stats programs like STATA and SPSS. They will have some important funcitonality in regression, but for right now we need to change that so that we can properly use the weights.

The obvious (but annoyingly wrong) way to do this would simply be to use the as.numeric() function to change the weights into numbers. But if you apply as.numeric() to a factor variable, it converts all the values to their arbitrary numeric values rather than their meaningful labels.

head(as.numeric(genfor$weight)) # wrong
#> [1]  992 1206  264 1540 1191 1132

We can get around this problem by first converting the weights into characters and then converting the character strings into numeric

head(as.numeric(as.character(genfor$weight)))
#> [1] 0.824913 1.304924 0.191364 3.863494 1.263106 1.119185

I can’t tell you how frustratingly common this issue is, which is why I’m point it out now. You’re most likely to encounter it when you read in a datasetfrom a spreadsheet, and R turns some of your columns into factors rather than numeric.

We can confirm that this was the right choice by looking at a summary of the two attempts:

summary(as.numeric(genfor$weight))
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>     1.0   366.8   716.5   765.1  1162.2  1627.0
summary(as.numeric(as.character(genfor$weight))) 
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#>  0.01762  0.24216  0.47743  1.00000  1.18315 10.69962

So now let’s store the corrected weights in the dataset:

genfor$weight <- as.numeric(as.character(genfor$weight))

Let’s fix the date:

head(genfor$date) 
#> [1] "2017-10-29" "2017-10-26" "2017-10-26" "2017-10-27"
#> [5] "2017-11-01" "2017-10-26"
class(genfor$date)
#> [1] "character"

genfor$date <- ymd(genfor$date) 
class(genfor$date)
#> [1] "Date"

This is particularly useful if you’re interested in subsetting the data based on the dates. Let’s say I just want to look at the people who completed in November:

table(genfor$date) 
#> 
#> 2017-10-26 2017-10-27 2017-10-28 2017-10-29 2017-10-30 
#>        542        390        107        246        229 
#> 2017-10-31 2017-11-01 2017-11-02 2017-11-03 2017-11-04 
#>        103        100         44         41         12 
#> 2017-11-05 2017-11-06 2017-11-07 2017-11-08 2017-11-09 
#>         18         10         15          7         11 
#> 2017-11-10 
#>          1

#View(genfor[genfor$date >= "2017-11-01",])

5.2.5 Recoding and replacing values

The most common form of data cleaning is recoding variables. There are tons of different ways to do this in R.

The most basic is recoding by taking advantage of the bracket notation.

Let’s work on recoding the vote2016 (formerly Q0) variable, which is a question about 2016 presidential vote

table(genfor$vote2016)
#> 
#>   1   2   3   4  98  99 
#> 896 262 213 497   6   2
attributes(genfor.untouched$Q0)
#> $label
#> [1] "Did you vote for Hillary Clinton, Donald Trump, someone else, or not vote in the 2016 presidential election?"
#> 
#> $format.stata
#> [1] "%8.0g"
#> 
#> $labels
#>                                Hillary Clinton 
#>                                              1 
#>                                   Donald Trump 
#>                                              2 
#>                                   Someone else 
#>                                              3 
#> Did not vote in the 2016 presidential election 
#>                                              4 
#>                                 SKIPPED ON WEB 
#>                                             98 
#>                                        refused 
#>                                             99

So let’s set the 98 and 99 responses to missing

genfor$vote2016[genfor$vote2016 %in% c(98,99)] <- NA
summary(genfor$vote2016)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>   1.000   1.000   2.000   2.166   4.000   4.000       8
table(genfor$vote2016)
#> 
#>   1   2   3   4 
#> 896 262 213 497

Similarly can use characters to recode

genfor$vote2016[genfor$vote2016==1] <- "Clinton"
genfor$vote2016[genfor$vote2016==2] <- "Trump"
genfor$vote2016[genfor$vote2016==3] <- "Other"
genfor$vote2016[genfor$vote2016 %in% c(4,98,99)] <- NA
table(genfor$vote2016)
#> 
#> Clinton   Other   Trump 
#>     896     213     262
head(genfor$vote2016)
#> [1] "Trump" "Trump" "Other" "Trump" NA      "Other"

Clean data often has labels that are more descriptive than just arbitrary numbers. So we might recode the gender variable:

attributes(genfor.untouched$gender)
#> $label
#> [1] "Respondent gender"
#> 
#> $format.stata
#> [1] "%8.0g"
#> 
#> $labels
#> Unknown    Male  Female 
#>       0       1       2

genfor$gender[genfor$gender == 1] <- "M"
genfor$gender[genfor$gender == 2] <- "F"
genfor$gender[genfor$gender == 0] <- "U"

My personal preference, however, is to have indicator variables for categories like this.

genfor$female <- NA
genfor$female[genfor$gender=="M" | genfor$gender=="U"] <- 0
genfor$female[genfor$gender=="F"] <- 1

his displays the same information, but in a way that we can do math on, which is helpful.

One last function… If a column contains information about more than one variable we might want to divide it into two columns using the separate() function.

Let’s imagine we wanted separate columns for year, month, and day that the respondent took the survey:

genfor2 <- separate(genfor,
         col = "date",
         into = c("year","month","day"))
head(genfor2)
#>        genf.id   weight vote2016 approve.trump partyid7
#> 4    803023033 0.824913    Trump             2        7
#> 9    831421601 1.304924    Trump             1        7
#> 107 1219666547 0.191364    Other             2        2
#> 149 1364524431 3.863494    Trump             1        7
#> 180 1465399167 1.263106     <NA>            98       -1
#> 182 1476757020 1.119185    Other             3        7
#>     year month day duration     device gender age educ
#> 4   2017    10  29        8 Smartphone      M  32   12
#> 9   2017    10  26       12    Desktop      F  23   10
#> 107 2017    10  26        9 Smartphone      M  26   11
#> 149 2017    10  27        8 Smartphone      F  32    9
#> 180 2017    11  01     8965 Smartphone      F  25    9
#> 182 2017    10  26       10 Smartphone      F  19    9
#>     state approve.dem approve.rep top.issue female
#> 4      TN           4           2    Q10A_1      0
#> 9      CA           4           1    Q10A_1      1
#> 107    IL           2           4    Q10A_1      0
#> 149    UT           4           1    Q10A_1      1
#> 180    GA          NA          NA      <NA>      1
#> 182    WA           4           2    Q10A_1      1

The function will attempt to automatically detect and guess the character that separates the columns. In this case it works just fine, but it’s not a bad idea to specify exactly what you want, so that it doesn’t make an unexpected mistake:

genfor <- separate(genfor,
                   col = "date",
                   into = c("year","month","day"),
                   sep = "-")

5.2.6 Analysis

Let’s look at doing a quick analysis. Let’s see how age relates to voting for Trump

Let’s first make a variable that is vote for trump (1) or not (0)

genfor$trump <- NA
genfor$trump[genfor$vote2016=="Trump"]<- 1
genfor$trump[genfor$vote2016!="Trump"]<- 0

table(genfor$trump)
#> 
#>    0    1 
#> 1109  262

(remember this is a survey of millenials, so very few voting for trump)

How does age relate to voting for trump?

making a scatterplot not very helpful here:

plot(genfor$age,genfor$trump)

Another possibility is a boxplot where we look at the distribution of ages for trump and not trump voters

boxplot(genfor$age ~ genfor$trump,
        xlab="Vote for Trump",
        ylab="Age",
        main="Distribution of Age for Trump and Non-Trump Voters")

Can also look at the correlation

cor(genfor$age, genfor$trump)
#> [1] NA
cor(genfor$age, genfor$trump, use="pairwise.complete")
#> [1] 0.1280754

Small positive correlation

Another way to look at the distribution of age is a density plot

plot(density(genfor$age[genfor$trump==1],na.rm=T),
     xlab="Age",
     main="Distribution of Age for Trump and Non-Trump Voters",
     col="firebrick",
     ylim=c(0,0.1))
points(density(genfor$age[genfor$trump==0],na.rm=T), type="l",
       col="dodgerblue")
legend("topright",c("Trump Voters","Non-Trump Voters"),
       lty=c(1,1), col=c("firebrick","dodgerblue"))