Lab 2 Data in R

2.1 Welcome back!

In our previous lab, we set up an RStudio Cloud session and we got familiar with the RStudio environment and some of the purpose and contents of its panes. In this Lab we will learn about R packages, how to install them and load them. Also, we will use different types of data. You will have the chance to practice with additional R operators. Lastly, we will load a real-world data set and put in practice your new skills.

2.2 R Packages

As mentioned in our last lab, R (R Core Team 2021) is a collaborative project. This means that R users are developing, maintaining, and extending the functionalities constantly. When you set up R and RStudio for the first time, as we did it last week, it comes only with the ‘basic’ functionalities by default. However, there are literally thousands of extensions that are developed by other users. In R, these non-default extensions are called packages.

Most of the times, we use packages because they simplify our work in R or they allow us to extend the capabilites of base R.

2.2.1 Installing packages

Let’s put hands-on to install and load some useful packages. We will start with tidyverse (Wickham 2021).¹

2.2.2 Activity:

Part 1. Access your lab group in R Studio Cloud

Make sure you have a free, institutional-subscription RStudio Cloud account (in case you have not created one yet, please follow the guidance provided in Lab 1);
You will receive a link from your tutor to join your lab group in a shared space. Copy and paste it in your web browser (log in if necessary). If you already joined your lab group in RStudio Cloud, simply access the ‘Lab 2’ project and omit steps 3 to 5. Otherwise, continue with steps 3, 4 and 5.
If you did not join your lab group yet, you should see the following window:

Figure 2.1: Join Space
Click on the ‘Join space’ button shown above.
Open the shared space form the left-hand side pane called ‘Quants Lab Group..’ and start the Lab 2 project by clicking on the ‘Start’ button as shown below:

Figure 2.2: Start Lab 2.

Part 2. Working on your script

Once you have accessed the ‘Lab 2’ project, write or copy the following line in your script (pane 1) and run it:

install.packages('tidyverse')

Wait until you get the message ‘The downloaded source packages are in…’. The installing process can take up to a couple of minutes to finish.
Once the package is installed, you need to load it using the library() function. Please, copy and paste the following line, and run it:

library(tidyverse)

And that’s it, tidyverse is ready to be used in your current session!

There are couple of things you should know. First, the packages need to be installed only per project in RStudio Cloud (and only once if you are working in RStudio Desktop version). However, packages must be loaded using the library() function every time you restart an R session.

Another thing to notice is that when you install a package you need to use quotation marks, whereas in library() you only need to write the plain package name within brackets. Usually, you will load the packages at the beginning of your script.

2.3 Types of variables

R can handle many classes of data. It is crucial that you can distinguish the main ones. Broadly speaking there are two types of variables,

categorical and;
numeric (formally know as interval or ratio).

Categorical variables are distinctive because they are limited in the number of categories it can take, e.g., country, name, political party, or gender. Ordinal data is a sub-type of the categorical, and it is used when the categories can be ranked and their order is meaningful, e.g., education level or level of satisfaction. Numeric values can be continuous (these are usually measured and can take infinite values, e.g. speed or time).²

In R, the basic types of data are known as ‘atomic vectors’ and there are 6 of them (logical, integer, double, character, complex and raw). In the social sciences, we often use the following: numeric, factor and character. Numeric vectors are used to represent continuous numerical data.³ On the other hand, factor vectors are used to represent categorical and ordinal data.

In R, there are couple of functions that will help us to identify the type of data. First, we have glimpse(). This prints some of the main characteristics of a data set, namely its overall dimension, name of each variable (column), the first values for each variable, and the type of the variable. Second we have the function class(), that will help us to determine the overall class(type) of on R object.

2.3.1 Activity:

We are now going to use some datasets that are pre-loaded in the R session by default. Please go to your ‘Lab_2’ project in RStudio Cloud and do the following:

We will start with a classic dataset example in R called iris. This contains measurements of various flowers species (for more info type ?iris in your console). Please go to your console and type the line below.

glimpse(iris)

What do you observe from the output?… First, it tells you the number of rows and the columns on the top. Later, it lists the name of each variable. Additionally, it tells you the type of the variable between these symbols < >. The first five variables in this dataset are of type <dbl> which is a type of numeric variable. The last, Species, is a factor <fct>. In sum, there is information of the species and four types of continuous measures associated to each flower in this dataset.
Now you know that each flower belongs to a species, but what are the specific categories in this data set? To find out, type the following in your console.

levels(iris$Species)

As you can see, there are three categories, which are three types of flower species. In R the categories in factor vectors are are called levels.

Note the syntax above. Inside the function, we used the name of the dataset followed by the dollar sign ($) which is is needed to access the specific column/variable Species.

Now, let’s get serious and explore Star Wars. Yes, the famous film series!

The starwars data set from the dplyr package contains information about the characters, including height, hair colour, and sex (to get more information type ?starwars in your console). At this time we will use a reduced version of the full data set. Please complete the following activities from your R script (pane 1).

First, we will run the next couple of lines to reduce the data set, and then we will glimpse the Star Wars characters:

starwars2 <- starwars[ ,1:11]
glimpse(starwars2)

What do you observe this time? … It seems that the data type is not consistent with their content. For example, the variables species, gender, and hair_color are of type <chr> (that is character), when according to what we just learnt they should be a factor. To transform them, we will use the function ´factor()´. This process is known as coercing a variable, that is when you change from one type to another.
Let’s coerce the species variable from character to factor and assign the result to the same column in the dataset.

starwars2$species <- factor(starwars2$species)

Let’s check if the type of variable really changed by glimpsing the data and checking the levels of species.

glimpse(starwars2)
levels(starwars2$species)

The glimpse result now is telling us that species is a <fct>, as expected. Furthermore, the levels() function reveals that there are 37 types of species, including Human, Ewok, Droid, and more.

Hopefully, these examples will help you to identify the the main vector types and more importantly to coerce them in an appropriate type. Be aware that many data sets represent categories with numeric values, for example, using ‘0’ for males and ‘1’ for females. Usually, large data sets are accompanied by extra information in a code book or documentation file, which specifies the values of the numeric code and their respective meaning. It’s important to read the code book/documentation of every dataset as the conventions and meanings can vary.

2.4 More operators and some essential symbols

A useful operator is the pipe %>%. This is part of the tidyverse package. So, it is ready for you to use. This operator passes the result of one operation to the next. Check the results of the following operations in your console:

1 %>% + 1
1 %>% + 1 %>% + 5

Observe what happened…The result from the first line was 2. This is because this line can be read as: ‘take 1, THEN sum 1’. Therefore, the result is 2.

Similarly, the second line follows this process: ‘take 1, THEN sum 1, take the result of this (which is ’2’) and THEN sum 5’. Therefore, the result is 7. This can sound a bit abstract at this point, but we will practice with some data in the next section.

2.5 Black lives matter!

In this section we will work with data originally collected by The Guardian in 2015, for more information click here. The data set we will use today is an extended version which was openly shared in GitHub by the American news website FiveThirtyEight. This data set contains information about the people that were killed by police or other law enforcement bodies in the US, such as age, gender, race/ethnicity, etc. Additionally, it includes information about the city or region where the event happened. For more information click here.

2.5.1 Downloading and reading the data

For the following excercices, please make sure that your are working in your R script.

First, we will create a new folder in our project directory to store the data. To do it from R, run this line in your script (Don’t worry if you get a warning. This appears because you already have a folder with this name):

dir.create("data")

Note that in the ‘Files’ tab of Pane 4, there is a new folder called data.

Now, download the data from the GitHub repository using the function download.file(). This function takes two arguments separated by a comma: (1) the URL and (2) the destination (including the directory, file name, and file extension), as shown below. Also, since the file we downloaded is wrapped in a .zip file, we will need to unzip it using unzip(). Copy, paste in your script, the following lines:

download.file("https://projects.fivethirtyeight.com/data-webpage-data/datasets/police-killings.zip",
              "data/police-killings.zip")
unzip("data/police-killings.zip", exdir = "data")

After following the previous steps, we are ready to read the data. As you can see in the ‘File’ tab, the data comes as a .csv file. Thus, we can use the read_csv() function included in the tidyverse package (make sure you the package is loaded in your session as explained in a previous section). We will assign the data in an object called police.

police <- read_csv("data/police-killings/police_killings.csv")

2.5.2 Examining the data

If you look at your ‘Environment’ tab in pane 2, you will see there is a new object called police, which has 467 observations and 34 variables (or columns). To start exploring the contents, we will glimpse the police data as following:

glimpse(police)

As you can see, there are several variables included in the dataset, such as age, gender, law enforcement agency (lawenforcementagency), or whether the victim was armed (armed). You will see some of these variables are not in the appropriate type. For instance, some are categorical and should be type <fct> instead of <chr>.

2.5.3 Data wrangling

Before coercing these variables, we will create a smaller subset selecting only the variables that we are interested in. To do so, we can use the select() function. The select function takes the name of the data first and then the name of the variables we want to keep (no quotation marks needed). We will select a few variables and assign the result to a new object called police_2.

police_2 <- select(police, age, gender, raceethnicity, lawenforcementagency, armed)

If you look again to the ‘Environment’ tab, there is a second data set with the same number of observations but only 5 variables. You can glimpse this object to have a better idea of its contents.

glimpse(police_2)

Having a closer look at the reduced version, we can see that in fact all the variables are of type <chr>, including age.

Let’s coerce the variables in to their correct type. We will start with age, from character to numeric:

police_2 <- police_2 %>% mutate(age = as.numeric(age))

Age is not known for some cases. Thus, it is recorded as ‘Unknown’ in the dataset. Since this is not recognized as a numeric value in the coercion process, R automatically sets it as a missing value, NA. This is why it will give you a warning message.

We can continue coercing raceethnicity and gender from character to a factor:

police_2 <- police_2 %>% mutate(raceethnicity = factor(raceethnicity))
police_2 <- police_2 %>% mutate(gender = factor(gender))

Let’s run a summary of your data. This shows the number of observations in each category or a summary of a numeric variable:

summary(police_2)

There are some interesting figures coming out from the summary. For instance, in age you can see that the youngest is… 16 years old(?!), and the oldest 87 years old. Also, the vast majority are male individuals (445 vs 22). In relation to race/ethnicity, roughly half of them is ‘White’, whereas ‘Black’ individuals represent an important share. One may question about the proportion of people killed in terms of race/ethnicity compared to the composition of the total population (considering Black is a minority group in the US).

Let’s suppose that we only want observations in which race/ethnicity is not unknown. To ‘remove’ undesired observation we can use the filter() function. We will assign the result of filter in a variable called police_2.

police_2 <- police_2 %>% filter(raceethnicity != 'Unknown')

So, what just happened in the code above? First, the pipe operator, %>%: What we are doing verbally is take the object police_2, THEN filter raceethnicity based on a condition. Later, what is happening inside filter? Lets have a look at what R does in the background for us (Artwork by @alison_horst):

Figure 2.3: Filter. Source: Artwork by Horst (n.d.).

In the example above, we are keeping the observations in raceethnicity that are NOT EQUAL to ‘Unknown’. Finally, when we assigned the result to an object named as the same as our previous object, we replaced the old dataset with the filtered version.

2.6 Activity

Discuss the following questions with your neighbour or tutor:

What is the main purpose of the functions select() and filter?
What does coerce mean in the context of R? and Why do we need to coerce some variables?
What is the mutate() function useful for?

Using the police_2 dataset:

Filter how many observations are ‘White’ in raceethnicity? How may rows/observations are left?
How many ‘Latino/Hispanic’ are there in the dataset?
Using the example of Figure 2.3, could you filter how many were killed that were (a) ‘Black’ and (b) killed by firearm (‘firearm’)?
What about ‘White’ and ‘firearm’?

This is the end of Lab 2. Again, the changes in your script should be saved automatically in R Studio Cloud. However, make sure this is the case as you were taught in Lab 1. After this, you can close the tab in your web browser. Hope you had fun!

References

Horst, Alison. n.d. “GitHub - Allisonhorst/Stats-Illustrations: R & Stats Illustrations by @Allison_horst.” Accessed July 11, 2022. https://github.com/allisonhorst/stats-illustrations.

R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Wickham, Hadley. 2021. Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.

https://www.tidyverse.org/↩︎
For more details, please refer to the DataCamp module Introduction to Data in R.↩︎
Notice that numeric vectors can be represented as integeror double in R, their difference is of little relevance for now.↩︎