Chapter 2 Foundations of Data Analysis with R

2.1 Setting up your environment

Setting up your R environment for data analysis involves a series of steps to install R, RStudio, and essential packages. Here’s is a step-by-step guide to help you get started:

2.1.1 Installing R

Visit the R Project website (https://www.r-project.org/).
Click on the “CRAN” link under “Download and Install R.”
Choose a CRAN mirror near you.
Download and install the appropriate version of R for your operating system (Windows, macOS, or Linux).
Follow the installation instructions for your chosen operating system.

2.1.2 Installing RStudio

Go to the RStudio website (https://www.rstudio.com/).
Click on the “Products” tab and select “RStudio Desktop.”
Choose the appropriate version for your operating system (Windows, macOS, or Linux).
Download and install RStudio by following the installation instructions.

2.1.3 Launching RStudio and setting up a project

Open RStudio after installation.
Click on “File” in the top menu, then select “New Project.”
Choose the project type. For data analysis, select “New Directory” and then “New Project.”
Specify the project name and location. You can create a new directory or use an existing one.
Choose an optional version control system (e.g., Git) if you plan to use one.
Click “Create Project.” This will create a new RStudio project with its own working directory.

2.1.4 Installing packages

You can install packages by using the install.packages() function, replacing “package_name” with the name of the package you want to install.

install.packages("package_name")

For data analysis, some essential packages include:

tidyverse for data manipulation.
ggplot2 for data visualization.
psych for statistical analyses.

Install these packages using the above method.

2.1.5 Loading packages

You can load packages using the library() function.

# Load the packages tidyverse, ggplot2, and psych
library(tidyverse)
library(ggplot2)
library(psych)

2.2 Vectors, factors, matrices, and data frames

A solid understanding of data structures is essential as they form the foundation for storing and manipulating data. R provides various data structures that cater to different types of data and analysis needs. In this section, we’ll explore four key data structures: vectors, matrices, data frames, and factors.

2.2.1 Vectors

A vector is the simplest and most fundamental data structure in R. It is a one-dimensional array that can hold homogeneous elements of the same data type, such as numbers or characters. Vectors are created using the c() function. Here’s an example.

# Creating a numeric vector
numeric_vector <- c(1, 2, 3, 4, 5)

# Creating a character vector
character_vector <- c("apple", "banana", "orange")

2.2.2 Factors

Factors are used to represent categorical data and are particularly useful for data that have discrete levels or categories. They are an essential data structure for performing statistical analyses and plotting categorical data. Factors are created using the factor() function.

# Creating a factor
gender <- factor(c("male", "female", "male", "female"))

2.2.3 Matrices

A matrix is a two-dimensional data structure composed of rows and columns. All elements in a matrix must be of the same data type. Matrices are useful for storing data in a tabular format, such as datasets with equal-length variables. You can create a matrix using the matrix() function.

# Creating a matrix
matrix_data <- matrix(1:12, nrow = 3, ncol = 4)

2.2.4 Data Frames

Data frames are perhaps the most commonly used data structure for data analysis. They are similar to matrices but can store columns of different data types, making them versatile for handling real-world datasets. Data frames maintain the tabular structure and are often used to represent datasets with variables and observations. You can create a data frame using the data.frame() function.

# Creating a data frame
data_frame <- data.frame(
  name = c("Alice", "Bob", "Charlie"),
  age = c(25, 30, 22),
  score = c(95, 85, 78)
)

2.2.5 Accessing subsets of vectors and data frames

Values in can be accessed in a number of ways. To access a specific value in a vector, you can use square brackets [] with the integer index.

# Create a vector
vector <- c(2, 6, 22, 12, 8)

# Using square brackets and integer index to access the value in the forth position (12)
vector[4]

To access a specific column in a data frame, you can use the $ operator or square brackets [] with either the column name or integer index.

# Create a data set
data <- data.frame(variable1 = c(2, 6, 22, 12, 8),
                   variable2 = c(4, 8, 4, 10, 3),
                   variable3 = c(9, 15, 9, 11, 4))

# Using the $ operator
data$variable1

# Using square brackets and variable name
data["variable1"]

# Using square brackets and integer index to access the first column (variable)
data[ , 1]

To access specific rows in a data frame, you can use integer indexing with square brackets [].

# Create a data set
data <- data.frame(variable1 = c(2, 6, 22, 12, 8),
                   variable2 = c(4, 8, 4, 10, 3),
                   variable3 = c(9, 15, 9, 11, 4))

# Using square brackets and integer index to access the forth row
data[4, ]

To access a specific value at the intersection of a row and a column, you can use both row and column indexing with square brackets [].

# Create a data set
data <- data.frame(variable1 = c(2, 6, 22, 12, 8),
                   variable2 = c(4, 8, 4, 10, 3),
                   variable3 = c(9, 15, 9, 11, 4))

# Using square brackets and integer index to access the value in the second row of the first column
data[2, 1]

Understanding these data structures and how to manipulate them is crucial for effective data analysis in R. They enable you to store, organize, and manipulate data in various formats, facilitating a wide range of data manipulation, transformation, and visualization tasks. As you progress in your data analysis journey, you’ll often find yourself working extensively with these fundamental data structures to extract insights from your data.

2.3 Data manipulation with tidyverse

The tidyverse package provides a set of functions that streamline and simplify data manipulation tasks. It follows a consistent grammar that makes it easy to express complex data transformations using a small set of intuitive verbs. Here’s a step-by-step guide to manipulating data using tidyverse.

2.3.1 Install and load the `tidyverse` package

If you haven’t already installed and loaded the tidyverse package, do so using these commands.

install.packages("tidyverse")
library(tidyverse)

2.3.2 The pipe operator `%>%`

The %>% symbol, known as the pipe operator, is a fundamental feature of the tidyverse ecosystem, which includes packages like dplyr, ggplot2, and more. The pipe operator simplifies and enhances the process of chaining multiple operations together, making your code more readable and efficient.

The pipe operator is used to take the output of one function and feed it as the input to the next function. Its basic syntax is.

output <- input %>%
  function1() %>%
  function2() %>%
  function3()

2.3.3 Selecting columns with `select()`

Use select() to choose specific columns from your data frame.

# Create a data set
data <- data.frame(variable1 = c(2, 6, 22, 12, 8),
                   variable2 = c(4, 8, 4, 10, 3),
                   variable3 = c(9, 15, 9, 11, 4))

# Select variable1 and variable2
data %>%
  select(variable1, variable2)

# Select all variables apart from variable1
data %>%
  select(!variable1)

2.3.4 Filtering rows with `filter()`

Use filter() to select rows from a data frame based on specific conditions.

# Create a data set
data <- data.frame(variable1 = c(2, 6, 22, 12, 8),
                   variable2 = c(4, 8, 4, 10, 3),
                   variable3 = c(9, 15, 9, 11, 4))

# Get rows where values in column1 are greater than 10
data %>%
  filter(column1 > 10)

# Get rows where values in column2 are "apples" 
data %>%
  filter(column2 == "apples")

# Get rows where values in column3 are not "cars"
data %>%
  filter(!column3 == "cars")

# Get rows where values in column1 are less than 10 and values in column2 are "oranges"
data %>%
  filter(column1 < 10) %>%
  filter(column2 == "oranges")

2.3.5 Renaming variables with `rename()`

Use rename() to change the name of columns in a data frame.

# Create a data set
data <- data.frame(variable1 = c(2, 6, 22, 12, 8),
                   variable2 = c(4, 8, 4, 10, 3),
                   variable3 = c(9, 15, 9, 11, 4))

# Rename variable1 to Variable_1
data %>%
  rename(Variable_1 = variable1)

2.3.6 Recoding variables with `recode()` and `mutate()`

Use recode() together with mutate() to change the values of a variable in a data frame.

# Create a data set
data <- data.frame(variable1 = c("apple", "banana", "orange"),
                   variable2 = c(1, 0, 1),
                   variable3 = c(2, 5, 7))

# Create a recoded character variable
data$variable1_r <- data %>%
  mutate(variable1_r = recode(variable1,
                              "apple" = "red",
                              "banana" = "yellow",
                              "orange" = "orange")) %>%
  pull(variable1_r)

# Create a recoded binary variable
data$variable2_r <- data %>%
  mutate(variable2_r = recode(variable2,
                              0 = 1,
                              1 = 0)) %>%
  pull(variable2_r)

# Create a reverse scored continuous variable (assuming the original variable ranges from 1-7)
data$variable3_r <- 8 - variable3

2.3.7 Creating new variables by combining `select()` with `rowMeans()` or `rowSums()`

Use select() to pull out relevant variables and then rowMeans() to compute a mean score across them or rowSums() to compute a sum score.

# Create a data set
data <- data.frame(variable1 = c(2, 6, 22, 12, 8),
                   variable2 = c(4, 8, 4, 10, 3),
                   variable3 = c(9, 15, 9, 11, 4))

# Create a new variable with the sum score of variable1, variable2, and variable3
data$sum_score <- data %>%
  select(variable1, variable2, variable3) %>% 
  rowSums(na.rm = TRUE)

# Create a new variable with the average score of variable1, variable2, and variable3
data$mean_score <- data %>% 
  select(variable1, variable2, variable3) %>% 
  rowMeans(na.rm = TRUE)