Chapter 3 Importing, saving, and exploring data
Now that you have a solid understanding of how to install packages and work with vectors and dataframes, we are ready to start working with real psychological data. This section will explain how to read in your data and explore it descriptively.
3.1 Importing: Getting your data into R.
R provides various functions and packages to import data from different file formats. Here are some common methods to read data into R.
CSV files
Excel files
# Install and load the openxlsx package
install.packages("openxlsx")
library(openxlsx)
# Read Excel file
data <- read.xlsx("data.xlsx", sheet = 1)
SPSS files
# Install and load the openxlsx package
install.packages("haven")
library(haven)
# Read SPSS file
data <- read_sav("data.sav")
R data files
3.1.1 Save and organize your work
Save your R script, RMarkdown document, or project by clicking on “File” and selecting “Save” or “Save As.”
Use RStudio’s project structure to organize your files, code, data, and outputs. This helps maintain a clear and structured workflow.
R provides various functions and packages to import data from different file formats. Here are some common methods to read data into R.
Objects in R, such as data frames, can be saved in a number of formats. Here are some common methods to save data from R.
CSV files
Excel files
# Install and load the openxlsx package
install.packages("openxlsx")
library(openxlsx)
# Save data as Excel file
write.xlsx(data,
"data.xlsx", sheet = 1)
SPSS files
# Install and load the openxlsx package
install.packages("haven")
library(haven)
# Save data as SPSS file
write_sav(data,
"data.sav")
R data files
By following these steps, you’ll set up an optimal R environment for data analysis. As you progress, you can explore more advanced techniques, packages, and tools to enhance your skills and gain deeper insights from your data.
3.2 A note on data from Qualtrics
It is recommended that you download Qualtrics data in .csv
format and select the Export Values
option (found under Data and Analysis
-> Export & Import
-> Export...
). If you have exported your Qualtrics data in .csv
format, you will find that first three rows all contain headers. Reading such a file directly into R
will result in gobbledygook. Here is a bit of code to read it in, remove the redundant headers and save a trimmed file for further analyses.
# Read data set
data <- read.csv("qualtrics_data.csv")
# Remove redundant headers
data_trimmed <- data[c(3:nrow(data)), ]
# Save trimmed data set
data_trimmed %>%
write.csv("data.csv")
You should orient yourself and ensure you have a good understanding of the variables in the dataset and how they link back to your survey. This will often require you to go back to the Qualtrics platform and navigate to your survey and note down the indicator of each question (found in the top left hand corner of each question). It is worth keeping that these can be edited in Qualtrics to make them more interpretable. If left, they will be assigned uninterpretable names (e.g., Q424).
You may need to re-code variables in your dataset to more interpretable values (e.g., 1 -> male). Which numeric values Qualtrics has assigned to each response option can be found by selecting the relevant question in the main survey view on the Qualtrics Platform and selecting Recode values
at the bottom of the left information pane. How to do re-code values in R
is explained in Chapter 2.
Qualtrics provides a number of variables by default which you should know how to interpret, including:
- Finished
is whether participants completed the entire survey (1) or stopped short of completion (0)
- Duration (in seconds)
is how long it took participants to complete the survey in seconds
- DistributionChannel
is whether the data were generated by previewing the survey or via a distribution link
3.3 Good practices: Making sure your data is correct.
It is recommended that you conduct basic checks to ensure the validity of your data, including computing summary()
of all relevant variables to confirm the expected values (e.g., a scale from 1-7 should not have scores of 9), and also examining their distributions via hist()
. Details on these functions are provided below.
3.4 Distributions and summary statistics: Understanding basic properties of data
Exploratory Data Analysis is a crucial initial step in the data analysis process. It involves understanding the structure, patterns, and characteristics of your data before diving into more complex analyses. In this section, we will explore how to explore data distributions and calculate summary statistics in R.
3.4.1 Distributions
Data distribution refers to how the data values are spread across different ranges. Visualizing data distributions helps identify patterns, outliers, and central tendencies within the data.
A histogram is a graphical representation of the frequency distribution of a set of values. It divides the data into intervals (bins) and displays the count of data points falling into each bin.
To create a histogram in R, you can use the hist()
function:
3.4.2 Summary statistics
Summary statistics provide a concise summary of the main characteristics of a data set. They include measures of central tendency, dispersion, and shape.
The summary()
function provides a quick overview of essential statistics for each variable in the dataset. It includes the minimum, 1st quartile, median, mean, 3rd quartile, and maximum.
# Create a data set
data <- data.frame(participant_id = c(1, 2, 3, 4, 5, 6, 7, 8),
age = c(18, 19, 18, 22, 18, 19, 19, 18),
gender = c("male", "female", "male", "female", "female", "female", "male", "male"),
condition = c("high", "high", "low", "high", "low", "low", "low", "high"),
variable1 = c(9, 15, 9, 11, 4, 6, 4, 12))
# Calculate summary statistics for a data set
summary(data)
# Calculate summary statistics for a variable
summary(data$variable1)
The tidyverse
package provides additional functionality for easy calculation of frequency counts, means, standard deviations by group with the summarize()
and group_by()
functions.
# Load packages
library(tidyverse)
# Create a data set
data <- data.frame(participant_id = c(1, 2, 3, 4, 5, 6, 7, 8),
age = c(18, 19, 18, 22, 18, 19, 19, 18),
gender = c("male", "female", "male", "female", "female", "female", "male", "male"),
condition = c("high", "high", "low", "high", "low", "low", "low", "high"),
variable1 = c(9, 15, 9, 11, 4, 6, 4, 12))
# Summary ns and age
data %>%
summarize(n = n(),
age_mean = mean(age, na.rm = TRUE),
age_sd = sd(age, na.rm = TRUE))
# Summary ns by gender
data %>%
group_by(gender) %>%
summarize(n = n())
# Summary of variable1 by condition
data %>%
group_by(condition) %>%
summarize(n = n(),
variable1_mean = mean(variable1, na.rm = TRUE),
variable1_sd = sd(variable1, na.rm = TRUE))