Cover Image
This guide is mainly written for academics, community based researchers, and advocates, who are interested in using R to analyze and visualize data.
R has a number of advantages for individuals working in academic settings, agencies, and community settings. First of all because R is open source, R is free, and does not have a high cost like proprietary statistical software or data visualization software.
Second, using R means that one has access to a worldwide community of people who are constantly developing new R packages, and new materials for learning R.
That being said, R can have a number of drawbacks. Documentation and help files can sometimes be difficult to understand. R’s syntax, and the “R way of doing things” can present a formidable barrier.
My hope in this document is to provide an introduction to R that bypasses some of these difficulties by providing straightforward instruction focused on the likely needs of social researchers, community based researchers, and advocates. I want to help these groups of people to use R in an effective way.
I believe that it is possible to teach R in an accessible way, and that a little bit of R can take you a long way.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
This document is a brief introduction to R1. Commands that you actually type into R are represented in courier font
. mydata
is the name of your data set. x
and y
and z
refer to variables in your data. More documentation on any command is usually available via help(command)
or ??command
. The R interface makes it extremely easy to do rapid interactive data analysis. Hit “Up-Arrow” to recall the most recent command, which you can then quickly edit and resubmit. Remember also that one often submits a command or set of commands from a script window.
The general idea of many R commands is:
or
Graphical Possibilities in Base R
The
$
sign is a kind of “connector”.mydata$x
means: “The variablex
in the dataset calledmydata
”.
Sometimes, it is not necessary to use any options since some authors of R have done a good job of thinking about the defaults. R can make use of long pathnames to files like:
Note that R uses forward slashes
/
instead of backslashes for directories. R uses~
to refer to the user’s (usually your) home directory.
Most of this guide makes use of what is most often called Base R, the R that you get when you install the R software, and RStudio, on your computer.
For many social researchers, the data structure of primary interest is the data frame, and thus that is my focus here. In the interests of parsimony I do not go into a great deal of detail on R’s other data structures.
A great deal can be accomplished with Base R. However, as you grow in your use of R, you will likely frequently need to make use of libraries, which are invoked by the library(...)
command. Before using a library you need to install it. Below is an example of installing the ggplot2 advanced graphics library.
You would need to install the library only once. Installation can also be accomplished from the “Packages” tab in RStudio.
Then start the library when you are using R by typing…
I should mention here the new additions to the R language of the new libraries which make up the tidyverse. Learning the tidyverse requires an additional investment in learning, however the tidyverse makes many improvements to the R language and functionality.
R uses the concept of a working directory to know where to find files, and where to save files.
It is often helpful to simply set your working directory to a particular location and by default, files will be accessed from, and saved to, that directory e.g.:
getwd() # "get", or find out, your working directory
setwd("C:/Users/user1/Desktop/") # set your working directory
Note that R uses a forward slash
/
to specify directory paths. R does not understand the use of a backward slash \(\backslash\) to specify directories. R uses~
to refer to the user’s (usually your) home directory.
R is a command or syntax based program, and many advanced functions are only available via syntax.
R Commands are stored in a script or code file that usually ends in .R, e.g.
myRscript.R
. The command file is distinct from your actual data, stored in an .RData file, e.g.mydata.RData
.
Base R can sometimes be cryptic.
However, a little bit of Base R can go a long way, and you can get a great learning return for a little bit of investment in learning Base R.
A good Graphical User Interface (GUI) can make some of the base functionality of R available without the use of syntax. RCommander is the best GUI, and can be installed from the command line by typing:
RCommander can make some tasks easier, but the syntax that it produces can sometimes be non-intuitive. Often it is easiest (and more in the interests of replicable research) just to learn how to write the R code
that accomplishes a particular task. Further, your learning may go quicker if you bypass RCommander altogether and simply learn how to write R code.
screenshot of RCommander
RStudio is an Integrated Development Environment (IDE) that can be run simultaneously with RCommander and provides an easier working enivronment for R Software. If all the software is installed, Start RStudio to start R, then type library(Rcmdr)
to start RCommander.
screenshot of RStudio
Remember that R uses a forward slash
/
to specify directory paths. R does not understand the use of a backward slash \(\backslash\) to specify directories.
R most easily makes use of data in R format. Data can be loaded with the load()
command.
load("the/path/to/myRfile.RData") # specific directory path and file
load("myRfile.RData") # no path indicated; file needs to be in working directory
Note–as we discuss in a little more detail below–that a single data file can contain multiple data frames.
For example, a data file called projectdata.RData could contain:
The name of the RData file can be very different from the name of the data frames that it contains.
R can also read comma separated values (csv).
R can easily import well-formatted data from other packages} like SPSS, Stata, or Excel2.
foreign
readxl
haven
haven
is a new library for reading in data from statistical packages that may work better.
Once you have your data in R, it will likely make sense to save it in *.RData
format for future use.
Note–as we alluded to earlier–that multiple data frames can be saved into a single data file.3
Use the Script Editor to save R commands that you want to use again, or to modify for the next project, as well as to create an “audit trail” of your work so that your workflow is documented and replicable. R commands are saved in a .R file, e.g. myscript.R.
Working with a subset of your data (i.e. fewer variables rather than many many variables) is often helpful. The subset function can be especially helpful.
mydata_subset <- subset(mydata, # name of data
age > 18, # condition(s)
select = c(id, sex, income)) # variables
You can then run functions like summary()
on a subset of your data.
You can also save this subset for future use.
R recognizes two basic kinds of variables: continuous variables (which R calls “numeric” variables) which are often scales like income, mental health, or neighborhood quality; and categorical variables (which R calls “factor variables”) like race, gender or religion.
R seems to make a stronger distinction between these two types of variables than some other statistical software.4 It can thus sometimes be useful to change variables from one type to another:
mydata$x <- as.numeric(mydata$y)
mydata$x <- as.factor(mydata$y) # shorter syntax
# longer, more complete syntax
mydata$w <- factor(mydata$z, # original numeric variable
levels = c(0, 1, 2), # levels of numeric variable
labels = c("Group A", # labels
"Group B",
"Group C"),
ordered = TRUE) # often useful to order the levels
Data with missing values, often represented as negative numbers (e.g. -99, -9, -8) need to be recoded so that the missing values are represented as a missing value character (“NA”) that R knows to exclude from calculations.
Sometimes you want to drop rows of data that contain missing values. This can be accomplished with na.omit()
.
na.omit()
removes a row of data where any value is missing, so sometimes you want to work with a subset of your data before applying na.omit()
.
It is often convenient to rename your data so that the variables have more intuitively understandable names e.g.
It is sometimes useful to sort your data. sort(mydata$x)
will sort mydata by the values of x
.
You can easily create new variables in R. For example, a change score between a measure collected at two time-points, like a pre-test, and a post-test, would be:
Similarly, you can sum the items of a scale into a scale as follows:
You can test the alpha reliability of this scale with the following syntax:
The syntax above create a dataframe of only the scale items.
Then,
summary(mydata$x)
gives you basic descriptive statistics for a variable, such as the mean (average). Especially useful for continuous variables. Use summary(mydata)
to summarize every variable in your data.
skim(mydata)
from library(skimr)
or describe(mydata)
from library(psych)
will often give you a nicer summary of your variables that is closer to what you want for an academic paper or agency report.
You may only want to look at descriptive statistics for a subset of your data. Creating a subset and then running descriptive statistics on that subset may be helpful.
table(mydata$x)
gives you a frequency distribution for your variable. Especially useful for factor variables. prop.table(table(mydata$x))
will give you a table of proportions. Calling up library(descr)
and then using freq(mydata$x)
will give you a more nicely formatted frequency distribution.
Tabulating two categorical variables (factor variables) together gives you a cross-tabulation of those variables, e.g:
table(mydata$x, mydata$y) # simple table of counts
prop.table(table(mydata$x, mydata$y)) # table of cell proportions
prop.table(table(mydata$x, mydata$y)
margin = 1) # row margins: row proportions
prop.table(table(mydata$x, mydata$y)
margin = 2) # column margins: column proportions
then
will give you a chi-square test of the relationship of x
and y
.
will give you the correlation of continuous variables x
and y
5.
will test the statistical significance of this correlation.
gives you a summary of continuous variable x
by factor variable z
.
runs a t-test of continuous variable x
over factor variable z
.
runs the corresponding ANOVA of continuous variable x
across factor variable z
.
runs a regression (linear model) of y
on x
and z
.
Type summary(mymodel)
to display the results.
histogram of x
will give you a nice display of one continuous variable.
gives a nicer looking graph.
gives similar results when x is a factor variable.
barplot of y
gives you a twoway scatterplot of your data
A more nicely labelled graph can be obtained with:
abline(lm(mydata$y~mydata$x))
will add a linear fit line to a scatterplot that you have already constructed.
abline(lm(mydata$y~mydata$x), col="gold", lwd=5)
will be a nicer looking fit line.
scatterplot of y against x
This document is inspired by my longstanding “Two Page Stata” document: (PDF) (HTML).↩
These instructions assume you have setwd()
appropriately, or alternatively are specifying a full pathname and filename.↩
Some would call this a feature of R, while others would simply say that this is another confusing aspect of R.↩
In many cases, this is very helpful in that R recognizes that the type of variable calls for a certain kind of statistic or graph, or vice versa. In other cases, this may be the source of an error message.↩
Here is an example where R turns a simple issue into a difficult one, and the syntax is frankly less than elegant, and not intuitive. I don’t have this syntax memorized. I use library(Rcmdr)
if I need to test, or create a script for, a correlation.↩
15 Comments, Questions and Corrections
Comments, questions and corrections most welcome and may be sent to: Andrew Grogan-Kaylor @ http://www.umich.edu/~agrogan & @ agrogan@umich.edu.
Last updated:
May 12 2019
at14:18