2.2 Some R basics
By this time you probably had realized that some pieces of R code are repeated over and over and that it is simpler to just modify them than to navigate the menus. For example, the codes lm
and scatterplot
always appear related with linear models and scatterplots. It is important to know some of the R basics in order to understand what are these pieces of text actually doing. Do not worry, the menus will always be there to generate the proper code for you – but you need to have a general idea of the code.
In the following sections, type – not copy and paste systematically – the code in the 'R Script'
panel and send it to the output panel (on the selected expression, either with the 'Submit'
button or with 'Control' + 'R'
).
We begin with the lm
function, since it is the one you are more used to. In the following, you should get the same outputs (which are preceded by ## [1]
).
2.2.1 The lm
function
We are going to employ the EU
dataset from Section 2.1.2, with the case names set as the Country
. In case you do not have it loaded, you can download it here as an .RData
file.
# First of all, this is a comment. Its purpose is to explain what the code is doing
# Comments are preceded by a #
# lm has the syntax: lm(formula = response ~ explanatory, data = data)
# For example (you need to load first the EU dataset)
<- lm(formula = Seats2011 ~ Population2010, data = EU)
mod
# We have saved the linear model into mod, which now contains all the output of lm
# You can see it by typing
mod##
## Call:
## lm(formula = Seats2011 ~ Population2010, data = EU)
##
## Coefficients:
## (Intercept) Population2010
## 7.910e+00 1.078e-06
# mod is indeed a list of objects whose names are
names(mod)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "na.action" "xlevels" "call" "terms"
## [13] "model"
# We can access these elements by $
# For example
$coefficients
mod## (Intercept) Population2010
## 7.909890e+00 1.078486e-06
# The residuals
$residuals
mod## Germany France United Kingdom Italy Spain
## 2.86753139 -3.70310468 -1.78469388 0.01391858 -3.50839420
## Poland Romania Netherlands Greece Belgium
## 1.92718471 1.94344541 0.21421832 1.89769973 2.39942538
## Portugal Czech Republic Hungary Sweden Austria
## 2.61748660 2.75866040 3.28980283 2.01631620 2.05747784
## Bulgaria Denmark Slovakia Finland Ireland
## 1.93275540 -0.87902697 -0.76059520 -0.68132864 -0.72840765
## Lithuania Latvia Slovenia Estonia Cyprus
## 0.49978824 -1.33472983 -2.11752493 -3.35519827 -2.77607293
## Luxembourg Malta
## -2.45136132 -2.35527254
# The fitted values
$fitted.values
mod## Germany France United Kingdom Italy Spain
## 96.132469 77.703105 74.784694 72.986081 57.508394
## Poland Romania Netherlands Greece Belgium
## 49.072815 31.056555 25.785782 20.102300 19.600575
## Portugal Czech Republic Hungary Sweden Austria
## 19.382513 19.241340 18.710197 17.983684 16.942522
## Bulgaria Denmark Slovakia Finland Ireland
## 16.067245 13.879027 13.760595 13.681329 12.728408
## Lithuania Latvia Slovenia Estonia Cyprus
## 11.500212 10.334730 10.117525 9.355198 8.776073
## Luxembourg Malta
## 8.451361 8.355273
# Summary of the model
<- summary(mod)
sumMod
sumMod##
## Call:
## lm(formula = Seats2011 ~ Population2010, data = EU)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7031 -1.9511 0.0139 1.9799 3.2898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.910e+00 5.661e-01 13.97 2.58e-13 ***
## Population2010 1.078e-06 1.915e-08 56.31 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.289 on 25 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.9922, Adjusted R-squared: 0.9919
## F-statistic: 3171 on 1 and 25 DF, p-value: < 2.2e-16
The following table contains a handy cheat sheet of equivalences between R code and some of the statistical concepts associated to linear regression.
R | Statistical concept |
---|---|
x |
Predictor \(X_1,\ldots,X_n\) |
y |
Response \(Y_1,\ldots,Y_n\) |
data <- data.frame(x = x, y = y) |
Sample \((X_1,Y_1),\ldots,(X_n,Y_n)\) |
model <- lm(y ~ x, data = data) |
Fitted linear model |
model$coefficients |
Fitted coefficients \(\hat\beta_0,\hat\beta_1\) |
model$residuals |
Residuals \(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\) |
model$fitted.values |
Fitted values \(\hat Y_1,\ldots,\hat Y_n\) |
model$df.residual |
Degrees of freedom \(n-2\) |
summaryModel <- summary(model) |
Summary of the fitted linear model |
summaryModel$sigma |
Fitted residual standard deviation \(\hat\sigma\) |
summaryModel$r.squared |
Coefficient of determination \(R^2\) |
summaryModel$fstatistic |
\(F\)-test |
anova(model) |
ANOVA table |
Do the following:
-
Compute the regression of
CamCom2011
intoPopulation2010
. Save that model as the variablemyModel
. -
Access the objects
residuals
andcoefficients
ofmyModel
. -
Compute the summary of
myModel
and store it as the variablesummaryMyModel
. -
Access the object
sigma
ofmyModel
. -
Repeat the previous steps changing the names of
myModel
andsummaryMyModel
tootherMod
andinfoOtherMod
, respectively.
Now you know how to fit and summarize a linear model with a few keystrokes. Let’s see more of the basics of R – it will be useful for the next sections.
2.2.2 Simple computations
# These are some simple operations
# The console can act as a simple calculator
1.0 + 1.1
## [1] 2.1
2 * 2
## [1] 4
3/2
## [1] 1.5
2^3
## [1] 8
1/0
## [1] Inf
0/0
## [1] NaN
# Use ; for performing several operations in the same line
1 + 3) * 2 - 1; 1 + 3 * 2 - 1
(## [1] 7
## [1] 6
# Mathematical functions
sqrt(2); 2^0.5
## [1] 1.414214
## [1] 1.414214
sqrt(-1)
## Warning in sqrt(-1): NaNs produced
## [1] NaN
exp(1)
## [1] 2.718282
log(10); log10(10); log2(10)
## [1] 2.302585
## [1] 1
## [1] 3.321928
sin(pi); cos(0); asin(0)
## [1] 1.224647e-16
## [1] 1
## [1] 0
# Remember to complete the expressions
1 +
1 + 3
(## Error: <text>:4:0: unexpected end of input
## 2: 1 +
## 3: (1 + 3
## ^
2.2.3 Variables and assignment
# Any operation that you perform in R can be stored in a variable (or object)
# with the assignment operator "<-"
<- 1
a
# To see the value of a variable, we simply type it
a## [1] 1
# A variable can be overwritten
<- 1 + 1
a
# Now the value of a is 2 and not 1, as before
a## [1] 2
# Careful with capitalization
A## Error in eval(expr, envir, enclos): object 'A' not found
# Different
<- 3
A
a; A## [1] 2
## [1] 3
# The variables are stored in your workspace (.RData file)
# A handy tip to see what variables are in the workspace
ls()
## [1] "a" "A" "appCamComNoGer2011"
## [4] "appEU2011" "appEUNoGer2011" "appUS"
## [7] "EU" "mod" "pisa"
## [10] "pisaLinearModel" "sumMod" "US"
# Now you know which variables can be accessed!
# Remove variables
rm(a)
a## Error in eval(expr, envir, enclos): object 'a' not found
Do the following:
-
Store \(-123\) in the variable
b
. -
Get the log of the square of
b
. (Answer:9.624369
) -
Remove variable
b
.
2.2.4 Vectors
# These are vectors - arrays of numbers
# We combine numbers with the function c
c(1, 3)
## [1] 1 3
c(1.5, 0, 5, -3.4)
## [1] 1.5 0.0 5.0 -3.4
# A handy way of creating sequences is the operator :
# Sequence from 1 to 5
1:5
## [1] 1 2 3 4 5
# Storing some vectors
<- c(1, 2)
myData <- c(-4.12, 0, 1.1, 1, 3, 4)
myData2
myData## [1] 1 2
myData2## [1] -4.12 0.00 1.10 1.00 3.00 4.00
# Entry-wise operations
+ 1
myData ## [1] 2 3
^2
myData## [1] 1 4
# If you want to access a position of a vector, use [position]
1]
myData[## [1] 1
6]
myData2[## [1] 4
# You also can change elements
1] <- 0
myData[
myData## [1] 0 2
# Think on what you want to access...
7]
myData2[## [1] NA
0]
myData2[## numeric(0)
# If you want to access all the elements except a position, use [-position]
-1]
myData2[## [1] 0.0 1.1 1.0 3.0 4.0
-2]
myData2[## [1] -4.12 1.10 1.00 3.00 4.00
# Also with vectors as indexes
1:2]
myData2[## [1] -4.12 0.00
myData2[myData]## [1] 0
# And also
-c(1, 2)]
myData2[## [1] 1.1 1.0 3.0 4.0
# But do not mix positive and negative indexes!
c(-1, 2)]
myData2[## Error in myData2[c(-1, 2)]: only 0's may be mixed with negative subscripts
Do the following:
- Create the vector \(x=(1, 7, 3, 4)\).
- Create the vector \(y=(100, 99, 98, ..., 2, 1)\).
-
Compute \(x_2+y_4\) and \(\cos(x_3) + \sin(x_2) e^{-y_2}\). (Answers:
104
,-0.9899925
) -
Set \(x_{2}=0\) and \(y_{2}=-1\). Recompute the previous expressions. (Answers:
97
,2.785875
) -
Index \(y\) by \(x+1\) and store it as
z
. What is the output? (Answer:z
isc(-1, 100, 97, 96)
)
2.2.5 Some functions
# Functions take arguments between parenthesis and transform them into an output
sum(myData)
## [1] 2
prod(myData)
## [1] 0
# Summary of an object
summary(myData)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.5 1.0 1.0 1.5 2.0
# Length of the vector
length(myData)
## [1] 2
# Mean, standard deviation, variance, covariance, correlation
mean(myData)
## [1] 1
var(myData)
## [1] 2
cov(myData, myData^2)
## [1] 4
cor(myData, myData * 2)
## [1] 1
quantile(myData)
## 0% 25% 50% 75% 100%
## 0.0 0.5 1.0 1.5 2.0
# Maximum and minimum of vectors
min(myData)
## [1] 0
which.min(myData)
## [1] 1
# Usually the functions have several arguments, which are set by "argument = value"
# In this case, the second argument is a logical flag to indicate the kind of sorting
sort(myData) # If nothing is specified, decreasing = FALSE is assumed
## [1] 0 2
sort(myData, decreasing = TRUE)
## [1] 2 0
# Don't know what are the arguments of a function? Use args and help!
args(sort)
## function (x, decreasing = FALSE, ...)
## NULL
?sort
Do the following:
-
Compute the mean, median and variance of \(y\). (Answers:
49.5
,49.5
,843.6869
) - Do the same for \(y+1\). What are the differences?
- What is the maximum of \(y\)? Where is it placed?
-
Sort \(y\) increasingly and obtain the 5th and 76th positions. (Answer:
c(4,75)
) - Compute the covariance between \(y\) and \(y\). Compute the variance of \(y\). Why do you get the same result?
2.2.6 Matrices, data frames and lists
# A matrix is an array of vectors
<- matrix(1:4, nrow = 2, ncol = 2)
A
A## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
# Another matrix
<- matrix(1, nrow = 2, ncol = 2, byrow = TRUE)
B
B## [,1] [,2]
## [1,] 1 1
## [2,] 1 1
# Binding by rows or columns
rbind(1:3, 4:6)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
cbind(1:3, 4:6)
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
# Entry-wise operations
+ 1
A ## [,1] [,2]
## [1,] 2 4
## [2,] 3 5
* B
A ## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
# Accessing elements
2, 1] # Element (2, 1)
A[## [1] 2
1, ] # First row
A[## [1] 1 3
2] # First column
A[, ## [1] 3 4
# A data frame is a matrix with column names
# Useful when you have multiple variables
<- data.frame(var1 = 1:2, var2 = 3:4)
myDf
myDf## var1 var2
## 1 1 3
## 2 2 4
# You can change names
names(myDf) <- c("newname1", "newname2")
myDf## newname1 newname2
## 1 1 3
## 2 2 4
# The nice thing is that you can access variables by its name with the $ operator
$newname1
myDf## [1] 1 2
# And create new variables also (it has to be of the same
# length as the rest of variables)
$myNewVariable <- c(0, 1)
myDf
myDf## newname1 newname2 myNewVariable
## 1 1 3 0
## 2 2 4 1
# A list is a collection of arbitrary variables
<- list(myData = myData, A = A, myDf = myDf)
myList
# Access elements by names
$myData
myList## [1] 0 2
$A
myList## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
$myDf
myList## newname1 newname2 myNewVariable
## 1 1 3 0
## 2 2 4 1
# Reveal the structure of an object
str(myList)
## List of 3
## $ myData: num [1:2] 0 2
## $ A : int [1:2, 1:2] 1 2 3 4
## $ myDf :'data.frame': 2 obs. of 3 variables:
## ..$ newname1 : int [1:2] 1 2
## ..$ newname2 : int [1:2] 3 4
## ..$ myNewVariable: num [1:2] 0 1
str(myDf)
## 'data.frame': 2 obs. of 3 variables:
## $ newname1 : int 1 2
## $ newname2 : int 3 4
## $ myNewVariable: num 0 1
# A less lengthy output
names(myList)
## [1] "myData" "A" "myDf"
Do the following:
-
Create a matrix called
M
with rows given byy[3:5]
,y[3:5]^2
andlog(y[3:5])
. -
Create a data frame called
myDataFrame
with column names “y,” “y2” and “logy” containing the vectorsy[3:5]
,y[3:5]^2
andlog(y[3:5])
, respectively. -
Create a list, called
l
, with entries forx
andM
. Access the elements by their names. -
Compute the squares of
myDataFrame
and save the result asmyDataFrame2
. -
Compute the log of the sum of
myDataFrame
andmyDataFrame2
. Answer:## y y2 logy ## 1 9.180087 18.33997 3.242862 ## 2 9.159678 18.29895 3.238784 ## 3 9.139059 18.25750 3.234656