2.2 Some R basics

By this time you probably had realized that some pieces of R code are repeated over and over and that it is simpler to just modify them than to navigate the menus. For example, the codes lm and scatterplot always appear related with linear models and scatterplots. It is important to know some of the R basics in order to understand what are these pieces of text actually doing. Do not worry, the menus will always be there to generate the proper code for you – but you need to have a general idea of the code.

In the following sections, type – not copy and paste systematically – the code in the 'R Script' panel and send it to the output panel (on the selected expression, either with the 'Submit' button or with 'Control' + 'R').

We begin with the lm function, since it is the one you are more used to. In the following, you should get the same outputs (which are preceded by ## [1]).

2.2.1 The lm function

We are going to employ the EU dataset from Section 2.1.2, with the case names set as the Country. In case you do not have it loaded, you can download it here as an .RData file.

# First of all, this is a comment. Its purpose is to explain what the code is doing
# Comments are preceded by a #

# lm has the syntax: lm(formula = response ~ explanatory, data = data)
# For example (you need to load first the EU dataset)
mod <- lm(formula = Seats2011 ~ Population2010, data = EU)

# We have saved the linear model into mod, which now contains all the output of lm
# You can see it by typing
mod
## 
## Call:
## lm(formula = Seats2011 ~ Population2010, data = EU)
## 
## Coefficients:
##    (Intercept)  Population2010  
##      7.910e+00       1.078e-06

# mod is indeed a list of objects whose names are
names(mod)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "xlevels"       "call"          "terms"        
## [13] "model"

# We can access these elements by $
# For example
mod$coefficients
##    (Intercept) Population2010 
##   7.909890e+00   1.078486e-06

# The residuals
mod$residuals
##        Germany         France United Kingdom          Italy          Spain 
##     2.86753139    -3.70310468    -1.78469388     0.01391858    -3.50839420 
##         Poland        Romania    Netherlands         Greece        Belgium 
##     1.92718471     1.94344541     0.21421832     1.89769973     2.39942538 
##       Portugal Czech Republic        Hungary         Sweden        Austria 
##     2.61748660     2.75866040     3.28980283     2.01631620     2.05747784 
##       Bulgaria        Denmark       Slovakia        Finland        Ireland 
##     1.93275540    -0.87902697    -0.76059520    -0.68132864    -0.72840765 
##      Lithuania         Latvia       Slovenia        Estonia         Cyprus 
##     0.49978824    -1.33472983    -2.11752493    -3.35519827    -2.77607293 
##     Luxembourg          Malta 
##    -2.45136132    -2.35527254

# The fitted values
mod$fitted.values
##        Germany         France United Kingdom          Italy          Spain 
##      96.132469      77.703105      74.784694      72.986081      57.508394 
##         Poland        Romania    Netherlands         Greece        Belgium 
##      49.072815      31.056555      25.785782      20.102300      19.600575 
##       Portugal Czech Republic        Hungary         Sweden        Austria 
##      19.382513      19.241340      18.710197      17.983684      16.942522 
##       Bulgaria        Denmark       Slovakia        Finland        Ireland 
##      16.067245      13.879027      13.760595      13.681329      12.728408 
##      Lithuania         Latvia       Slovenia        Estonia         Cyprus 
##      11.500212      10.334730      10.117525       9.355198       8.776073 
##     Luxembourg          Malta 
##       8.451361       8.355273

# Summary of the model
sumMod <- summary(mod)
sumMod
## 
## Call:
## lm(formula = Seats2011 ~ Population2010, data = EU)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7031 -1.9511  0.0139  1.9799  3.2898 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.910e+00  5.661e-01   13.97 2.58e-13 ***
## Population2010 1.078e-06  1.915e-08   56.31  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.289 on 25 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9922, Adjusted R-squared:  0.9919 
## F-statistic:  3171 on 1 and 25 DF,  p-value: < 2.2e-16

The following table contains a handy cheat sheet of equivalences between R code and some of the statistical concepts associated to linear regression.

R Statistical concept
x Predictor \(X_1,\ldots,X_n\)
y Response \(Y_1,\ldots,Y_n\)
data <- data.frame(x = x, y = y) Sample \((X_1,Y_1),\ldots,(X_n,Y_n)\)
model <- lm(y ~ x, data = data) Fitted linear model
model$coefficients Fitted coefficients \(\hat\beta_0,\hat\beta_1\)
model$residuals Residuals \(\hat\varepsilon_1,\ldots,\hat\varepsilon_n\)
model$fitted.values Fitted values \(\hat Y_1,\ldots,\hat Y_n\)
model$df.residual Degrees of freedom \(n-2\)
summaryModel <- summary(model) Summary of the fitted linear model
summaryModel$sigma Fitted residual standard deviation \(\hat\sigma\)
summaryModel$r.squared Coefficient of determination \(R^2\)
summaryModel$fstatistic \(F\)-test
anova(model) ANOVA table

Do the following:

  • Compute the regression of CamCom2011 into Population2010. Save that model as the variable myModel.
  • Access the objects residuals and coefficients of myModel.
  • Compute the summary of myModel and store it as the variable summaryMyModel.
  • Access the object sigma of myModel.
  • Repeat the previous steps changing the names of myModel and summaryMyModel to otherMod and infoOtherMod, respectively.

Now you know how to fit and summarize a linear model with a few keystrokes. Let’s see more of the basics of R – it will be useful for the next sections.

2.2.2 Simple computations

# These are some simple operations
# The console can act as a simple calculator
1.0 + 1.1
## [1] 2.1
2 * 2
## [1] 4
3/2
## [1] 1.5
2^3
## [1] 8
1/0
## [1] Inf
0/0
## [1] NaN

# Use ; for performing several operations in the same line
(1 + 3) * 2 - 1; 1 + 3 * 2 - 1
## [1] 7
## [1] 6

# Mathematical functions
sqrt(2); 2^0.5
## [1] 1.414214
## [1] 1.414214
sqrt(-1)
## Warning in sqrt(-1): NaNs produced
## [1] NaN
exp(1)
## [1] 2.718282
log(10); log10(10); log2(10)
## [1] 2.302585
## [1] 1
## [1] 3.321928
sin(pi); cos(0); asin(0)
## [1] 1.224647e-16
## [1] 1
## [1] 0
# Remember to complete the expressions
1 +
(1 + 3
## Error: <text>:4:0: unexpected end of input
## 2: 1 +
## 3: (1 + 3
##   ^

2.2.3 Variables and assignment

# Any operation that you perform in R can be stored in a variable (or object)
# with the assignment operator "<-"
a <- 1

# To see the value of a variable, we simply type it
a
## [1] 1

# A variable can be overwritten
a <- 1 + 1

# Now the value of a is 2 and not 1, as before
a
## [1] 2

# Careful with capitalization
A
## Error in eval(expr, envir, enclos): object 'A' not found

# Different
A <- 3
a; A
## [1] 2
## [1] 3

# The variables are stored in your workspace (.RData file)
# A handy tip to see what variables are in the workspace
ls()
##  [1] "a"                  "A"                  "appCamComNoGer2011"
##  [4] "appEU2011"          "appEUNoGer2011"     "appUS"             
##  [7] "EU"                 "mod"                "pisa"              
## [10] "pisaLinearModel"    "sumMod"             "US"
# Now you know which variables can be accessed!

# Remove variables
rm(a)
a
## Error in eval(expr, envir, enclos): object 'a' not found

Do the following:

  • Store \(-123\) in the variable b.
  • Get the log of the square of b. (Answer: 9.624369)
  • Remove variable b.

2.2.4 Vectors

# These are vectors - arrays of numbers
# We combine numbers with the function c
c(1, 3)
## [1] 1 3
c(1.5, 0, 5, -3.4)
## [1]  1.5  0.0  5.0 -3.4

# A handy way of creating sequences is the operator :
# Sequence from 1 to 5
1:5
## [1] 1 2 3 4 5

# Storing some vectors
myData <- c(1, 2)
myData2 <- c(-4.12, 0, 1.1, 1, 3, 4)
myData
## [1] 1 2
myData2
## [1] -4.12  0.00  1.10  1.00  3.00  4.00

# Entry-wise operations
myData + 1
## [1] 2 3
myData^2
## [1] 1 4

# If you want to access a position of a vector, use [position]
myData[1]
## [1] 1
myData2[6]
## [1] 4

# You also can change elements
myData[1] <- 0
myData
## [1] 0 2

# Think on what you want to access...
myData2[7]
## [1] NA
myData2[0]
## numeric(0)

# If you want to access all the elements except a position, use [-position]
myData2[-1]
## [1] 0.0 1.1 1.0 3.0 4.0
myData2[-2]
## [1] -4.12  1.10  1.00  3.00  4.00

# Also with vectors as indexes
myData2[1:2]
## [1] -4.12  0.00
myData2[myData]
## [1] 0

# And also
myData2[-c(1, 2)]
## [1] 1.1 1.0 3.0 4.0

# But do not mix positive and negative indexes!
myData2[c(-1, 2)]
## Error in myData2[c(-1, 2)]: only 0's may be mixed with negative subscripts

Do the following:

  • Create the vector \(x=(1, 7, 3, 4)\).
  • Create the vector \(y=(100, 99, 98, ..., 2, 1)\).
  • Compute \(x_2+y_4\) and \(\cos(x_3) + \sin(x_2) e^{-y_2}\). (Answers: 104, -0.9899925)
  • Set \(x_{2}=0\) and \(y_{2}=-1\). Recompute the previous expressions. (Answers: 97, 2.785875)
  • Index \(y\) by \(x+1\) and store it as z. What is the output? (Answer: z is c(-1, 100, 97, 96))

2.2.5 Some functions

# Functions take arguments between parenthesis and transform them into an output
sum(myData)
## [1] 2
prod(myData)
## [1] 0

# Summary of an object
summary(myData)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.5     1.0     1.0     1.5     2.0

# Length of the vector
length(myData)
## [1] 2

# Mean, standard deviation, variance, covariance, correlation
mean(myData)
## [1] 1
var(myData)
## [1] 2
cov(myData, myData^2)
## [1] 4
cor(myData, myData * 2)
## [1] 1
quantile(myData)
##   0%  25%  50%  75% 100% 
##  0.0  0.5  1.0  1.5  2.0

# Maximum and minimum of vectors
min(myData)
## [1] 0
which.min(myData)
## [1] 1

# Usually the functions have several arguments, which are set by "argument = value"
# In this case, the second argument is a logical flag to indicate the kind of sorting
sort(myData) # If nothing is specified, decreasing = FALSE is assumed
## [1] 0 2
sort(myData, decreasing = TRUE)
## [1] 2 0

# Don't know what are the arguments of a function? Use args and help!
args(sort)
## function (x, decreasing = FALSE, ...) 
## NULL
?sort

Do the following:

  • Compute the mean, median and variance of \(y\). (Answers: 49.5, 49.5, 843.6869)
  • Do the same for \(y+1\). What are the differences?
  • What is the maximum of \(y\)? Where is it placed?
  • Sort \(y\) increasingly and obtain the 5th and 76th positions. (Answer: c(4,75))
  • Compute the covariance between \(y\) and \(y\). Compute the variance of \(y\). Why do you get the same result?

2.2.6 Matrices, data frames and lists

# A matrix is an array of vectors
A <- matrix(1:4, nrow = 2, ncol = 2)
A
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

# Another matrix
B <- matrix(1, nrow = 2, ncol = 2, byrow = TRUE)
B
##      [,1] [,2]
## [1,]    1    1
## [2,]    1    1

# Binding by rows or columns
rbind(1:3, 4:6)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
cbind(1:3, 4:6)
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

# Entry-wise operations
A + 1
##      [,1] [,2]
## [1,]    2    4
## [2,]    3    5
A * B
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

# Accessing elements
A[2, 1] # Element (2, 1)
## [1] 2
A[1, ] # First row
## [1] 1 3
A[, 2] # First column
## [1] 3 4

# A data frame is a matrix with column names
# Useful when you have multiple variables
myDf <- data.frame(var1 = 1:2, var2 = 3:4)
myDf
##   var1 var2
## 1    1    3
## 2    2    4

# You can change names
names(myDf) <- c("newname1", "newname2")
myDf
##   newname1 newname2
## 1        1        3
## 2        2        4

# The nice thing is that you can access variables by its name with the $ operator
myDf$newname1
## [1] 1 2

# And create new variables also (it has to be of the same
# length as the rest of variables)
myDf$myNewVariable <- c(0, 1)
myDf
##   newname1 newname2 myNewVariable
## 1        1        3             0
## 2        2        4             1

# A list is a collection of arbitrary variables
myList <- list(myData = myData, A = A, myDf = myDf)

# Access elements by names
myList$myData
## [1] 0 2
myList$A
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
myList$myDf
##   newname1 newname2 myNewVariable
## 1        1        3             0
## 2        2        4             1

# Reveal the structure of an object
str(myList)
## List of 3
##  $ myData: num [1:2] 0 2
##  $ A     : int [1:2, 1:2] 1 2 3 4
##  $ myDf  :'data.frame':  2 obs. of  3 variables:
##   ..$ newname1     : int [1:2] 1 2
##   ..$ newname2     : int [1:2] 3 4
##   ..$ myNewVariable: num [1:2] 0 1
str(myDf)
## 'data.frame':    2 obs. of  3 variables:
##  $ newname1     : int  1 2
##  $ newname2     : int  3 4
##  $ myNewVariable: num  0 1

# A less lengthy output
names(myList)
## [1] "myData" "A"      "myDf"

Do the following:

  • Create a matrix called M with rows given by y[3:5], y[3:5]^2 and log(y[3:5]).

  • Create a data frame called myDataFrame with column names “y,” “y2” and “logy” containing the vectors y[3:5], y[3:5]^2 and log(y[3:5]), respectively.

  • Create a list, called l, with entries for x and M. Access the elements by their names.

  • Compute the squares of myDataFrame and save the result as myDataFrame2.

  • Compute the log of the sum of myDataFrame and myDataFrame2. Answer:

        ##         y       y2     logy
        ## 1 9.180087 18.33997 3.242862
        ## 2 9.159678 18.29895 3.238784
        ## 3 9.139059 18.25750 3.234656