Chapter 1 R basics
In this chapter, users are introduced to a series of basic concepts in using R. There are many other, dedicated introductory books for this purpose (e.g. Verzani 2005; Dalgaard 2008; Zuur, Ieno, and Meesters 2009, among others). There is also a dedicated section containing freely available, contributed manuals on the CRAN (Comprehensive R Archive Network) website1.
The purpose of this chapter is less about providing a complete and exhaustive guide to R, but rather to help understand at least those basic concepts that would facilitate using the QCA package.
Chapter 2 will help even further in this aim, especially section 2.4 which shows how the graphical user interface can be used to interactively construct written commands, but for the start it is highly recommended that users would at least understand what R is all about and, perhaps most importantly, how it differs from the “usual” data analysis software.
Various object types will be briefly presented in section 1.3. Each and every type of object will become relevant in the subsequent chapters, therefore a thorough understanding of those types is necessary. There are many other books that describe those objects, and such content is outside the scope of this book. Users are invited to supplement their knowledge with external reading, whenever they feel this brief introduction does not sufficiently cover the topic.
When first opening an R console, the user is immediately subjected to an inherent, perceived puzzle: there are no specific menus, and importing a simple dataset seems like an insurmountable task. When the data is somehow imported, the most general questions are:
- Where is the data?
- How do we look at the data?
These are natural questions for users who have worked with data analysis software like SPSS or even Excel, where the data is there to be seen immediately after opening the file. As users make progress with their R knowledge, they realize the data doesn’t actually need to be seen. In fact, that is actually impossible unless the dataset is small enough to fit on the screen. There are large datasets (thousands of rows and hundreds of columns) that are only partially displayed, and users usually scroll left and right, as well as up and down through the dataset. Furthermore, R can open multiple datasets in the same time, or can create temporary copies of some datasets, this one cannot possibly “look” at everything.
When large datasets are opened, the human eye cannot possibly observe all potential problems in a dataset, many of which could remain undetected. For this purpose, R does a far better job using written commands to interactively query the data for potential problems.
Specific commands are used to obtain answers to questions such as: Are there any missing values in my data? If yes, where are they located? Are all values numeric? Are all values from a specific column in the interval between 0 and 1? Which are the rows where values are larger than a certain threshold? etc.
Even if the dataset is small and fits on the screen, it is still a good practice advice to address questions like the above using written commands, thus refrain from trusting what we usually see on the screen. For example, there are situations when a certain column is encoded as character, even if all values are numeric: "1", "5", "3"
etc. When printed on the screen, the quotes are not automatically displayed, therefore users naturally assume the variable is numeric. A better approach is to query R if a certain vector (column) is numeric or not, as it will be shown in section 1.3.
Before delving deeper into these issues, the simplest possible command for a first interaction with R is:
the result being:
[1] 2
R reads the command, it inspects it for possible syntax mistakes and if no errors occur it executes it and prints the result on the screen. This is how an interpreted language like R works, spending some time to first understanding what the text command is, then execute it. The big advantage of such an approach is interactivity, and an increased flexibility.
There are many details that, once understood and mastered, help users get passed the steep learning curve. Among those details, perhaps some of the most important are the concepts of a) working directory, associated with data input and output, and b) working space, with the associated types of objects.
1.1 Working directory
One of the most challenging tasks for an absolute beginner is to get data in. Graphical user interfaces from the classical software make it look like a natural activity by offering a way to navigate through different directories and files using a mouse. In R, there are two ways to read a certain file: either by using its name, if it was previously saved in the so called “working directory”, or by providing the complete path with all directories and subdirectories, including the file name. More details will be provided in section 2.4.8.
The simplest is to save it in the working directory, which can be found using this command:
[1] "/Users/jsmith/"
There are differences between operating systems, for example in Windows one might see something like:
[1] "C:/Documents and Settings/jsmith"
A data file can be saved into this default directory, or alternatively the user can set the working directory to a new location where the file is found:
These two commands are very useful to select files and directories using commands, and as an alternative one can use the computer’s native file browser using the command file.choose()
. Instead of specifying the complete path to a certain file, the import command (suppose reading from a CSV file) can be combined with the file browser:
This command is a combination of two functions, executed in a specific order. First, the file browser is activated by the function file.choose()
, and once the user selects a specific CSV file by navigating to a specific directory, the path to that file is automatically constructed and served to the function read.csv()
which performs the actual data reading.
For the moment, the result of this combination of commands is printed on the screen, but the usual method is to assign this result to a certain object, as it will be shown in the next section.
1.2 Workspace
Every object, created or imported, resides in the computer’s memory (RAM). In R, almost everything is an object: data files, scalars, vectors, matrices, lists, even functions etc. For instance, creating an object that contains a single value (a scalar) can be done using the same operator “<-
” for the assignment:
A new object called scalar
was created in the workspace, and it contains the value 3. The list of all objects currently in the workspace can be seen using this command:
[1] "datafile" "scalar"
Some objects are read from the hard drive, while others are created directly in the R environment. With sufficient RAM available, it really is not a problem to accommodate (very) many objects, even if some of them are large. It is only troublesome when the workspace gets polluted with many temporary objects, that makes finding the important ones more difficult.
Individual objects can be eliminated with:
Left unsaved, the workspace is lost when closing R. Apart from exporting datasets using write.table()
(more details in section 2.4.8), individual objects can also be saved in binary format:
It is also possible to save the entire workspace in binary format, with all containing objects:
Cleaning up the entire workspace (removing all objects at once) can be done using this command:
With binary workspace images saved on the hard drive, the opposite command is to bring them back to R when needed. There is a single function to load both types of binary files, a single object or an entire workspace, using:
1.3 Object types
A vector is a collection of values of a certain type (called “mode”): numeric, character, logical etc. It is probably the most common type of object in R, making an important difference from other software. In many cases, the structure of the data is restricted to a rectangular shape with cases on rows and variables on columns but R is more flexible, allowing that type of file but also all kinds of other structures.
It is not always necessary to structure values in a rectangular shape. Sometimes we might be interested in the values of a (single) simple vector, for any possible reason including to just play with it and see what happens when applying different transformation commands.
Vectors are simple, but very powerful structures. Sometimes they contain values of their own, and some other times they contain indexes referring to the positions of certain values in other vectors or other objects. The simplest way to create a vector is to use the c()
function:
{.R}
The name of the vector (just as the name of any other object) is not important, in this case chosen as nvector
but users are free to experiment with any other. Invoking the name will print its content on the screen:
[1] 1 3 5 7 9 11
The same values can be obtained using some other predefined functions, such as seq()
which generates a sequence of every two values between 1 and 9:
[1] 1 3 5 7 9 11
As stated before, the most common type of vectors are numeric, logical, character or factor. The above example is numeric, and creating character vectors is just as simple:
[1] "C" "C" "B" "A" "A"
Factors are a special type of vectors, that define categorical variables. The categories are called “levels” in R, and they can be either unordered (nominal variables) or ordered (ordinal variables).
[1] C C B A A
Levels: A B C
Unless otherwise specified, the levels are printed in alphabetical order but irrespective of their arrangement this kind of factor still defines a nominal variable:
[1] C C B A A
Levels: C A B
To qualify as an ordinal variable, there is an argument called ordered
that accepts a logical value:
[1] C C B A A
Levels: A < B < C
The difference between this object and the previous one is given by the “<
” sign between levels. If not otherwise specified, their order is still ordered alphabetically by default, but a preferred order can be specified:
[1] C C B A A
Levels: B < A < C
Matrices are another types of useful objects, having two dimensions: a number of rows, and a number of columns:
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 3 7 11
One useful thing about matrices is the fact they can have names for columns, as well as for rows, and they can be assigned with:
COL1 COL2 COL3
ROW1 1 5 9
ROW2 3 7 11
Perhaps the most important object type, used in almost all data analyses, is the data frame. Such an object has the same, familiar rectangular shape with cases on rows and variables on columns. In fact it is a list, with an additional restriction that all its components have the same length (just like all variables in a dataset have the same number of cases).
dfobject <- data.frame(A = c("urban", "urban", "rural", "rural"),
B = 12:15, C = 5:8, D = c("1", "2", "3", "4"))
rownames(dfobject) <- paste("C", 1:4, sep = "")
dfobject
A B C D
C1 urban 12 5 1
C2 urban 13 6 2
C3 rural 14 7 3
C4 rural 15 8 4
In other software, only the columns (variables) have names. In R the rows (cases) can also have names and, especially for QCA related analyses, it makes sense because each case is well known and studied, as opposed to large N quantitative analyses where cases are randomly selected from a larger population. On a moderately large number of cases, QCA works with aggregate entities (communities, countries, regions) where the case names are just as important as the columns, or causal conditions.
One very important thing to notice is the way column D
is printed on the screen. It looks numeric, just as columns B
and C
, however it is clear from the command this was written using characters like "1"
. Returning to the advice in the preamble of this chapter, we should not trust what we “see” on the screen. Instead, we can query the structure of the R objects:
chr [1:4] "1" "2" "3" "4"
It can be seen the column D
is not numeric, instead it is a factor due to the automatic coercion of character to factor in the function data.frame()
. The same kind of query can be performed using the is.something()
family:
[1] FALSE
This is also a very good way to verify if a seemingly numeric column is indeed numeric, and similar queries can be performed using is.character()
, or is.factor()
, or is.integer()
etc.
1.4 Mathematical and logical operations
R is a vectorized language. This makes it not only very powerful, but also very easy to use. In other programming languages, to add the number 1 to a numerical vector, the commands have to explicitly iterate through each value of the vector and increase it with 1, something like:
In a vectorized language like R, this kind of iteration is redundant, because R can work with entire vectors at once:
[1] 2 4 6 8 10 12
R already knows the object in this mathematical operation is a vector, and it takes care of the iteration behind the scene. The code is not only easier to write, but it is also a lot easier to read and understand. This is one of the many reasons why R became so popular: it is so easy to use that thousands of non-programmers have been able to quickly use it and contribute with more packages.
In the above iteration, there is a certain sequence operator “:
” which can be easily overlooked. It generates the sequence of all numbers between 1 and 6:
[1] 1 2 3 4 5 6
R’s vectorized nature has even more advantages, because mathematical operations can be applied with two vectors at once:
[1] 2 5 6 9 10 13
The resulting vector seems strange, but reveals another very powerful feature in R called “recycling”: the values in the smaller vector (of length 2) are recycled until reaching the length of the longer vector. The values 1 and 2 are automatically reused three times and added to the next values of the nvector
.
Addition is one of the possible mathematical operations in R. There is also subtraction with “-
”, multiplication with “*
”, division with “/
” and so on.
But there are also logical operations, that in combination with indexing (see next section) contribute to the dialogue between the user and the data. With very large datasets that are impossible to print on the screen, users have to interrogate:
Is there a value there? Is that value numeric? If yes, is it bigger than the other number? Is there any value equal to 3? Which is the first one?
These kinds of human interpretable questions can be “translated” into the R language, via logical operations applied to either entire object or to specific values.
[1] TRUE
The above command asks R if the object nvector
is numeric, which means that all its values are numbers. The result of a logical operation can be a TRUE
is the answer is true, and FALSE
if not true. These kinds of operations can also be applied on each of the containing values:
[1] FALSE FALSE TRUE TRUE TRUE TRUE
There is one logical result for each value compared with 3, which means the result of this logical operation is a logical vector of the same length. The same kind of procedure can be applied for all kinds of human like questions, such as testing equality with 3:
[1] FALSE TRUE FALSE FALSE FALSE FALSE
Note the usage of the double equal “==
” sign to test equality. This is necessary because in R, much like in all other programming languages, a single “=
” is used to assign value(s) to an object, just like the “<-
” operator.
The result is a logical vector containing 6 values, and now it is possible to ask which value is equal to 3:
[1] 2
Combining functions like which()
with logical operators like “==
” makes the communication with R possible and easy. Knowing the result of the command above is a vector, it is now possible to further introduce indexing in the command to ask even more complex questions like which is the first value bigger than 5, in this case being found on the fourth position:
[1] 4
1.5 Indexing and subsetting
Indexing is a powerful tool to inspect various parts of the data or to perform data modifications based on certain criteria. Understanding how indexing works is fundamental for data interrogation, extracting subsets of data, analysing the structure of the objects and generally it facilitates a communication method between R and the user.
R is a 1-based language, which means the first element of an object is indicated with the number 1. By contrast, there are 0-based languages where the first element is indicated with the number 0, while number 1 refers to the second element. This is a useful information, especially for users previously exposed to other software.
Through indexing, it is possible to refer to specific values in different positions of an object. The second element of the numeric vector is 3, and that can be seen with:
[1] 3
Indexing is performed using the square brackets “[
”, and the value(s) inside depend on the number of dimensions an object has. A vector has a single dimension, therefore a single number is needed to refer to a specific position. Matrices and data frames have two dimensions (rows and columns), hence two numbers are needed to indicate a specific cell at the intersection between a row and a column.
[1] 14
In the data frame object above, the value found in the third row of the second column is 14, and the positions inside the square brackets are separated by a comma. This is an important detail that defines a syntax rule which, if not properly used, can generate an error.
The same type of indexing can be performed on matrices:
[1] 11
An interesting matrix property is the way it is stored in the memory. It looks like a two dimensional object (just like a data frame), but underneath the matrix is just a vector that only acts like it has two dimensions. The same value 11 can be obtained by indexing with the last, sixth value:
[1] 11
For data frames, columns can be selected (and indexed) via the second value in the square brackets. But the variables (on columns) can also be selected using the useful “$
” operator before the variable name:
[1] "urban" "urban" "rural" "rural"
The command above selects a single column A
from dfobject
, and it is printed on the screen as a vector. The data.frame()
function used to automatically convert any character variable to a factor (categorical variable) and print its levels. To prevent coercion, the logical argument stringsAsFactors
, needed to be set as FALSE
when creating the data frame.
The vector can be further indexed via the usual method:
[1] "rural"
As mentioned, a data frame is a list with an additional constraint that all its components must have the same length. Lists can also be indexed using the double square brackets “[[
”:
[1] 12 13 14 15
The result of the above command is a vector which can be further indexed using a sequence of square brackets, the example below selecting the second component (second variable column) and inside that, the third value:
[1] 14
One of the other features of the indexing system in R is the negative indexing. The positive one selects values from certain positions, as seen in the previous examples. Negative indexing, on the other hand, presents all values except those specified between the square brackets.
[1] 12 13 15
There are many other things about indexing, outside the scope of this section, but there is one more important feature that is worth presenting: the indexes can be single values, if a single value is to be selected from a certain position, but in other circumstances indexes can be vectors themselves (vectors of indexes).
This is an important observation that prepares other advanced concepts like subsetting, where vectors of certain positions are used to select certain observations from a dataset, or to apply certain calculations on some, or based on, certain observations. Indexing is present in every data manipulation in R, and it is well worth learning how to properly use it.
Subsetting is similar to indexing, but with the specific objective to obtain a “subset”, a certain part of a dataset that need to be analyzed to answer specific research questions. It is sometimes called “filtering”, especially when applied to the rows of a dataset.
Subsetting can be performed using numerical vectors like in the examples above (with certain positions to keep or to eliminate), but it can also use logical vectors where any position that has a TRUE value will be preserved (or the other way round, any position with a value of FALSE is eliminated).
[1] 1 3 5 7 9 11
[1] 5 7 9 11
[1] 3 4 5 6
[1] 5 7 9 11
It is especially useful for subsetting data frames, on both rows and columns. For columns it selects among their names (possibly with a specific pattern), and on rows it creates a subset of cases, in the example below those where the column C is greater than 6:
A B C D
C3 rural 14 7 3
C4 rural 15 8 4
Note how the column C was specified in conjunction with the name of the object and separated by a “$
” sign, but otherwise it is possible to refer directly at the column name using the function with()
, in this command adding the values from the columns B and C:
[1] 17 19 21 23
R has a base function called subset()
, that can be used on both vectors, and rectangular objects like matrices and data frames. For dataframe, a possible command is:
A C D
C3 rural 7 3
C4 rural 8 4
The command should be self explanatory, obtaining a subset of the dfobject
keeping all rows where the column B has a value greater than 13, and from the columns selecting A, C and D.
1.6 Data coercion
The function is.numeric()
verifies all values in a vector, but unlike most of the logical operations in the previous section, it results into a single value indicating if the entire object is numeric or not.
This is possible because of a particular feature in R, that all values in a certain vector have to be of the same “mode”. If all values in a vector are numbers, such as those from nvector
, the object is called numeric. But if a single value among all numbers is a character, the entire vector is represented as character.
Many people don’t realize that numbers are characters themselves, and most importantly the character "1"
can be something different from the number 1
(especially if coerced to a factor). This conversion may happen if, by any chance, a true character is added to a numeric vector, or it replaces one of the values in the vector. At that moment, the entire vector which previously was numeric, is transformed (coerced) to a character type.
All values in a vector have to be of the same type. One single value of a different type is sufficient to coerce the entire vector to the other type, but the coercion goes only one way: a single number in a character vector does not coerce that into numeric. That happens because all numbers can act like characters, while not all characters can act as numbers.
[1] "1" "2" "3" "a"
The object cnvector
contains three numbers and one character, but due to the presence of the character, all numbers are now presented as characters. Removing the character leaves the vector with the same character type, even though all values are in fact numbers:
[1] "1" "2" "3"
Converting from one type to another is done using the “as.
” family of functions, in this case with:
[1] 1 2 3
All this demonstration reveals one of the most common situations in data analysis, especially for those users who want to “see” the data. Some objects are not always what we think they are, simply because we see them printed on the screen:
A B C D
C1 urban 12 5 1
C2 urban 13 6 2
C3 rural 14 7 3
C4 rural 15 8 4
At a first sight, the fourth variable D
seems to contain numbers just like B
and C
. However they only appears to be numbers, while in reality column D contains their character representation (e.g. "1"
, "2"
, "3"
and "4"
). This is an important observation for two reasons, with cascading effects over the next chapters:
- if at any data analysis stage, we need to perform mathematical operations on a vector we think it is numeric but in reality it is not, an (unexpected) error will be thrown and most novice users remain puzzled how did that error appeared “out of nowhere”
- as it was shown, assuming that something it is of a certain type just because we “see” it printed on the screen does not necessarily make it of a that type; instead, it is by far a better idea to always check if a variable is indeed what we think it is
A special type of coercion happens between the logical and numerical vectors. In many languages, including R, logical vectors are treated as numerical binary vectors with two values: 0 as a replacement of FALSE, and 1 as a replacement of TRUE.
[1] 4
The reverse is also valid, meaning a numerical vector can also be interpreted as logical, where 0 means FALSE
, any other number means TRUE
, and the exclamation sign !
negates the entire vector:
[1] TRUE FALSE FALSE
1.7 User interfaces
There are various ways to work with R, and the starting window presented below is deceptively simple:
In fact, it is so basic that many first time users are puzzled: no fancy menus, and only a handful of buttons that are completely unrelated to data analysis. The starting window looks different in Windows and Linux, but the overall appearance is the same: a console, where users are expected to type commands in order to get things done.
There are very good reasons for which the base R window looks so simple. The best explanation is that a graphical user interface is completely unable to replace a command line. Writing efficient commands is an art, as there are literally thousands of different ways to obtain a certain result, via an infinitely large number of possible combinations of commands. Some are simple, some look complicated but are extremely efficient, it all depends on the user’s knowledge.
In contrast, a graphical user interface is necessarily restricting to exactly one possible way (among thousands) of obtaining a result. For simple activities such as inspecting the data, or basic data manipulations (creating or modifying variables etc.) it is relatively easy to provide a graphical user interface. But unless the work is highly standardized with exactly similar input and output, each research problem is unique and requires a custom solution.
Customizing solutions using written commands is a lot easier than using point and click graphical interfaces. In some cases, the interface is simply unable to answer very specific problems, because it has not been designed for those problems in the first place. This is the very reason for which, although R exists for almost three decades, there are very few graphical user interfaces that are capable of a smooth interaction with the base R environment.
On the other hand, especially for those who have never used R before, the simple command line can be a deterrent. The newest interface that is currently most fashionable and widely recognized as user friendly is RStudio, presented in figure 1.2. It depends on a valid R installation, and it is built in a local web environment which solves the problem of differences between different operating systems.
Most importantly, RStudio manages to bring into a single interface not only the command console, but also an object browser, as well as a package installer, a plot window etc. It is more or less the closest possible user interface that balances both the expectation of a point and click user with the inherently rich set of possibilities from the R environment.
One of the most useful features of the RStudio is the History tab, where all previous commands in the Console are stored and can be inspected later. It is not shown in the figure, but it also has a modern syntax editor with a bracket matching, different colors to highlight functions, comments and operators etc. This is extremely useful, especially for the possibility to quickly select the commands to be tested and immediately sent to the console. Although having a much better user experience, RStudio is also a command centered interface, for the same reasons stated above.
1.8 Working with script files
The upper left part of the RStudio window in figure 1.2 above is a text file with an extension “.R”, where specific commands can be written and saved for later references. Commands can be selected and run in the R environment from the lower left part, either individually (specific rows) or the entire file with all commands at once.
These kinds of scripts, containing written commands, are the heart of the R environment. Not only because written commands are infinitely more flexible than a graphical user interface with precise boundaries, but also because scripts allow one of the most appreciated features of scientific research: replicability.
It is extremely difficult (if not impossible) to obtain exactly the same results using a graphical user interface, unless the exact sequence of clicks, selections and various other settings are somehow recorded, assuming the same input data is used. Users who click though a graphical dialog rarely think to document the exact sequence of operations, and simply reports results as if all other users would know what the sequence is.
At the other end, commands can be saved in text files, so that anyone else could follow (simply by copy/paste) the commands and obtain exactly the same results. And syntax files have an even more important role for the user: they can also serve as documentation vehicles for particular ideas contained in different commands.
Any line which starts with a hash tag #
is considered a comment (as readers might have already noticed in the previous examples), and anything on that line is not evaluated therefore it can contain natural language about what the commands do. It is a good practice advice to document as much as possible the script files, especially when beginning to use R. Not only it makes it easier for a different reader to understand the commands, but many times it makes it easier for the same user to understand the code, some times later.
Just like data files (or R binary data files that usually have the extension “.Rdata”), syntax files can reside in a working directory, usually having the extension “.R” but none of these file extensions are mandatory. The commands inside can be run line by line, or in several batches of lines (using either RStudio or by simply copy and paste to the R console), but otherwise it is also possible to run all commands from the entire script file (possibly starting by reading the dataset, manipulating it, and finally obtaining the end results) using one single command in R:
Such script files are easy to exchange with peers and colleagues, and they work just the same under any operating system, making R a highly attractive and interoperable working environment for teams of people all over the world.
The new shinyapps2, as well as the flexboards and probably even more spectacular R Notebooks, although requiring a more advanced R knowledge, are even more attractive ways of working with international teams by exchanging not only the script files but also their immediate, visible results.
RStudio has many more functionalities, for instance the package bookdown can be used to write articles and even books. This very book was written using this package.