1 Getting ready for econometrics
Abstract
This chapter reviews concepts about data and statistics that are important for econometrics. The central goal of econometrics–of estimating the magnitude of a hypothesized relationship in a population–is discussed. The nature of datasets is reviewed. Simple summary statistics are defined and commands for calculating summary statistics using R are introduced.
Keywords: Data, econometrics, inference, entity of observation, datasets, summary statistics, visualizations
1.1 Welcome to the study of econometrics
This book is an introductory textbook for courses in basic quantitative data analysis. The analysis is appropriate for students in Economics, Political Science, Development Studies, and Sociology. The emphasis is on regression analysis, where the analyst seeks to estimate the magnitude of a hypothesized causal relationship. Regression analysis might be used, for example, to estimate the likely effect, on future adult earnings, of enabling young people to have an additional year of education. Most people probably think the effect is positive, but is it likely large enough to justify the cost of the policies enacted or money spent?
In this chapter, we introduce many terms and concepts that you should be familiar with from previous classes in statistics and mathematics. The discussion is abbreviated because the purpose is to remind you of things that you should already know. If you feel like your knowledge of these terms and concepts is weak, then definitely take time to review. Online videos from Khan Academy are excellent ways to review material in short 15 minute chunks https://www.khanacademy.org/math/ap-statistics/.
This chapter also introduces you to R and RStudio, statistical computing software that is commonly used in academic, government, civil society, and corporate settings. R is an open source programming language and software environment for statistical computing and graphics. It is used by statisticians and “number crunchers” for data analysis and related applications. RStu- dio is a helpful interface for editing, running, and managing R code. The
R computing environment provides the data management, data wrangling, analysis, and visualization tools needed for virtually all analysis. Appendix A and Appendix B cover the basics of installing the software and using it, and should be read carefully. You should install R and RStudio on a computer that you will use regularly, or establish an account in the cloud for using R online (see instructions in the appendix). In this chapter, we show you how to calculate summary statistics and how to do basic data visualization.
1.2 What is econometrics?
Econometrics is the term social scientists use to describe the application of statistical concepts and techniques to the analysis of social science data. It is analogous to terms such as biometrics (application to biological data), sabermetrics (application to sports data), and psychometrics (application to psychology and education data).
A broad statement of the goal of econometrics is to make inferences about a relationship that might exist, or that might characterize, a population, based on data from a random sample drawn from that population. The key words are in that sentence: population, random sample, relationship, and inference.
A population refers to a group of individuals, households, events, coun- tries or other unit of observation. Population is usually used as a shorthand for a population of interest to the researcher. For any given research question, a researcher is interested in some subset of an entire population. A researcher might be interested in the population of California, or the working-age pop- ulation of California, or the males of working-age in California, or the males of working-age in California who have not completed high school. A random sample is usually a subset of the population of interest chosen randomly from the population. The researcher hopes to analyze data from the sample, and use that analysis to infer attributes of the population. For example, a re- searcher might believe that males of working-age in California who did not complete high school are more prone to poor health outcomes than males who did complete high school. A sample of working-age males is drawn from the population, their characteristics are measured, and the differences in the average incidence of poor health then lead the researcher to make an inference about the population.
Econometrics is especially concerned with estimates of relationships, not just differences. An econometrician typically has in mind a model of how the social world works. For example, an econometrician might think that males without high school diplomas might systematically end up in jobs that do not offer health care insurance. Their lack of insurance through their employer then leads them to defer preventative health care. Underlying health conditions fester. Since treatment is usually less effective than prevention, they have more health problems. The econometrician is then interested in estimating how much of the difference in health outcomes is likely due to, or attributable to, the difference in health insurance. It may be that differential access to health insurance explains a lot of the health differences. It may be that differential access to health insurance explains very little of the health differences. (Perhaps people without insurance are more likely to be very careful to take preventative steps to make sure they do not have health problems!) Only careful analysis of data can help answer the question.
The questions econometricians ask are, more broadly, whether and how much an explanatory variable causes or affects an outcome. The first stage in econometrics is about determining a way to estimate the effect and carry out hypothesis tests. The method of ordinary least squares is a good starting point. The second stage in econometrics is a discussion of the credibility of estimates as causal effects, and what will be called the internal and external validity of estimates.
1.3 Data: Variables and observations
For econometric analysis, we use data organized into a dataset, which consists of information arranged in a structured fashion, such as a spreadsheet. In a typical dataset, each row represents information about a particular unit of observation or entity. For example, a row might represent an individual person, or a county, or a country, or company. In normal, standard datasets, all of the rows refer to the same kinds of entities. That is, each row is a person who is observed, or a county, or a country, or a company. The rows are not mixed, with some rows being individuals and others companies. Each column represents a particular variable, such as an individual’s earnings in dollars, or a county’s population. Rows and columns of data are sometimes called a matrix. Each element of the matrix is a “cell” in a spreadsheet version of the data.
Table 1.1 displays a simple dataset, arranged in the same way it might appear in a spreadsheet. It consists of 6 rows (observations), one for each of 6 individuals identified by the variable ID. The number of observations in a dataset is often called the “sample size,” or just N for short. In this dataset, there are four variables: the ID variable, and variables for each individual’s age, education, and wage.
ID | age | education | wage |
---|---|---|---|
A | 33 | 15 | 13 |
B | 35 | 11 | 15 |
C | 38 | 11 | 19 |
D | 32 | 14 | 22 |
E | 30 | 16 | 30 |
F | 30 | 17 | 33 |
To keep track of our datasets (or dataframes, as they are often called in R), we usually give each dataset a name. For this book, and for many data analysts, the common practice is to name generic data matrices as df.
Some of the values of the variables might be missing. For example, our dataframe df might look like Table 1.2.
ID | age | education | wage |
---|---|---|---|
A | 33 | 15 | 13 |
B | 35 | 11 | |
C | 38 | 19 | |
D | 32 | 14 | 22 |
E | 30 | 16 | |
F | 30 | 32 |
Variables education and wage contain missing values. Missing values are sometimes written as NA, but sometimes in datasets have special codes to denote missing values, such as -999. The idea is that -999 would never be a value observed for the variable for the entities that are being considered, so the data analyst knows that it means the observation of that variable for that entity is missing. Why code something as -999 instead of leaving the value blank? It is our understanding that when computers first started being used to analyze data, and data was entered using paper punch cards that were fed into a card-reader, there was no way to represent a value being missing other than to encode it with an unusual number, and numbers like -99 and -999 fit the bill.
Variables and data come in many varieties, and the terminology for the varieties is useful to know. Numeric variables can be continuous or discrete. A variable measuring the age in years of a person would be a discrete variable, taking on values 23, 27, or 65, for example. Year of birth would similarly be a discrete variable. A person’s weight, however, might be continuous. Depending on the accuracy of a digital scale, a measurement of 122.56735 might be observed. Would there be any need for such accuracy, of five decimal places, in the social sciences? Plainly not. A person’s weight at the fifth decimal place varies with many factors: Were they in a slightly warm room, and perspired a bit? Do they have a penny in their pocket? Did they get a haircut, or cut their fingernails? We would have to measure all of these other variables if the fifth digit of the decimal of the weight were relevant for something.
The discussion about weight reminds us that measurement of most things contains errors. With an analog scale (the old ones with a dial that pointed to numbers), your reading of weight depended on vantage point. Each time you looked you might “read” a different number. There usually was no reason to go past one decimal point, or even to go to any decimal points. You might say, “I weigh roughly 122 pounds.” Recognizing the prevalence of measurement error will be an important part of econometric practice.
Another term used in econometrics relevant to this discussion is order of magnitude. Usually one thinks of each decimal place, each multiple of 10, as an order of magnitude different. So 2.0 is an order of magnitude bigger than 0.2. 57.5 is (roughly) an order of magnitude bigger than 8.3. There is nothing precise about the usage, but you will likely encounter it in ordinary speech. In econometrics, the term comes up frequently when discussing the importance of the finding from data analysis. When someone remarks that, “The impact of the policy change was to increase hourly wages by 0.05%,” another person might respond, “That percent change is an order of magnitude that is not worth discussing further.”
A rule of thumb when presenting or discussing numeric results is to round them to the nearest relevant order of magnitude. When discussing annual incomes in the United States in 2023, for example, a person might say, “I earn $65,000.” They will not say, ”I earn $62,320.57.” They know, from immersion in the culture, the right order of magnitude. Yet, one continues to see published econometric work that proudly declares to have calculated an average annual income of $78,903.7651, or similar nonsense. Avoid being a person who reports numbers with irrelevant decimal places.
Several special kinds of variables are commonly used in econometrics. One is called a binary or dummy variable, and takes on values 0 and 1. In a dummy variable, the value 1 stands for one category and the value 0 for another category. For example, in a dataset concerning restaurants, fast food restaurants might be coded as 1 and regular table service restaurants coded as 0. By convention, dummy variables are named as the value the variable takes on when it is equal to 1, so this variable might be called fastfood. That way the analyst knows right away that the value 1 corresponds to the fast food restaurants. A logical variable is a special kind of dummy variable indicating whether a condition is true or false. Most software automatically converts logical variables to 0 and 1 for numeric operations, with 1 being the true case.
A dummy variable is a particular instance of the more general categorical or factor variable, that has indicators for different categories. A categorical variable, for example, might be region = (N, S, N, S, W, E, E, E, S, W), where the N stands for the northern region, the W stands for the western region, and so on. Finally, another kind of variable is a date variable. Date variables have special formatting and need to be used carefully. Is 11/12/14 referring to 11 December 1914 or to 12 November 2014? Without more understanding of how the variable is formatted, we cannot know.
The datasets represented in Tables 1 or 2, where each row corresponds to an entity or a unit of observation, which is the kind analyzed in this book, are called cross-section datasets. The rows represent a cross-section of some population.
We should mention that not all datasets are as simple as cross-section datasets. Two other datasets are common in econometrics. In a time series dataset, each row represents the values of the variables for an entity in a specific time period. The time period might be day, month, or year. In a cross-section time-series dataset, also called a panel dataset, the two are combined, and each row contains the values of the variables for an entity observed at a different time period, and there are many entities and many time periods. In the case of a panel dataset, two identifiers are needed, one for the entity and another for the time period. A panel dataset that tracks each specific unit (such as a given person) over multiple time periods is also called a longitudinal dataset. A panel dataset that has different entities observed at different times is sometimes called a repeated cross-section panel dataset.
The analysis of time series is common in macroeconomics. Important questions in macroeconomics include how GDP per capita, inflation, and unemployment change over time. The business cycle is a key phenomenon to explain. The analysis of panel data has become much more common in econometrics, as the digitization of modern economic life means that data that contains repeat observations at different times of the same entities has become ubiquitous. Analyzing these different kinds of datasets requires more advanced statistical techniques than those developed in this book.
1.4 Summary statistics
Most students enrolled in a course in econometrics have already taken a class in basic statistics. Indeed, such classes are also increasingly offered in high school. In this section, we review basic statistics and show how to use R to calculate summary statistics.
1.4.1 Summation notation and rules
The summation operator is used extensively in statistics and econometrics. Summation notation is an efficient shorthand for sums of variables or expressions. Let \(X_i\) be the value of a variable \(X\) for observation i, where \(i = 1, 2, 3,... n\) (there are n total observations). For example, \(X_i\) might be the population of state i, in which case \(n = 51\) (including Washington, DC). Using summation notation, the sum can be written: \[\sum_{i=1}^n X_i = \sum X_i = X_1 + X_2 + ... + X_n\]
Read this as “the sum from i equals 1, to i equals n, of X-sub-i.”
Here’s a specific example, using the variable age in the dataframe df (Table 1), denoting each element in \(age = (33, 35, 38, 32, 30, 30)\) as \(age_i\): \[\sum_{i=1}^6 age_i = 33 +35 + 38+ 32 + 30 + 30 = 198\] Here are some rules for using the summation operator (the sum is from i=1 to n if not written out): \[ \sum_{i=1}^na=a+a +...+ a=na\] \[ \sum(X_i + Y_i) = \sum X_i + \sum Y_i\] \[ \sum X_i = a \sum X_i\] Keep in mind the following to avoid common mistakes: \[ \sum X_iY_i \neq X_i \sum Y_i\] \[ \sum X_iY_i \neq \sum Xi \sum Y_i\] \[ \sum X_i^2 \neq \sum X_i \sum Y_i\] It is good to remember the following two summations (the sum is from i=1 to n), \[ \sum 1 = n\] \[ \sum 0 = 0\] ### 4.2 Calculating summary statistics for a variable
We can use the summation operator to calculate summary statistics for variables. Let us continue with the variable \(age\). We then have, \[mean(age) = \overline{age} = \frac{1}{6} \sum _{i=1}^6 age_i = \frac{198}6 = 33\] where a bar over a variable, \(\overline{age}\), denotes the mean or average of the variable.
The variance of a variable is defined as the sum of squared deviations from the mean, divided by the number of observations minus one \((n−1)\). So we have, \[var(age) = \frac 1{6-1} \sum _{i=1}^6 (age_i - \overline {age})^2 = \frac {0^2 + 2^2 + 5^2 + (-1)^2 + (-3)^2 + (-3)^2}5 = 9.6 \] The standard deviation of a variable is the square root of the variance, \[ sd(age) = \sqrt{var(age)} = 3.1\] Often the variance of a variable is written as the square of the standard deviation. And so we have, generally, for a variable \(X\): \[ s_X^2 = \frac 1{n-1} \sum _{i=1} ^n (X_i - \overline{X})^2\] To standardize a variable, we subtract its mean and divide by its standard deviation. The standardized variable is sometimes called the z-score of the original variable: \[ standardized X = z-score = \frac {X - \overline{X}}{s_x}\]
Now is also a good time to introduce two formula that are common in descriptive statistics: covariance and correlation. Both are measures of association between two variables.
\[covariance({X, Y}) = cov(X,Y) = \frac{1}{N-1} \sum_{i=1}^{N} (X_i-\overline{X})(Y_i-\overline{Y}) \]
\[correlation({X, Y}) = \sigma_{XY} = \frac{cov(X, Y)} {sd(X)sd(Y)} \]
The correlation coefficient takes the covariance and divides by the product of the standard deviations for each variable. Thus, the correlation coefficient is standardized and dimensionless, and varies from -1 to 1. A correlation coefficient of .87 would be referred to as a large positive correlation; a correlation coefficient of -.18 would be referred to as a relatively small negative correlation. There is no rule, of course, for what is meant by large or small; they are terms of art.
1.5 Summary statistics and data visualization in R
Let us use R to calculate some summary statistics, and to create data visualizations such as histograms, box plots, and scatter plots. You should read Appendix A which explains how to get started using R and RStudio, and Appendix B, which explains how to use R Markdown for generating reports. Being comfortable using the software is essential for success in econometrics, so we urge you to not procrastinate! We will, as in the appendices, read in the Kenya DHS 2022 dataset that is described in Appendix C. We will then write the commands to create a new variable.
Open a new script in RStudio. The script should have a section that loads packages, using the library()
commands, and sets options. The first commands in your script, then, should be these, which you can type, or copy and paste into the script:
# Load the packages
# (must have been installed)
library(tidyverse)
library(modelsummary)
library(sandwich)
library(estimatr)
library(haven)
# turn off scientific notation except for big numbers, and round decimals
options(scipen = 9)
options(digits = 3)
# Read Kenya earnings data from a website
url <- "https://github.com/mkevane/econ42/raw/main/kenya_earnings.csv"
kenya <- read.csv(url)
Notice that when reading in the data we first assign the web address (the Uniform Resource Locator, or url) and then use the command kenya <- read.csv(url)
to read the data addressed by the url and assign it to the object kenya
in the environment.
Let us create a new variable measuring the years of work experience for each individual. The dataset does not have such a variable. Ideally, this variable would measure the work experience directly, based on the individual’s experience, and not counting time when the individual was sick, or having a baby, or unemployed, or just taking a break from working. Such a measure might also add up part-time work experience. But such a variable is rarely available. Instead, we calculate the number of years that the person was potentially in the full-time work force. We do this by taking the person’s age and then subtracting the 6 years of early childhood and then subtracting the number of years in school. So a person who is 27 years old, and who completed 4 years of school, would have experience equal to 17. A person who was 50 years old, with 12 years of schooling, would have experience equal to 32.
Here is the line of code, that you can copy and paste into your script. Check, after running the command, that there is a new variable in the kenya
dataframe.
# Create a variable for work experience
kenya$exper <- kenya$age - 6 - kenya$educ_yrs
Notice the syntax here, which uses the assignment operator <-
, and the
two-part address syntax for identifying and creating variables: the data frame
name, a dollar sign $
, and then the variable name.
We can now create a table of summary statistics (often called a table of
descriptive statistics), which usually includes the mean, standard deviation,
and median for each variable. We can create such a table using the command
datasummary
in the package modelsummary
, one of the packages that we will
load in every script (Arel-Bundock, Vincent, 2022). Run the following code:
datasummary(All(subset(kenya, earnings_usd <= 1000)) ~ Mean + SD + Median, data = subset(kenya, earnings_usd <= 1000),
title="Kenya earnings dataset")
Mean | SD | Median | |
---|---|---|---|
educ_yrs | 9.54 | 3.95 | 10.00 |
age | 32.78 | 9.16 | 32.00 |
month_interview | 4.24 | 1.38 | 4.00 |
wealth_group | 3.19 | 1.35 | 3.00 |
female | 0.57 | 0.50 | 1.00 |
rural | 0.57 | 0.50 | 1.00 |
earnings_usd | 220.17 | 213.26 | 139.53 |
years_lived_elsewhere | 28.07 | 10.38 | 27.00 |
christian_main | 0.57 | 0.50 | 1.00 |
christian_evang | 0.29 | 0.46 | 0.00 |
muslim | 0.08 | 0.27 | 0.00 |
kikuyu | 0.17 | 0.38 | 0.00 |
exper | 17.24 | 10.58 | 16.00 |
Notice that we restrict, or subset, the dataset to only include observations with monthly earnings less than or equal to $1,000. That is, we exclude about 5% of the sample (about 1,000 observations) that had extremely high earnings (by Kenyan standards).
The results are in Table 3. The syntax All(kenya)
is what tells R
to calculate summary statistics for all the numeric variables in the kenya
dataframe. You can add other summary statistics (see modelsummary help
online). The table will be visible in the viewer pane in RStudio (lower right
window). You should run the code and confirm that the mean of monthly
earnings is about 220 USD and that the median age is 32. We can see that
the variable female has mean equal to .57; in other words, 57% of the survey
individuals are female. The mean of a dummy variable is the same as the
proportion of observations that have value equal to 1, and that proportion
multiplied by 100 is a percent.
You can also type the following commands to verify the results in the table.
median(kenya$earnings_usd[kenya$earnings_usd<=1000])
## [1] 140
median(kenya$age[kenya$earnings_usd<=1000])
## [1] 32
mean(kenya$female[kenya$earnings_usd<=1000], na.rm=TRUE)
## [1] 0.567
The option “na.rm=TRUE” is needed in the mean command; the option tells R to ignore missing values. The mean of female is again calculated to be approximately 0.56.
1.5.1 5.1 Visualization of key variables
Visualizing your data is an important first step in any analysis. By examining plots, we can often identify important relationships in the data as well as some potential sources of problems in making inferences about the broader population. For example, a plot might reveal an extreme outlier observation.
R is renowned for its graphics capabilities. There are many options for creating plots in R. We use, in this textbook, a command called ggplot
, which is part of the ggplot2
package, which is included in the tidyverse
package. We use the Kenya DHS 2022 data to generate histograms, box plots, and scatter plots.
A histogram is a chart of the distribution of a variable in the sample, with the height of each bar showing the number or percentage of observations in various ranges of values. Run the following line of code:
ggplot(data=subset(kenya, earnings_usd<=1000), aes(earnings_usd)) +
geom_histogram(breaks=seq(0, 1000, by = 50),
col="black", fill="gray", alpha = .5) +
labs(x="Earnings", y="Count") +
theme_bw()

Figure 1.1: Histogram of monthly earnings in Kenya DHS 2022 sample
We limit the range of earnings to be below 1,000, with the option
data=subset(kenya,earnings usd<1000)
. A plot, reproduced in Figure 1.1 should appear in the Session Management area (lower right). Click on the “Plots” tab if it does not appear. Notice that the distribution of monthly earnings is very skewed towards low earnings. People in Kenya are very poor, by global standards.
There are some things to note about the ggplot
command:
• This command has the usual R structure: a command name, with
additional information in parentheses.
• Inside the parentheses, the dataframe name you want to examine (kenya)
comes first. (If you left out the data= part, would the histogram still
run?) Then comes the “aes()” part, where aes stands for aesthetic, here interpreted as a mapping of variables to various parts of the plot. The variables to be used in the plot are here. In the histogram case, there is just one variable, earnings usd.
• In the geometry option, we use geom_histogram
, and then we have the
option of how to group values– by 50– and to limit the histogram to a
highest value of 1,000 (there are some outliers with very high earnings).
• Since we tell ggplot
which dataset to use with the data=kenya
option,
this allows us to name the variables without using the dataset as a
prefix (e.g., earnings_usd
instead of kenya$earnings_usd
).
• Further options for formatting the graph are added with + signs.
• The title and axis label options create titles for the graph as a whole and
for the x- and y- axis respectively. These titles should be in “quotes”.
• Our last option is a theme for how the graph is displayed. There are
many themes supported by ggplot and you should try other themes to
see what they look like. For example, try theme_classic()
.
Click on the Zoom button of the Plots window to see a larger, scalable version of your plot. Click on the Export tab. You can save your plot as a pdf file or as an image file, such as JPEG, or just copy it to your clipboard to paste into a document.
The code tells R to use the “aes” information and apply a particular set of
“geometry” choices for a histogram; namely, have bins going from 0 to 1000 in intervals of 50, make the outline of each bar black, and have the inside of the bars be gray (change the alpha value to .8 and see what happens). If you want to make graphs look even nicer, visit the ggplot site at http://ggplot2.tidyverse.org/reference/.
There are other ways to make histograms. R comes bundled with a simpler histogram command, in base R. Just type and run:
hist(kenya$age[kenya$earnings_usd<=1000],
main="", xlab="Age")

Figure 1.2: Distribution of ages of adults interviewed in Kenya DHS 2022 sample
In the plot window you will see a histogram similar to the one generated by the ggplot
command, but now for age. Figure 1.2 displays the histogram. The sample age ranges from 15 to 54, but most of the observations are between 20 and 50. The hist()
function does not have as many options as ggplot, but is pretty mnemonic! Note that in the hist()
command there is not a separate option for naming the dataframe, so the variable that you want to use to draw the histogram has to be properly addressed, by first indicating the dataframe, then the $
sign, then the variable name, or kenya$age
.
As mentioned earlier, one of the great features of R is that as an open source software users are constantly coming up with packages to improve the software. The package ggplot2 is one such user written package. The plethora of packages can, however, be confusing. We recommend sticking to ggplot, and learning its syntax.
We are usually most interested in the relationship between variables, such as earnings and education. A scatter plot is a good way to visualize the relationship. Run the scatter plot command in the script to see the relationship between education and earnings. Notice we now use the option “geom_point()”. The size, color, and shape options can all be changed.
ggplot(data=subset(kenya,earnings_usd<1000),
aes(x=educ_yrs,y=earnings_usd)) +
geom_point(size = 2, col = "firebrick", shape = 1)+
labs(x="Years of education", y="Earnings")+
theme_bw()

Figure 1.3: Scatterplot of monthly earnings and education attainment in Kenya DHS 2022 sample, for people earning under $1000
R automatically scales the plot axes to the range of values for X and Y.
Does the plot in Figure 1.3 suggest that higher education (educ_yrs) is associated with higher earnings? There does seem to be a positive relationship. However, since education is an integer variable, many of the dots are “on top” of each other. It is hard to tell how many observations are in the lower earnings range, since they may be all piled up there. We will later see some techniques to improve the visualization here.
A closing remark. It can be frustrating and exasperating that syntax exhibits so many different ways to code the same thing, but they have to be coded the right way in each syntax style. You may think to yourself, “Why wrong thing code not always right me think bad.” Yup.
Review terms and concepts: • econometrics • inference • dataset or dataframe • unit of observation or entity of observation • types of variables (continuous, discrete, factor, categorical, date, binary, dummy, and logical) • measurement error • order of magnitude • missing values (NA) • standardized variable, or z- score • cross-section, time-series, and panel datasets • summation sign • summary statistics (mean, variance, standard deviation) • histogram • scatterplot