This is totally a personal opinion, but I recommend you to use R Markdown when you write up your solutions that contain some R codes. R Markdown is an authoring framework which allows you to easily make reproducible documents in various formats (e.g., PDF (LaTeX), HTML, Word, PowerPoint). And a whole bunch of free official resources are available online:
There are also many extensions that are usuful, too (e.g., bookdown, rticle, blogdown). BUT, you don’t have to feel forced to use R Markdown for the submission. It’s true that the initial learning or switching cost is not small. Rather, I really want you guys to focus on the materials and don’t want you to be distracted by the oft-frustrating learning process. Any format is fine (as far as I can tell what’s written). But, if you are going to do serious research using R in the future, R Markdown is definitely one of, if not the best tools to use.
Import data. This sample data is a bit differenct from the original one. You might need a small cleaning process for the original one.
mydata <- read.csv("sample.csv")
Basically, there are two ways:
Here, let me show you how to do the latter. It’s simple. Just use the lm
function.
fit1 <- lm(civic ~ female + yob, mydata)
fit2 <- lm(hawthorne ~ female + yob, mydata)
fit3 <- lm(self ~ female + yob, mydata)
fit4 <- lm(neighbor ~ female + yob, mydata)
stargazer::stargazer(fit1, fit2, fit3, fit4,
type = "html", # change to "latex" or "text"
title = "Table 1: Balance Check",
keep.stat = "N")
Dependent variable: | ||||
civic | hawthorne | self | neighbor | |
(1) | (2) | (3) | (4) | |
female | -0.015 | 0.014 | -0.004 | -0.011 |
(0.011) | (0.011) | (0.011) | (0.011) | |
yob | -0.0003 | -0.001* | -0.00001 | -0.0003 |
(0.0004) | (0.0004) | (0.0004) | (0.0004) | |
Constant | 0.740 | 1.424* | 0.123 | 0.657 |
(0.776) | (0.773) | (0.769) | (0.755) | |
Observations | 3,000 | 3,000 | 3,000 | 3,000 |
Note: | p<0.1; p<0.05; p<0.01 |
Be careful when you interpret the results. Usually, we “stargaze” (ah-oh!), but we don’t when we want to check the balance.2
Just calculate. R is really good at calculation. Way better than me. Just in case, you might find the following functions. Of course, you prefer a more “modern” way to calculate (e.g., using tidyverse), it’s totally fine. But, please don’t do crazy things such as unnecessarily wrapping C/C++ or Python to show off your computing ability ;-).
mean(mydata$voted) # average
subset(mydata, civic == 1) # subset data such that `subset(original data, conditions)
subset(mydata, civic == 1 | hawthorne == 1) # OR (logical disjunction)
subset(mydata, civic == 1 & hawthorne == 1) # AND (logical conjunction)
“People will try to scare you by challenging how you constructed your standard errors.”
— Yujung Hwang / 황유정 (@YujungHwang3) July 3, 2021
I laughed loud at this 🤣 pic.twitter.com/PGREaTe3M1
Who's a good model?#RStats #MachineLearning pic.twitter.com/Jh7KuCJvuP
— #RStats Question A Day (@data_question) July 24, 2021
For the standard errors, we can use the vcovHC
function in sandwich package. This automatically estimates the heteroskedasticity-consistent variance covariance matrix from regression results. Astute attendees must have noticed that vcovHC
stands for Heteroskedasticity-Consistent Variance COVariance matrix estimation.
result <- lm(voted ~ civic, mydata)
library(sandwich)
vcm <- vcovHC(result1, type = "HC2") # estimates variance covariance matrix
v <- diag(vcm) # we only need the diagonal elements
seHC <- sqrt(v) # take the square root
stargazer(result, type = "html",
se=list(seHC),
keep.stat = c("N"),
title = "Table 2: Linear Regression Result",
digits = 5)
Dependent variable: | |
voted | |
civic | -0.01869 |
(0.02682) | |
Constant | 0.32408*** |
(0.00907) | |
Observations | 3,000 |
Note: | p<0.1; p<0.05; p<0.01 |
Just the same as (d) with more independent variables.
Let me give you another example. Say you have data of individuals’ annual wage and education. Assume, for the sake of simplicty, the education variable only tells us whether individuals went to college but didn’t go on to graduate school or not and went to graduate school or not. Namely, it has three categories: high school diploma or lower, college degree, and graduate degree holders. If you are interested in the wage returns to graduate education, which of the following models is more appropriate (here, assume there is no concern of endogeneity.)? Why?
Don’t confuse the two different baselines.
My hypothesis is many people attending this session will have a sandwich for lunch. Again, we can use sandwich package. This time we use the vcovCL
function. For comparison, let us align three columns with different SEs.
se <- sqrt(diag(vcov(result))) # usual SEs
seCL <- sqrt(diag(vcovCL(result, cluster = ~ hh_id)))
stargazer::stargazer(result, result, result,
type = "html",
se=list(se, seHC, seCL),
keep.stat = c("N"),
title = "Table 3: Different SEs",
add.lines = list(c("SE type", 'Normal', 'HC2', "CL")),
digits = 5)
Dependent variable: | |||
voted | |||
(1) | (2) | (3) | |
civic | -0.01869 | -0.01869 | -0.01869 |
(0.02713) | (0.02682) | (0.02695) | |
Constant | 0.32408*** | 0.32408*** | 0.32408*** |
(0.00905) | (0.00907) | (0.00908) | |
SE type | Normal | HC2 | CL |
Observations | 3,000 | 3,000 | 3,000 |
Note: | p<0.1; p<0.05; p<0.01 |
Observe that \(\text{"Standard" SEs} < \text{HC SEs} < \text{CL SEs}\) in this case. How can this possiblly affect your conclusion drawn from statistical tests? How does it actually turn out?
Sometimes we see a different spelling for “heteroskedasticity” or “homoskedasticity”. It’s “heteroscedasticity” or “homoscedasticity”. There used to be a controversy on which spelling is correct until McCulloch (1985) (in Econometrica!) pointed out that “-skedasticity” is correct in terms of etymology. The “-skedastic” part means “scatter” in greek (\(\sigma\kappa\varepsilon\delta\alpha\gamma\gamma\nu\nu\upsilon\mu\iota\)). And in the English language, the counterpart letter for the Greek letter “\(\kappa\)” is “k”, so “-skedasticity” should be the right one. There is also a fascinating paper that studies how the different spellings were used in academic literature by counting the frequency using Google Books Ngram Viewer (See Paloyo (2011)).
Graduate School of Economics, Waseda University, ritsu.kitagawa@fuji.waseda.jp↩
Here, we use the stargazer
function from stargazer package. This package is actually a little obsolte, but we use it anyway.↩