Linear Regression in Stata and R
Winter 2022
1 Introduction
Source: XKCD
Welcome to your guide to learning linear regression in Stata and R. This website houses all the information you need learn the basics of coding linear regression in Stata and R. It will not contain all the information taught in class, but will allow you to bridge that knowledge into running linear regressions on your own.
The Stata labs on this website were adapted from materials by Ewurama Okai.
1.1 Labs
This is a 10-week course with 9 labs. Each lab will focus on some topic related to coding linear regression. By the end of the course, you should be able to run a linear regression project from start to finish with reproducible code.
Each lab will contain links to download script files (in .do or .r format), overviews of key concepts, and application questions.
Lab Topics
Note: All lab topics are tentative and subject to change.
- Lab 1: Data cleaning review & writing clean code
- Lab 2: Running a basic linear regression
- Lab 3: Testing the assumptions of linear regression
- Lab 4: Transforming variables & displaying results with margins
- Lab 5: Exporting tables & reproducible code
- Lab 6: Evaluating linearity & interactions
- Lab 7: Robust standard errors & multicolinearity
- Lab 8: Review & requested topics TBD
- Lab 9: Comparing models & running a project from start to finish
1.2 Finding Data
When selecting data, consider:
- The research question you would like to answer
- The model type you will be applying (linear regression in this class)
- The unit/level of analysis in the dataset (individual? school? district? state?)
- The main independent and dependent variables you want to analyze
- Other relevant variables to include in your model
Some places to find datasets:
- Inter-university Consortium for Political and Social Research
- National Center for Education Statistics
- UNData
- World Values Survey
- General Social Survey
- Princeton’s Office of Population Research Data Archive
- Harvard Dataverse
- U.S. Government’s Open Data
- Chicago Open Data
- COVID-19 Open Data Repository
1.3 Notes on Statistical Significance
Statistical significance is a yes/no test. Did it meet the test of statistical significance you set or not? A smaller p-value does not necessarily mean the association is more meaningful.
As social scientists we need to pay more attention to whether something is socially or sociologically significant. We do this by paying attention to the interpretation of coefficients and effect size. This is is especially important because it is actually relatively easy to get statistically significant results with large samples. In the world of “big data,” this will come up more and more.
Some questions to ask yourself in papers from Bernardi, Chakhaia, and Leopold (2017):
- Do you avoid interpreting a statistically insignificant coefficient as evidence of no effect?
- Do you avoid using the adjective “significant” in an ambiguous way?
- Do you avoid justifying the inclusion of variables in your model on the basis of statistical significance of their estimates?
- Do you report coefficients in some usefull and intelligble form that makes it easier to understand how large the effect is?
- Do you discus the substantive significance of the model coefficients.