Syllabus
Current as of
2024-05-13
Lecture
MW 12:00-1:00 (Location TBD)
Recitations
R: 10:15-11:15 | R: 12:00-1:00 | F: 10:15-11:15 | F: 12:00-1:00
Dr. Marc Trussler
Fox-Fels Hall 32 (3814 Walnut Street)
Office Hours: M 9-11am
TA: Dylan Radley
Fox-Fels Hall 35 (3814 Walnut Street)
-
Office Hours:
M 3-4p
W 10-11a and 3-4p
Course Description
Understanding and interpreting large, quantitative data sets is increasingly central in social science and the business world. Whether one seeks to understand political communication, international trade, inter-group conflict, or a host of other issues, the availability of large quantities of digital data has revolutionized how questions are asked and answered. The ability to quickly and accurately find, collect, manage, and analyze data is now a fundamental skill for quantitative researchers. The answers to a range of important questions lie in publicly available data sets, whether they are election returns, survey results, journalists’ dispatches, or a range of other data types.
Becoming an effective Data Scientist requires two related, but distinct, skill sets: technical proficiency and theoretical knowledge of statistics. Most courses try to teach both at once. This course, instead, will focus primarily in the first: building your skills in data acquisition, management, and visualization. Leaving this course, students will be able to acquire, format, analyze, and visualize various types of data using the statistical programming language R.
A secondary learning goal of this class is to be able to write and talk about statistics in a concise and clear fashion. Being able to run the most complicated statistics in the world is unhelpful if you can not explain (particularly to non-specialists) what you have found and why they should care. Too many high school and college classes emphasize long essays, when the primary skill you will need is to write short reports (or, let’s be honest, emails) to quickly communicate an idea or finding. In this class we will emphasize this type of writing.
While this course is not a statistics class, we will discuss (in non-technical terms) the fundamental nature of statistics, particularly the important concepts of uncertainty and causality. The expectation is that you take further courses to build on this knowledge. PSCI 3800 “Applied Data Science” & PSCI 1801 “Statistical Methods” are designed to be a direct follow-ups to this course.
While no background in statistics, political science, or computer science is required, students are expected to be generally familiar with contemporary computing environments (e.g. know how to use a computer, download new software, find the path to saved files etc.) and have a willingness to learn a wide variety of data science tools. Instructions will follow on software to be installed prior to the first class.
Expectations and policies
Course Slack Channel
We will use Slack to communicate with the class. You will receive an invitation to join the our channel shortly after the start of class. One of the better things to come about through the pandemic is the use of Slack for classroom communications. It is a really good tool to allow us to send quick and informal messages to individual students or groups (or for you to message us). Similarly, it allows you to collaborate with other students in the class, and is a great place to get simple questions answered.
Because we will be making announcements via Slack, it is extremely important you get this set up.
Format/Attendance
The course will have two components: weekly lectures and a recitation.
The lectures will be in person. They will be more instructional/lecture based in format, though there is an expectation of some amount of participation and feedback. The lectures will not be recorded, though I will post Rscripts and my notes.
The recitations will also be in person. You are required to register for one of the three recitations. Attendance will not be taken, though you are highly encouraged to participate. The purpose of the recitations is to provide a smaller class format for you to ask questions, practice techniques, and to debug code with the TA. The answers to problem sets will also be covered in these sessions.
Academic integrity
We expect all students to abide by the rules of the University and to follow the Code of Academic Integrity.1
For Problem Sets: Collaboration on problem sets is permitted. Ultimately, however, the write-up and code that you turn in must be your own creation. Please write the names of any students you worked with at the top of each problem set. Note that While collaborating is an important aspect of learning, I would encourage each of you to spend a good deal of time trying to work through problems on your own before moving to work collaboratively. Collaboration with others is a good step when you are really, really, stuck. (Or you think you are making a stupid mistake in your code and need to see where it is). There is a value to helping one another, but be aware of not knowing what you don’t know. The same is true about asking for help from one of the instructors. Of course we’re more than willing to lend a hand – but we’re not always going to be here. Learning how to troubleshoot problems on your own is an extremely valuable skill.
For Exams: Collaboration on the take home exam is cheating. Anyone caught collaborating (and I have caught many) will be immediately referred to the University’s disciplinary system.
Re-grading of assignments
All student work will be assessed using fair criteria that are uniform across the class. If, however, you are unsatisfied with the grade you received on a particular assignment (beyond simple clerical errors), you can request a re-grade using the following protocol. First, you may not send any grade complaints or requests for re-grades until at least 24 hours after the graded assignment was returned to you. After that, you must document your specific grievances in writing by submitting a PDF or Word Document to the teaching staff. In this document you should explain exactly which parts of the assignment you believe were mis-graded, and provide documentation for why your answers were correct.We will then re-score the entire assignment (including portions for which you did not have grievances), and the new score will be the one you receive on the assignment (even if it is lower than your original score).
Late policy
Notwithstanding everything below: exceptions to all of these policies will be made for health reasons, extraordinary family circumstances, and religious holidays. The teaching staff are extremely reasonable and lenient, as long as you discuss with us potential issues before the deadline.
For problem sets: You are granted 5 “grace days” throughout the semester. Over the course of the semester you can use these when you need to turn problem sets in late. You can only use 3 grace days on any given assignment. You do not have to ask to use these days. This is counted in whole days, so if a problem set is turned in at 5:01pm the day it is due (i.e. 1 minute late) you will have used 1 grace day. If you turn the problem set in at 5:01pm the day after it is due (i.e. 24 hours and 1 minute late) you will have used 2 grace days etc. Choosing to not complete a problem set (see policy below) does not affect your grace days. Once you are out of grace days subsequently late problem sets will be graded as incomplete.
The nature of the take home midterm and final paper does not allow for any extensions.
Assessment and grading
The grading assessments are designed to test two learning goals: technical proficiency in R, and the ability to communicate clearly about statistics.
All problem sets and the midterm will be graded anonymously. Please turn in these assignments on Canvas with your student number, not your name.
-
Participation (5%)
This portion of your grade mixes two components:
Traditional participation including: asking and answering questions in lecture and in recitations, asking and answering questions on the course Slack, attending office hours, or working with teaching staff on your final paper and presentation.
The completion of weekly “check-in” quizzes on Canvas. These will be available each week, will only take a few minutes, and will be graded by completion (not correctness).
-
Problem sets (32%)
Four problem sets.
Completed using Rmarkdown. Submissions will include a knitted html file and the associated .RMD file.
Scored out of 100. Having answers that strictly produce the “Correct” output from R will result in a grade of 90/100. 90+ grades are reserved for submissions that have all the correct answers, have code that is cleanly and effectively written, and have written explanations that clearly and concisely articulate the findings.
You are free to do as many of the problem sets as you like. If you do not complete a problem set, the percentage points for that assignment will be transferred to the midterm (for PS1 and PS2), or the final paper (for PS3, PS4). For example if you don’t complete PS2, the midterm would then be worth 36% of your final grade. If you don’t complete PS3 & PS4, the final paper would be worth 41% of your final grade.
-
Midterm 28%
- This will be an open-book 24 hour take-home test. The test will open on March 11th at 1:00pm and close on March 15th at 11:59pm. You can select any 24 hour period to do the test during this window. The latest you can open the test and still have 24 hours to complete it is therefore March 14th at 11:59pm. You may not work with other students on this exam. It will take a similar form as the problem sets.
-
Final Presentation 10%
- Presentations will be about the same topic as your final paper (see below). These presentations will take place in week 12 during your usual recitation period. Each presentation will be no more than 3 minutes long (with a strict cutoff). You will present exactly one slide with one figure on it that you think best tells the “story” of your final paper. The goal is to walk the audience through why you have a question they want to know the answer to, and why you have the data to answer it. This format is commonly used in “Three Minute Thesis” competitions. An example of this format is posted on Canvas.
-
Final Paper 25%
- Due: May 1st. The final project of this course is to produce a short (less than 1000 words) data-journalism style blog post that makes use of data. For this project you will find your own data and use it to produce a series of figures and tables to support an argument suitable for a non-technical audience. This project brings together the two learning goals of this course: the technical ability to find, clean, and present data; as well as the ability to write about your findings in a clear and persuasive way. Accordingly, you will be graded on both the quality and rigorousness of your statistical findings, as well as the presentation and writing of the piece. To emphasize: a major component of this project and of your grade is determined by how you write your results up. 1000 words is short for a final project. As such, I would highly encourage you to start work on this early. Part of the goal of the problem sets is to have you think a lot about how to present statistics in an approachable and non-technical way. Many undergrads spend 95% of their time writing and 5% of their time editing. (In your working life post-undergrad these two percentages will be almost exactly flipped!) Given the amount of time and the light word count, my expectation is that you meet with the teaching team to talk about your research question relatively early, and spend the majority of the time editing your work, not writing.
Grade scale
Letter grades at the conclusion of the class will be assigned using the following scale. I do not round grades. If your grade exceeds one of the thresholds below you will receive the grade.
\[\begin{aligned} 97 \leq Grade: &A+\\ 93 \leq Grade <97: &A\\ 90 \leq Grade <93: &A-\\ 87 \leq Grade <90: &B+\\ 83 \leq Grade <87: &B\\ 80 \leq Grade <83: &B-\\ 77 \leq Grade <80: &C+\\ 73 \leq Grade <77: &C\\ 70 \leq Grade <73: &C-\\ 67 \leq Grade <70: &D+\\ 63 \leq Grade <67: &D\\ 60 \leq Grade <63: &D-\\ Grade <60: &F \end{aligned}\]Computing
The course will require students to have access to a personal computer in order to run the statistics software. If this is not possible, please consult with one of the instructors as soon as possible. Support to cover course costs is available through Student Financial Services.
We will use R in this class, which you can download for free at https://www.r-project.org/. R is completely open source and has an almost endless set of resources online. Virtually any data science job you could apply nowadays to will require some background in R programming.
While R is the language we will use, RStudio is a free program that makes it considerably easier to work with R. After installing R, you will install RStudio https://www.rstudio.com. Please have both R and RStudio installed by the end of the first week of classes.
If you’re having trouble installing either program, there are more detailed installation instructions on the course Canvas page.
Textbook
The reading load for this course will be relatively light, with the expectation that your primary task outside of class hours will be working on problem sets and reviewing material. That being said, textbook chapters that supplement the lectures are included, and reading through them before lecture will be helpful.
We will be using two supplementary textbooks for this course. Both are available for free online through the library website.
Programming Skills for Data Science: Start Writing Code to Wrangle, Analyze, and Visualize Data with R by Michael Freeman and Joel Ross
The Book of R: A First Course in Programming and Statistics by Tilman M. Davies
Three additional books that I have found helpful in my development as a data scientist:
Data Analysis for Social Science: A Friendly and Practical Introduction. Elena Llaudet & Kosuke Imai
The Functional Art: An introduction to information graphics and visualization by Alberto Cairo
On Writing Well by William Zinsser
Class Schedule
Week 1: January 22 - January 24
What is Data Science? What is Data?
Leonard Mlodinow. The Drunkard’s Walk: How Randomness Rules Our Lives. (Excerpts on Canvas)
Week 5: February 19 - February 21
Cleaning and Reshaping
Freeman and Ross Chapter 12 (tidyr reshaping)
Week 6: February 26 - February 28
For/If
Davies Chapter 10
Problem Set 2 Due Wednesday 7pm.
February 27: Drop period ends
Week 8: March 11 - March 13
Review/Midterm
Wednesday class will be replaced with additional office hours.
First Midterm Exam period Monday 1:30pm to Friday 11:59pm.
Week 9: March 18 - March 20
Collecting and Merging Data
Freeman and Ross Chapter 11.5
March 22: Grade type change deadline.
Week 10: March 25 - March 27
Writing and Visualizing Our Findings
Zinsser. . (Excerpts on Canvas).
Badger et al. 2018. . NYT Upshot.
Problem Set 3 Due Wednesday 7pm.
Week 12: April 8 - April 10
Regression I
Freeman and Ross Chapter 16
Final Project Presentations During your Recitation Times.
Slides to be submitted by Wednesday 7pm.