Spark Social Science Manual
Research Programming, The Urban Institute
September 16, 2016
Chapter 1 Introduction
Most social scientists perform statistical analysis on personal computers or on servers dedicated to statistical processing. While these environments do provide effective research platforms for the majority of projects, their inherent constraints on data size and processing speed limit the ability of social scientists to perform empirical analysis on large data sets.
We begin this guide by discussing: alternative frameworks for data storage and performing statistical analysis, when social scientists may benefit from these approaches and the solution developed by Research Programming to address the massive data needs of policy researchers at The Urban Institute (UI).
We subsequently provide an overview of the framework that supports the solution developed by Research Programming and best-practices for utilizing this solution. While understanding the mechanisms behind the framework in detail is not necessary to leverage the UI solution described below, a basic understanding will help researchers to perform their analyses more efficiently.
1.2 Making distributed computing straightforward & cost-effective
The solution developed by Research Programming to address massive data needs relies on cloud-based, distributed computing rather than an on-site server. The Spark distributed computing platform stores and processes data across a cluster of machines, distributing the data storage and processing tasks across the cluster.
The cloud computing services hosted by AWS allows researchers to “spin-up” and shut down groupings of virtual servers, called Elastic Compute Cloud (EC2) instances, as needed for analysis while permanently storing data in the AWS data storage infrastructure, S3. The instance types available to researchers are determined by combinations of memory and storage and networking capacity,1 and researchers are encouraged to consult with Research Programming in order to identify what instance type is most appropriate for their project.
This cloud-based, distributed framework effectively removes any computing constraint on the size of data that researchers can store and analyze. Researchers can rent any number of machines from AWS and then use Spark, which coordinates tasks between machines, to implement their data analysis and manipulation. Data is stored in S3, and then distributed in memory across the cluster machines when tasks need to be performed. This can be scaled to a massive degree – the largest known cluster to have utilized Spark is 8,000 machines and Spark has been shown to perform well processing up to several petabytes (each 1 million gigabytes) worth of data.2
Installation of Spark occurs during AWS cluster configuration so that researchers can immediately perform work with their data in the distributed environment, accessing Spark through a common programming language of their choice, once a cluster is finished spinning up. We describe and compare supported programming languages in subsequent sections of this manual.
A complete list of EC2 instance types can be found at https://aws.amazon.com/ec2/instance-types/.↩