1.5 Datasets for the course
This is a handy list with a small description and download link for all the relevant datasets used in the course. To download them, simply save the link as a file in your browser.
pisa.csv
(download). Contains 65 rows corresponding to the countries that took part on the PISA study. Each row has the variablesCountry
,MeanMath
,MathShareLow
,MathShareTop
,ReadingMean
,ScienceMean
,GDPp
,logGDPp
andHighIncome
. ThelogGDPp
is the logarithm of theGDPp
, which is taken in order to avoid scale distortions.US_apportionment.xlsx
(download). Contains the 50 US states entitled to representation in the US House of Representatives. The recorded variables areState
,Population2010
andSeats2013–2023
.EU_apportionment.txt
(download). Contains 28 rows with the member states for the EU (Country
), the number of seats assigned under different years (Seats2011
,Seats2014
), the Cambridge Compromise apportionment (CamCom2011
) and the states population (Population2010
,Population2013
).least-squares.RData
(download). Contains a singledata.frame
, namedleastSquares
, with 50 observations of the variablesx
,yLin
,yQua
andyExp
. These are generated as \(X\sim\mathcal{N}(0,1)\), \(Y_\mathrm{lin}=-0.5+1.5X+\varepsilon\), \(Y_\mathrm{qua}=-0.5+1.5X^2+\varepsilon\) and \(Y_\mathrm{exp}=-0.5+1.5\cdot2^X+\varepsilon\), with \(\varepsilon\sim\mathcal{N}(0,0.5^2)\). The purpose of the dataset is to illustrate the least squares fitting.assumptions.RData
(download). Contains the data frameassumptions
with 200 observations of the variablesx1
, …,x9
andy1
, …,y9
. The purpose of the dataset is to identify which regressiony1 ~ x1
, …,y9 ~ x9
fulfills the assumptions of the linear model. The datasetmoreAssumptions.RData
(download) has the same structure.cpus.txt
(download) andgpus.txt
(download). The datasets contain 102 and 35 rows, respectively, of commercial CPUs and GPUs appeared since the first models up to nowadays. The variables in the datasets areProcessor
,Transistor count
,Date of introduction
,Manufacturer
,Process
andArea
.hap.txt
(download). Contains data for 20 advanced economies in the time period 1946–2009, measured for 31 variables. Among those, the variabledRGDP
represents the real GDP growth (as a percentage) anddebtgdp
represents the percentage of public debt with respect to the GDP.wine.csv
(download). The dataset is formed by the auctionPrice
of 27 red Bordeaux vintages, five vintage descriptors (WinterRain
,AGST
,HarvestRain
,Age
,Year
) and the population of France in the year of the vintage (FrancePop
).Boston.xlsx
(download). The dataset contains 14 variables describing 506 suburbs in Boston. Among those variables,medv
is the median house value,rm
is the average number of rooms per house andcrim
is the per capita crime rate. The full description is available in?Boston
.assumptions3D.RData
(download). Contains the data frameassumptions3D
with 200 observations of the variablesx1.1
, …,x1.8
,x2.1
, …,x2.8
andy.1
, …,y.8
. The purpose of the dataset is to identify which regressiony.1 ~ x1.1 + x2.1
, …,y.8 ~ x1.8 + x2.8
fulfills the assumptions of the linear model.challenger.txt
(download). Contains data for 23 Space-Shuttle launches. The data consists of 23 shuttle flights. There are 8 variables. Among them:temp
, the temperature in Celsius degrees at the time of launch, andfail.field
andfail.nozzle
, indicators of whether there were an incidents in the O-rings of the field joints and nozzles of the solid rocket boosters.eurojob.txt
(download). Contains data for employment in 26 European countries. There are 9 variables, giving the percentage of employments in 9 sectors:Agr
(Agriculture),Min
(Mining),Man
(Manufacture),Pow
(Power),Con
(Construction),Ser
(Services),Fin
(Finance),Soc
(Social) andTra
(Transport).Chile.txt
(download). Contains data for 2700 respondents on a survey for the voting intentions in the 1988 Chilean national plebiscite. There are 8 variables:region
,population
,sex
,age
,education
,income
,statusquo
(scale of support for the status quo) andvote
.vote
is a factor with levelsA
(abstention),N
(against Pinochet),U
(undecided),Y
(for Pinochet). Available in R through the packagecar
anddata(Chile)
.USArrests.txt
(download). Arrest statistics forAssault
,Murder
andRape
in each of the 50 US states in 1973. The percent of the population living in urban areas,UrbanPop
, is also given. Available in R throughdata(USArrests)
.USJudgeRatings.txt
(download). Lawyers’ ratings of state judges in the US Superior Court. The dataset contains 43 observations of 12 variables measuring the performance of the judge when conducting a trial. Available in R throughdata(USJudgeRatings)
.la-liga-2015-2016.xlsx
(download). Contains 19 performance metrics for the 20 football teams in La Liga 2015/2016.pisaUS2009.csv
(download). Reading score of 3663 US students in the PISA test, with 23 variables informing about the student profile and family background.