D Reporting with R and R Commander
A nice feature of R Commander is that integrates seamless with R Markdown
, which is able to create .html
, .pdf
and .docx
reports directly from the outputs of R. Depending on the kind of report that we want, we will need the following auxiliary software34:
.html
. No extra software is required..docx
and.rtf
. You must installPandoc
, a document converter software. Download it here..pdf
(only recommended for experts). An installation of LaTeX, additionally toPandoc
, is needed. Download LaTeX here.
The workflow is simple. Once you have done some statistical analysis, either by using R Commander’s menus or R code directly, you will end up with an R script, on the 'R Script'
tab, that contains all the commands you have run so far. Switch then to the 'R Markdown'
tab and you will see the commands you have entered in a different layout, which essentially encapsulates the code into chunks delimited by ```{r}
and ```
. This will generate a report once you click in the 'Generate report'
button.
Let’s illustrate this process through an example. Suppose we were analyzing the Boston
dataset, as we did in Section 3.1.2. Ideally35 our final script would be something like this:
# A simple and non-exhaustive analysis for the price of the houses in the Boston
# dataset. The purpose is to quantify, by means of a multiple linear model,
# the effect of 14 variables in the price of a house in the suburbs of Boston.
# Import data
library(MASS)
data(Boston)
# Make a multiple linear regression of medv in the rest of variables
<- lm(medv ~ ., data = Boston)
mod summary(mod)
# Check the linearity assumption
plot(mod, 1) # Clear non-linearity
# Let's consider the transformations given in Harrison and Rubinfeld (1978)
<- lm(I(log(medv * 1000)) ~ I(rm^2) + age + log(dis) +
modTransf log(rad) + tax + ptratio + I(black / 1000) +
I(log(lstat / 100)) + crim + zn + indus + chas +
I((10 * nox)^2), data = Boston)
summary(modTransf)
# The non-linearity is more subtle now
plot(modTransf, 1)
# Look for the best model in terms of the BIC
<- stepwise(modTransf)
modTransfBIC summary(modTransfBIC)
# Let's explore the most significant variables, to see if the model can be
# reduced drastically in complexity
<- lm(I(log(medv * 1000)) ~ I(log(lstat / 100)) + crim, data = Boston)
mod3D summary(mod3D)
# With only 2 variables, we explain the 72% of variability.
# Compared with the 80% with 10 variables, it is an important improvement
# in terms of simplicity.
# Let's add these variables to the dataset, so we can call scatterplotMatrix
# and scatter3d through R Commander's menu
$logMedv <- log(Boston$medv * 1000)
Boston$logLstat <- log(Boston$lstat / 100)
Boston
# Visualize the pair-by-pair relations of the response and two predictors
scatterplotMatrix(~ crim + logLstat + logMedv, reg.line = lm, smooth = FALSE,
spread = FALSE, span = 0.5, ellipse = FALSE,
levels = c(.5, .9), id.n = 0, diagonal = 'histogram',
data = Boston)
# Visualize the full relation between the response and the two predictors
scatter3d(logMedv ~ crim + logLstat, data = Boston, fit = "linear",
residuals = TRUE, bg = "white", axis.scales = TRUE, grid = TRUE,
ellipsoid = FALSE)
This contains all the major points in the analysis, that now can be expanded and detailed. You can download the script here, open it through 'File' -> 'Open script file...'
and run it by yourself in R Commander. If you so, and then switch to the R Markdown
tab, you will see this:
---
title: "Replace with Main Title"
author: "Your Name"
date: "AUTOMATIC"
---
```{r, echo = FALSE, message = FALSE}
# Include this code chunk as-is to set options
knitr::opts_chunk$set(comment=NA, prompt=TRUE)
library(Rcmdr)
library(car)
library(RcmdrMisc)
```
```{r, echo = FALSE}
# Include this code chunk as-is to enable 3D graphs
library(rgl)
knitr::knit_hooks$set(webgl = hook_webgl)
```
```{r}
# A simple and non-exhaustive analysis for the price of the houses in the Boston
```
```{r}
# dataset. The purpose is to quantify, by means of a multiple linear model,
```
```{r}
# the effect of 14 variables in the price of a house in the suburbs of Boston.
```
```{r}
# Import data
```
```{r}
library(MASS)
```
```{r}
data(Boston)
```
```{r}
# Make a multiple linear regression of medv in the rest of variables
```
```{r}
mod <- lm(medv ~ ., data = Boston)
```
```{r}
summary(mod)
```
[More outputs - omitted]
```
The complete, lengthy, file can be downloaded here. This is an R Markdown
file, which has extension .Rmd
. As you can see, by default, R Commander will generate a code chunk like
```{r}
code line
```
for each code line
you run in R Commander. You probably will want to modify this crude report manually by merging chunks of code, removing comments or adding more information in between chunks of code. To do so, go to 'Edit' -> 'Edit Markdown document'
. Here you can also remove unnecessary chunks of code resulting from any mistake or irrelevant analyses.
The following file (download) could be a final report. Pay attention to the numerous changes with respect to the previous one:
---
title: "What makes a house valuable?"
subtitle: "A reproducible analysis in the Boston suburbs"
author: "Outstanding student 1, Awesome student 2 and Great student 3"
date: "31/11/16"
---
```{r, echo = FALSE, message = FALSE, warning = FALSE}
# include this code chunk as-is to set options
knitr::opts_chunk$set(comment=NA, prompt=TRUE)
library(Rcmdr)
library(car)
library(RcmdrMisc)
```
```{r, echo = FALSE, message = FALSE, warning = FALSE}
# include this code chunk as-is to enable 3D graphs
library(rgl)
knitr::knit_hooks$set(webgl = hook_webgl)
```
This short report shows a simple and non-exhaustive analysis for the price of
the houses in the `Boston` dataset. The purpose is to quantify, by means of a
multiple linear model, the effect of 14 variables in the price of a house in
the suburbs of Boston.
We start by importing the data into R and considering a multiple linear
regression of `medv` (median house value) in the rest of variables:
```{r}
# Import data
library(MASS)
data(Boston)
```
```{r}
mod <- lm(medv ~ ., data = Boston)
summary(mod)
```
The variables `indus` and `age` are non-significant in this model. Also,
although the adjusted R-squared is high, there seems to be a clear
non-linearity:
```{r}
plot(mod, 1)
```
In order to bypass the non-linearity, we are going to consider the
non-linear transformations given in Harrison and Rubinfeld (1978)
for both the response and the predictors:
```{r}
modTransf <- lm(I(log(medv * 1000)) ~ I(rm^2) + age + log(dis) +
log(rad) + tax + ptratio + I(black / 1000) +
I(log(lstat / 100)) + crim + zn + indus + chas +
I((10*nox)^2), data = Boston)
summary(modTransf)
```
The adjusted R-squared is now higher and, what is more important, the
non-linearity now is more subtle (it is still not linear but closer
than before):
```{r}
plot(modTransf, 1)
```
However, `modTransf` has more non-significant variables. Let\'s see if
we can improve over the previous model by removing some of the
non-significant variables? To see this, we look for the best model in
terms of the Bayesian Information Criterion (BIC) by `stepwise`:
```{r}
modTransfBIC <- stepwise(modTransf, trace = 0)
summary(modTransfBIC)
```
The resulting model has a slightly higher adjusted R-squared than `modTransf`
with all the variables significant.
We explore the most significant variables to see if the model can be reduced
drastically in complexity.
```{r}
mod3D <- lm(I(log(medv * 1000)) ~ I(log(lstat / 100)) + crim, data = Boston)
summary(mod3D)
```
It turns out that **with only 2 variables, we explain the 72% of variability**.
Compared with the 80% with 10 variables, it is an important improvement
in terms of simplicity: the logarithm of `lstat` (percent of lower status of
the population) and `crim` (crime rate) alone explain the 72% of the
variability in the house prices.
We add these variables to the dataset, so we can call `scatterplotMatrix` and
`scatter3d` through R Commander,
```{r}
Boston$logMedv <- log(Boston$medv * 1000)
Boston$logLstat <- log(Boston$lstat / 100)
```
and conclude with the visualization of:
1. the pair-by-pair relations of the response and the two predictors;
2. the full relation between the response and the two predictors.
```{r, warning = FALSE}
# 1
scatterplotMatrix(~ crim + logLstat + logMedv, reg.line = lm, smooth = FALSE,
spread = FALSE, span = 0.5, ellipse = FALSE,
levels = c(.5, .9), id.n = 0, diagonal = 'histogram',
data = Boston)
```
```{r, webgl = TRUE}
# 2
scatter3d(logMedv ~ crim + logLstat, data = Boston, fit = "linear",
residuals = TRUE, bg = "white", axis.scales = TRUE, grid = TRUE,
ellipsoid = FALSE)
```
When we click on 'Generate report'
for the above R Markdown
file, we should get the following output files:
.html
: visualize and download. Once it is produced, this file is difficult to modify, but very easy to distribute (anyone with a browser can see it)..docx
: visualize and download. Easy to modify in a document processor like Microsoft Office. Easy to distribute..rtf
: download. Easy to modify in a document processor, not very elegant..pdf
: visualize and download. Elegant and easy to distribute, but hard to modify once it is produced.
For advanced users, there is a lot of information on mastering R Markdown
here by using RStudio
, a more advanced framework than R Commander.