The prediction of stock prices has always been an intriguing and challenging task for financial analysts and investors. With the advent of modern computing tools and techniques, the use of statistical models in predicting stock prices has become increasingly popular. One such model is the ARIMA (AutoRegressive Integrated Moving Average) model, which is widely used in time series analysis to forecast future values based on historical data. In this context, this project presents an R-code implementation of the ARIMA model to predict the stock price of NASDAQ. The ARIMA model is fitted to the historical data of the stock, and the predicted values are compared with the actual values to evaluate the model’s accuracy. The study provides valuable insights into the potential of ARIMA modeling in predicting the stock prices of NASDAQ and other similar financial instruments.
df <- read.csv("data_nasdaq.csv")
head(df)
## Date Open High Low Close Adj.Close Volume
## 1 1971-02-05 100.00 100.00 100.00 100.00 100.00 0
## 2 1971-02-08 100.84 100.84 100.84 100.84 100.84 0
## 3 1971-02-09 100.76 100.76 100.76 100.76 100.76 0
## 4 1971-02-10 100.69 100.69 100.69 100.69 100.69 0
## 5 1971-02-11 101.45 101.45 101.45 101.45 101.45 0
## 6 1971-02-12 102.05 102.05 102.05 102.05 102.05 0
getSymbols("^IXIC",src="yahoo",from="2015-01-01",to = "2022-11-28")
## [1] "^IXIC"
chartSeries(IXIC,TA = NULL)
IXIC_returns <- 100*diff(log(IXIC$IXIC.Close))[-1]
chartSeries(IXIC_returns)
The Augmented Dickey-Fuller (ADF) test is a statistical hypothesis test used to determine if a time series is stationary or not. Stationarity is a critical assumption in time series analysis, which means that the statistical properties of a time series do not change over time. In the context of stock price prediction using the ARIMA model, it is essential to ensure that the time series data is stationary before fitting the ARIMA model. This is because the ARIMA model assumes that the data is stationary and can be modeled as a combination of autoregressive (AR) and moving average (MA) processes.
The ADF test helps in determining whether the time series has a unit root, which means that it is non-stationary. If the ADF test’s null hypothesis is rejected, it implies that the time series is stationary, and the ARIMA model can be applied to the data. On the other hand, if the null hypothesis is not rejected, it suggests that the time series is non-stationary, and the data needs to be transformed or differenced to achieve stationarity before fitting the ARIMA model.
print(adf.test((IXIC_returns)))
## Warning in adf.test((IXIC_returns)): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: (IXIC_returns)
## Dickey-Fuller = -12.423, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary
Assuming significance α=0.01, we reject the null hypothesis, and classify this as stationary.
The Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots are used to determine the appropriate order of the ARIMA model, which is essential for accurate stock price prediction.
par(mfrow = c(1, 2))
acf(IXIC_returns)
pacf(IXIC_returns)
par(mfrow = c(1, 1))
#dev.off()
## ADF Test
print(adf.test(IXIC_returns))
## Warning in adf.test(IXIC_returns): p-value smaller than printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: IXIC_returns
## Dickey-Fuller = -12.423, Lag order = 12, p-value = 0.01
## alternative hypothesis: stationary
We can see that the autocorrelation is significant for large number of lags. We use ACF and PACF plot to identify the (q) order and the PACF will dampen exponentially. If we can note that it is a significant spike only at first lags, means that all the higher order autocorrelation is effectively explained by the first lag autocorrelation.
Now, we fit our model to the price data.
The auto. arima()
function in R uses a combination of
unit root tests, minimization of the AIC and MLE to obtain an ARIMA
model.
If d=0 then the constant c is included; if d≥1 then the constant c is set to zero.
auto.arima()
to the datasetIn R, the auto.arima()
function can be
used to automatically select the best ARIMA model for a given time
series dataset.
The auto.arima()
function uses a
stepwise algorithm to search through the possible combinations of p, d,
and q values in the ARIMA model and selects the best model based on a
minimum AIC (Akaike Information Criterion) or BIC (Bayesian Information
Criterion) value.
modelfit <-auto.arima(IXIC_returns, trace = TRUE)
##
## Fitting models using approximations to speed things up...
##
## ARIMA(2,0,2) with non-zero mean : 6939.848
## ARIMA(0,0,0) with non-zero mean : 6976.437
## ARIMA(1,0,0) with non-zero mean : 6941.313
## ARIMA(0,0,1) with non-zero mean : 6945.625
## ARIMA(0,0,0) with zero mean : 6976.362
## ARIMA(1,0,2) with non-zero mean : 6939.477
## ARIMA(0,0,2) with non-zero mean : 6938.632
## ARIMA(0,0,3) with non-zero mean : 6939.186
## ARIMA(1,0,1) with non-zero mean : 6939.795
## ARIMA(1,0,3) with non-zero mean : 6939.461
## ARIMA(0,0,2) with zero mean : 6938.839
##
## Now re-fitting the best model(s) without approximations...
##
## ARIMA(0,0,2) with non-zero mean : 6938.636
##
## Best model: ARIMA(0,0,2) with non-zero mean
#modelfit <-auto.arima(IXIC$IXIC.Close, lambda = "auto", trace = TRUE)
summary(modelfit)
## Series: IXIC_returns
## ARIMA(0,0,2) with non-zero mean
##
## Coefficients:
## ma1 ma2 mean
## -0.1259 0.0687 0.0435
## s.e. 0.0223 0.0228 0.0292
##
## sigma^2 = 1.912: log likelihood = -3465.31
## AIC=6938.62 AICc=6938.64 BIC=6961
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 4.46426e-05 1.381669 0.9375665 -Inf Inf 0.6753624 -0.001765822
With our model summary we can check the residuals of the model with ARIMA parameters selected 0.1010/0.0224 = 4.5, very good result - coefficients are significant 0.0377/0.0224 = 1.68, not that good. However, the first was good
After fitting an ARIMA model to a time series dataset, the next step is to perform diagnostics on the model residuals to evaluate its accuracy and goodness of fit.
# Diagnostics on Residuals - diff between the observations and the fitted values
plot(resid(modelfit),ylab="Residuals",main="Residuals(Arima(2,0,0)) vs. Time")
As we did some correlation tests to our dataset, we now check our residuals over a normal curve.
# Histogram of Residuals & Normality Assumption
hist(resid(modelfit),freq=F,main="Histogram of Residuals")
e=resid(modelfit)
curve(dnorm(x, mean=mean(e), sd=sd(e)), add=TRUE, col="darkred")
The tsdiag
function in R is used to
generate a set of diagnostic plots for time series models, that can be
helpful in identifying potential issues with the model and assessing its
goodness of fit.
# Diagnostics tests for ARMA
tsdiag(modelfit)
With this 3 graphs we focus on the Ljung-Box p-values.
The Box-Ljung test is a statistical test that is commonly used to assess the goodness of fit of the model. The test examines whether the autocorrelations of the residuals of the fitted model are significantly different from zero, indicating that the model is not adequately capturing the time series’ dynamics.
For the Ljung-Box test we have that our null hypothesis is:
Hθ: The dataset points are independently distributed.
# Box test for lag=2
Box.test(modelfit$residuals, type="Ljung-Box")
##
## Box-Ljung test
##
## data: modelfit$residuals
## X-squared = 0.0062113, df = 1, p-value = 0.9372
The test statistic is X-squared = 0.0062966 with one degree of freedom (df = 1). The p-value is 0.9368, which is greater than the significance level of 0.05. Therefore, we fail to reject the null hypothesis, indicating that there is no evidence of significant autocorrelation in the residuals => dataset points are not correlated.
The plot(as.ts(IXIC_returns))
function
in R is used to create a time series plot of the returns of the NASDAQ
stock market index, which have been converted into a time series object
using the as.ts()
function.
plot(as.ts(IXIC_returns))
lines(modelfit$fitted,col="red", lty = 5)
The forecast()
function in R is used to
generate forecasts for an ARIMA model. The
forecast(modelfit, h = 30)
generates
30-day forecasts for the time series model specified by the
modelfit
object.
plot(forecast(modelfit,h=30))
As we can see, we have a blue line that represents the mean of our prediction.
# Generate 30-day forecasts using the ARIMA model
fc <- forecast(modelfit, h = 30)
# Extract the point forecasts from the forecast object
pred_vals <- fc$mean
# Convert the predicted values back to their original scale
pred_vals_orig <- exp(pred_vals)
pred_vals_orig
## Time Series:
## Start = 1990
## End = 2019
## Frequency = 1
## [1] 1.205274 1.009139 1.044466 1.044466 1.044466 1.044466 1.044466 1.044466
## [9] 1.044466 1.044466 1.044466 1.044466 1.044466 1.044466 1.044466 1.044466
## [17] 1.044466 1.044466 1.044466 1.044466 1.044466 1.044466 1.044466 1.044466
## [25] 1.044466 1.044466 1.044466 1.044466 1.044466 1.044466