Introduction

  The original dataset is called “Monthly Counts of Deaths by Select Causes, 2014-2019” from catalog.data.gov [link]. The counts are exclusively from the US. The analysis will be done on a subset of this data. Specifically death counts by homicide.
plot(homicides, main = "Homicides Per Month (2014-2019)")

The data ranges from 2014 through 2019. Homicides are increasing over time with a parabolic trend. Seasonality appears to be yearly with large dips around January.

SARIMA Modeling

Fitting the Model

  For the SARIMA model, the best fit was achieved with a SARIMA(2,1,0)(0,1,1)[12] model of the form SARIMA(p,d,q)(P,D,Q)[S]. Originally I fit a third order AR component similar to my previous ARMA model. The model did not fit well. This model has the lowest AIC. The data has seasonality as well as a trend so it needs to be differenced twice. S = 12 and d/D = 1. A seasonal MA of order 1 also decreased the AIC, so Q = 1.

ARIMA model: \[ (1-\phi_1B-\phi_2B^2)(1-B)(1-B^{12})x_t=(1+\theta_1B^{12})w_t \] such that: \[ (1+0.7173B+0.3790B^2)(1-B)(1-B^{12})x_t=(1+0.7089B^{12})w_t, ~~ w_t \sim^{iid} N(0,4702) \]

model.sarima <- sarima(homicides, 2, 1, 0, 0, 1, 1, 12, details = F)
model.sarima
## $fit
## 
## Call:
## arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, Q), period = S), 
##     include.mean = !no.constant, transform.pars = trans, fixed = fixed, optim.control = list(trace = trc, 
##         REPORT = 1, reltol = tol))
## 
## Coefficients:
##           ar1      ar2     sma1
##       -0.7173  -0.3790  -0.7089
## s.e.   0.1254   0.1257   0.2202
## 
## sigma^2 estimated as 4702:  log likelihood = -337.56,  aic = 683.11
## 
## $degrees_of_freedom
## [1] 56
## 
## $ttable
##      Estimate     SE t.value p.value
## ar1   -0.7173 0.1254 -5.7225  0.0000
## ar2   -0.3790 0.1257 -3.0161  0.0038
## sma1  -0.7089 0.2202 -3.2198  0.0021
## 
## $AIC
## [1] 11.57817
## 
## $AICc
## [1] 11.58556
## 
## $BIC
## [1] 11.71902
residuals <- model.sarima$fit$residuals
par(mfrow = c(2,2))

plot(residuals)
qqnorm(residuals)
qqline(residuals)
acf <- acf1(residuals)
lag <- 1:35
lags <- as.list(lag)
p.values <- sapply(lags, function(x) Box.test(residuals, x, "Ljung-Box")$p.value)
plot(lag, p.values, ylim = c(0,1), main = "Box Test P-values for the Residuals")
abline(h=0.05, col = "blue", lty = 2)

This model fits quite well. The spike around lag 2 and the lower end of the Q-Q plot stand out, but do not look like a huge problem.

Prediction

This is a prediction for the next 24 months.

sarima.for(homicides, 24, 2, 1, 0, 0, 1, 1, 12)

## $pred
##           Jan      Feb      Mar      Apr      May      Jun      Jul      Aug
## 2020 1675.143 1442.789 1579.335 1593.680 1740.108 1767.190 1851.684 1774.708
## 2021 1749.584 1501.415 1630.197 1656.105 1797.181 1823.720 1910.632 1832.127
##           Sep      Oct      Nov      Dec
## 2020 1735.190 1719.117 1702.715 1739.025
## 2021 1792.789 1777.166 1760.374 1796.794
## 
## $se
##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 2020  69.02733  71.71663  77.28531  87.39357  91.97341  97.65345 103.55022
## 2021 137.68426 142.96222 148.82627 155.45543 160.84281 166.37141 171.84417
##            Aug       Sep       Oct       Nov       Dec
## 2020 108.29493 113.22138 117.97420 122.38194 126.71824
## 2021 176.94911 181.99801 186.93381 191.68848 196.33372

Comparing Models

  The largest difference between the two methods is using the differencing when fitting the SARIMA model. This allowed for a much better fit. Residuals for the ARMA model had correlation when they should not have. This is due to the poor fit of the trend. On a similar note, the parabolic trend used for in the first did not capture the overall trend very well. The forecast continued the trend and went down. In reality, the trend would go up as in the figure above. This would have been slightly better if R allows fitting a higher order polynomial as I mentioned before.

Conclusion

  In general, differencing is much easier and effective at capturing complicated trends. Seasonal AR and MA componets also add very important correlation to a model. A more complicated ARIMA is not always needed, but there are many benefits.