Forecasting COVID-19 Confirmed Cases in Ghana: A Model Selection Approach

© 2021 The Authors. This article is licensed under a Creative Commons Attribution 4.0 License Abstract. This study seeks to determine an appropriate statistical technique for forecasting the cumulated confirm cases of Coronavirus in Ghana. Cumulated daily data spanning from March 12, 2020, to August 04, 2020, was retrieved from the Center for Systems Science and Engineering at Johns Hopkins University. Four statistical forecasting techniques: Autoregressive Integrated Moving Average, Artificial Neural Network, Exponential smoothing and Autoregressive Fractional Integrated Moving Average were fitted to the COVID-19 series. Their respective forecast accuracy measures were compared to select the appropriate technique for forecasting the COVID-19 cases. Our findings revealed that the ARFIMA technique was a suitable statistical model for predicting COVID-19 cases in Ghana. The "best" model for forecasting is ARFIMA (2, 0.49, 4) which passed all the needed diagnostic tests. An unequal weight was estimated to derive a combined model for all four forecasting techniques. A 149-cumulated daily forecast from the "best" model and the combined model revealed that the number of confirmed COVID-19 cases would increase slightly until the end of this year.


INTRODUCTION
The occurrence of the COVID-19 pandemic has given the 21st-century generation a feel of the Spanish flu in 1918. The COVID-19 was seen in December 2019, when a cluster of cases with unknown pneumonia with similar clinical manifestations suggesting viral pneumonia appeared in Wuhan City, Hubei Province, China. According to the WHO [1] situation report, SARS-CoV-2 (COVID-19) belong to β-coronavirus, which is a typical RNA-virus and can spread from personto-person [2]. According to Fernandes [3], the COVID-19 outbreak has caused serious global socio-economic turmoil. Globally, as of August 7 2020, about 21.88 million confirmed cases of COVID-19 had been recorded in a total of about 215 countries with more than 773,926 deaths and 14.6 million recoveries [4,5]. The Africa subregion constitutes 5.14% of the global confirmed cases, 3.32 % of deaths and 5.75 % of the recovery's cases [4,5]. The confirmed cases in Ghana stands at 42,653 with 239 deaths and 40,567 re-coveries cases as of August 7 2020. Many countries, including Ghana, have responded by implementing self-isolation measures, social distancing, and wearing the mask to prevent further spread [6].
Decision-makers are confronted with considerable uncertainties in deciding how to deal with the pandemic in scarce health resources. In this regard, it is practically essential to construct statistical models that are accurate and realistic enough to help forecast its future behaviour in terms of a possible number of daily cases. This can assist the medical system in better plan the healthcare resources for new patients. These statistical predictive models are useful in forecasting as well as controlling the global epidemic threat.
Some studies have modelled and forecasted the COVID-19 pandemic using the time series analysis methods [7,8,9,10,11]. Although all countries deal with the same SARS-CoV-2, predicting future outbreaks seems to differ based on cases' unique pattern. However, there is limited data on statistical methods that best predict SARS-CoV-2 infections in Ghana and other African countries.
This study aims to compare the performance of four different time series methods and determine the appropriate or "best" way that could be used to forecast the confirmed cases of COVID-19 in Ghana. In each time series technique, competing models are constructed, and information criteria are used to select the "within-best" model. The error metric from the out-sample of these forecast techniques is compared to choose the overall "best" forecast model for the COVID-19 cases in Ghana. Therefore, in this study, much attention is giving to how the "best" forecast method is selected.

METHODS AND MATERIALS
Dataset and Approach of Analysis. We focus on the confirmed cumulative daily COVID-19 cases in Ghana starting from March 12, 2020, to August 4 2020. The data from March 12, 2020, to July 9, 2020, were used as the training data for fitting the model, while the daily confirmed cases from July 10, 2020, to August 4, 2020, were used as test data for the comparison of the forecast performance of the models.
The procedure used to analyze the dataset in this study are indicated as follows: 1. The COVID-19 confirmed daily cases are plotted to observe the trend pattern and other features.
2. Three different unit root tests of stationarity are performed on the time series data.
3. For each forecasting technique employed, competing models are fitted to the cumulative COVID-19 case series; the "best" model is selected using the minimum information criterion. 4. In a situation where the information criteria disagree, we compare the models' forecast accuracy measure suggested by each of the information criteria. The final "best" model is selected based on the minimum forecast accuracy measure.
5. The forecast performance of "best" models from each forecasting technique in step 3 is compared using their error metric. 6. Forecast COVID-19 confirmed cases using the overall "best" forecasting technique in step 5.
Unit Root Tests. The Augmented Dickey-Fuller (ADF), Phillips & Perron (PP) and the Kwiatkowski Phillips Schmidt and Shin (KPSS) tests are the three most commonly used unit root tests of which the ADF and the PP have the same null hypothesis that the given time series data set have a unit root (that is, it is not stationary). The alternative idea is that the data set does not have a unit root (that is, it is fixed). However, the KPSS has its null hypothesis as the data set is stationary with alternative as the series is not stationary where ( ) is the autoregressive in a backshift form, (1 − ) is the differencing order, and ( ) is the moving average part of the ARIMA model.

Exponential Smoothing Technique (ETS)
. Author [12] extended the simple exponential smoothing to allow the forecasting of data with a trend. This method involves a forecast equation and two smoothing equations (one for the level and one for the direction): Forecast equation: Level Equation: Trend equation: where ℓ denotes an estimate of the level of the series at time t, denotes an estimate of the trend (slope) of the series at time t, is the smoothing parameter for the level, 0 ≤ ≤ 1, and * is the smoothing parameter for the trend, 0 ≤ * ≤ 1.
According to [13], the simple exponential smoothing method is defined by cell (N, N), Holt's linear method by cell (A, N), the damped trend method by cell (Ad, N), Holt-Winters' additive method by cell (A, A), and Holt-Winters' multiplicative method is given by cell (A, M) in Table 1.

Autoregressive Fractional Integrated Moving
Average (ARFIMA). The ARFIMA is considered an extended memory model. In ARFIMA, the idea of assigning = 1 2 to make a series stationary has been extended to the class of fractionally integrated ARMA, or ARFIMA models, where we allow −0.5 < < 0.5; when d is negative [14]. Now, becomes a parameter to be estimated, and a better way to calculate d is using the expression: Neural Network Models. A neural network is a network of "neurons", which are organized in layers. The predictors (or inputs) form the bottom layer, and the forecasts (or outputs) include the top layer ( Figure 1). Most external networks contain no hidden layers and are equivalent to linear regressions. In time series, the series' lagged values can be used as inputs to a neural network autoregression or NNAR model. The notation NNAR (p, k) is used to indicate that there are p lagged inputs and k nodes in the hidden layer.

Figure 1
Model Selection Criteria. In this study, three information criteria are utilized; the Akaike Information Criterion (AIC), the Corrected Akaike information criterion (AICc) and the Bayesian Information Criterion (BIC). The AIC is given by (4): where (̂ ) is the likelihood of the fitted model, and k is a number of unknown parameters free to vary.
The is also computed as (5) where is the total number of observation while the BIC is given by (6): Forecast Accuracy Measures. Three error metrics, namely, root mean square error (RMSE), mean absolute percentage error (MAPE), and mean fundamental error (MAE), were employed to measure the predictive performance of the models in (7)- (9). The RMSE is a measure of the spread of the forecast errors about the actual data points, which informs how far or near the forecasted values of an estimated model are from the real data points. It is computed as (7): where = Y t −̂ is the error.
The MAPE is a measure of the size of the error of a forecast in percentage. It is used to measure the accuracy of a prediction using the formula (8): The MAE is a scale-dependent measure that is based on the absolute errors and computed as (9):

RESULTS AND DISCUSSIONS
Firstly, four different univariate time series techniques were employed to model and forecast COVID-19 cases in Ghana. These time series techniques are ARIMA, ETS, ANN and ARFIMA. The various methods' predictive performance was used in selecting the "best" way for forecasting COVID-19 cases in Ghana. For each forecasting technique, appropriate competing models were constructed, and their information criteria were recorded. The model with the least information criterion was chosen as the 'best' model for forecasting the COVID-19 time series data. The R software precisely predicted, and the ARFRIMA package was used to run the time series models.
Time Series Plot. Generally, in Figure 2, there is a strong upward trend in COVID-19 cases from 2020-03-12 to 2020-08-04. This indicates that the COVID-19 cases series is not stationary. As observed in Figure 2, the strong upward trend of COVID-19 cases in Ghana shows that the series is not stationary. This is confirmed by results of the three-unit root tests ADF, PP and KPSS as presented in Table 2, where the p-values are all greater than 5% level of significance. Thus, there is no enough evidence to reject the null hypothesis that the COVID-19 series of Ghana is nonstationary. Nonetheless, a first difference of the series made it stationary, as confirmed by the ADF and the PP test. Yet, the KPSS test still showed non-stationarity of the series until the second difference.

Model Selection
In statistical model building, the standard practice fits several candidate models to a dataset to choose the "best" model, thus using the minimum information criterion.
Modelling with ARIMA Model. With the "differencing" information acquired at the test of stationarity in Table 2, the ADF and PP tests suggest a different order of "1" whiles the KPSS test means a differencing order of "2". Hence, two sets of competing models are built based on the differencing order, and their respective information criteria are computed. Their performance metric will then suggest the model be chosen for the ARIMA technique. Table 3 presents results with differencing order of "1", and all the three information criteria (AIC, AICc and BIC) suggest ARIMA (1, 1, 2) as the "best" model. * The "best" model, boldface=minimum information criterion Table 4 presents results with differencing order of "2", as suggested by the KPSS test. All three information criteria (AIC, AICc and BIC) indicate ARIMA (0, 2, 2) as the "best" model.

*The "best" model, boldface=minimum information criterion
To select the appropriate model for the ARIMA method for COVID-19 cases in Ghana, the two models' forecast values with a different order of difference were then compared. Their accuracy measures were computed using the 3-error metrics (RSME, MAE and MAPE). From Table 5, it is evident that ARIMA (1, 1, 2) is the "best" model since it had the minimum error metric.  *The "best" model, boldface = minimum information criterion Modelling with Artificial Neural Network. Several competing artificial neural networks were constructed after setting seed, and NNAR (3, 1, 2) model was considered the "best" since it had the minimal forecast accuracy measure in Table 7. An optimal difference integer (d) was estimated to be 0.49; nine competing models were constructed. Information criteria suggested two models. Thus AIC suggested ARFIMA (2, 0.49, 4), whiles BIC suggested ARMA (2, 0.49, 0) as presented in Table 8.

boldface = minimum information criterion
We estimated the two ARFIMA models' forecast performance suggested by the information criteria (AIC and BIC), and the minimal performance metric was used to select the "best" ARFIMA. The results presented in Table 9, ARFIMA (2, 0.49, 4), were chosen as the "best" model for the ARFIMA.

boldface = minimum error metric
We compare the forecast performance of the "best" models from the four different time series modelling techniques using the 3-performance metrics computed from the "test" data. The time series technique with the minimum performance metric is selected as the "best "method. From Table 10, it is obvious that the ARFIMA (2, 0.49, 4) has the least error metric values among the other three forecasting techniques. Hence, it is concluded that the ARFIMA (2,0.49,4) is the 'best' model for forecasting COVID-19 confirmed cases in Ghana. In Figure 3, the diagnostic checks on the residuals of the chosen model [ARFIMA (2, 0.49, 4)] is presented. This is done to see if it does not violate any of the assumptions underlying the model. From Figure 3, we observed the following: 1. There is no apparent trend in the plot of the standardized residuals over the days.

2.
A plot of the ACF of the residuals confirms that none of their lags is statistically significant, implying that the residuals are not correlated.
3. The box plot shows that most of the errors are normally distributed except for a few at almost the midpoint that potential cases of outliers are observed.
The test of autocorrelations provides an essential diagnostic tool. Therefore, the Box-Ljung test was used to check for autocorrelation under the hypotheses: 0 : residuals are not auto-correlated versus 1 : residuals are auto-correlated. From the results presented in Table 11, the null hypothesis of residuals not being auto-correlated is not rejected since, at a significance of 5%, the p-value (0.9786) is more generous.  In Figure 4, the forecast of cumulated confirmed COVID-19 cases from four forecast techniques (i.e., respective "best" models) starting from August (starting from 05/08/2020) to the end of December is presented. The NNAR forecast technique gives the lowest forecast value, while the ETS technique provides the highest forecast. The forecast of the overall "best" model that is ARFIMA (2, 0.49, 4), is slightly above NNAR. Therefore, in Table 13, we combined the forecast values of the "best" models from the four respective forecast methods. An unequal weight is estimated from the MAPE in Table 10. The MAPE of the overall "best" forecast techniques (ARFIMA (2, 0.49, 4)) is subtracted from the other forecast techniques to get the difference (d) in Table 12.
The forecast values from the overall "best" forecast techniques (ARFIMA (2, 0.49, 4)) are similar to that of the combined model.  Some studies have modelled the COVID-19 pandemic using the time series analysis methods [7,8,9,10]. In this study, four competing forecasting techniques (ARIMA, ETS, NNAR, and ARFIMA) were fitted to the COVID-19 confirmed cases so that the appropriate or "best" forecasting techniques would be used to forecast the COVID-19 issues in Ghana.
Although researchers like [9,10,11] have used some of these techniques to model and forecast COVID-19 cases in other countries, the selection of the appropriate method was not exhaustive.
Here, several competing models were constructed for each forecasting technique and the "best" model was selected to represent that technique. Eventually, the out-sample forecast performance of these respective "best" techniques are compared, and the one with the minimum error metric was selected as the overall "best" forecasting technique.
Therefore, the ARFIMA technique was selected as the overall "best" forecasting technique for the COVID-19 cases in Ghana in this study. To the best of our knowledge, this study is the first to construct time series models and specifically selecting ARFIMA techniques as the appropriate forecast technique for Ghana's COVID-19 cases.

CONCLUSION
COVID-19 pandemic has been spreading rapidly across different parts of the world, and Ghana has not been spared. This pandemic continues to cause more havoc, most especially in the economic development of the country. Hence prediction of cases is vital for stakeholders of the public and private sectors of Ghana. Therefore, this research sought to identify an appropriate statistical technique for forecasting the cumulative daily cases of Coronavirus in Ghana. Thus, four com-peting forecasting techniques were compared to choose the proper method. Four competing forecasting techniques (ARIMA, ETS, NNAR, and ARFIMA) were applied to the COVID-19 series, from 2020-03-12 to 2020-08-04. Our findings revealed that the ARFIMA technique is the appropriate statistical technique for forecasting COVID-19 cases in Ghana. The "best" model for forecasting is ARFIMA (2, 0.49, 4) which passed all the needed diagnostic tests. A 149-daily forecast from the "best" model revealed that the number of cases of COVID-19 will still be on the rise.