Multi-Step Polynomial Regression Method To Model and Forecast Malaria Incidence
Multi-Step Polynomial Regression Method To Model and Forecast Malaria Incidence
Multi-Step Polynomial Regression Method To Model and Forecast Malaria Incidence
By
Chandrajit Chatterjee
M Sc (I), Statistics
University of Madras, Chennai
Introduction to malaria
Malaria is a communicative disease, caused
through parasitic infection (mostly by the
parasite Plasmodium falciparum).
Causes 300-500 million cases of infection
around the world and kills 1.5 to 2.7 million
people each year.
Problems of drug resistance of the parasite and
no single, universally accepted control measure
around the world has aggravated the global
situation of malaria incidence.
The premier cause of concern is the high
incidence rates of disease in children below 5
Four species of pathogens are found to be
causative of the major part of malaria around
namely Plasmodium falciparum, P. vivax, P.
malariae and P.ovale.
Our aim:
We aim at establishing a simple method that bypass the
The Data:
Malaria incidence data from the Corporation of Chennai.
(I) The first data set (large-over 30 time points) consists of the
monthly slide positivity rate of malaria (all types) in Chennai over 37
months from Jan 2002 to Jan 2005.
(II) The second data set (small-less than 30 time points) consists of
deaths due to plasmodium vivax distributed over the 10 zones of
Chennai city for 12 months of the year 2006 and the population of
these months for all zones.
Climatic data was had for the same time points from the following
websites:
www.waterportal-india.org
www.wunderground.com
www.imd.ernet.in
www.worldweather.com
Contd….
The population for Chennai city for the period between 2002
Jan and 2005 Jan was obtained from census data from 1901 to
2001 from the website ‘www.gisd.tn.nic.in/census-paper1,
census-paper2’.
A third order polynomial was fitted to this data to impute
population at required time points as in fig 2.
FIG 5: SCATTER BETWEEN log(SPR) AND Population FIG 6: SCATTER BETWEEN log(SPR) AND Min
Temp
Model construction and refinement
We deviate here from the other similar works done in the
field, in that we consider a non-linear model as it is but
keeping methodology simple and not having to pre-define
the functional form of the model as it is determined in the
process.
In our multi step regression procedure in the model
refinement we have considered each functional form of an
independent variable as a separate independent variable.
when the R2 is not a plausible option for comparison of
predictions made we have taken the adjusted R2 * as the
criterion of comparison.
We then construct and refine the model based on principle
* Adjusted R2 is a refinement
of ‘progressive improvement over of
R2 residual
in that it sum
uses of
the unbiased
squares in
estimators for the sum of squares used in R –calculated as 1-[(1-
2
stepwise induction
R2)*(n-1)/(n-p-1)] where of variables’
p is the number, asof
depicted in thevariables
independent next
in model.
flowchart .
Model with the initial 1
variable (say Xi),with
highest R-square
NO
Induce Xi 2
Remove Is R2
increasing?
Xj
NO Is R2 YES
Remove increasing?
Xi2 Accept
Xj
YES Induce all higher orders of Xj
Accept (one by one) as before,
Xi2 dependant on R2
Induce
Induce all higher orders of Xi (one by one) a new
as defined by its relation with dependant Are all NO variabl
variable, dependant on R2 variables e and
exhausted its
Induce Xj ? functio
n-
YES forms
1
STOP inducing
In this particular analysis the final model after refinement is:
3
log(SPR) = -26.492+5.343E-21*(POPULATION) +22.008*log(MAX TEMP)
4
-1.652*[ log(MAX TEMP)] +1.097E-02*(MIN TEMP)
3
-5.24E-07*[MIN TEMP] -5.03E-02* log(RAINFALL)
2 3
+ 9.128E-02*[ log(RAINFALL)] -2.45E-02*[ log(RAINFALL)] .
Results:
The model gave an R2 of 44.16% and an adjusted R2 of 42.51%
which says that the model will not vary much in its degree of
prediction when applied to a new data set of similar kind.
ANOVA indicates an F value of 2.720 (F(sig) =0.024) which is
highly significant at the 5% significance level indicating a good
explanative nature of the regression.
95% confidence intervals over all 36 time points contain the
predicted responses indicating significantly correct response
prediction (process –slide 17).
The forecasted value of Jan ‘05 also falls in the 95% C.I. of
response indicating a fairly good degree of forecast by the model
FIG 7: PLOT OF FIT OF PREDICTED VALUES OF MODEL AGAINST OBSERVED
VALUES
Legends:
Dashed series with circular markers-predicted values
Straight line with square markers-observed values
Green circle-forecasted value
Light blue square-observed
The error bars are constructed according to 95% confidence intervals of
Testing the model predictions:
Let a general regression model be :
8.72E07*(max hum)
3
5.81E11*(max hum)
4
2.98E10*(max temp)
NOTE: After the predictions for scaled p.v. deaths are obtained from
the model one needs to multiply the values by population total to
obtain total p.v. deaths.
Results:
The model gave a coefficient of determination of 82.565% and an
adjusted R2 of 60.77% which as before implies the goodness of model in
application to a new data set.
ANOVA indicates an F value of 3.78 of regression against a critical
value of 0.11 at 5% level thereby signaling a good regressive nature of
the model.
95% C.I. over the 10 time points all include the predicted P.V. deaths
(except 1 value) and the observed values as well lie in the C.I. thereby
validating the confidence intervals at level 0.05.
The 95% C.I. contains both the forecasted values of November and
FIG 11:PLOT OF FIT OF PREDICTED VALUES OF MODEL AGAINST OBSERVED
VALUES.
Legends:
Dashed series with circular markers-predicted values
Straight line with square markers-observed values
Green circle-forecasted value
Blue square-observed values in forecasting horizon
The error bars are constructed according to 95% confidence intervals of
predicted values.
A special study of individual zones
In order to analyze the zones separately we needed to construct
the models for each zone separately. However that is cumbersome
and may be redundant.
We hence did a Factor Analysis with the ten zones of data of
deaths (scaled by population) due to P.V. and the climatic factors
together, with the extraction method as Principal Component
Analysis (the data set was normalized before the analysis to reduce
the large analytical data set of intercorrelated variables into a
smaller and interpretable set of ‘factors’).