Multi-Step Polynomial Regression Method To Model and Forecast Malaria Incidence

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 26

Regression model for

predicting Malaria incidence


in Chennai

By
Chandrajit Chatterjee
M Sc (I), Statistics
University of Madras, Chennai
Introduction to malaria
 Malaria is a communicative disease, caused
through parasitic infection (mostly by the
parasite Plasmodium falciparum).
 Causes 300-500 million cases of infection
around the world and kills 1.5 to 2.7 million
people each year.
 Problems of drug resistance of the parasite and
no single, universally accepted control measure
around the world has aggravated the global
situation of malaria incidence.
 The premier cause of concern is the high
incidence rates of disease in children below 5
 Four species of pathogens are found to be
causative of the major part of malaria around
namely Plasmodium falciparum, P. vivax, P.
malariae and P.ovale.

 In India and other warmer parts of the world


the falciparum species is far more predominant
than its counterparts (death is caused due to
cerebral malaria and renal failure).
 The parasite is transmitted from person to
person by mosquitoes of genus Anopheles.
 Out of the 422 species of existing Anopheles
species worldwide only 40 are important from
MALARIA IN INDIA
Malaria is prevalent in all parts of the country except
in areas 5000 feet above sea-level.
In India, Malaria has been on a constant high from
1993 except the period between 1995 and 1999 (Fig. 2)

Fig 1: THE STATE OF MALARIA IN INDIA FROM 1961 TO


2006
Malaria control–need of a model
 A complete orientation towards eradication needs
understanding the causative factors, their degree of influence
and disease transmission dynamics over a horizon-that is
where the need of a model arises.

 New age modeling probably began with Harvey in 16th century


when he used quantitative reasoning in proving circular motion
of blood.

 There are 2 broad methods in modeling – one using abstract


differential equations (mathematical modeling) and the other
using data on previous incidences and related causes (data
based or statistical modeling)

Our aim:
We aim at establishing a simple method that bypass the
The Data:
Malaria incidence data from the Corporation of Chennai.

(I) The first data set (large-over 30 time points) consists of the
monthly slide positivity rate of malaria (all types) in Chennai over 37
months from Jan 2002 to Jan 2005.

(II) The second data set (small-less than 30 time points) consists of
deaths due to plasmodium vivax distributed over the 10 zones of
Chennai city for 12 months of the year 2006 and the population of
these months for all zones.

Climatic data was had for the same time points from the following
websites:

www.waterportal-india.org

www.wunderground.com

www.imd.ernet.in

www.worldweather.com
Contd….
 The population for Chennai city for the period between 2002
Jan and 2005 Jan was obtained from census data from 1901 to
2001 from the website ‘www.gisd.tn.nic.in/census-paper1,
census-paper2’.
 A third order polynomial was fitted to this data to impute
population at required time points as in fig 2.

Fig 2: IMPUTATION OF POPULATION


(Dashed-observed; Straight line-predicted).
Tools of analysis:
 Microsoft excel,2000 & 2007
 SPSS for windows 11.0
 MATLAB 7.5.0.

The method in brief

Our methodology consists of the following steps:


Selection of variables
Identification of relationship between the model
variables
Initial identification of the model
Model refinement
Prediction and forecasting with the help of the model
equation
Testing the correctness of prediction with standard
methods
Variable selection and identification of
relationship

The procedure begins with variable selection:


 The first order models are considered first i.e. simple
linear regression between dependant variable and the
factors individually, purely based on the coefficient of
determination* we select the first variable.
 In the next step we consider regression model of order
2 with the first variable already in model and based on
partial t-statistic**, we choose the second variable and go
on increasing the order with this procedure unless all
variables are exhausted or no other variable qualify the
criterion, whichever earlier.
* Coefficient of determination—a measure of goodness of
regression fit denoted as R-square
** Partial t-statistic is a t test to determine the influence of a
particular variable in a multi variable model.
Discussions - Analysis I-Slide Positivity Rates:
 After variable selection we have:
 Rainfall
 Maximum temperature
 Minimum humidity
 Population.
 The optimum relations are then determined between
the dependant and the selected independent variables
from study of the following scatters.

(In the method trial and error in determining the


functional forms of individual relationships is the way we
chose).
 The methodology of model refinement to attain the final
model was run then.
FIG 3:SCATTER BETWEEN log(SPR) AND log(Max Temp) FIG 4: SCATTER BETWEEN log(SPR) AND
log(Rainfall)
A 4th order polynomial explains the relationship A 3 rd order polynomial explains
18% of variability
between the variables above with an R square 49% between the two in
fig3.

FIG 5: SCATTER BETWEEN log(SPR) AND Population FIG 6: SCATTER BETWEEN log(SPR) AND Min
Temp
Model construction and refinement
 We deviate here from the other similar works done in the
field, in that we consider a non-linear model as it is but
keeping methodology simple and not having to pre-define
the functional form of the model as it is determined in the
process.
 In our multi step regression procedure in the model
refinement we have considered each functional form of an
independent variable as a separate independent variable.
 when the R2 is not a plausible option for comparison of
predictions made we have taken the adjusted R2 * as the
criterion of comparison.
 We then construct and refine the model based on principle
* Adjusted R2 is a refinement
of ‘progressive improvement over of
R2 residual
in that it sum
uses of
the unbiased
squares in
estimators for the sum of squares used in R –calculated as 1-[(1-
2

stepwise induction
R2)*(n-1)/(n-p-1)] where of variables’
p is the number, asof
depicted in thevariables
independent next
in model.
flowchart .
Model with the initial 1
variable (say Xi),with
highest R-square
NO
Induce Xi 2
Remove Is R2
increasing?
Xj

NO Is R2 YES
Remove increasing?

Xi2 Accept
Xj
YES Induce all higher orders of Xj
Accept (one by one) as before,
Xi2 dependant on R2
Induce
Induce all higher orders of Xi (one by one) a new
as defined by its relation with dependant Are all NO variabl
variable, dependant on R2 variables e and
exhausted its
Induce Xj ? functio
n-
YES forms
1
STOP inducing
In this particular analysis the final model after refinement is:
3
log(SPR) = -26.492+5.343E-21*(POPULATION) +22.008*log(MAX TEMP)

4
-1.652*[ log(MAX TEMP)] +1.097E-02*(MIN TEMP)
3
-5.24E-07*[MIN TEMP] -5.03E-02* log(RAINFALL)
2 3
+ 9.128E-02*[ log(RAINFALL)] -2.45E-02*[ log(RAINFALL)] .

NOTE: One has to get back SPR values by exponential calculation.

Results:
The model gave an R2 of 44.16% and an adjusted R2 of 42.51%
which says that the model will not vary much in its degree of
prediction when applied to a new data set of similar kind.
 ANOVA indicates an F value of 2.720 (F(sig) =0.024) which is
highly significant at the 5% significance level indicating a good
explanative nature of the regression.
 95% confidence intervals over all 36 time points contain the
predicted responses indicating significantly correct response
prediction (process –slide 17).
 The forecasted value of Jan ‘05 also falls in the 95% C.I. of
response indicating a fairly good degree of forecast by the model
FIG 7: PLOT OF FIT OF PREDICTED VALUES OF MODEL AGAINST OBSERVED
VALUES

Legends:
 Dashed series with circular markers-predicted values
 Straight line with square markers-observed values
 Green circle-forecasted value
 Light blue square-observed
 The error bars are constructed according to 95% confidence intervals of
Testing the model predictions:
Let a general regression model be :

Yi=α+ΣjXij*βj+εi; i=1(1)m, j=1(1)n


n=number of independent variables
m=number of data points
S=Σi ri2 where ri are the residuals
Then by GaussMarkov GLM
β est=(XTX)-1XTYand
σ2est=S/(m-n)
Given independent variable valuesXdj for a given time point d,
Then writingZ=(1, xd1, …, xdn)
100(1- α) %C.I. for a predicted response for level α is

[ZT β ±t α/2;(m-n)* σest* sqrt[(1+ZT(XTX)-1Z)]


Where X=((Xij))j=1(1)n ; i=1(1)m-the matrix of independent variables
Analysis II-Total P.V. deaths in Chennai city
 We had data on 10 months of total deaths due to
Plasmodium Vivax, scaled with total population (hence
our dependant variable) in Chennai and we wanted to see
the credibility of forecasts of the model for the next 2
time points.

 We demonstrate here that our method can work equally


well for smaller data sets for which normality may not be
readily assumed (as in the earlier case of the data set
with 36 time points, which had good chances of tending
towards normality).

 In this analysis with the same process, variable


selection yielded:
 Minimum Temperature
 Maximum Temperature
 Maximum Humidity
 The initial relationships were studied through scatters
and then the model refined by the multi step procedure
Fig 8: A 6th order polynomial explains 85.8% between Fig 9: A 4th order polynomial explains
30.6%
P.V. Deaths (scaled) and Min Temp. between scaled P.V. deaths and Max
Temp

Fig 10: A 3rd order polynomial tells 61.8% of the relationship


The final model equation was:
(Scaled)PV deaths=2.309E­04+8.313E­06* (min temp) 

                           +7.375E­13*(min temp)

­8.72E­07*(max hum) 
                                                      


                           ­5.81E­11*(max hum)
4
­2.98E­10*(max temp)  
                                                     

NOTE: After the predictions for scaled p.v. deaths are obtained from
the model one needs to multiply the values by population total to
obtain total p.v. deaths.
Results:
 The model gave a coefficient of determination of 82.565% and an
adjusted R2 of 60.77% which as before implies the goodness of model in
application to a new data set.
 ANOVA indicates an F value of 3.78 of regression against a critical
value of 0.11 at 5% level thereby signaling a good regressive nature of
the model.
 95% C.I. over the 10 time points all include the predicted P.V. deaths
(except 1 value) and the observed values as well lie in the C.I. thereby
validating the confidence intervals at level 0.05.
 The 95% C.I. contains both the forecasted values of November and
FIG 11:PLOT OF FIT OF PREDICTED VALUES OF MODEL AGAINST OBSERVED
VALUES.

Legends:
 Dashed series with circular markers-predicted values
 Straight line with square markers-observed values
 Green circle-forecasted value
 Blue square-observed values in forecasting horizon
 The error bars are constructed according to 95% confidence intervals of
predicted values.
A special study of individual zones
In order to analyze the zones separately we needed to construct
the models for each zone separately. However that is cumbersome
and may be redundant.

We hence did a Factor Analysis with the ten zones of data of
deaths (scaled by population) due to P.V. and the climatic factors
together, with the extraction method as Principal Component
Analysis (the data set was normalized before the analysis to reduce
the large analytical data set of intercorrelated variables into a
smaller and interpretable set of ‘factors’).

Subsequently we did a hierarchial Cluster Analysis to reconfirm


results from FA.

The clustering of the variables obtained from here, were used to


frame models with our methodology and the predictions were
tested through the same method as before to yield fits as shown
later.

The Dendrogram in cluster analysis looks as below from which two


Where
HT - highest temperature,
TL- lowest temperature,
LH - lowest humidity,
HH -highest humidity,
TR- total rainfall,
Z1-Z10 - scaled values of P.V. deaths over the 10 zones of Chennai city.
CLUSTER 1-TOT DEATHS OF ZONES 2,3,5,6,7,8-FITTED

d series with circular markers-predicted


ht line with square markers-observed
bars constructed on the basis of 95% C.I. of predicted values
CLUSTER 2-DEATHS OF ZONE 1-FITTED

Dashed series with circular markers-predicted


Straight line with square markers-observed
Error bars constructed on the basis of 95% C.I.
of predicted values
CONCLUSION :
The method yielded good fits to data sets, both large and
small with equally apt predictions and for two different
formats of incidence of the disease-one with SPR values and
the other with deaths.
 The response predictions of our model for majority of
cases, lie in the 95% C.I. of the response predictions,
keeping in mind, the tests that we have applied for
prediction response has its basis in the most general format
of linear models-the Gauss Markov models.
 The method may well be applicable to still larger data sets
with several other variables as the socio-economic factors,
geographical influences, etc. and one may follow the method
to establish a regression model for future predictions of
malaria incidence as we have also shown-using methods like
FA and cluster analysis and then adopting our method yields
predictions of equal caliber as before.
Statistics is like clay in the hands of a sculptor, he can
create a God but the demon is waiting to be created.

You might also like