Data Mining Final Report

1
UNIVERSITI UTARA MALAYSIA

COLLEGE OF ARTS AND SCIENCES
SQIT 3033
KNOWLEDGE ACQUISITION IN DECISION
MAKING
GROUP A
TITLE: MEDICAL DATA (GROUP PROJECT)

LECTURER NAME:
DR. IZWAN NIZAL MOHD SHAHARANEE

STUDENT NAME :
KANAGAMBAL D/O SUBRAMANIAM (211619)
KOGILAVANI D/O THIRUMALAIRAJAN (211292)
RENUGAA D/O S.MUTHANAGOPAL (211473)

2

CONTENTS

NO TITLE PAGE NUMBER
1. 1.0: INTRODUCTION 3-4
2. 2.0: PROBLEM STATEMENT
2.1: OBJECTIVES
2.2: JUSTIFY THE USE OF SPECIFIC SOLUTION
TECHNIQUES OR PROBLEM SOLVING
PROCEDURES IN YOUR WORK
5-6
7
8-9
3. 3.0: RESEARCH METHODOLOGY
3.1: DATA MINING TECHNIQUE TO SOLVE THE
PROBLEM
3.2: STEPS INVOLVING IN SAS ENTERPRISE MINER
3.3: ANALYSIS OF DATA

10-15

16-41
4. 4.0: CONCLUSION 42-43
5. 5.0: REFERENCE 44
3

1.0: INTRODUCTION

Medical is field which relating to the science or practice of medicine. It is an examination to
assess a person's state of physical health or fitness. Health care is very important for human
beings for a long life. Health factors are the major part which decides the life range of human.
There are many factors that involved in deciding the life cycle of human beings. Various type
of diseases are identified to be the major contribution toward live or death of the patients who
diagnosed it.
Gary Null, PhD; Carolyn Dean MD, ND; Martin Feldman, MD; Debora Rasio, MD; and
Dorothy Smith, PhD states a group of researchers meticulously reviewed the statistical
evidence and their findings are absolutely shocking 4 These researchers have authored a
paper titled Death by Medicine that presents compelling evidence that todays system
frequently causes more harm than good. This fully referenced report shows the number of
people having in-hospital, adverse reactions to prescribed drugs to be 2.2 million per year.
The number of unnecessary antibiotics prescribed annually for viral infections is 20 million
per year. The number of unnecessary medical and surgical procedures performed annually is
7.5 million per year. The number of people exposed to unnecessary hospitalization annually
is 8.9 million per year.
The most stunning statistic, however, is that the total number of deaths caused by
conventional medicine is an astounding 783,936 per year. It is now evident that the American
medical system is the leading cause of death and injury in the US. (By contrast, the number
of deaths attributable to heart disease in 2001 was 699,697, while the number of deaths
attributable to cancer was 553,251.5). By exposing these statistics in painstaking detail, we
provide a basis for competent and compassionate medical professionals to recognize the
inadequacies of todays system and at least attempt to institute meaningful reforms.
Medical and medicines are related to each other and they have the large important in human
beings live on earth. So, the medical have to be more conscious towards identify the diseases
in right time and perfect medicine to cure the illness. This can help for a better lifestyle and
secure for a long life.
4

5

2.0: PROBLEM STATEMENT
In this modern age, there are a huge development and improvement are identified and a lot
researches are been done towards the medicines for the illness for increase the life time of the
patients. They frequently modify the medicines and always want to improve for better cure
and increase their life time in this world. However, there are still unpredictable things happen
to the patients either they alive or dead. When researchers find more alternatives towards
curing the diseases still the percentage of the people who survived is decreasing.
So, here we are also going to do a study based on a secondary medical data that shows
various causes for the live or death of the patience. There is status, death causes, age coronary
heart disease diagnose, sex, age at start, height, weight, diastolic, systolic, MRW, smoking,
age at death, cholesterol, cholesterol status, BP status, weight status and smoking status.
Status is our target variable where it is mean by the status of the patients alive or dead. The
second is death causes which categorized by chronic diseases Cancer, Coronary heart disease,
Cerebral vascular disease, others and unknown causes.
The third is age coronary heart disease diagnoses which identified to be the age between 32-
90 years old. The fourth data input is sex represent male or female. The fifth is age at which
the illness start is from 28-62 years old. The sixth and seventh data input is height and weight
of the patients in a range of 51.5-76.5 and 67-300 respectively.
The eighth and ninth input is diastolic and systolic rate. When your heart beats, it contracts
and pushes blood through the arteries to the rest of your body. This force creates pressure on
the arteries. This is called systolic blood pressure. A normal systolic blood pressure is 120 or
below. A systolic blood pressure of 120-139 means you have normal blood pressure that is
higher than ideal or borderline high blood pressure. Even people with this level are at a
greater risk of developing heart disease. A systolic blood pressure number of 140 or higher,
on repeated measurements, is considered to be hypertension, or high blood pressure. The
diastolic blood pressure number or the bottom number indicates the pressure in the arteries
when the heart rests between beats. A normal diastolic blood pressure number is 80 or less. A
diastolic blood pressure between 80 and 89 is normal but higher than ideal. A diastolic blood
pressure number of 90 or higher, on repeated measurements, is considered to be hypertension
or high blood pressure.
6

The tenth input is MRW rate from range of 67-268. The eleventh input is the rate of smoking
of the patients. Next is the age at death rate from 36-93 years old. Then the next input is
cholesterol level and status. The level is from 96 to 568 the status is borderline, desirable and
high. The next input is BP status which related to high, normal and optimal. Then the weight
and smoking status of the patience is classified as underweight, normal, overweight and light,
non-smoker, heavy, moderate, very heavy according to the weight and smoking data.
So, we are going to use this dataset to classify the status of the list of the patients either dead
or alive using the data mining techniques.

7

2.1: OBJECTIVES
1. To develop a classification model to determine the status of people whether alive or
dead based on medical causes.
2. To identify the best model in determine the medical causes.
3. To identify significant variable in determining alive people.

8

2.2: JUSTIFY THE USE OF SPECIFIC SOLUTION TECHNIQUES OR
PROBLEM SOLVING PROCEDURES IN YOUR WORK
Decision tree
Decision tree is a tree-shaped structure that represents set of decisions or prediction of data
trends. It is suitable to describe sequence of interrelated decisions or prediction of future data
trends and has the capability to classify entities into specific classes based on feature of
entities. Each tree consists of three types of nodes: root node, internal node and terminal
node/leaf. The top most node is the root node and it represents all of the rows in the dataset.
Nodes with child nodes are the internal nodes while nodes without child node are called the
terminal node or leaf. A common algorithm for building a decision tree selects a subset of
instances from the training data to construct an initial tree. The remaining training instances
are then used to test the accuracy of the tree. If any instance is incorrectly classified the
instance is added to the current set of training data and the process is repeated. A main goal is
to minimize the number of tree levels and tree nodes, thereby maximizing data
generalization. Decision trees have been successfully applied to real problem, are easy to
understand, and map nicely to a set of production rules.
Regression
The Regression node in Enterprise Miner does either linear or logistic regression depending
upon the measurement level of the target variable. Linear regression is done if the target
variable is an interval variable. In linear regression the model predicts the mean of the target
variable at the given values of the input variables. Logistic regression is done if the target
variable is a discrete variable. In logistic regression the model predicts the probability of a
particular level(s) of the target variable at the given values of the input variables. Because the
predictions are probabilities, which are bounded by 0 and 1 and are not linear in this space,
the probabilities must be transformed in order to be adequately modelled. The most common
transformation for a binary target is the logit transformation. Probit and complementary log-
log transformations are also available in the regression node. There are three variable
selection methods available in the Regression node of Enterprise Miner. Forward first selects
the best one-variable model. Then it selects the best two variables among those that contain
the first selected variable. This process continues until it reaches the point where no
additional variables have a p-value less than the specified entry p-value.
9

Backward starts with the full model. Next, the variable that is least significant, given the
other variables, is removed from the model. This process continues until all of the remaining
variables have a p-value less than the specified stay p-value. Stepwise is a modification of the
forward selection method. The difference is that variables already in the model do not
necessarily stay there. After each variable is entered into the model, this method looks at all
the variables already included in the model and deletes any variable that is not significant at
the specified level. The process ends when none of the variables outside the model has a p-
value less than the specified entry value and every variable in the model is significant at the
specified stay value.
Neural Network
An artificial neural network is a network of many simple processors ("units"), each possibly
having a small amount of local memory. The units are connected by communication channels
("connections") that usually carry numeric (as opposed to symbolic) data encoded by various
means. The units operate only on their local data and on the inputs they receive via the
connections. The restriction to local operations is often relaxed during training. More
specifically, neural networks are a class of flexible, nonlinear regression models, discriminant
models, and data reduction models that are interconnected in a nonlinear dynamic system.
Neural networks are useful tools for interrogating increasing volumes of data and for learning
from examples to find patterns in data. By detecting complex nonlinear relationships in data,
neural networks can help make accurate predictions about real-world problems.

10

3.0: RESEARCH METHODOLOGY
In this project, our group conducts a survey to gather information about patients who alive
and dead with different types of medical causes. The causes includes cancer, cerebral
vascular disease, coronary heart disease, others and unknown.
After gathering the data, we process the data by using KDD process. KDD process is method
uses to digest or find hidden information in database. KDD process helps to convert unknown
or hidden pattern into useful, understandable and informative way. KDD process has 5
sections, which are selection, preprocessing, transformation, data mining and interpretation &
evaluation.

a) Selection
Data selection is defined as the process of determining the appropriate data
type and source, as well as suitable instruments to collect data. Data selection
precedes the actual practice of data collection. The data is obtained from the
UCI Machine Learning Repository.

b) Pre-processing
Data pre-processing is a data mining technique that involves transforming raw data
into an understandable format. Real-world data is often incomplete, inconsistent,
and/or lacking in certain behaviours or trends, and is likely to contain many errors.
Data pre-processing is a proven method of resolving such issues. Data processing can
be categorized into few methods:
11

Data cleaning
Data cleaning is a process of detecting and correcting (or removing) corrupt or
inaccurate records from a record set, table, or database. The purpose of using
data cleaning is to identifying incomplete, incorrect, inaccurate, irrelevant, etc.
parts of the data and then replacing, modifying, or deleting this dirty data.
Data cleaning can handle incomplete, noisy and inconsistent data.
o Incomplete data is the missing value happens due to improper data
collection methods. During collection of data, there may happen no
recorded value for certain attribute and causing incomplete of data. To
overcome this problem, we can use mean-value, estimate the probable
value using regression, using constant value or ignore the missing
record. For example:

o Noisy data is random error or variance in data. This happens due to
corrupted data transmission, technological limitation. During
transmission data into certain software such as SPSS or SAS, we may
key in wrong data in it, thus, this will cause noisy data happened. To
12

solve this problem, we can use binning method or outlier removal
method.
o Inconsistent data means the data contains replication or possibly
redundancy data. Method to overcome this problem is removing
redundant or replicate data.

Data integration involves combining data residing in different sources and
providing users with a unified view of these data. Data comes from different
sources with different naming standard. This will cause in inconsistencies and
redundancies. There are several ways to handle this problem:
Consolidate different source into one repository (using metadata).
Correlation analysis (measure the strength of relationship between different
attribute)

Data reduction is the transformation of numerical or alphabetical digital
information derived empirical or experimentally into a corrected, ordered, and
simplified form. The basic concept is the reduction of multitudinous amounts
of data down to the meaningful parts. This is to increase efficiency, can reduce
the huge data set into a smaller representation. Several techniques can be used
in data reduction such as data cube aggregation, dimension reduction, data
compression and discretization.

13

In our medical data, we have use the data integration which is we eliminate the unrelated
variables from our data.

c) Transformation
In the transformation process, which also known as data normalization, is
basically re-scale the data into a suitable range. This process is important
because it can increase the processing speed and reduce the memory
allocation. There are several methods in transformation:
- Decimal Scaling
- Min Max,
- Z Score and
- Logarithmic Normalization
We choose Min Max normalization to solve our data. Min Max
normalization is linear transformation of the original input to newly specified
range. The formula use is:

14

d) Data Mining
Data mining is the use of algorithms to extract the information and patterns by the
KDD process. This step applies algorithms to the transformed data to generate the
desired results. In this project, we are using SAS Enterprise Miner to build up the
comparing models. In the SAS Enterprise Miner, we are using decision tree method.
Decision tree is a tree-shaped structure that represents set of decisions or prediction of
data trends. It is suitable to describe sequence of interrelated decisions or prediction of
future data trends and has the capability to classify entities into specific classes based
on feature of entities. Each tree consists of three types of nodes: root node, internal
node and terminal node/leaf. The top most nodes are the root node and it represents all
of the rows in the dataset. Nodes with child nodes are the internal nodes while nodes
without child node are called the terminal node or leaf. A common algorithm for
building a decision tree selects a subset of instances from the training data to construct
an initial tree. The remaining training instances are then used to test the accuracy of the
tree. If any instance is incorrectly classified the instance is added to the current set of
training data and the process is repeated. A main goal is to minimize the number of
tree levels and tree nodes, thereby maximizing data generalization. Decision trees have
been successfully applied to real problem, are easy to understand, and map nicely to a
set of production rules. The resulting trees are usually quite understandable and can be
easily used to obtain a better understanding of the phenomenon in question.

15

DATA VARIABLES
From the table above, our target is predicting the status

e) Interpretation & evaluation process
In interpretation & evaluation process, certain data mining output is non-
human understandable format and we need interpretation for better
understanding. So, we convert output into an easy understand medium.

16

3.2 STEPS INVOLVING IN SAS ENTERPRISE MINER
I. Firstly, we open the SAS Enterprise Miner. Then click File and we create a new project.
From the SAS menu bar select FileNewProject. Then, we name our project as
GroupProject.

II. Click CREATE. The Group Project will open an initial untitled diagram and name it
as Data.

17

The SAS Enterprise Miner window contains the following interface components:

Project Navigator enables you to manage projects and diagrams, add tools to the
Diagram Workspace, and view HTML reports that are created by the Reporter node.
Note that when a tool is added to the Diagram Workspace, the tool is referred to as a
node. The Project Navigator has three tabs:
Diagrams tab - lists the current project and the diagrams within the project. By
default, the project window opens with the Diagrams tab activated.
Tools tab - contains the Enterprise Miner tools palette. This tab enables you to see
all of the tools (or nodes) that are available in Enterprise Miner. The tools are
grouped according to the SEMMA data-mining methodology. Many of the
commonly used tools are shown on the Tools Bar at the top of the window. You
can add additional tools to the Tools Bar by dragging them from the Tools tab
onto the Tools Bar. In addition, you can rearrange the tools on the Tools Bar by
dragging each tool to a new location on the Tools Bar.
Reports tab - displays the HTML reports that are generated by using the Reporter
node.
Diagram Workspace - enables you to build, edit, run, and save process flow diagrams.
Tools Bar - contains a customizable subset of Enterprise Miner tools that are commonly
used to build process flow diagrams in the Diagram Workspace. You can add or delete
tools from the Tools Bar.
18

Progress Indicator - displays a progress indicator bar that indicates the execution status
of an Enterprise Miner task.
Message Panel - displays messages about the execution of an Enterprise Miner task.
Connection Status Indicator - displays the remote host name and indicates whether the
connection is active for a client/server project.

III. The Sample Nodes that we used in our Group Project:
a) Input Data Source

The Input Data Source node reads data sources and defines their attributes for later
processing by Enterprise Miner. This node can perform various tasks:
Access SAS data sets and data marts. Data marts can be defined by using the SAS Data
Warehouse Administrator, and they can be set up for Enterprise Miner by using the
Enterprise Miner Warehouse Add-ins.
Automatically create a metadata sample for each variable when you import a dataset with
the Input Data Source node. By default, Enterprise Miner obtains the metadata sample by
taking a random sample of 2,000 observations from the data set that is identified in the
Input Data Source. Optionally, you can request larger samples. If the data is smaller than
2,000 observations, the entire data set is used.
Use the metadata sample to set initial values for the measurement level and the model
role for each variable. You can change these values if you are not satisfied with the
automatic selections that are made by the node.
Display summary statistics for interval and class variables.
Define target profiles for each target in the input data set.

19

b) Data Partition

The Data Partition node enables you to partition data sets into training, test, and validation
data sets. The training data set is used for preliminary model fitting. The validation data set is
used to monitor and tune the model weights during estimation and is also used for model
assessment. The test data set is an additional data set that you can use for model assessment.
This node uses simple random sampling, stratified random sampling, or a user-defined
partition to create training, test, or validation data sets. Specify a user-defined partition if you
have determined which observations should be assigned to the training, validation, or test
data sets. This assignment is identified by a categorical variable that is in the raw data set.

b) Replacement

The Replacement node enables you to impute values for observations that have missing
values. You can replace missing values for interval variables with the mean, median,
midrange, mid-minimum spacing, or distribution-based replacement, or you can use a
replacement M-estimator such as Tukeys by weight, Hubers, or Andrews Wave. You can
also estimate the replacement values for each interval input by using a tree-based imputation
method. Missing values for class variables can be replaced with the most frequently occurring
value, distribution-based replacement, tree-based imputation, or a constant.

20

c) Transform Variables

The Transform Variables node enables you to transform variables. For example, you can
transform variables by taking the square root of a variable, by taking the natural logarithm,
maximizing the correlation with the target, or normalizing a variable. Additionally, the node
supports user-defined formulas for transformations and enables you to group interval-valued
variables into buckets or quantiles. This node also automatically places interval variables into
buckets by using a decision tree-based algorithm. Transforming variables to similar scale and
variability may improve the fit of models and, subsequently, the classification and prediction
precision of fitted models.

d) Regression

The Regression node enables you to fit both linear and logistic regression models to your
data. You can use both continuous and discrete variables as inputs. The node supports the
stepwise, forward, and backward-selection methods. A point-and-click interaction builder
enables you to create higher-order modelling terms

21

e) Decision Tree

The Tree node enables you to perform multi way splitting of your database, based on
nominal, ordinal, and continuous variables. This is the SAS implementation of decision trees,
which represents a hybrid of the best of CHAID, CART, and C4.5 algorithms. The node
supports both automatic and interactive training. When you run the Tree node in automatic
mode, it automatically ranks the input variables by the strength of their contribution to the
tree. This ranking can be used to select variables for use in subsequent modelling. In addition,
dummy variables can be generated for use in subsequent modelling. Using interactive
training, you can override any automatic step by defining a splitting rule or by pruning a node
or sub tree.

f) Neural Network

The Neural Network node enables you to construct, train, and validate multilayer feed-
forward neural networks. By default, the Neural Network node automatically constructs a
multilayer feed-forward network that has one hidden layer consisting of three neurons. In
general, each input is fully connected to the first hidden layer, each hidden layer is fully
connected to the next hidden layer, and the last hidden layer is fully connected to the output.
The Neural Network node supports many variations of this general form.

22

g) Assessment

The Assessment node provides a common framework for comparing models and predictions
from any of the modelling nodes (Regression, Tree, Neural Network, and User Defined
Model nodes). The comparison is based on the expected and actual profits or losses that
would result from implementing the model. The node produces the following charts that help
to describe the usefulness of the model: lift, profit, return on investment, receiver operating
curves, diagnostic charts, and threshold-based charts. The Reporter node assembles the
results from a process flow analysis into an HTML report that can be viewed with a Web
browser. Each report contains header information, an image of the process flow diagram, and
a separate report for each node in the flow including node settings and results. Reports are
managed in the Reports tab of the Project Navigator.
IV) After that, we drag and drop an input data source node to the workspace. We use
the Input Data Source node to access medical data sets. After open the data
sources node, we will choose the data set EMDATA.MEDICAL

23

24

Next, we continue with set the model role in the input data sources. We assign the status as
the target. Status variable becomes a target variable which it measurement is binary. The
measurement of cholesterol status, BP status, weight status and the smoking status is will be
measured as nominal. The unwanted variables such as Deathcause, Ageatdeath and
Agechodiag wil be rejected as it does not contribute to the model.

25

Using the Cursor
The shape of the cursor changes depending on where it is positioned. The behavior of the
mouse commands depends on the shape of the cursor as well as on the selection state of the
node over which the cursor is positioned. Right-click in an open area to see the pop-up menu
as shown below. You can connect the node where the cursor is positioned (beginning node)
to any other node (ending node) as follows:
i. Ensure that the beginning node is deselected. It is much easier to drag a line when the
node is deselected. If the beginning node is selected, click in an open area of the
workspace to deselect it.
ii. Position the cursor on the edge of the icon that represents the beginning node (until
the cross-hair appears).
iii. Press the left mouse button and immediately begin to drag in the direction of the
ending node. Note: If you do not begin dragging immediately after pressing the left
mouse button, you will only select the node. Dragging a selected node will generally
result in moving the node (that is, no line will form).
iv. Release the mouse button after you reach the edge of the icon that represents the
ending node.
v. Click away from the arrow. Initially, the connection will appear as follows. After you
click away from the line in an open area of the workspace, the finished arrow forms.

26

Select View distribution to see the distribution of values for Status in the metadata sample. A
distribution is showed.

We need to investigate the number of causes, percentage of missing values, and the sort order
of each variable. For a binary target, Status has two levels then Status is the target event.
Close the Input Data Source node, and save changes when you are prompted.

27

IV) After that, we use the data partition node to partition MEDICAL data sets into the
training, test and validation. The training data is used for preliminary model
fitting, the validation is used to tune model weights during estimation and the test
data is used for model assessment. We set the percentage of train for 70%, test for
30% and left validation be empty.

28

V) Add a Regression node. Then, connect it from the Data Partition node to Regression
node.

The Regression node fits models for interval, ordinal, nominal, and binary targets. Since we
selected a binary variable (Status) as the target in the Input Data Source node
(EMDATA.MEDICALDATA), the Regression node will fit (by default) a binary logistic
regression model with main effects for each input variable.

29

30

Stepwise here is refers as a modification of the forward selection method. The difference is
that variables already in the model do not necessarily stay there. After each variable is
entered into the model, this method looks at all the variables already included in the model
and deletes any variable that is not significant at the specified level. The process will end
when none of the variables outside the model has a p-value less than the specified entry value
and every variable in the model is significant at the specified stay value.
The regression process will be continued by setting the method as Forward. First it will
select the best one-variable model. Then it selects the best two variables among those that
contain the first selected variable. This process continues until it reaches the point where no
additional variables have a p-value less than the specified entry p-value.
Different from others, the backward model starts with the full model. Next the variable that
is least significant is removed from the model given that the other variables constant. This
process continues until all of the remaining variables have a p-value less than the specified
stay p-value.

31

By default, the node uses Deviation coding for categorical input variables. Right-click the
Regression node and select Run. When the run is complete, click Yes to view the results. The
Estimates tab in the Regression Results Browser displays bar charts of effect T-scores and
parameter estimates.

VARIABLES NAME EFFECT T-SCORES
X1 AgeAtStart 19.9558
X2 Smoking 9.3465
X3 Systolic 5.4062
X4 Sex-Female -4.8960
X5 Height -4.1090
X6 Weight 3.6369
X7 MRW -3.5919
X8 Cholesterol 1.9340
X9 InterceptStatus=DEAD 1.7756
X10 Diastolic 1.3250

32

The T-scores are plotted (from left to right) in decreasing order of their absolute values. The
higher the absolute value is, the more important the variable is in the regression model. In this
data, the variables X1=Ageatstart, X2= Smoking and X3= Systolic are the most important
model predictors.
Next, right-click the Regression node in the Diagram Workspace and select Model Manager.
In the Model Manager, select ToolsLift Chart . A cumulative % Response chart appears.
By default, this chart arranges observations into deciles based on their predicted probability
of response, and then plots the actual percentage of respondents.

From the lift chart, the individuals are sorted in descending order of their predicted
probability of default on medical based on the causes of death. The plotted values are the
cumulative actual probabilities of loan defaults. If the model is useful, the proportion of
individuals that defaulted will be relatively high in the top deciles and the plotted curve will
be decreasing. In this case, the default regression is not useful. Applying a default regression
model directly to the training data set is not appropriate in this case, because regression
models ignore observations that have a missing value for at least one input variable. We
should consider performing imputation before fitting a regression model. In Enterprise Miner,
we can use the Replacement node to perform imputation.

33

IV. Add an Insight node and connect it to the Data Partition node.
Run the flow from the Insight node by right-clicking the Insight node and selecting Run.
Select Yes when you are prompted to see the results. An output is shown below

34

VI) Add a Replacement node. This allows you to replace missing values for each variable.
This replacement is necessary to use all of the observations in the training data set
when you build a regression or neural network model. By default, Enterprise Miner
uses a sample from the training data set to select the values for replacement. The
following statements are true:
Observations that have a missing value for an interval variable have the missing
Value replaced with the mean of the sample for the corresponding variable.
Observations that have a missing value for a binary, nominal, or ordinal variable
Have the missing value replaced with the most commonly occurring non-missing
Level of the corresponding variable in the sample.

VII) Performing Variable Transformation
After you have viewed the results in Insight, it will be clear that some input variables
have highly skewed distributions. In highly skewed distributions, a small percentage of
the points may have a great deal of influence. Sometimes, performing a transformation on
an input variable can yield a better fitting model. In order to do that, we add a Transform
Variables node as shown in below.

35

After connecting the node, open the node by right-clicking on it and selecting Open. The
Variables tab is shown by default, which displays statistics for the interval level variables that
include the mean, standard deviation, skewness, and kurtosis (calculated from the metadata
sample).
Open the Regression node. The Variables tab is active by default. Change the status of all
input variables except the M_ variables to dont use. Close the Regression node and save the
model. Run the flow from the Assessment node and select Yes to view the results. Create a
lift chart for the stepwise regression model.

VIII) Add a default Tree node, connect the Data Partition node to the Tree node, and then
connect the Tree node to the Assessment node. Decision trees handle missing values
directly, while regression and neutral network models ignore all incomplete
observations (observations that have a missing value for one or more input variables).
The flow should now appear like the following flow.

36

Result for tree with two branch:

Lift Chart

37

Competing Splits View:

Result for tree with three branch:

38

Lift Chart

Competing Split View:

39

X. Add a default Neural Network node. Then, connect the Neural Network node to the
Assessment node.

Run the flow from the Neural Network node. Select Yes when you are prompted to view the
results. The default Neural Network node fits a multilayer perceptron (MLP) model with no
direct connections, and the number of hidden layers is data dependent. In this case, the Neural
Network node fitted a MLP model with a single hidden layer. By default, the Tables tab in
the Neural Network Results Browser displays various statistics of the fitted model. Click the
Weights tab. The Weights tab displays the weights (parameter estimates) of the connections.
The following display shows the weights of the connections from each variable to the single
hidden layer. Each level of each status variable is also connected to the hidden layer. The
Neural Network node iteratively adjusts the weights of the connections to minimize the error
function.

40

Table below shows the neural network results for medical data:

41

Weights:

We have 43 variables result. From the SAS output, we find that the highest value is 2.39012
from variable 43 which is H12.

42

Conclusion

43

By referring to the assessment node and the regression node, the misclassification rate in
training data set and misclassification rate in testing data are as below:
Tool Training Testing
Decision Tree With 2
Branch
0.24273 0.26807
Decision Tree With 3
Branch
0.24495 0.28534
Neural Network 0.24602 0.25655
Regression (Stepwise) 0.25534 0.25019
Regression (Forward) 0.25535 0.25016
Regression (Backward) 0.25151 0.25784

Misclassification rate in training data set = 0.25535
Misclassification rate in testing data set = 0.25016
Besides, Regression of forward method classify as a best model because it contains the
least misclassification rate in testing data set compared to other 2 models. It is also can be say
that it has the highest accuracy in testing data set compared to other 2 models.

44

References
http://en.wikipedia.org/wiki/Medicine
http://www.medicinenet.com/script/main/hp.asp
http://www.saedsayad.com/decision_tree.htm
http://www.sas.com/technologies/analytics/datamining/miner/

Data Mining Final Report

Uploaded by

Copyright:

Available Formats

Data Mining Final Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining Final Report

Uploaded by

Copyright:

Available Formats

1

UNIVERSITI UTARA MALAYSIA

You might also like