This document outlines a group project on medical data conducted by three students at Universiti Utara Malaysia. It aims to develop a classification model to determine patients' status (alive or dead) based on medical causes using a secondary dataset containing information on patients such as age, sex, blood pressure, cholesterol levels, weight, and smoking status. Decision trees and regression techniques will be used to identify the best model for determining medical causes and significant variables for predicting survival. The objectives are to classify status, identify the best predictive model, and determine factors associated with being alive.
This document outlines a group project on medical data conducted by three students at Universiti Utara Malaysia. It aims to develop a classification model to determine patients' status (alive or dead) based on medical causes using a secondary dataset containing information on patients such as age, sex, blood pressure, cholesterol levels, weight, and smoking status. Decision trees and regression techniques will be used to identify the best model for determining medical causes and significant variables for predicting survival. The objectives are to classify status, identify the best predictive model, and determine factors associated with being alive.
This document outlines a group project on medical data conducted by three students at Universiti Utara Malaysia. It aims to develop a classification model to determine patients' status (alive or dead) based on medical causes using a secondary dataset containing information on patients such as age, sex, blood pressure, cholesterol levels, weight, and smoking status. Decision trees and regression techniques will be used to identify the best model for determining medical causes and significant variables for predicting survival. The objectives are to classify status, identify the best predictive model, and determine factors associated with being alive.
This document outlines a group project on medical data conducted by three students at Universiti Utara Malaysia. It aims to develop a classification model to determine patients' status (alive or dead) based on medical causes using a secondary dataset containing information on patients such as age, sex, blood pressure, cholesterol levels, weight, and smoking status. Decision trees and regression techniques will be used to identify the best model for determining medical causes and significant variables for predicting survival. The objectives are to classify status, identify the best predictive model, and determine factors associated with being alive.
NO TITLE PAGE NUMBER 1. 1.0: INTRODUCTION 3-4 2. 2.0: PROBLEM STATEMENT 2.1: OBJECTIVES 2.2: JUSTIFY THE USE OF SPECIFIC SOLUTION TECHNIQUES OR PROBLEM SOLVING PROCEDURES IN YOUR WORK 5-6 7 8-9 3. 3.0: RESEARCH METHODOLOGY 3.1: DATA MINING TECHNIQUE TO SOLVE THE PROBLEM 3.2: STEPS INVOLVING IN SAS ENTERPRISE MINER 3.3: ANALYSIS OF DATA
Medical is field which relating to the science or practice of medicine. It is an examination to assess a person's state of physical health or fitness. Health care is very important for human beings for a long life. Health factors are the major part which decides the life range of human. There are many factors that involved in deciding the life cycle of human beings. Various type of diseases are identified to be the major contribution toward live or death of the patients who diagnosed it. Gary Null, PhD; Carolyn Dean MD, ND; Martin Feldman, MD; Debora Rasio, MD; and Dorothy Smith, PhD states a group of researchers meticulously reviewed the statistical evidence and their findings are absolutely shocking 4 These researchers have authored a paper titled Death by Medicine that presents compelling evidence that todays system frequently causes more harm than good. This fully referenced report shows the number of people having in-hospital, adverse reactions to prescribed drugs to be 2.2 million per year. The number of unnecessary antibiotics prescribed annually for viral infections is 20 million per year. The number of unnecessary medical and surgical procedures performed annually is 7.5 million per year. The number of people exposed to unnecessary hospitalization annually is 8.9 million per year. The most stunning statistic, however, is that the total number of deaths caused by conventional medicine is an astounding 783,936 per year. It is now evident that the American medical system is the leading cause of death and injury in the US. (By contrast, the number of deaths attributable to heart disease in 2001 was 699,697, while the number of deaths attributable to cancer was 553,251.5). By exposing these statistics in painstaking detail, we provide a basis for competent and compassionate medical professionals to recognize the inadequacies of todays system and at least attempt to institute meaningful reforms. Medical and medicines are related to each other and they have the large important in human beings live on earth. So, the medical have to be more conscious towards identify the diseases in right time and perfect medicine to cure the illness. This can help for a better lifestyle and secure for a long life. 4
5
2.0: PROBLEM STATEMENT In this modern age, there are a huge development and improvement are identified and a lot researches are been done towards the medicines for the illness for increase the life time of the patients. They frequently modify the medicines and always want to improve for better cure and increase their life time in this world. However, there are still unpredictable things happen to the patients either they alive or dead. When researchers find more alternatives towards curing the diseases still the percentage of the people who survived is decreasing. So, here we are also going to do a study based on a secondary medical data that shows various causes for the live or death of the patience. There is status, death causes, age coronary heart disease diagnose, sex, age at start, height, weight, diastolic, systolic, MRW, smoking, age at death, cholesterol, cholesterol status, BP status, weight status and smoking status. Status is our target variable where it is mean by the status of the patients alive or dead. The second is death causes which categorized by chronic diseases Cancer, Coronary heart disease, Cerebral vascular disease, others and unknown causes. The third is age coronary heart disease diagnoses which identified to be the age between 32- 90 years old. The fourth data input is sex represent male or female. The fifth is age at which the illness start is from 28-62 years old. The sixth and seventh data input is height and weight of the patients in a range of 51.5-76.5 and 67-300 respectively. The eighth and ninth input is diastolic and systolic rate. When your heart beats, it contracts and pushes blood through the arteries to the rest of your body. This force creates pressure on the arteries. This is called systolic blood pressure. A normal systolic blood pressure is 120 or below. A systolic blood pressure of 120-139 means you have normal blood pressure that is higher than ideal or borderline high blood pressure. Even people with this level are at a greater risk of developing heart disease. A systolic blood pressure number of 140 or higher, on repeated measurements, is considered to be hypertension, or high blood pressure. The diastolic blood pressure number or the bottom number indicates the pressure in the arteries when the heart rests between beats. A normal diastolic blood pressure number is 80 or less. A diastolic blood pressure between 80 and 89 is normal but higher than ideal. A diastolic blood pressure number of 90 or higher, on repeated measurements, is considered to be hypertension or high blood pressure. 6
The tenth input is MRW rate from range of 67-268. The eleventh input is the rate of smoking of the patients. Next is the age at death rate from 36-93 years old. Then the next input is cholesterol level and status. The level is from 96 to 568 the status is borderline, desirable and high. The next input is BP status which related to high, normal and optimal. Then the weight and smoking status of the patience is classified as underweight, normal, overweight and light, non-smoker, heavy, moderate, very heavy according to the weight and smoking data. So, we are going to use this dataset to classify the status of the list of the patients either dead or alive using the data mining techniques.
7
2.1: OBJECTIVES 1. To develop a classification model to determine the status of people whether alive or dead based on medical causes. 2. To identify the best model in determine the medical causes. 3. To identify significant variable in determining alive people.
8
2.2: JUSTIFY THE USE OF SPECIFIC SOLUTION TECHNIQUES OR PROBLEM SOLVING PROCEDURES IN YOUR WORK Decision tree Decision tree is a tree-shaped structure that represents set of decisions or prediction of data trends. It is suitable to describe sequence of interrelated decisions or prediction of future data trends and has the capability to classify entities into specific classes based on feature of entities. Each tree consists of three types of nodes: root node, internal node and terminal node/leaf. The top most node is the root node and it represents all of the rows in the dataset. Nodes with child nodes are the internal nodes while nodes without child node are called the terminal node or leaf. A common algorithm for building a decision tree selects a subset of instances from the training data to construct an initial tree. The remaining training instances are then used to test the accuracy of the tree. If any instance is incorrectly classified the instance is added to the current set of training data and the process is repeated. A main goal is to minimize the number of tree levels and tree nodes, thereby maximizing data generalization. Decision trees have been successfully applied to real problem, are easy to understand, and map nicely to a set of production rules. Regression The Regression node in Enterprise Miner does either linear or logistic regression depending upon the measurement level of the target variable. Linear regression is done if the target variable is an interval variable. In linear regression the model predicts the mean of the target variable at the given values of the input variables. Logistic regression is done if the target variable is a discrete variable. In logistic regression the model predicts the probability of a particular level(s) of the target variable at the given values of the input variables. Because the predictions are probabilities, which are bounded by 0 and 1 and are not linear in this space, the probabilities must be transformed in order to be adequately modelled. The most common transformation for a binary target is the logit transformation. Probit and complementary log- log transformations are also available in the regression node. There are three variable selection methods available in the Regression node of Enterprise Miner. Forward first selects the best one-variable model. Then it selects the best two variables among those that contain the first selected variable. This process continues until it reaches the point where no additional variables have a p-value less than the specified entry p-value. 9
Backward starts with the full model. Next, the variable that is least significant, given the other variables, is removed from the model. This process continues until all of the remaining variables have a p-value less than the specified stay p-value. Stepwise is a modification of the forward selection method. The difference is that variables already in the model do not necessarily stay there. After each variable is entered into the model, this method looks at all the variables already included in the model and deletes any variable that is not significant at the specified level. The process ends when none of the variables outside the model has a p- value less than the specified entry value and every variable in the model is significant at the specified stay value. Neural Network An artificial neural network is a network of many simple processors ("units"), each possibly having a small amount of local memory. The units are connected by communication channels ("connections") that usually carry numeric (as opposed to symbolic) data encoded by various means. The units operate only on their local data and on the inputs they receive via the connections. The restriction to local operations is often relaxed during training. More specifically, neural networks are a class of flexible, nonlinear regression models, discriminant models, and data reduction models that are interconnected in a nonlinear dynamic system. Neural networks are useful tools for interrogating increasing volumes of data and for learning from examples to find patterns in data. By detecting complex nonlinear relationships in data, neural networks can help make accurate predictions about real-world problems.
10
3.0: RESEARCH METHODOLOGY In this project, our group conducts a survey to gather information about patients who alive and dead with different types of medical causes. The causes includes cancer, cerebral vascular disease, coronary heart disease, others and unknown. After gathering the data, we process the data by using KDD process. KDD process is method uses to digest or find hidden information in database. KDD process helps to convert unknown or hidden pattern into useful, understandable and informative way. KDD process has 5 sections, which are selection, preprocessing, transformation, data mining and interpretation & evaluation.
a) Selection Data selection is defined as the process of determining the appropriate data type and source, as well as suitable instruments to collect data. Data selection precedes the actual practice of data collection. The data is obtained from the UCI Machine Learning Repository.
b) Pre-processing Data pre-processing is a data mining technique that involves transforming raw data into an understandable format. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviours or trends, and is likely to contain many errors. Data pre-processing is a proven method of resolving such issues. Data processing can be categorized into few methods: 11
Data cleaning Data cleaning is a process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. The purpose of using data cleaning is to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data. Data cleaning can handle incomplete, noisy and inconsistent data. o Incomplete data is the missing value happens due to improper data collection methods. During collection of data, there may happen no recorded value for certain attribute and causing incomplete of data. To overcome this problem, we can use mean-value, estimate the probable value using regression, using constant value or ignore the missing record. For example:
o Noisy data is random error or variance in data. This happens due to corrupted data transmission, technological limitation. During transmission data into certain software such as SPSS or SAS, we may key in wrong data in it, thus, this will cause noisy data happened. To 12
solve this problem, we can use binning method or outlier removal method. o Inconsistent data means the data contains replication or possibly redundancy data. Method to overcome this problem is removing redundant or replicate data.
Data integration involves combining data residing in different sources and providing users with a unified view of these data. Data comes from different sources with different naming standard. This will cause in inconsistencies and redundancies. There are several ways to handle this problem: Consolidate different source into one repository (using metadata). Correlation analysis (measure the strength of relationship between different attribute)
Data reduction is the transformation of numerical or alphabetical digital information derived empirical or experimentally into a corrected, ordered, and simplified form. The basic concept is the reduction of multitudinous amounts of data down to the meaningful parts. This is to increase efficiency, can reduce the huge data set into a smaller representation. Several techniques can be used in data reduction such as data cube aggregation, dimension reduction, data compression and discretization.
13
In our medical data, we have use the data integration which is we eliminate the unrelated variables from our data.
c) Transformation In the transformation process, which also known as data normalization, is basically re-scale the data into a suitable range. This process is important because it can increase the processing speed and reduce the memory allocation. There are several methods in transformation: - Decimal Scaling - Min Max, - Z Score and - Logarithmic Normalization We choose Min Max normalization to solve our data. Min Max normalization is linear transformation of the original input to newly specified range. The formula use is:
14
d) Data Mining Data mining is the use of algorithms to extract the information and patterns by the KDD process. This step applies algorithms to the transformed data to generate the desired results. In this project, we are using SAS Enterprise Miner to build up the comparing models. In the SAS Enterprise Miner, we are using decision tree method. Decision tree is a tree-shaped structure that represents set of decisions or prediction of data trends. It is suitable to describe sequence of interrelated decisions or prediction of future data trends and has the capability to classify entities into specific classes based on feature of entities. Each tree consists of three types of nodes: root node, internal node and terminal node/leaf. The top most nodes are the root node and it represents all of the rows in the dataset. Nodes with child nodes are the internal nodes while nodes without child node are called the terminal node or leaf. A common algorithm for building a decision tree selects a subset of instances from the training data to construct an initial tree. The remaining training instances are then used to test the accuracy of the tree. If any instance is incorrectly classified the instance is added to the current set of training data and the process is repeated. A main goal is to minimize the number of tree levels and tree nodes, thereby maximizing data generalization. Decision trees have been successfully applied to real problem, are easy to understand, and map nicely to a set of production rules. The resulting trees are usually quite understandable and can be easily used to obtain a better understanding of the phenomenon in question.
15
DATA VARIABLES From the table above, our target is predicting the status
e) Interpretation & evaluation process In interpretation & evaluation process, certain data mining output is non- human understandable format and we need interpretation for better understanding. So, we convert output into an easy understand medium.
16
3.2 STEPS INVOLVING IN SAS ENTERPRISE MINER I. Firstly, we open the SAS Enterprise Miner. Then click File and we create a new project. From the SAS menu bar select FileNewProject. Then, we name our project as GroupProject.
II. Click CREATE. The Group Project will open an initial untitled diagram and name it as Data.
17
The SAS Enterprise Miner window contains the following interface components:
Project Navigator enables you to manage projects and diagrams, add tools to the Diagram Workspace, and view HTML reports that are created by the Reporter node. Note that when a tool is added to the Diagram Workspace, the tool is referred to as a node. The Project Navigator has three tabs: Diagrams tab - lists the current project and the diagrams within the project. By default, the project window opens with the Diagrams tab activated. Tools tab - contains the Enterprise Miner tools palette. This tab enables you to see all of the tools (or nodes) that are available in Enterprise Miner. The tools are grouped according to the SEMMA data-mining methodology. Many of the commonly used tools are shown on the Tools Bar at the top of the window. You can add additional tools to the Tools Bar by dragging them from the Tools tab onto the Tools Bar. In addition, you can rearrange the tools on the Tools Bar by dragging each tool to a new location on the Tools Bar. Reports tab - displays the HTML reports that are generated by using the Reporter node. Diagram Workspace - enables you to build, edit, run, and save process flow diagrams. Tools Bar - contains a customizable subset of Enterprise Miner tools that are commonly used to build process flow diagrams in the Diagram Workspace. You can add or delete tools from the Tools Bar. 18
Progress Indicator - displays a progress indicator bar that indicates the execution status of an Enterprise Miner task. Message Panel - displays messages about the execution of an Enterprise Miner task. Connection Status Indicator - displays the remote host name and indicates whether the connection is active for a client/server project.
III. The Sample Nodes that we used in our Group Project: a) Input Data Source
The Input Data Source node reads data sources and defines their attributes for later processing by Enterprise Miner. This node can perform various tasks: Access SAS data sets and data marts. Data marts can be defined by using the SAS Data Warehouse Administrator, and they can be set up for Enterprise Miner by using the Enterprise Miner Warehouse Add-ins. Automatically create a metadata sample for each variable when you import a dataset with the Input Data Source node. By default, Enterprise Miner obtains the metadata sample by taking a random sample of 2,000 observations from the data set that is identified in the Input Data Source. Optionally, you can request larger samples. If the data is smaller than 2,000 observations, the entire data set is used. Use the metadata sample to set initial values for the measurement level and the model role for each variable. You can change these values if you are not satisfied with the automatic selections that are made by the node. Display summary statistics for interval and class variables. Define target profiles for each target in the input data set.
19
b) Data Partition
The Data Partition node enables you to partition data sets into training, test, and validation data sets. The training data set is used for preliminary model fitting. The validation data set is used to monitor and tune the model weights during estimation and is also used for model assessment. The test data set is an additional data set that you can use for model assessment. This node uses simple random sampling, stratified random sampling, or a user-defined partition to create training, test, or validation data sets. Specify a user-defined partition if you have determined which observations should be assigned to the training, validation, or test data sets. This assignment is identified by a categorical variable that is in the raw data set.
b) Replacement
The Replacement node enables you to impute values for observations that have missing values. You can replace missing values for interval variables with the mean, median, midrange, mid-minimum spacing, or distribution-based replacement, or you can use a replacement M-estimator such as Tukeys by weight, Hubers, or Andrews Wave. You can also estimate the replacement values for each interval input by using a tree-based imputation method. Missing values for class variables can be replaced with the most frequently occurring value, distribution-based replacement, tree-based imputation, or a constant.
20
c) Transform Variables
The Transform Variables node enables you to transform variables. For example, you can transform variables by taking the square root of a variable, by taking the natural logarithm, maximizing the correlation with the target, or normalizing a variable. Additionally, the node supports user-defined formulas for transformations and enables you to group interval-valued variables into buckets or quantiles. This node also automatically places interval variables into buckets by using a decision tree-based algorithm. Transforming variables to similar scale and variability may improve the fit of models and, subsequently, the classification and prediction precision of fitted models.
d) Regression
The Regression node enables you to fit both linear and logistic regression models to your data. You can use both continuous and discrete variables as inputs. The node supports the stepwise, forward, and backward-selection methods. A point-and-click interaction builder enables you to create higher-order modelling terms
21
e) Decision Tree
The Tree node enables you to perform multi way splitting of your database, based on nominal, ordinal, and continuous variables. This is the SAS implementation of decision trees, which represents a hybrid of the best of CHAID, CART, and C4.5 algorithms. The node supports both automatic and interactive training. When you run the Tree node in automatic mode, it automatically ranks the input variables by the strength of their contribution to the tree. This ranking can be used to select variables for use in subsequent modelling. In addition, dummy variables can be generated for use in subsequent modelling. Using interactive training, you can override any automatic step by defining a splitting rule or by pruning a node or sub tree.
f) Neural Network
The Neural Network node enables you to construct, train, and validate multilayer feed- forward neural networks. By default, the Neural Network node automatically constructs a multilayer feed-forward network that has one hidden layer consisting of three neurons. In general, each input is fully connected to the first hidden layer, each hidden layer is fully connected to the next hidden layer, and the last hidden layer is fully connected to the output. The Neural Network node supports many variations of this general form.
22
g) Assessment
The Assessment node provides a common framework for comparing models and predictions from any of the modelling nodes (Regression, Tree, Neural Network, and User Defined Model nodes). The comparison is based on the expected and actual profits or losses that would result from implementing the model. The node produces the following charts that help to describe the usefulness of the model: lift, profit, return on investment, receiver operating curves, diagnostic charts, and threshold-based charts. The Reporter node assembles the results from a process flow analysis into an HTML report that can be viewed with a Web browser. Each report contains header information, an image of the process flow diagram, and a separate report for each node in the flow including node settings and results. Reports are managed in the Reports tab of the Project Navigator. IV) After that, we drag and drop an input data source node to the workspace. We use the Input Data Source node to access medical data sets. After open the data sources node, we will choose the data set EMDATA.MEDICAL
23
24
Next, we continue with set the model role in the input data sources. We assign the status as the target. Status variable becomes a target variable which it measurement is binary. The measurement of cholesterol status, BP status, weight status and the smoking status is will be measured as nominal. The unwanted variables such as Deathcause, Ageatdeath and Agechodiag wil be rejected as it does not contribute to the model.
25
Using the Cursor The shape of the cursor changes depending on where it is positioned. The behavior of the mouse commands depends on the shape of the cursor as well as on the selection state of the node over which the cursor is positioned. Right-click in an open area to see the pop-up menu as shown below. You can connect the node where the cursor is positioned (beginning node) to any other node (ending node) as follows: i. Ensure that the beginning node is deselected. It is much easier to drag a line when the node is deselected. If the beginning node is selected, click in an open area of the workspace to deselect it. ii. Position the cursor on the edge of the icon that represents the beginning node (until the cross-hair appears). iii. Press the left mouse button and immediately begin to drag in the direction of the ending node. Note: If you do not begin dragging immediately after pressing the left mouse button, you will only select the node. Dragging a selected node will generally result in moving the node (that is, no line will form). iv. Release the mouse button after you reach the edge of the icon that represents the ending node. v. Click away from the arrow. Initially, the connection will appear as follows. After you click away from the line in an open area of the workspace, the finished arrow forms.
26
Select View distribution to see the distribution of values for Status in the metadata sample. A distribution is showed.
We need to investigate the number of causes, percentage of missing values, and the sort order of each variable. For a binary target, Status has two levels then Status is the target event. Close the Input Data Source node, and save changes when you are prompted.
27
IV) After that, we use the data partition node to partition MEDICAL data sets into the training, test and validation. The training data is used for preliminary model fitting, the validation is used to tune model weights during estimation and the test data is used for model assessment. We set the percentage of train for 70%, test for 30% and left validation be empty.
28
V) Add a Regression node. Then, connect it from the Data Partition node to Regression node.
The Regression node fits models for interval, ordinal, nominal, and binary targets. Since we selected a binary variable (Status) as the target in the Input Data Source node (EMDATA.MEDICALDATA), the Regression node will fit (by default) a binary logistic regression model with main effects for each input variable.
29
30
Stepwise here is refers as a modification of the forward selection method. The difference is that variables already in the model do not necessarily stay there. After each variable is entered into the model, this method looks at all the variables already included in the model and deletes any variable that is not significant at the specified level. The process will end when none of the variables outside the model has a p-value less than the specified entry value and every variable in the model is significant at the specified stay value. The regression process will be continued by setting the method as Forward. First it will select the best one-variable model. Then it selects the best two variables among those that contain the first selected variable. This process continues until it reaches the point where no additional variables have a p-value less than the specified entry p-value. Different from others, the backward model starts with the full model. Next the variable that is least significant is removed from the model given that the other variables constant. This process continues until all of the remaining variables have a p-value less than the specified stay p-value.
31
By default, the node uses Deviation coding for categorical input variables. Right-click the Regression node and select Run. When the run is complete, click Yes to view the results. The Estimates tab in the Regression Results Browser displays bar charts of effect T-scores and parameter estimates.
The T-scores are plotted (from left to right) in decreasing order of their absolute values. The higher the absolute value is, the more important the variable is in the regression model. In this data, the variables X1=Ageatstart, X2= Smoking and X3= Systolic are the most important model predictors. Next, right-click the Regression node in the Diagram Workspace and select Model Manager. In the Model Manager, select ToolsLift Chart . A cumulative % Response chart appears. By default, this chart arranges observations into deciles based on their predicted probability of response, and then plots the actual percentage of respondents.
From the lift chart, the individuals are sorted in descending order of their predicted probability of default on medical based on the causes of death. The plotted values are the cumulative actual probabilities of loan defaults. If the model is useful, the proportion of individuals that defaulted will be relatively high in the top deciles and the plotted curve will be decreasing. In this case, the default regression is not useful. Applying a default regression model directly to the training data set is not appropriate in this case, because regression models ignore observations that have a missing value for at least one input variable. We should consider performing imputation before fitting a regression model. In Enterprise Miner, we can use the Replacement node to perform imputation.
33
IV. Add an Insight node and connect it to the Data Partition node. Run the flow from the Insight node by right-clicking the Insight node and selecting Run. Select Yes when you are prompted to see the results. An output is shown below
34
VI) Add a Replacement node. This allows you to replace missing values for each variable. This replacement is necessary to use all of the observations in the training data set when you build a regression or neural network model. By default, Enterprise Miner uses a sample from the training data set to select the values for replacement. The following statements are true: Observations that have a missing value for an interval variable have the missing Value replaced with the mean of the sample for the corresponding variable. Observations that have a missing value for a binary, nominal, or ordinal variable Have the missing value replaced with the most commonly occurring non-missing Level of the corresponding variable in the sample.
VII) Performing Variable Transformation After you have viewed the results in Insight, it will be clear that some input variables have highly skewed distributions. In highly skewed distributions, a small percentage of the points may have a great deal of influence. Sometimes, performing a transformation on an input variable can yield a better fitting model. In order to do that, we add a Transform Variables node as shown in below.
35
After connecting the node, open the node by right-clicking on it and selecting Open. The Variables tab is shown by default, which displays statistics for the interval level variables that include the mean, standard deviation, skewness, and kurtosis (calculated from the metadata sample). Open the Regression node. The Variables tab is active by default. Change the status of all input variables except the M_ variables to dont use. Close the Regression node and save the model. Run the flow from the Assessment node and select Yes to view the results. Create a lift chart for the stepwise regression model.
VIII) Add a default Tree node, connect the Data Partition node to the Tree node, and then connect the Tree node to the Assessment node. Decision trees handle missing values directly, while regression and neutral network models ignore all incomplete observations (observations that have a missing value for one or more input variables). The flow should now appear like the following flow.
36
Result for tree with two branch:
Lift Chart
37
Competing Splits View:
Result for tree with three branch:
38
Lift Chart
Competing Split View:
39
X. Add a default Neural Network node. Then, connect the Neural Network node to the Assessment node.
Run the flow from the Neural Network node. Select Yes when you are prompted to view the results. The default Neural Network node fits a multilayer perceptron (MLP) model with no direct connections, and the number of hidden layers is data dependent. In this case, the Neural Network node fitted a MLP model with a single hidden layer. By default, the Tables tab in the Neural Network Results Browser displays various statistics of the fitted model. Click the Weights tab. The Weights tab displays the weights (parameter estimates) of the connections. The following display shows the weights of the connections from each variable to the single hidden layer. Each level of each status variable is also connected to the hidden layer. The Neural Network node iteratively adjusts the weights of the connections to minimize the error function.
40
Table below shows the neural network results for medical data:
41
Weights:
We have 43 variables result. From the SAS output, we find that the highest value is 2.39012 from variable 43 which is H12.
42
Conclusion
43
By referring to the assessment node and the regression node, the misclassification rate in training data set and misclassification rate in testing data are as below: Tool Training Testing Decision Tree With 2 Branch 0.24273 0.26807 Decision Tree With 3 Branch 0.24495 0.28534 Neural Network 0.24602 0.25655 Regression (Stepwise) 0.25534 0.25019 Regression (Forward) 0.25535 0.25016 Regression (Backward) 0.25151 0.25784
Misclassification rate in training data set = 0.25535 Misclassification rate in testing data set = 0.25016 Besides, Regression of forward method classify as a best model because it contains the least misclassification rate in testing data set compared to other 2 models. It is also can be say that it has the highest accuracy in testing data set compared to other 2 models.