Full Doc

DISEASE PREDICTION SYSTEM USING MACHINE
LEARNING
ABSTRACT
Technology has altered the health arena to a large extent in this era of IT. The
goal of this research is to create a diagnosis model for a variety of diseases
based on their symptoms. To create such a model, this system used data mining
techniques such as classification. The intelligent agent is trained using datasets
containing copious data regarding patient diseases that have been gathered,
refined, categorized, and utilized. Accuracy is used to evaluate the machine
learning models after splitting the data. For cross-validation, employed are the
Support Vector Classifier, Gaussian Naive Bayes Classifier, and Random The
patient might then contact the doctor for further therapy based on the results. It's
an example of how technology and medical expertise are flawlessly woven
together with the goal of achieving "prediction is better than cure."
Disease Prediction System based on various prediction models that help to

predict the disease of the user on the basis of the symptoms that user enters as
an input to the system. Predictive models with the help of machine learning
classification algorithms analyzes the symptoms provided by the user as input
and gives the name and probability of the disease as an output. Disease
Prediction is done by implementing the Naive Bayes Classifier, Decision tree
and Random Forest Algorithm. The Naive Bayes helps to calculate the
probability of the disease which is predicted. The model uses a dataset with the
count of 132 symptoms from which the user can select their symptoms. The
user does not need to have a medical report to use this system as the prediction
is based on the symptoms which will save the money. The system also has a
very easy to use user interface so all the users can use it to predict the generic
diseases.
INTRODUCTION
1.1 About
The healthcare and medical sector are more in need of datamining today. When certain data
mining methods are used in a right way, valuable information can be extracted from large
database and that can help the medical practitioner to take early decision and improve health
services. The spirit is to use the classification in order to assist the physician. Diseases and
health related problems like malaria, dengue, Impetigo, Diabetes, Migraine, Jaundice,
Chickenpox etc., cause significant effect on one’s health and sometimes might also lead to
death if ignored. The healthcare industry can make an effective decision making by “mining”
the huge database they possess i.e. by extracting the hidden patterns and relationships in the
database.
With the rise in number of patient and disease every year medical system is overloaded and
with time have become overpriced in many countries. Most of the disease involves a
consultation with doctors to get treated. With sufficient data prediction of disease by an
algorithm can be very easy and cheap. Prediction of disease by looking at the symptoms is an
integral part of treatment.
There are times when we need a doctor all of a sudden but sometimes they are not available
due to some reason and we are left in trouble. The system we have proposed is user friendly
to get help and advice on health issues immediately through the online healthcare system.
Now a days, with the help of the statistics and posterior distribution the problems are swiftly and
easily. As the statistics has a great success rate in the field of economic, social science and a
few other fields just like that, in medical fields, people have solved various medical problems
that are tiresome to be settled in classic statistics by classification and can be solved easily.
The classification rules which help in solving the prediction of disease are generated by the
samples trained by themselves and help in solving the problem easily.
It is approximated that greater than 70% of people in India are prone to various body dis-
eases like viral, flu, cough, cold etc. in intervals of 2 months. As many people don’t
understand that the general body diseases could be symptoms of something more harmful,
25% of this population dies or gets some serious medical problem because of ignoring the
early general body symptoms and this is a very serious condition that we are facing and the
problem can be proven to be a very dangerous situation for the population and can be
alarming if the people will continue ignoring these diseases. Hence identifying or predicting
the disease at the very basic stage is very important to avoid any unwanted problems and
deaths. The systems which are available now a days are the systems that are either dedicated
to a particular disease or are in development or the research for solving the algorithms related
to the problem when it comes to generalized disease.
The main motive of the proposed system is the prediction of the commonly occurring dis-
eases in the early phase as when they are not checked or examined they can turn into a
disease more dangerous disease and can even cause death. The system applies data mining
techniques,
decision tree algorithms, Naive Bayes algorithm and Random Forest algorithm. This system
will predict the most possible disease based on the given symptoms by the user and
precaution- ary measures required to avoid the aggression of disease, it will also help doctors
to analyze the patterns of diseases in the society. This project is dedicated to the Disease
prediction System that will have data mining techniques for the basic stages of the dataset
and the main model will be trained using the Machine Learning (ML) algorithms and will
help in the prediction of general diseases.
Data mining algorithms like Decision Tree, Random Forest and Naïve Bayes algorithms can
give a remedy to this situation. Hence, we have developed an automated system that can
discover and extract hidden knowledge associated with the diseases from a
historical(diseases-symptoms) database according to the rule set of the respective algorithms.
1.2 Data Mining and Machine Learning Algorithm
The Data Mining and the Machine Learning Algorithms are used for the prediction of
Disease in the Project. There are different Data Mining and Machine Learning used for the
purpose of correcting and evaluating the dataset and then testing the dataset on the basis of
train score and the test score of the ML model.
1.2.1 Data Analysis and Data Mining
The Data Mining is a process in which raw data is prepared and structured from the unstructured
data as to take meaningful information from the data which can be used in the project. Task
of making data organized and reflective about data is to way to get what this information does
the data contains in it and what it does not have in it. There are so many different types of
methods in which the people can make use of data analysis. It is simply very easy to use data
during the analysis phase and get to some certain conclusions or some agendas. The analysis
of data is a process of inspecting, cleaning, transforming, and modeling data with the
objective of highlighting useful information, suggesting conclusions, and supporting decision
making which are helpful to the user. Data analysis has multiple facets and approaches,
encompassing diverse techniques under an array of names, in different business, science, and
social science domains.
Data Mining is the discovery of unknown information found in databases, data mining func-
tions has some different methods for clustering, classification, prediction, and associations. In
the data mining important application is that of mining association rules, association rules
was first introduced in 1993 and are used to identify relationships among a set of items in
databases these different properties are not based on the properties of the data, but rather
based on co- occurrence of the data items. The Data mining helps in giving new and different
perspectives for data analysis the main role of data mining is to extract and discover new
knowledge from data. In the past few years, different methods have been coined and
developed about the capabilities of data collection and data generation, data collection
tools have provided us with a huge amount of data, data mining processes have integrated
techniques from multiple dis- ciplines such as, statistics, machine learning, database
technology, pattern recognition, neural
3
networks, information retrieval and spatial data analysis. The data mining techniques have
been used in many different fields such as, business management, science, engineering, banking,
data management, administration, and many other applications.
1.2.2 Machine Learning Algorithms
The ML is a small part of Artificial Intelligence (AI) which is used in the computation work
and the analysis work in the AI. The ML algorithms are used to find different patters and
different structures in the dataset which is provided to the dataset, the ML algorithms are
used to give a large computation capabilities to the system by which a large amount of data is
given to the model for the purpose of training and testing the data, the ML algorithms are
used in decision making process the model which is prepared by using the ML has a large
amount of data in it which makes it a very good for the process of decision making. ML
algorithms have very high computational power and are proven to be very helpful in today’s
world.
Different types of ML algorithms are organized into different ways, based on the desired
outcome of the algorithm. Common algorithm types include:
@ Supervised learning — The supervised learning algorithm can apply what has been
learned in the past to new data using labelled examples to predict the future events.
Start- ing from analysis of a known training dataset. This algorithm is used to provide
targets for any new values after sufficient amount of training of the model.
@ Unsupervised learning — Unsupervised machine learning algorithms are used when the
information used to train is neither classified nor labeled. This algorithm shows how
the system can infer a function to describe a hidden structure from unlabeled data.
@ Semi-supervised learning — This category of the ML algorithms falls somewhere be-

tween the supervised learning and the unsupervised learning algorithm which combines
both labeled and unlabeled examples to generate an appropriate function or classifier
which is used to make a model for the purpose of prediction or classification.
@ Reinforcement learning — This is the algorithm where the algorithm learns a policy
of how to act given an observation of the world. Every action has some impact in the
environment, and the environment provides feedback that guides the learning
algorithm.
@ Transduction — This algorithm is similar to supervised learning, but does not explic-
itly construct a function: instead, tries to predict new outputs based on training inputs,
training outputs, and new inputs.
@ Learning to learn — This method is where the algorithm learns its own inductive bias
based on previous experience.
The performance and computational analysis of ML algorithms is a branch of statistics

known as computational learning theory.
Machine learning is about designing algorithms that allow a computer to learn. Learning
is not necessarily involving consciousness but learning is a matter of finding statistical
regularities or other patterns in the data. Thus, many machine learning algorithms will barely
resemble how human might approach a learning task. However, learning algorithms can give
insight into the relative difficulty of learning in different environments
Machine learning is made up of three parts:
@ The computational algorithm at the core of making determinations.
@ Variables and features that make up the decision.
@ Base knowledge for which the answer is known that enables (trains) the system to learn.
Initially, the model is fed parameter data for which the answer is known. The algorithm
is then run, and adjustments are made until the algorithm’s output (learning) agrees with the
known answer. At this point, increasing amounts of data are input to help the system learn
and process higher computational decisions.
LITERATURE SURVEY
2.1 Introduction to Old Models
• In the model proposed by [1] showed important ML approaches to predict the disease
but this model which was proposed by [1] works on the K-Nearest Neighbour (KNN)
and Convolution Neural Network (CNN) approach of the machine learning
algorithm. Both the KNN and CNN approaches are used in this system which is
different from the approach which is used in our project. The CNN uses both the
structures as well as the unstructured data for the prediction of the disease which
makes it more time consuming.
The accuracy of the system proposed by [1] comes out to be very high i.e. above 95%
for the KNN algorithm and 100% for the CNN algorithm that is very high for a ML
model, In such cases the model is said to be overfitting.
Figure 2.1: Accuracy of The Model
• The model proposed by [2] is used for Disease Prediction and uses different ML algo-
rithms like Iforest for correcting the dataset problems and SMOTET for balancing the
dataset and then it uses the Ensemble learning technique. The Input the the ML
model is
taken only by the electronic reports which are produced by the blood examination of
the patient or the user. Some of the input taken in this model are glucose level,
cholesterol, lipoprotein, blood pressure and other inputs which are only be possible by
the physical examination the user or the patient.
• The model proposed by [3] uses big data analytics and the deep learning models for the
prediction of Disease The dataset is big so it uses the Big Data analytics like Map
reduce is used in this model and on that the deep learning models are used for the
prediction of the Disease which makes it a very big process and it becomes very time
consuming. This model needs the full medical examination of the user or the
patient foe the prediction of the disease. Full medical history of the patient or the
user is taken as an input to this model which is stored with the help of the big data
tools and then used by the deep learning models to predict the disease. this model
also needs all the medical record of the patient like all the medications which the
patient or the user was taking and the list of doctors which he or she has visited which
help in proper analysis of the patient’s problem.
• In the model proposed by [4] uses different ML algorithms like Random Forest,
Logistic Regression, Decision tree and others for the sake of prediction of the Disease
and is used for the prediction of the Heart Disease, Breast Cancer and Diabetes. All
the algorithms used in the system have their own way of predicting the Disease and are
used accordingly. Different dataset are used in this model for the different disease like
the heart disease has a different dataset and the Breast Disease has a different dataset.
For the different dataset the algorithms have their different accuracy % accordingly
and are used as per the accuracy.
• In the model proposed by [5] uses different data mining and the classification
algorithms for the prediction of disease. This model is mainly used for the prediction
of the Heart Disease and the algorithms which are used in this model are Decision
Tree and the Naive Bayes algorithm which are used for the prediction of the Disease
and various data mining techniques are also used in this model for correcting and
balancing the dataset so that the system can work correctly and can predict the correct
Disease. This model also needs the blood report of the patient or the user of the
model. Some of the inputs which are used in this system are Cholesterol Level, Blood
Pressure, Glucose Level in the body etc. This model has accuracy rate of 91% for the
decision tree algorithm and 87% for the Naive Bayes algorithm but has a very limited
scope in the prediction of the Diseases as it can only predict the Disease which are
related to the Heart, Diabetes and Breast Cancer and cannot predict the general
Diseases.
• In the model proposed by [6] makes use of Support Vector Machine (SVM) technique
of the Machine Learning for the prediction of the Diseases. The dataset used in this
model has dome general symptoms like eating habits, physical activity and they are
rated in this model between 1-5 where 1 is for excellent and 5 is for very bad. This
model helps to predict that weather a person’s lifestyle is healthy or not and does he
or she have any
13
disease or not, The model does not predict the name of the Disease or any problem
which the patient is facing or not. The data from the user is collected by the means of a
form and then is used by SVM for the prediction. This model is more focused on the
lifestyle is the user that the user is active or not that how much physical work is he or
she doing in day to day life and how much stress he or she have in life and on the basis
of that the health of the user is predicted.
• In the model proposed by [7] uses big data techniques for the disorders and helps in
the prediction of the disease like thyroid, chronic diseases. This model uses the
Mahout Hadoop technique of the big data analytics for the prediction of the disease
Mahout has all the data mining techniques in it which makes the system efficient and
powerful. In this model the Mahout part of the Hadoop system helps in the analysis
of the data which is stored in HBase and on the basis of that the disease are predicted
in this model. The size of the dataset is very large hence the overall system becomes
very time consuming and the system requirements are also very high to run this
system so it needs the very high and fast processing environment for its functioning
and disease prediction.
2.2 Identification of Research Gap and Problems
• The model given by [1] uses KNN and CNN algorithms which is more time consuming
as it involves both the structured and the unstructured data so the time taken to
process the data is more as compared to the dataset which contains only the
structured data as in the proposed project which contains only the structured data and
the classification algorithms used in the proposed project are decision tree, Naive
Bayes and Random forest. The accuracy of the model given by [1] is above 90%
which is not good for a ML model as it is said to be in an over fitting situation
whereas the proposed model has accuracy of about 86% which is good enough for a
model of disease prediction.
• The model given by [2] has a very limited scope as it is only meant for the prediction of
the diabetes and hypertension whereas the proposed model is used for the prediction
of the basic general disease. The model given by [2] needs the blood report of the
patient or the user for the prediction of the diabetes or the hypertension and the
algorithms used in this model are ensemble learning techniques whereas the predicted
model does not need any blood report or physical presence of the user or the patient.
The system contains a list of symptoms from which the user can select the symptoms
which the user is facing and can predict the disease very easily and the algorithms
used are different from the given model. The input required in the given model are
based on the medical report of the user like cholesterol, blood glucose etc whereas
the proposed system does not require any type of blood report for the prediction of
the disease.
• The model given by [3] uses a very big data set and to manage that dataset the big data
analytics are used which makes this system slow as needs a lot of system
requirements to run this project and the deep learning algorithms are used in this
project are FISM, NAIS, DeepICF which is different from the proposed model which
uses the classification algorithms which are light weight for a PC and run faster as
compared as compaired to the big deep learning techniques and big data analytics
which takes more time and space.
• The model given by [4] is best for the prediction of disease related to breast cancer,
diabetes and heart related problems and has different dataset for all the three different
kind of disease which is different from the proposed model as the proposed model
helps in the prediction of the general diseases with the help of the symptoms and has
a single dataset for all the diseases. The accuracy in some case of the given model by
[4] is very high.
Figure 2.2: Accuracy Of The Breast Cancer Dataset
so the model given by [4] can be said to be a over fitting model as the accuracy is too
high where as the accuracy in the proposed model is about 86% which is good for a
model for the disease prediction.
• The model given by [5] uses KNN algorithm for the prediction of heart related diseases
and uses parameters like high cholesterol, high blood sugar, diabetes, smoking habits,
consuming too much alcohol as the input for the prediction of heart related diseases
this model also gives information about the cardio vascular diseases and the cardiac arrest
and many more heart related problems, the efficiency of the system is high for the
decision tree algorithm i.e. about 91% whereas the proposed system has the
capability to predict the general diseases and is more helpful as compared to a simple
heart disease prediction system which is only helpful for the heart diseases but the
simple disease prediction system is helpful for the prediction of more diseases.
• The model given by [6] uses SVM algorithm for the prediction. This model is used to
predict the lifestyle and weather a person is suffering from any disease or not. The
input in the model is given as per the rating i.e. from 1-5 where 1 is for excellent and
5 is for very bad and the symptoms which are rated are lack of physical activity,
obesity, stress and activity, smoking etc.
Figure 2.3: Dataset For [6]
whereas the proposed model dataset has the data on the basis of 0 and 1 that weather a
symptom is present or not and helps in the prediction of disease in a better way as the
given model is only to predict the lifestyle of a person that he or she is physically active
or not and many other things and what are the chances that a person is prone to a
disease whereas the proposed system predicts the disease.
• The model given by [7] uses very big data set and uses big data analytics for the prediction
of disorders. this model uses mahout of the Hadoop file system for the prediction as
the mahout contains all the data mining and analysis techniques for the prediction of
the disorders but as we can see that there is a huge amount of data associated with
this model so it’s a bit hard to process all the data for the predictions of the disorders
and the overall speed of the system becomes a bit slow as there is a huge amount of
data to be processed. This model helps in the prediction of the chronic disorders like
thyroid and needs a medical examination of the user as well whereas the proposed
system is fast as it has light weight algorithms and has higher efficiency and helps in
the prediction of the more commonly occurring diseases which can later result in big
problems later to the user or the patient. The system also helps in getting medical
advice from the doctor as the doctors are also registered with the system which helps
in better diagnosis of the disease and getting medical treatment.
17
SYSTEM ANALYSIS
3.1 Major Inputs Required
The inputs required for the project are:
• Software Inputs:
– Jupyter Notebook
– Python version 3
– Pip version 3
– Pip virtual environment
– Flask
• Hardware Inputs:
– Windows/Linux/Mac OS
– At least of 4 GB RAM
– At least 512 GB ROM
– At least a Integrated Graphic card
• User Inputs:
– Basic Details
– Symptoms
3.2 Feasibility Analysis
3.2.1 Technical Feasibility
The project is technically feasible as it can be built using the existing available technologies.
It is a web-based applications that uses Flask Framework. The technology required by
Disease Predictor is available and hence it is technically feasible.
3.2.2 Economic Feasibility
The project is economically feasible as the cost of the project is involved only in the hosting
of the project. As the data samples increases, which consume more time and processing
power. In that case better processor might be needed.
3.2.3 Operational Feasibility
The project is operationally feasible as the user having basic knowledge about computer and
Internet. Disease Predictor is based on client-server architecture where client is users and
server are the machine where dataset and project are stored.
3.3 Algorithms Used
In this project different ML algorithms are used and several techniques of data mining are
also used to check the dataset that weather it is a balanced dataset or not and check for the
data is structured or not for the disease prediction. The various ML algorithms used in this
project are:
• Logistic Regression: The logistic regression is used the find the probability of a event
that weather a event is going to occur or not. The logistic regression is used in
statistics to find the probability of the occurrence of a event like the probability the
school will open or not is either 1 or 0 where 1 means that the school will open and 0
means that the school will remain closed. For determining the logistic regression, the
sigmoid function is used in this algorithm i.e.
1
Sigma(t) =
1 + e−t (3.1)
Figure 3.1: Sigmoid Activation Function
There are different types of logistic regression present in machine learning which have
their specific uses:
– Binary Logistic Regression: In Binary Logistic Regression there are only two
cases i.e. only a event can happen or not.
– Multinomial Logistic Regression: In Multinomial Logistic Regression there can
be three number of possibilities like a person can purely vegetarian, purely non-
vegetarian and both can consume both also in the third case. So this type of situation
comes under the category of the Multinomial Logistic Regression.
– Ordinal Logistic Regression:Ordinal Logistic Regression works when there are
3 or more categories on which the logistic regression is to be applied like Rating
the food of a hotel from 1-10 where 1 is for best and 10 is for bad.
• Naive Bayes Classifier: The Naive Bayes is used as a classification technique in the ML
which is used to classify the things and give the answer on the basis of the
classification. The main working of the Naive Bayes Classifier is based on the bayes
theorm of the statistics and the bayes theorm states that:
P (Y |X) P
P (X|Y) =
(X) (3.2)
P (Y)
By making the use of bayes theorm we can find the probability of X to occur when the
probability of Y occurrence is given to us. For exmaple:
Table 3.1: Prediction Using Naive Bayes
S.No. Temperature Weather Rain Humid Play Tennis

1 Hot Sunny No High Yes
2 Mild Overcast Yes Low Yes
3 Mild Rainy Yes High NO
4 Cold Rainy Yes High NO
5 Hot Sunny No Low Yes
This is kind of input which is given to the Naive Bayes Classifier for the prediction. On
this kind of table input the bayes theorem is used the ML algorithm for the prediction,
with this table the prediction can be done as probably of playing tennis when the
temperature is mild, weather is rainy, rain is yes, humid is high like this the prediction
is done in the ML model also. There are three different types of naive bayes classifier:
– Multinomial Naive Bayes: This is mainly used for the classification of big files
and documents by dividing them into different categories as weather a file or a
document is of sports category, politics, technology or something else. This
method of classification is very widely used in Machine Learning algorithms.
– Bernoulli Naive Bayes: This method of classification is very similar to the multi-
nomial naive bayes but the result of this type of naive bayes is only in either yes
or no.
– Gaussian Naive Bayes: This type of naive bayes is used for the prediction of the
continuous values while the other two types were used for the discrete value pre-
diction.
In this way the Naive Bayes Classifier is used in different ways in the ML models for
the process of prediction and analysis.
21
• Decision Tree Algorithm: The decision tree algorithm is a type of a supervised learning
algorithm which can be used in the case of classification as well as regression it has
capability to solve both the kind of problems. In decision tree for predicting the value
we start from the root of tree and then form a sub-tree from that root and finally
come to a conclusion which is nothing but the predicted value. The detailed overview
of the decision tree can be given as:
Figure 3.2: Decision Tree Overview
Decision tree helps in the method of classification as it sorts the values while traversing
down the tree and predicts the right value.The decision tree starts from the root node on
each step of moving downward in the tree the Information Gain and entropy are calcu-
lated, after that the branch with lowest entropy or the highest information gain is
selected and the information gain and the entropy are calculated again.
Σc
Entropy = −pi log2 pi (3.3)
i=1
Informationgain = entropy(parent) − [averageentropy(children)] (3.4)

When the entropy is high then it is concluded that the randomness of the dataset is high
and then it is hard to predict the answer whereas the information gain tells us about that
how well structured the dataset is and if the information gain is high then it is easy to
predict the answer. All these steps are repeated until the decision tree reaches the leaf
node i.e., the final predicted value by the decision tree. Different types of Decision tree
present are:
– Categorical Variable Decision Tree: These types of decision tree works on the
whole value i.e. suppose we have to choose between 0 and 1 or something which
is taken as a whole.
– Continuous Variable Decision tree: These types of the decision tree works on the
continuous values that’s why they are known as the continuous variable decision
tree.
In this way the decision tree algorithm is used in classification and regression problems.
• Random Forest Algorithm: The random forest algorithm is a type of supervised learn-
ing algorithm which is used for both classification as well as regression as the basic
idea behind the random forest algorithm is the decision tree algorithm, this algorithm
creates multiple decision trees and then predicts the values as shown in the figure 3.3
Figure 3.3: Random Forest Overview
As shown in the figure there as multiple decision trees in a random forest algorithm.
Many of these decision trees will not be performing good enough and the best
performing decision trees are selected for prediction.
CHAPTER 4
SYSTEM DESIGN
The whole project can be divided into two parts i.e. The Machine Learning Model and The
User Interface and they can be elaborated as:
4.1 Machine Learning Model
SYMPTOMS
MACHINE
DATA PRE- LEARNING ALGORITHM
DATASE
T PROCESSING
DISEASE
TEST DATA PRE-

TEST PROCESSING PREDICTIVE PREDICTED
DATA MODEL DISEASE
Figure 4.1: Detailed Design Of Model
Data mining techniques are used in the project to see weather the dataset is good for
prediction or not. Various data mining libraries used in the project are:
1. Scipy: This is used for implementing scientific computing in Python programming lan-
guage. It is a collection of mathematical algorithms and convenience functions built
on Numpy. Following are some of the functionalities it provides Special Functions
(special),
Integration (integrate), Optimization (optimize), Fourier Transforms (interpolate),
Signal Processing (signal), Linear Algebra (linalg), Statistics (stats), File IO (io) etc. In
this project stats (Statistics) library of this package is primarily used.
2. Sklearn : This stands for Scikit learn and is built on the Scipy package. It is the primary
package being used in this project. It is used for providing interface for supervised
and unsupervised learning algorithms. Following groups of models are provided by
sklearn Clustering, Cross Validation, Datasets, Dimensionality Reduction, Ensemble
methods, Feature extraction, Feature selection, Parameter Tuning, Manifold
Learning, Supervised Models.
3. Numpy : It is a library for the Python programming language, adding support for multi-
dimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays. It provides functions for Array
Objects, Routines, Constants, Universal Functions, Packaging etc. In this project it is
used for performing multi-dimensional array operations.
4. Pandas : This library is used to provide high-performance, easy-to-use data structures and
data analysis tools for the Python programming language. It provides functionalities
like table manipulations, creating plots, calculate summary statistics, reshape tables,
combine data from tables, handle times series data, manipulate textual data etc. In
this project it is used for reading csv files, comparing null and alternate hypothesis
etc.
5. Matplotlib : It is a library for creating static, animated, and interactive visualizations in

Python programming language. In this project it is used for creating simple plots,
sub- plots and its object is used alongside with the seaborn object to employ certain
functions such as show, grid etc. A %matplotlib inline function is also used for
providing more concise plots right below the cells that create that plot.
6. Seaborn : It provides a interface for making graphs that are more attractive and interactive
in nature. It is based on the matplotlib module. These graphs can be dynamic and are
much more informative and easier to interpret. It provides different presentation
formats for data such as Relational, Categorical, Distribution, Regression, Multiples
and style and color of all these types. In this project they are used for creating
complex plots that use various attributes.
7. Warning : It is used for handling any warnings that may arise when the program is
running. It is a subclass of Exception.
8. Stats : This library is used to incorporate statistics functionality in Python programming

language. This library is included in the scipy package. This library is not directly
used rather the required functions are directly imported as and when required i.e. for
Measures of Central Tendency, Measures of Variability . The functions used can be
for simple concepts like mean, median, mode, variance, percentiles, skew, kurtosis,
range, Cumulative
25
Distribution Function (CDF), Probability Distribution Function (PDF),stats (used for
re- turning mean, variance, skew, kurtosis) etc, to complex hypothesis tests like chi2
contin- gency (used for chi-square test), ttest-ind(used for performing t test), ks-2samp
(used for performing Kolmogorov-Smirnov test) etc.
9. Model selection : This library is used for helping in choosing the best model. This
library is present in the sklearn package. It is also used for functions like test-train-
split which is used for splitting the data into train and test data set which helps in
improving accuracy of the model, and like cross-val-score which is used for
computing the accuracy of a model. It also includes functions for techniques for
improving the accuracy of a model like K- Folds algorithm which is used in this
project. It involves the functions of linear-model, SVM etc.
10. Naive Bayes : It is a library built for implementing the Naive Bayes algorithm. It is also
defined in the sklearn package. In this project multinomial variant is used and it is one
of the most crucial algorithms in the project.
11. Tree : It is the library that comprises of all the functionality and concepts associated
with trees. It is included in the sklearn package. The most important algorithm included
in this library is the Decision Tree Classifier which gives very high accuracy and one of
the most used algorithm for projects like this.
12. Linear model : It is the library that implements the Regression algorithms. It is also in-
cluded in the sklearn package. It is used in this project to implement Logistic
Regression.
13. Ensemble : It is the library that includes the ensemble methods. It is defined in the sklearn
package. As ensemble techniques are used to improve the accuracy of the models hence
Gradient Boosting Classifier and Random Forest Classifier(very important algorithm
and provides very high accuracy) are used in this project.
14. Metrics : It is the library used for presenting the accuracy of the model. The function
accuracy score is the most basic of them all. It is included in the sklearn package.
15. Joblib : It is a package that provides lightweight pipe lining in Python programming
language. It is included in the externals package in the sklearn main package. It is
used for providing transparent disk-caching of the output values, easy simple parallel
computing, logging and tracing of the execution. In this project it used to provide the
interaction with the user and perform operations accordingly.
All these libraries are used to create a model with the help of the dataset. The model
created by applying all these data mining techniques is a binary file so that the model is
secure from any kind of modifications and other security threats related to the system. The
binary file of the model cannot be opened as it does not have any extension to it. This binary
file is basically created with the help of job-lib library and is used for creating the UI.
4.2 User Interface
The UI is developed with the help of Flask. The ML model developed with the use of data
mining techniques and the ML algorithms is used in the UI. The same version of the job-lib
library of the data mining is installed in the python environment on which the Flask is
working and after that the binary file of the model can be used in the prediction of disease.
Methodology
Input Symptoms
Data preprocessing
Decision tree
Random
Naïve Bayes
Output Disease
Fig. 1. PREDICTION MODEL
A. Input (Symptoms):
While designing the model we have assumed that the user has a clear idea about the
symptoms he is experiencing. The Prediction developed considers 95 symptoms amidst
which the user can give the symptoms his processing as the input.
B. Data preprocessing:
The data mining technique that transforms the raw data or encodes the data to a form
which can be easily interpreted by the algorithm is called data preprocessing. The
preprocessing techniques used in the presented work are:
 Data Cleaning: Data is cleansed through processes such as filling in missing
value, thus resolving the inconsistencies in the data.
 Data Reduction: The analysis becomes hard when dealing with huge database.
Hence, we eliminate those independent variables(symptoms) which might have
less or no impact on the target variable(disease). In the present work, 95 of 132
symptoms closely related to the diseases are selected.
C. Models selected:
The system is trained to predict the diseases using three algorithms
􀁸 Disease Tree Classifier
􀁸 Random forest Classifier
􀁸 Naïve Bayes Classifier
A comparative study is presented at the end of work, thus analyzing the performance of
each algorithm of the considered data.
D. Output(diseases) Once the system is trained with the training set using the mentioned
algorithms a rule set is formed and when the user the symptoms are given as an input to
the model, those symptoms are processed according the rule set developed, thus making
classifications and predicting the most likely disease.
CHAPTER 6
CONCLUSION
Proposed a system to predict the disease based on previous cases in the

medical history and connected the patients registered on the network with the
best doctors in the specialized field by reducing a patient’s trouble visiting a
general physician before. A disease prediction web application network based on
a machine learning algorithm was effectively built. Support Vector Classifier,
Naive Bayes Classifier, and Random Forest Classifier were used to train three
different models, which were then combined to create a more accurate and
effective system to classify patient data. This is because medical data is growing
at an exponential rate, and it is necessary to process existing data in order to
predict exact disease based on symptoms. By providing the input as patient
symptoms, we were able to get an accurate general illness risk prediction, which
let us grasp the level of disease risk prediction.
CHAPTER 7
FUTURE ENHANCEMENT
Today’s, world most of the data is computerized, the data is distributed, and it is not utilizing
properly. With the help of the already present data and analysing it, we can also use for un-
known patterns. The primary motive of this project is the prediction of diseases with high rate
of accuracy. For predicting the disease, we can use logistic regression algorithm, naive
Bayes, sklearn in machine learning. The future scope of the paper is the prediction of diseases
by using advanced techniques and algorithms in less time complexity.
A technology called CAD is more beneficial as sometimes systems are better diagnostics
than Doctors. Machine Learning and its different branches are used in Cancer detection as
well. It helps or can say assist in making decisions on critical cases or on therapies. Artificial
intelligence plays an important role in development of many health related procedure or
methods. Artificial intelligence is very common now a days in surgeries, like Robotics
surgery. Since we are in the circumstances of growing population, we must need technology
which can help us to meet the expectations of the patients, their flawless cure, their better
health and their smooth and easy approachable access to health care industries to heal and get
well soon!!
REFERENCES
[1] Dahiwade, D., Patle, G., and Meshram, E. (2019). “Designing disease prediction model
using machine learning approach.” 2019 3rd International Conference on Computing
Method- ologies and Communication (ICCMC), IEEE. 1211–1215.
[2] Fitriyani, N. L., Syafrudin, M., Alfian, G., and Rhee, J. (2019). “Development of disease
prediction model based on ensemble learning approach for diabetes and hypertension.” IEEE
Access, 7, 144777–144789.
[3] Hong, W., Xiong, Z., Zheng, N., and Weng, Y. (2019). “A medical-history-based potential
disease prediction algorithm.” IEEE Access, 7, 131094–131101.
[4] Kohli, P. S. and Arora, S. (2018). “Application of machine learning in disease predic-
tion.” 2018 4th International Conference on Computing Communication and Automation
(ICCCA), IEEE. 1–4.
[5] Krishnan.J, M. “Prediction of heart disease using machine learning algorithms.
[6] Patil, M., Lobo, V. B., Puranik, P., Pawaskar, A., Pai, A., and Mishra, R. (2018). “A
proposed model for lifestyle disease prediction using support vector machine.” 2018 9th
In- ternational Conference on Computing, Communication and Networking Technologies
(IC- CCNT), IEEE. 1–6.
[7] Shobana, V. and Kumar, N. (2017). “A personalized recommendation engine for predic-
tion of disorders using big data analytics.” 2017 International Conference on Innovations
in Green Energy and Healthcare Technologies (IGEHT), IEEE. 1–4.
[8] A. K. M Sazzadur Rahman, M. Mehedi Hasan, S. Asaduzzaman, M. Asaduzzaman, and

S. Akhter Hossain, “An analysis of computational intelligence techniques for diabetes
prediction Machine Learning View project An analysis of computational intelligence
techniques for diabetes prediction,” Int. J. Eng. &Technology, vol. 7, no. 4, pp. 6229–
6232, 2018.
[9] G. H. Tang, A. B. M. Rabie, and U. Hägg, “Indian hedgehog: A Mechanotransduction

Mediator in Condylar Cartilage,” J. Dent. Res., vol. 83, no. 5, pp. 434– 438, 2004
[10] Y. Karaca and C. Cattani, “7. Naive Bayesian classifier,” Computer Methods Data
Analysis
[11] Purushottam, K. Saxena, and R. Sharma, “Efficient Heart Disease Prediction

System,” Procedia Computer Science, vol. 85, pp. 962–969, 2016
[12] K. Deepika and S. Seema, “Predictive analytics to prevent and control chronic
diseases,” Proc. 2016 2nd Int. Conf. Appl. Theor. Comput. Commun. Technology
iCATccT 2016, no. January 2016, pp. 381386, 2017
[13] “Analysis and Prediction of Various Heart Diseases Using DNFS Techniques,” vol.
2, no. 1, pp. 1–7, 2015. Proceedings of the International Conference on Electronics and
Sustainable Communication Systems (ICESC 2020) IEEE Xplore Part
NumberCFP20V66-ART; ISBN: 978-1-7281-4108-4978

Full Doc

Uploaded by

Copyright:

Available Formats

Full Doc

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Full Doc

Uploaded by

Copyright:

Available Formats

DISEASE PREDICTION SYSTEM USING MACHINE

Disease Prediction System based on various prediction models that help to

1.2.1 Data Analysis and Data Mining

1.2.2 Machine Learning Algorithms

outcome of the algorithm. Common algorithm types include:

@ Semi-supervised learning — This category of the ML algorithms falls somewhere be-

The performance and computational analysis of ML algorithms is a branch of statistics

Machine learning is made up of three parts:

@ The computational algorithm at the core of making determinations.

@ Variables and features that make up the decision.

2.1 Introduction to Old Models

Figure 2.1: Accuracy of The Model

3.1 Major Inputs Required

The inputs required for the project are:

3.2.1 Technical Feasibility

3.2.2 Economic Feasibility

3.2.3 Operational Feasibility

Figure 3.1: Sigmoid Activation Function

Table 3.1: Prediction Using Naive Bayes

S.No. Temperature Weather Rain Humid Play Tennis

Figure 3.2: Decision Tree Overview

Informationgain = entropy(parent) − [averageentropy(children)] (3.4)

Figure 3.3: Random Forest Overview

4.1 Machine Learning Model

TEST DATA PRE-

DATA MODEL DISEASE

Figure 4.1: Detailed Design Of Model

5. Matplotlib : It is a library for creating static, animated, and interactive visualizations in

8. Stats : This library is used to incorporate statistics functionality in Python programming

Proposed a system to predict the disease based on previous cases in the

[5] Krishnan.J, M. “Prediction of heart disease using machine learning algorithms.

[8] A. K. M Sazzadur Rahman, M. Mehedi Hasan, S. Asaduzzaman, M. Asaduzzaman, and

[9] G. H. Tang, A. B. M. Rabie, and U. Hägg, “Indian hedgehog: A Mechanotransduction

[11] Purushottam, K. Saxena, and R. Sharma, “Efficient Heart Disease Prediction

You might also like