SUMMARY
SUMMARY
SUMMARY
import NumPy as np
import warnings
In this section, we will load and view the CSV file and its contents.
Filename: heart.csv
Contents:
We have 14 attributes/features including target which will be age, gender, cholesterol level,
exacting, chest pain, old peak, thalach, FBS, slope, thal, etc.
dataframe=pd.read_csv("/content/heart.csv")
data frame.head(10)
Output:
Inference: From the dataset, we have three types of data such as continuous, ordinal, and binary
data.
Data Analysis
dataframe.info()
Output:
Now, let us look at whether the dataset has null values or not.
dataframe.isna().sum()
Output:
Inference: From this output, our data does not contain null values and duplicates. So, the data is
Correlation Matrix
Visulaizing the data features to find the correlation between them which will infer the important
features.
plt.figure(figsize=(15,10))
sns.heatmap(dataframe.corr(),linewidth=.01,annot=True,cmap="winter")
plt.show()
plt.savefig('correlationfigure')
Output:
Inference:
From the above heatmap, we can understand that Chest pain(cp) and target have a positive
correlation. It means that whose has a large risk of chest pain results in a greater chance to have
heart disease. In addition to chest pain, thalach, slope, and resting have a positive correlation
Then, exercise-induced angina(exang) and the target have a negative correlation which means
when we exercise, the heart requires more blood, but narrowed arteries slow down the blood
flow. In addition to ca, old peak, thal have a negative correlation with the target.
Let us see the relation between each features distribution with the help of histogram.
dataframe.hist(figsize=(12,12))
plt.savefig('featuresplot')
Output:
Train-Test Split
X_train, X_test,y_train, y_test=train_test_split(X,y,test_size=0.25,random_state=40)
We split the whole dataset into trainset and testset which contains 75% train and 25% test.
We can include this train set into classifiers to train our model and the test set is useful for
Algorithm Implementation
1 .Logistic Regression
warm_start=False)
model1=lr.fit(X_train,y_train)
prediction1=model1.predict(X_test)
cm=confusion_matrix(y_test,prediction1)
cm
sns.heatmap(cm, annot=True,cmap='winter',linewidths=0.3,
linecolor='black',annot_kws={"size": 20})
TP=cm[0][0]
TN=cm[1][1]
FN=cm[1][0]
FP=cm[0][1]
Output:
print(classification_report(y_test, prediction1))
Inference: From the above report, we get the accuracy of the Logistic Regression classifier is
about 80%.
2. Decision Tree
tree_model = DecisionTreeClassifier(max_depth=5,criterion='entropy')
m=tree_model.fit(X, y)
prediction=m.predict(X_test)
cm= confusion_matrix(y_test,prediction)
sns.heatmap(cm, annot=True,cmap='winter',linewidths=0.3,
linecolor='black',annot_kws={"size": 20})
print(classification_report(y_test, prediction))
TP=cm[0][0]
TN=cm[1][1]
FN=cm[1][0]
FP=cm[0][1]
Output:
Inference: From the above report, we get the accuracy of the Decision Tree classifier is about
92%.
rfc=RandomForestClassifier(n_estimators=500,criterion='entropy',max_depth=8,min_samples_s
plit=5)
cm3=confusion_matrix(y_test, prediction3)
sns.heatmap(cm3, annot=True,cmap='winter',linewidths=0.3,
linecolor='black',annot_kws={"size": 20})
TP=cm3[0][0]
TN=cm3[1][1]
FN=cm3[1][0]
FP=cm3[0][1]
print(round(accuracy_score(prediction3,y_test)*100,2))
Output:
Let us see the classification report for Random Forest Classifier:
print(classification_report(y_test, prediction3))
Output:
Inference: From the above report, we can get the accuracy of the Random Forest classifier is
about 80%.
4. Support Vector Machines(SVM)
svm=SVC(C=12,kernel='linear')
model4=svm.fit(X_train,y_train)
prediction4=model4.predict(X_test)
cm4= confusion_matrix(y_test,prediction4)
sns.heatmap(cm4, annot=True,cmap='winter',linewidths=0.3,
linecolor='black',annot_kws={"size": 20})
TP=cm4[0][0]
TN=cm4[1][1]
FN=cm4[1][0]
FP=cm4[0][1]
Output:
Let us see the classification report of SVM:
print(classification_report(y_test, prediction4))
Output:
Inference: From the above report, we get the accuracy of the Support Vector Machine classifier
is about 82%.
From the results that we got, as four machine learning algorithms like Logistic Regression,
Random Forest, Support Vector Machines and Decision Trees. From the final results, we got
Logistic Regression as 80%, Random Forest as 80%, Support Vector Machines as 82%, and
Decision Trees as 92%. We can conclude that the Decision Tree algorithm is the best algorithm
Now, we can apply the best working algorithm (i.e., Decision Tree Classifier) into our model and
check whether our model will result in the correct output or not with the help of available data.
input=(63,3,145,233,150,2.3,0)
input_as_numpy=np.asarray(input)
input_reshaped=input_as_numpy.reshape(1,-1)
pre1=tree_model.predict(input_reshaped)
if(pre1==1):
else:
Output:
CASE 2 – For Normal Data
input=(72,1,125,200,150,1.3,1)
input_as_numpy=np.asarray(input)
input_reshaped=input_as_numpy.reshape(1,-1)
pre1=tree_model.predict(input_reshaped)
if(pre1==1):
else:
Output:
Conclusion
Finally, we can conclude that real-time predictors will be essential in the healthcare sector
nowadays. From this project, we will be able to predict real-time heart disease using the patient’s
data from the model using the Decision Tree Algorithm, thereby making accurate heart disease
prediction using machine learning. I hope that you are all excited about the blog. Let us know