Results

Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Decision trees results : supervised task

1. `dtModel = tree.DecisionTreeClassifier()`: This line initializes a Decision Tree Classifier


model. It creates an instance of the DecisionTreeClassifier class from the `tree` module, which
is a part of the popular Python machine learning library, scikit-learn.
2. `dtModel.fit(features_train, ifMalware_train)`: Here, the model is trained on the training data.
`features_train` represents the feature data (input) used for training the model, and
`ifMalware_train` contains the corresponding labels (output) for each data point. The `fit()`
function is used to train the classifier on this data.
3. `dtPredict = dtModel.predict(features_test)`: After training, the model is used to make
predictions on the test data, which is stored in `features_test`. The predicted labels for the test
data are stored in `dtPredict`.
4. `(ifMalware_test != dtPredict).sum()`: This line compares the predicted labels `dtPredict`
with the actual labels of the test data, `ifMalware_test`. The expression `(ifMalware_test !=
dtPredict)` creates a boolean array indicating which entries were mislabeled (predictions that
do not match the true labels), and `.sum()` then counts the number of mislabeled entries.
5. `print("Number of mislabeled out of a total of %d test entries: %d" % (features_test.shape[0],
(ifMalware_test != dtPredict).sum()))`: This line prints the number of mislabeled entries out of
the total number of test entries. It gives an idea of how well the model performed on the test
data.
6. `successRate = 100 * f1_score(ifMalware_test, dtPredict, average='micro')`: Here, the F1-
score is calculated as a measure of the model's performance. The F1-score is a metric that
combines precision and recall to assess the model's accuracy. The `f1_score()` function is called
from scikit-learn, and `average='micro'` specifies that it should be calculated as the micro-
average (which treats all classes equally).
7. `print("The Success Rate was calculated as % : " + str(successRate) + " with the Decision
Tree.")`: The final line of code prints the success rate of the model using the F1-score. It shows
how well the Decision Tree Classifier performed on the test data.
The provided result indicates that out of 2635 test entries, the Decision Tree Classifier
mislabeled 537 of them. The success rate, which is calculated as the F1-score, is approximately
79.62%.

The F1-score is a metric commonly used in binary classification problems, which combines
both precision and recall. It ranges from 0 to 1, with higher values indicating better
performance. In this case, a success rate of around 79.62% suggests that the Decision Tree
model achieved a reasonable level of accuracy on the test data.

Random Forest : supervised learning


1. `rfModel = RandomForestClassifier()`: This line initializes a Random Forest Classifier
model. It creates an instance of the RandomForestClassifier class from the scikit-learn library.
2. `rfModel.fit(features_train, ifMalware_train)`: The model is trained on the training data.
`features_train` represents the feature data (input) used for training the model, and
`ifMalware_train` contains the corresponding labels (output) for each data point. The `fit()`
function is used to train the random forest classifier on this data.
3. `rfPredict = rfModel.predict(features_test)`: After training, the model is used to make
predictions on the test data, which is stored in `features_test`. The predicted labels for the test
data are stored in `rfPredict`.
4. `(ifMalware_test != rfPredict).sum()`: This line compares the predicted labels `rfPredict` with
the actual labels of the test data, `ifMalware_test`. The expression `(ifMalware_test !=
rfPredict)` creates a boolean array indicating which entries were mislabeled (predictions that
do not match the true labels), and `.sum()` then counts the number of mislabeled entries.
5. `print("Number of mislabeled out of a total of %d test entries: %d" % (features_test.shape[0],
(ifMalware_test != rfPredict).sum()))`: This line prints the number of mislabeled entries out of
the total number of test entries, which gives an idea of how well the Random Forest model
performed on the test data.
6. `successRate = 100 * f1_score(ifMalware_test, rfPredict, average='micro')`: Similar to the
previous code, the F1-score is calculated as a measure of the model's performance. The F1-
score is a metric that combines precision and recall to assess the model's accuracy. The
`f1_score()` function is called from scikit-learn, and `average='micro'` specifies that it should
be calculated as the micro-average (which treats all classes equally).
7. `print("The Success Rate was calculated as % : " + str(successRate) + " with the Random
Forest")`: The final line of code prints the success rate of the Random Forest model using the
F1-score. It shows how well the Random Forest Classifier performed on the test data.
The provided result indicates that out of 2635 test entries, the Random Forest Classifier
mislabeled 724 of them. The success rate, which is calculated as the F1-score, is approximately
72.52%.

The performance of the Random Forest Classifier is a bit lower than the previously mentioned
Decision Tree Classifier, which achieved a success rate of around 79.62%. In this case, the
Random Forest model misclassified more data points on the test set, resulting in a lower success
rate.

the k-Nearest Neighbors (KNN) : unsupervised learning

the k-Nearest Neighbors (KNN) algorithm for a binary classification task, specifically to
classify data into 'legitimate' or 'malicious' (represented by 0 or 1) based on various features.
The code also performs hyperparameter tuning using cross-validation to find the optimal value
of the parameter 'k' (number of neighbors) for the KNN model.
1. Data Preparation: The code loads the data and separates the feature matrix 'X' and the target
vector 'y'. It drops the 'Name', 'md5', and 'legitimate' columns from the feature matrix, as they
are not used as input features for the model.
2. Train-Test Split: The data is split into training and test sets using the `train_test_split`
function from scikit-learn. The training set (X_train and y_train) will be used to train the KNN
model, and the test set (X_test and y_test) will be used to evaluate its performance.
3. Feature Scaling: The feature matrices X_train and X_test are standardized using
`StandardScaler` from scikit-learn. This step ensures that all features are on the same scale,
which can improve the performance of distance-based algorithms like KNN.
4. KNN Model Training: The KNN classifier is initialized with `n_neighbors=7`,
`metric='minkowski'`, and `p=2` (Euclidean distance). The model is then trained on the
standardized training data using `fit`.
5. KNN Model Evaluation: The trained model is used to make predictions on the test data
(X_test), and the predicted labels are stored in 'y_pred'.
6. Confusion Matrix: The code calculates the confusion matrix using `confusion_matrix` from
scikit-learn. The confusion matrix helps evaluate the performance of the classifier by showing
the counts of true positives, true negatives, false positives, and false negatives.
7. Hyperparameter Tuning: The code performs hyperparameter tuning for the KNN model to
find the optimal number of neighbors (k). It iterates through a list of odd numbers from 1 to 49
(inclusive) and uses 20-fold cross-validation to calculate the mean accuracy for each 'k'. The
optimal value of 'k' is chosen based on the one that yields the lowest misclassification error.
8. Visualization: The code then plots the misclassification error vs. the number of neighbors (k)
to visualize the relationship and identify the optimal 'k' with the lowest error.
9. Print Results: The code prints the optimal number of neighbors (k) and the success rate (mean
accuracy) obtained from cross-validation with the optimal 'k' value.
The KNN algorithm is a simple yet effective classification algorithm. The hyperparameter
tuning using cross-validation helps identify the most suitable value for 'k,' which can
significantly impact the model's performance. By plotting the misclassification error vs. 'k', the
code visually demonstrates how the choice of 'k' affects the accuracy of the model.
The provided success rate of approximately 96.77% with 2 neighbors for the K-Nearest
Neighbors (KNN) model suggests that the model achieved a high level of accuracy on the
dataset. This means that the model correctly classified about 96.77% of the samples in the test
set.
More details :
1. High Accuracy: A success rate of 96.77% indicates that the KNN model performed
exceptionally well on the test data, demonstrating a strong ability to distinguish between
'legitimate' and 'malicious' instances based on the provided features.
2. Appropriate Choice of 'k': The code used 2 neighbors for the KNN model, which appears to
be a suitable choice for this dataset. In some cases, using a smaller 'k' value can be beneficial,
especially if the decision boundaries are complex and data points are tightly clustered.
However, using a smaller 'k' can also make the model sensitive to noise or outliers in the data.
3. Robustness through Cross-Validation: The success rate is obtained through cross-validation,
which helps provide a more reliable estimate of the model's performance. By performing 20-
fold cross-validation and calculating the average accuracy over the folds, the success rate is less
likely to be affected by the specific data split into training and test sets.
4. Model Limitations: While a high success rate is impressive, it is essential to be aware of the
limitations of any machine learning model. In some cases, the success rate might be misleading,
especially if the dataset is imbalanced, or if the features do not capture all the underlying
patterns in the data.
To compare the three models:
1. Decision Tree Classifier:
- Success Rate: Approximately 79.62%
- Comment: The Decision Tree model achieved a moderate success rate, indicating that it
performed reasonably well on the test data. It is a simple and interpretable model, but it may
have limitations in capturing complex relationships in the data.
2. Random Forest Classifier:
- Success Rate: Approximately 72.52%
- Comment: The Random Forest model achieved a lower success rate compared to the
Decision Tree model. Random Forests are more complex ensembles of decision trees and are
often more robust against overfitting. However, in this case, it may not have generalized as well
to the test data, possibly due to suboptimal hyperparameter settings or data-specific challenges.
3. K-Nearest Neighbors (KNN) Classifier:
- Success Rate: Approximately 96.77% with 2 neighbors
- Comment: The KNN model achieved the highest success rate among the three models. KNN
is a simple instance-based learning algorithm and can be effective when there are clear patterns
or clusters in the data. In this case, it appears to have performed remarkably well in
distinguishing between 'legitimate' and 'malicious' instances.
Comparison :
- The KNN model outperformed both the Decision Tree and Random Forest models in terms
of success rate. It achieved the highest accuracy, making it a strong candidate for this specific
classification task.
- The Decision Tree and Random Forest models may have their strengths in interpretability and
handling complex relationships in the data. However, they did not perform as well as the KNN
model on this particular dataset.

You might also like