Answer 1722791857 NLP and Classification Practical MCQ 4991

NLP and classification [Practical MCQ] (Version : 0)
TEST
Correct Answer
Answered in 20.433333333333 Minutes
Question 1/15
What does the CountVectorizer output X represent in the code snippet below?
from sklearn.feature_extraction.text import CountVectorizer
# Sample text data

data = ["Machine learning is fascinating.", "Natural language processing and machine learning are cl
# Initialise the CountVectorizer

vectorizer = CountVectorizer()
# Fit and transform the data

X = vectorizer.fit_transform(data)
# Get the feature names

feature_names = vectorizer.get_feature_names_out()
A list of all words used across the

documents.
The count of unique words in each

document.
The frequency of each word in each

document.
A binary indication of whether a word

appears in a document or not.
Explanation:
- The frequency of each word in each document.

is correct. The matrix X is a sparse matrix where
each row represents a document and each column
represents a word from the entire set of documents,
containing the frequency of the word appearances
in each document.
- The count of unique words in each document

is incorrect. CountVectorizer by default counts the
occurrences of words in each document, not just
their presence.
- A list of all words used across the documents.

is incorrect. It does not provide a list but a sparse
matrix representing word frequencies.
- A binary indication of whether a word appears

in a document or not is incorrect. This would be
true if the binary=True parameter were used in
CountVectorizer. By default, it counts occurrences,
not just presence.
Question 2/15
Modify the code below to compute and print the accuracy.
from sklearn.datasets import load_breast_cancer

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load the dataset

data = load_breast_cancer()
X = data.data
y = data.target
# Split the data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# Initialise the Logistic Regression model

logreg = LogisticRegression(solver='liblinear')
# Train the model

logreg.fit(X_train, y_train)
# Predict the test set results

y_pred = logreg.predict(X_test)
# insert code here
What is the accuracy of the logistic regression model on the test data?
0.958
0.975
0.962
0.945
Explanation:
- 0.958 is correct. When the code to compute

accuracy is added, we obtain an accuracy around
0.958, depending on the specific split of data and
model parameters.
- 0.975, 0.962, and 0.945 are incorrect. These

values might seem plausible but do not match the
output generated by the actual computation given
the specific random state and solver settings.
Question 3/15
What is the value of True Positive (TP) in the confusion matrix generated by the RandomForestClassifier below?
Modify the code to print the value.
from sklearn.metrics import confusion_matrix

from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
# Generate synthetic binary classification dataset

X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
# Split the dataset into training and test sets

# Initialise and train the RandomForestClassifier

rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

y_pred = rf_classifier.predict(X_test)
# Generate the confusion matrix

cm = confusion_matrix(y_test, y_pred)
# insert code here
130
97
113
118
Explanation:
- 113 is correct. Adding the code snippet to print

cm[1, 1] will display the number of True Positives,
which should closely match this answer depending
on the specific details of data generation and model
training.
- 118, 97, and 130 are incorrect. These values do

not match the expected True Positive count for the
given random state and classifier settings.
Question 4/15
What is the best value of the parameter 'C' for the SVC according to the grid search? Modify the code to print
the best parameter.
from sklearn.datasets import load_digits

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
import numpy as np
# Load a dataset
digits = load_digits()
X = digits.data
y = digits.target
# Initialise an SVC (Support Vector Classifier) with a linear kernel

svm = SVC(kernel='linear')
# Define parameter range for C (regularisation parameter)

param_grid = {'C': np.logspace(-3, 3, 7)}
# Setup the grid search with cross-validation

grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
# Fit grid search

grid_search.fit(X, y)
# insert code here
0.001
0.1
1.0
0.01
Explanation:
- A 'C' value of 0.001 provides the best balance

between complexity and accuracy for the model,
based on the output of
grid_search.best_params_['C']. A common result
for many problems where a moderate amount of
regularisation is beneficial is 1.0. The lower value of
'C' in our result therefore suggests stronger
regularisation is optimal for the dataset used.
- The other values are possible options within the

range but do not necessarily represent the best
regularisation parameter for maximising the
accuracy of the SVC on this dataset.
Question 5/15
Which code snippet can be used to fill in the missing lines of code to train the SVM classifier, predict the test set
results, and print the classification report?

from sklearn.metrics import classification_report
# Generate a synthetic dataset


# Initialise the SVM classifier with a radial basis function kernel

svm_rbf = SVC(kernel='rbf')
# Fit the classifier to the training data

# [Your Code Here] - Line to add for fitting the model

# [Your Code Here] - Line to add for making predictions
# Generate and print the classification report

# [Your Code Here] - Line to add for printing the classification report
svm_rbf.train(X_train, y_train)
y_pred = svm_rbf.classify(X_test)
report = classification_report(y_test,
y_pred)
print(report)
y_pred = svm_rbf.test(X_test)
print(classification_report(y_pred, y_test))
svm_rbf.fit(X_train, y_train)
y_pred = svm_rbf.predict(X_test)
report = classification_report(y_pred,
y_test)
print(report)
y_pred = svm_rbf.predict(X_test)
print(classification_report(y_test, y_pred))
Explanation:
- The lines in the correct option rightly implement

the standard methods used in scikit-learn for fitting
a model, predicting outcomes, and printing a
classification report which compares predictions
with true labels.
- The other optionas as all incorrect because each

of these options includes at least one incorrect
method or argument mismatch (like using non-
existent methods such as train, classify, or
incorrect order of arguments in the classification
report).
Question 6/15
Given the code below, your task is to select the function from the options provided that correctly completes the
task by:
i) Creating a function that determines which classifier (KNN or Naive Bayes) has a higher F1 score, or if they
have equal scores.
ii) Printing the name of the classifier along with its F1 score in the format: 'ClassifierName has the higher F1
score of Score' or 'Both classifiers have the same F1 score of Score'.
iii) Executing the function.
Select the appropriate code snippet from the options below.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import f1_score
# Generate a synthetic dataset


# Initialise KNN and Naive Bayes classifiers

knn = KNeighborsClassifier(n_neighbors=5)
nb = GaussianNB()
# Train both classifiers on the training data

knn.fit(X_train, y_train)
nb.fit(X_train, y_train)
# Predict test set results for both classifiers

y_pred_knn = knn.predict(X_test)
y_pred_nb = nb.predict(X_test)
# Calculate F1 scores for both classifiers

f1_knn = f1_score(y_test, y_pred_knn)
f1_nb = f1_score(y_test, y_pred_nb)
# [Your Code Here]
def best_f1_score(f1_knn, f1_nb):

print(f"KNN: {f1_knn}, Naive Bayes:
{f1_nb}")
best_f1_score(f1_knn, f1_nb)
def print_best_classifier(f1_knn, f1_nb):

if f1_knn > f1_nb:
print(f"KNN has the higher F1
score of {f1_knn}")
elif f1_knn < f1_nb:
print(f"Naive Bayes has the higher
F1 score of {f1_nb}")
else:
print(f"Both classifiers have the
same F1 score of {f1_knn}")
print_best_classifier(f1_knn, f1_nb)
def compare_f1_scores(f1_knn, f1_nb):

if f1_knn >= f1_nb:
score of {f1_knn}")
else:
compare_f1_scores(f1_knn, f1_nb)
def evaluate_classifiers(f1_knn, f1_nb):

if f1_knn > f1_nb:
elif f1_nb > f1_knn:
score of {f1_knn}")
else:
print(f"Both classifiers have the
same F1 score of {f1_nb}")
evaluate_classifiers(f1_knn, f1_nb)
Explanation:
-
def print_best_classifier(f1_knn, f1_nb):
if f1_knn > f1_nb:

print(f"KNN has the higher F1 score of
{f1_knn}")
elif f1_knn < f1_nb:
print(f"Naive Bayes has the higher F1
score of {f1_nb}")
else:
print(f"Both classifiers have the same F1
score of {f1_knn}")
print_best_classifier(f1_knn, f1_nb)
Correct because it accurately checks and prints

which classifier has a higher F1 score or if they are
the same, following the specified format.
-
def compare_f1_scores(f1_knn, f1_nb):
if f1_knn >= f1_nb:

{f1_knn}")
else:
score of {f1_nb}")
compare_f1_scores(f1_knn, f1_nb)
Incorrect because it assumes KNN is always better

or equal, which is misleading and does not handle
the equality scenario correctly in terms of printing.
-
def best_f1_score(f1_knn, f1_nb):
print(f"KNN: {f1_knn}, Naive Bayes: {f1_nb}")

best_f1_score(f1_knn, f1_nb)
Does not fulfill the requirement to evaluate which
classifier is better; it merely prints the scores
without comparison.
-
def evaluate_classifiers(f1_knn, f1_nb):
if f1_knn > f1_nb:

score of {f1_nb}")
elif f1_nb > f1_knn:
{f1_knn}")
else:
print(f"Both classifiers have the same F1
score of {f1_nb}")
evaluate_classifiers(f1_knn, f1_nb)
Incorrectly swaps the names of the classifiers in the

output, providing false information about which
classifier performs better.
Question 7/15
Which of the following options will complete the missing code lines to:
i) train the MLPClassifier,
ii) predict the test set labels,
iii) count the number of misclassified samples,
iv) call the function to print the results.

from sklearn.datasets import make_moons
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
import numpy as np
# Generate a two-moon dataset

X, y = make_moons(n_samples=1000, noise=0.2, random_state=42)
# Split the dataset into training and test sets

# Scale the features

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Initialise the MLPClassifier with one hidden layer with 10 neurons

mlp = MLPClassifier(hidden_layer_sizes=(10,), max_iter=1000, random_state=42)
# [Your Code Here] - Train the MLPClassifier on the scaled training data
# [Your Code Here] - Predict the labels for the scaled test data
# [Your Code Here] - Print the number of misclassified samples in the test set
mlp.fit(X_train_scaled, y_train)
y_pred = mlp.predict(X_test_scaled)
print(np.sum(y_test != y_pred))
mlp.train(X_train_scaled, y_train)
y_pred = mlp.test(X_test_scaled)
print(np.count_nonzero(y_test == y_pred))
mlp.fit(X_train_scaled, y_train)
y_pred = mlp.predict(X_test_scaled)
misclassified = np.where(y_test != y_pred,
1, 0)
print(misclassified.sum())
mlp.train(X_train_scaled, y_train)
y_pred = mlp.classify(X_test_scaled)
print((y_test - y_pred).count_nonzero())
Explanation:
- This correct code snippet accurately trains the

MLPClassifier with fit, predicts using predict, and
counts the number of mismatches correctly using
np.sum on a condition.
- The other snippets either use non-existent

methods (train, classify, test), incorrectly calculate
the number of mismatches, or incorrectly sum up
the condition.
Question 8/15
Before running the final line of the code in the snippet below to fit thegrid_search object, you are asked to
perform the following tasks directly in the code:
1. Modify the param_grid to include a new parameter: m

' ax_features' with values ranging from 1 to 4.
2. Fit the grid_search to the training data.
3. After fitting, extract and print the best parameter combination and the corresponding cross-validation score.
Which of the following options correctly completes these tasks?
from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
# Load the Iris dataset

iris = load_iris()
X = iris.data
y = iris.target

# Setup a basic decision tree classifier

dt = DecisionTreeClassifier(random_state=42)
# Define a parameter grid over which to optimise the decision tree

param_grid = {
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 10, 20]
}
# Setup the GridSearchCV

grid_search = GridSearchCV(dt, param_grid, cv=5)
param_grid.update({'max_features': [1, 2]})

grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
print(f"Best parameters found:
{best_params}, Score:
{grid_search.best_score_}")
param_grid['max_features'] = [1, 2, 3, 4]
grid_search = GridSearchCV(dt,
param_grid, cv=5)
print(f"Optimal Parameters:
{grid_search.best_params_}, CV
Accuracy: {grid_search.best_score_}")
param_grid['max_features'] = range(1, 5)
print(f"Best Params:
{grid_search.best_params_}, CV Score:
param_grid = {'max_features': [1, 2, 3, 4]}

print("Best Parameters:",
grid_search.best_params_)
print("Best Cross-validation Score:",
grid_search.best_score_)
Explanation:
-
param_grid['max_features'] = range(1, 5)
print(f"Best Params: {grid_search.best_params_},
CV Score: {grid_search.best_score_}")
The correct option efficiently updates the

`param_grid`, fits the model, and then directly prints
the best parameters and the cross-validation score
in a succinct and clear format.
-
param_grid.update({'max_features': [1, 2]})
best_params = grid_search.best_params_
print(f"Best parameters found: {best_params},
Score: {grid_search.best_score_}")
Correct in execution, but verbose. Using `update()`

is unnecessary for adding a single parameter. The
limited range `[1, 2]` may restrict the exploration of
optimal settings.
-
param_grid = {'max_features': [1, 2, 3, 4]}
print("Best Parameters:",
grid_search.best_params_)
print("Best Cross-validation Score:",
grid_search.best_score_)
Incorrect. This redefines the `param_grid` entirely,

losing previous parameters.
-
param_grid['max_features'] = [1, 2, 3, 4]
grid_search = GridSearchCV(dt, param_grid, cv=5)
print(f"Optimal Parameters:
{grid_search.best_params_}, CV Accuracy:
Incorrect as it unnecessarily reinitializes the

`GridSearchCV`, which is redundant and not
efficient coding practice.
Question 9/15
You are fine-tuning a decision tree classifier for a marketing dataset. To prevent overfitting and ensure robust
generalisability, you must adjust the depth of the decision tree after its initialisation but before it is fitted with
data. Considering the decision tree `dt` has already been initialised with a random state, which of the following is
the correct way to modify the tree's maximum depth?
from sklearn.tree import DecisionTreeClassifier

from sklearn.datasets import load_breast_cancer
# Load data
data = load_breast_cancer()
X = data.data
y = data.target
# Split data
# Initialise decision tree classifier

dt = DecisionTreeClassifier(random_state=42)
# [Your Code Here
dt.set_params(max_depth=5).fit(X_train,
y_train)
dt.max_depth = 42
dt.set_params(max_depth=5)
dt = DecisionTreeClassifier(max_depth=5,
random_state=42)
Explanation:
- dt.set_params(max_depth=5)
Correct. This method updates the max_depth

parameter appropriately without needing to re-
initialise the classifier.
- dt = DecisionTreeClassifier(max_depth=5,
random_state=42)
Incorrect. This line re-initialises the classifier, which

is not needed as dt has already been initialised.
- dt.max_depth = 42
Incorrect. Direct assignment to max_depth is

ineffective because it does not reconfigure the
classifier’s parameters officially.
- dt.set_params(max_depth=5).fit(X_train,
y_train)
Incorrect. Although this sets the parameter and fits

the model, the question specifies adjusting the
depth before fitting, making this option partially
correct but exceeding the required action.
Question 10/15
Suppose you are analysing the performance of a new email spam detection system using precision and recall.
You have already computed these metrics, and you are about to explore their trade-offs to optimise the
classifier's threshold. Given the code snippet below, identify the correct function call that would allow you to
adjust and visualise the precision-recall trade-off.
from sklearn.metrics import precision_recall_curve

import matplotlib.pyplot as plt
# Generate synthetic data for binary classification

# Split data into training and testing sets

# Train a RandomForest classifier

classifier = RandomForestClassifier(random_state=42)
classifier.fit(X_train, y_train)
# Predict probabilities for the test set

y_scores = classifier.predict_proba(X_test)[:, 1]
# [Your Code Here] - Generate precision and recall values for various thresholds
precision_recall_curve(classifier, X_test,
y_test)
precision, recall =
precision_recall_curve(y_test, y_scores)
plt.plot(precision_recall_curve(y_test,
y_scores))
precision, recall, thresholds =

Explanation:
- precision, recall, thresholds =

Correct. This function call correctly uses the

precision_recall_curve function to compute
precision and recall for various threshold values,
using the probability scores from the classifier.
- precision_recall_curve(classifier, X_test,
y_test)
Incorrect. This is a syntactically incorrect use of the

precision_recall_curve function.
- plt.plot(precision_recall_curve(y_test,
y_scores))
Incorrect. While this attempts to plot the precision-

recall curve, it misuses the function in the plotting
context and lacks variable assignment.
- precision, recall =
Incorrect. This option omits the crucial 'thresholds'

output which is typically used to analyse the trade-
off between precision and recall.
Question 11/15
You are tasked with enhancing the robustness of a logistic regression model by incorporating feature scaling.
You're currently working with a dataset that has significantly varying scales among its features, which can affect
the model's performance. Below is a preliminary setup for the logistic regression model. Identify the correct
sequence of steps to integrate feature scaling into the modelling process.

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset

iris = load_iris()
X = iris.data
y = iris.target
# Split data into training and testing sets

# Initialise the Logistic Regression model

lr = LogisticRegression()
# [Your Code Here] - Apply feature scaling to the training data

# [Your Code Here] - Fit the model on the scaled training data
# [Your Code Here] - Apply the same scaling to the test data
X_train_scaled =
scaler.fit_transform(X_train)
lr.fit(X_train_scaled, y_train)
X_test_scaled =
scaler.fit_transform(X_test)
X_train_scaled =
scaler.fit_transform(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled =
scaler.fit_transform(X_test)
X_scaled = scaler.fit_transform(X)
lr.fit(X_scaled, y)
Explanation:
-
Correct. This sequence properly fits the scaler on

the training data, applies the transformation to both
training and test datasets in the correct order, and
fits the logistic regression model on the scaled
training data, avoiding data leakage and ensuring
consistent preprocessing.
-
X_test_scaled = scaler.fit_transform(X_test)
Incorrect. This option initially seems correct as it

scales and fits the model on the training data.
However, the error of reinitialising and refitting the
scaler on the test data violates the principle of
using the same scaler for both training and testing,
which can mislead and degrade model
performance.
-
X_scaled = scaler.fit_transform(X)
lr.fit(X_scaled, y)
Incorrect. This approach incorrectly applies scaling

to the entire dataset before the train-test split,
leading to potential data leakage.
-
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)
Incorrect. This option incorrectly applies

transformation and fitting: it doesn't fit the scaler on
the training data before transforming it and
erroneously refits the scaler to the test data.
Question 12/15
You are fine-tuning a support vector machine (SVM) classifier to categorise images based on their content. The
dataset consists of various animal images, and you suspect that different kernel functions might yield better
classification accuracy. You decide to test which SVM kernel—linear or radial basis function (RBF)—works best
for your specific dataset. Below is your initial code setup:

from sklearn.datasets import load_digits
# Load a dataset of digit images

digits = load_digits()
X = digits.data
y = digits.target
# Split the data into training and testing sets

# Initialise two SVM classifiers, one with a linear kernel and another with an RBF kernel
svm_linear = SVC(kernel='linear')
svm_rbf = SVC(kernel='rbf')
# [Your Code Here] - Train both classifiers on the training data

# [Your Code Here] - Predict the test set results with both classifiers
# [Your Code Here] - Calculate and print the accuracy scores for both classifiers
Which of the following options correctly completes the task of training both SVM classifiers, predicting the test
set results, and calculating the accuracy for each
svm_linear.fit(X_train, y_train)
y_pred_linear =
svm_linear.predict(X_train)
y_pred_rbf = svm_rbf.predict(X_train)
print("Accuracy with Linear Kernel:",
accuracy_score(y_train, y_pred_linear))
print("Accuracy with RBF Kernel:",
accuracy_score(y_train, y_pred_rbf))
svm_linear.train(X_train, y_train)
y_pred_linear =
svm_linear.classify(X_test)
y_pred_rbf = svm_rbf.classify(X_test)
print("Linear Kernel Accuracy:",
accuracy_score(y_test, y_pred_linear))
print("RBF Kernel Accuracy:",
accuracy_score(y_test, y_pred_rbf))
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)
print("Linear Accuracy:",
print("RBF Accuracy:",
Explanation:
-
print("Linear Accuracy:",
print("RBF Accuracy:", accuracy_score(y_test,
y_pred_rbf))
Correct. This option accurately uses the fit and

predict methods on both SVM classifiers and then
correctly computes and prints the accuracy scores
for each kernel type.
-
svm_linear.train(X_train, y_train)
y_pred_linear = svm_linear.classify(X_test)
y_pred_rbf = svm_rbf.classify(X_test)
print("Linear Kernel Accuracy:",
print("RBF Kernel Accuracy:",
y_pred_linear = svm_linear.test(X_test)
y_pred_rbf = svm_rbf.test(X_test)
print("Linear Test Accuracy:",
print("RBF Test Accuracy:",
Both Incorrect. They use non-existent train,

classify, and test methods which are not part of
the sklearn's SVM API.
-
y_pred_linear = svm_linear.predict(X_train)
y_pred_rbf = svm_rbf.predict(X_train)
accuracy_score(y_train, y_pred_linear))
accuracy_score(y_train, y_pred_rbf))
Incorrect. It incorrectly uses training data for

prediction and accuracy assessment instead of test
data, which is a fundamental error in the context of
model evaluation.
Question 13/15
You are currently evaluating two classifiers, K-Nearest Neighbours (KNN) and Naive Bayes, for a project that
involves classifying texts into different categories based on their content. To finalise your model selection, you
decide to visually compare their performance using a bar chart. Below is the setup for calculating the accuracy
of both models on your dataset. Complete the code by adding the necessary lines to plot the accuracies in a bar
chart:
from sklearn.datasets import fetch_20newsgroups
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
# Load data
data = fetch_20newsgroups(subset='all')
X = data.data
y = data.target
# Create train-test split

# Vectorise text data

vectorizer = TfidfVectorizer()
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
# Initialise classifiers
knn = KNeighborsClassifier()
nb = MultinomialNB()
# Train classifiers
knn.fit(X_train_tfidf, y_train)
nb.fit(X_train_tfidf, y_train)
# Predict and calculate accuracy

knn_accuracy = accuracy_score(y_test, knn.predict(X_test_tfidf))
nb_accuracy = accuracy_score(y_test, nb.predict(X_test_tfidf))
# [Your code here] - Plot the accuracies in a bar chart
Which snippet of code will correctly plot the accuracies of KNN and Naive Bayes classifiers in a bar chart?
plt.bar(['KNN', 'Naive Bayes'],

[knn_accuracy, nb_accuracy])
plt.xlabel('Accuracy')
plt.ylabel('Classifier')
plt.title('Classifier Accuracy Comparison')
plt.show()
plt.bar(['KNN', 'Naive Bayes'],

plt.xlabel('Classifier')
plt.ylabel('Accuracy')
plt.title('Classifier Accuracies')
plt.show()
acc_data = [knn_accuracy, nb_accuracy]

labels = ['KNN', 'Naive Bayes']
plt.barh(labels, acc_data)
plt.title('Accuracy Comparison')
plt.show()
plt.plot(['KNN', 'Naive Bayes'],

plt.title('Comparison of Classifier
Performance')
plt.show()
Explanation:
-
plt.bar(['KNN', 'Naive Bayes'], [knn_accuracy,
nb_accuracy])
plt.title('Classifier Accuracies')
plt.show()
Correct. This option properly uses the plt.bar

function to create a bar chart displaying the
accuracies of the KNN and Naive Bayes classifiers,
with appropriate labels and titles.
-
plt.plot(['KNN', 'Naive Bayes'], [knn_accuracy,
nb_accuracy])
plt.title('Comparison of Classifier Performance')
plt.show()
Incorrect. It uses plt.plot, which generates a line

graph instead of a bar chart.
-
acc_data = [knn_accuracy, nb_accuracy]
labels = ['KNN', 'Naive Bayes']
plt.barh(labels, acc_data)
plt.title('Accuracy Comparison')
plt.show()
Incorrect. It uses plt.barh for a horizontal bar chart,

which was not specified in the task.
-
plt.bar(['KNN', 'Naive Bayes'], [knn_accuracy,
nb_accuracy])
plt.title('Classifier Accuracy Comparison')
plt.show()
Incorrect. The axis labels are swapped, which could

lead to confusion in interpreting the chart.
Question 14/15
You are tasked with evaluating a simple binary classification model using a confusion matrix. The dataset
involves predicting whether a given email is spam or not. To better understand the model's performance, you
plan to extract specific metrics from the confusion matrix, specifically True Positives (TP) and False Positives
(FP). Below is your initial code setup:
from sklearn.metrics import confusion_matrix

# Generate synthetic binary classification data

# Split the data

# Train a Random Forest classifier

classifier = RandomForestClassifier(random_state=42)
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)
# Generate the confusion matrix

cm = confusion_matrix(y_test, y_pred)
# [Your code here] - Extract and print True Positives and False Positives
Which snippet of code correctly extracts and prints the True Positives (TP) and False Positives (FP) from the
confusion matrix?
Which snippet of code correctly completes the setup to create a pipeline includingPolynomialFeatures and
LogisticRegression, fits it on the training data, and makes predictions?
print("TP:", cm[2, 2])

print("FP:", cm[1, 2])
tp = cm[1, 1]
fp = cm[0, 1]
print("True Positives:", tp)
print("False Positives:", fp)
print("TP:", cm[1][1])
print("FP:", cm[2][1])
print("True Positives:", cm[2][2])

print("False Positives:", cm[1][2])
Explanation:
- The right option correctly identifies TP and FP

from the confusion matrix where TP is at `cm[1, 1]`
(second row, second column) and FP is at `cm[0,
1]` (first row, second column), which are the
standard positions in a confusion matrix for binary
classification.
- The other options attempt to access indices that

are out of bounds for a typical 2x2 confusion matrix
used in binary classification, or they attempt to use
indices that do not exist, leading to an error or
incorrect data extraction. These answers are clearly
incorrect with no ambiguity or overlap with correct
coding practices.
Question 15/15
You are refining a logistic regression model to predict customer churn. The dataset includes various customer
interaction metrics. To enhance your model, explore how polynomial features can improve prediction accuracy.
This approach allows the model to capture complex interactions between variables.
Here is your setup:

from sklearn.preprocessing import PolynomialFeatures
# Generate synthetic data for binary classification

# Split the data into training and testing sets

# Apply polynomial features manually

poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
What is the correct procedure to fit a logistic regression model on the training data after transforming it with
polynomial features, and how should predictions be made on the test data?
model = LogisticRegression()
model.fit(X_train_poly, y_train)
y_pred = model.predict(X_test_poly)
model.fit(X_train_poly, y_test)
model.fit(X_test_poly, y_test)
y_pred = model.predict(X_train_poly)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Explanation:
-
model.fit(X_train_poly, y_train)
Correct as it follows the necessary steps of applying

polynomial transformations to the training data,
fitting the logistic regression model on this
transformed data, and using the same
transformation logic on the test data before making
predictions. This ensures the model is trained and
tested on the same type of data, which is crucial for
accurate model evaluation.
-
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
Does not apply polynomial transformations to the

data.
-
model.fit(X_test_poly, y_test)
y_pred = model.predict(X_train_poly)
Incorrectly uses test data for model training.
-
model.fit(X_train_poly, y_test)
Incorrectly uses the test set labels for training the

model.

Answer 1722791857 NLP and Classification Practical MCQ 4991

Uploaded by

Copyright:

Available Formats

Answer 1722791857 NLP and Classification Practical MCQ 4991

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Answer 1722791857 NLP and Classification Practical MCQ 4991

Uploaded by

Copyright:

Available Formats

NLP and classification [Practical MCQ] (Version : 0)

from sklearn.feature_extraction.text import CountVectorizer

# Sample text data

# Initialise the CountVectorizer

# Fit and transform the data

# Get the feature names

A list of all words used across the

The count of unique words in each

The frequency of each word in each

A binary indication of whether a word

- The frequency of each word in each document.

- The count of unique words in each document

- A list of all words used across the documents.

- A binary indication of whether a word appears

Modify the code below to compute and print the accuracy.

from sklearn.datasets import load_breast_cancer

# Load the dataset

# Split the data into training and test sets

# Initialise the Logistic Regression model

# Train the model

# Predict the test set results

- 0.958 is correct. When the code to compute

- 0.975, 0.962, and 0.945 are incorrect. These

from sklearn.metrics import confusion_matrix

# Generate synthetic binary classification dataset

# Split the dataset into training and test sets

# Initialise and train the RandomForestClassifier

# Predict the test set results

# Generate the confusion matrix

- 113 is correct. Adding the code snippet to print

- 118, 97, and 130 are incorrect. These values do

from sklearn.datasets import load_digits

# Initialise an SVC (Support Vector Classifier) with a linear kernel

# Define parameter range for C (regularisation parameter)

# Setup the grid search with cross-validation

# Fit grid search

- A 'C' value of 0.001 provides the best balance

- The other values are possible options within the

from sklearn.datasets import make_classification

# Generate a synthetic dataset

# Split the data into training and test sets

# Initialise the SVM classifier with a radial basis function kernel

# Fit the classifier to the training data

# Predict the test set results

# Generate and print the classification report

- The lines in the correct option rightly implement

- The other optionas as all incorrect because each

iii) Executing the function.

Select the appropriate code snippet from the options below.

# Generate a synthetic dataset

# Split the data into training and test sets

# Initialise KNN and Naive Bayes classifiers

# Train both classifiers on the training data

# Predict test set results for both classifiers

# Calculate F1 scores for both classifiers

# [Your Code Here]

def best_f1_score(f1_knn, f1_nb):

def print_best_classifier(f1_knn, f1_nb):

def compare_f1_scores(f1_knn, f1_nb):