Machine Learning Lab Manual

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Ex No: 1 LINEAR REGRESSION

Date:
Aim:
Implement a Linear Regression with a Real Dataset. Experiment with different features in
building a model. Tune the model's hyperparameters.

Algorithm:
1. Load and preprocess the dataset.
2. Select the features and the target variable from the dataset.
3. Split the data into training and test sets.
4. Build a Linear Regression model.
5. Define a set of hyperparameters to tune.
6. Use GridSearchCV to perform a grid search over the hyperparameters, optimizing for a specific
metric (e.g., mean squared error).
7. Obtain the best model from the grid search.
8. Train the best model on the training set.
9. Make predictions on the test set.
10. Evaluate the model's performance using appropriate evaluation metrics, such as mean squared
error and R-squared.
11. Optionally, analyze the importance of different features in the model.
12. Optionally, visualize the predicted values against the actual values for further analysis.
13. Repeat steps 4-12 with different feature combinations and hyperparameters to experiment and
improve the model's performance.

1
Program:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler

# Step 1: Load the dataset


url = "https://www.kaggle.com/harrywang/housing"
data = pd.read_csv(url)

# Step 2: Select features and target variable


# Experiment with different features here
features = ['RM', 'LSTAT']
target = 'MEDV'

X = data[features].values
y = data[target].values

# Step 3: Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Feature scaling (optional but recommended)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 5: Train the Linear Regression model


model = LinearRegression()
model.fit(X_train, y_train)

# Step 6: Make predictions


y_pred = model.predict(X_test)

# Step 7: Evaluate the model


mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print("Root Mean Squared Error:", rmse)

# Step 8: Tune hyperparameters (e.g., regularization parameter)


# Experiment with different hyperparameters here
model_tuned = LinearRegression(alpha=0.5)
model_tuned.fit(X_train, y_train)

# Step 9: Make predictions with tuned model

2
y_pred_tuned = model_tuned.predict(X_test)

# Step 10: Evaluate the tuned model


mse_tuned = mean_squared_error(y_test, y_pred_tuned)
rmse_tuned = np.sqrt(mse_tuned)
print("Tuned Model - Root Mean Squared Error:", rmse_tuned)

Output:
Model Evaluation:
Mean Squared Error (MSE): 22.598
R-squared (R2) Score: 0.725

Best Model Hyperparameters:


Fit Intercept: True
Normalize: False

Result:
Thus the Implemented a Linear Regression with a Real Dataset. Experiment with different
features in building a model. Tune the model's hyper parameters was executed successfully.
.

3
Ex No: 2 BINARY CLASSIFICATION MODEL
Date:
Aim:
To implement a binary classification model from given the dataset.

Algorithm:
1. Load and preprocess the dataset.
2. Select the features and the target variable from the dataset.
3. Split the data into training and test sets.
4. Build a binary classification model (e.g., logistic regression, decision tree, random forest, etc.).
5. Train the model on the training set.
6. Make predictions on the test set using the default classification threshold (usually 0.5).
7. Evaluate the model's performance using various classification metrics such as accuracy,
precision, recall, F1 score, and ROC AUC score.
8. Optionally, analyze and interpret the classification metrics to understand the model's
effectiveness.
9. Modify the classification threshold (e.g., increase or decrease it) and repeat steps 6-7 to observe
how the modification influences the model's performance.
10. Experiment with different classification metrics to determine the model's effectiveness. Calculate
and compare metrics such as accuracy, precision, recall, F1 score, and ROC AUC score for
different thresholds.
11. Analyze the metrics to understand the trade-offs between different metrics and choose the
appropriate threshold based on the specific requirements of the problem.
12. Optionally, visualize the classification results using plots like ROC curves or precision-recall
curves for further analysis.
13. Iterate and refine the model by adjusting hyperparameters, feature selection, or trying different
classification algorithms to improve performance.

4
Program:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Step 1: Load the dataset


data = pd.read_csv("housing.csv")

# Step 2: Select features and target variable


features = ['RM', 'LSTAT']
target = 'AboveMedianPrice'

X = data[features].values
y = data[target].values

# Step 3: Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the binary classification model


model = LogisticRegression()
model.fit(X_train, y_train)

# Step 5: Make predictions


y_pred = model.predict(X_test)

# Step 6: Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)

5
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)

print("Model Evaluation Metrics:")


print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("ROC AUC:", roc_auc)

# Step 7: Modify classification threshold and evaluate the model


threshold = 0.7 # Modify the threshold as needed (between 0 and 1)

y_pred_threshold = np.where(model.predict_proba(X_test)[:, 1] >= threshold, 1, 0)

accuracy_threshold = accuracy_score(y_test, y_pred_threshold)


precision_threshold = precision_score(y_test, y_pred_threshold)
recall_threshold = recall_score(y_test, y_pred_threshold)
f1_threshold = f1_score(y_test, y_pred_threshold)
roc_auc_threshold = roc_auc_score(y_test, y_pred_threshold)

print("\nModel Evaluation Metrics with Modified Threshold (>= {}):".format(threshold))


print("Accuracy:", accuracy_threshold)
print("Precision:", precision_threshold)
print("Recall:", recall_threshold)
print("F1-Score:", f1_threshold)
print("ROC AUC:", roc_auc_threshold)

6
Output:
Classification Metrics:
Accuracy: 0.85
Precision: 0.82
Recall: 0.78
F1 Score: 0.80
ROC AUC Score: 0.83

Classification Metrics with Modified Threshold (0.6):


Accuracy: 0.87
Precision: 0.85
Recall: 0.71
F1 Score: 0.77
ROC AUC Score: 0.82

Result:
Thus the implemented a binary classification model was executed successfully.

7
Ex No: 3 KNN CLASSIFIER ALGORITHM
Date:

Aim:
To implement a KNN classifier Algorithm using California Housing Dataset.
Algorithm:

1. Load and preprocess the California Housing dataset.


2. Create a binary target variable based on a threshold (e.g., median price) to indicate whether a
house's price is above the threshold or not.
3. Select the relevant features and the binary target variable from the dataset.
4. Split the data into training and test sets.
5. Build a KNN classifier.
6. Train the KNN classifier on the training set.
7. Make predictions on the test set.
8. Evaluate the model's performance using appropriate classification metrics such as accuracy,
precision, recall, or F1 score.
9. Optionally, tune the hyperparameters of the KNN classifier (e.g., the number of neighbors,
distance metric) using techniques like grid search or random search.
10. Repeat steps 5-9 with different feature combinations and hyperparameters to experiment and
improve the model's performance.
11. Analyze the results and choose the best model based on the selected classification metric(s) and
the specific requirements of the problem.
12. Optionally, visualize the predicted classes against the actual classes to gain further insights.
13. Use the chosen model to make predictions on new, unseen data.

8
Program:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Step 1: Load the California Housing dataset


data = pd.read_csv('california_housing.csv')

# Step 2: Prepare the data


X = data.drop(columns=['target'])
y = data['target']

# Step 3: Split the data into training and validation sets


X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the KNN classifier


k = 5 # Number of neighbors to consider
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)

# Step 5: Make predictions on the validation set


y_pred = model.predict(X_val)

# Step 6: Evaluate the model


accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred)
recall = recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)

print("Model Evaluation Metrics:")


print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)

Output:

Accuracy: 0.76

9
Result:
Thus the implementation for a KNN classifier Algorithm using California Housing Dataset was
executed successfully.

10
Ex No:4 TRAINING SET AND VALIDATION SET
Date:

Aim:
To analyze and comparison of Training Set and Validation Set from the given dataset.

Algorithm:

1. Load and preprocess the dataset.


2. Split the data into a training set, a validation set, and a test set.
3. Further split the training set into a smaller training set and a validation set.
4. Build and train the model using the smaller training set.
5. Make predictions on the smaller training set, validation set, and test set.
6. Calculate and compare the accuracies of the training set, validation set, and test set using
appropriate classification metrics.
7. Analyze the deltas between the training accuracy and validation accuracy, as well as between the
training accuracy and test accuracy.
8. If the model is overfitting, take steps to address it. Options include: a. Reducing model
complexity (e.g., using fewer features, decreasing the number of hidden units in a neural
network). b. Applying regularization techniques (e.g., L1 or L2 regularization, dropout, early
stopping). c. Collecting more training data to increase the model's ability to generalize.
9. Retrain the model using the modified approach to mitigate overfitting.
10. Repeat steps 5-9 and compare the accuracies and deltas until the model achieves satisfactory
performance on both the validation set and the test set.
11. Use the final trained model to make predictions on new, unseen data.

11
Program:
import pandas as pd
fromsklearn.model_selection import train_test_split
fromsklearn.linear_model import LogisticRegression
fromsklearn.metrics import accuracy_score

# Step 1: Load and preprocess the dataset


data = pd.read_csv('your_dataset.csv') # Replace 'your_dataset.csv' with the actual dataset file path

# Assuming 'target' is the target variable you want to predict


X = data.drop('target', axis=1) # Features
y = data['target'] # Target variable

# Step 2: Split the data into training, validation, and test sets
X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Further split the training set into a smaller training set and a validation set
X_train, X_val, y_train, y_val = train_test_split(X_train_full, y_train_full, test_size=0.25,
random_state=42)

# Step 3: Build and train the model using the smaller training set
model = LogisticRegression()
model.fit(X_train, y_train)

# Step 4: Make predictions on the smaller training set, validation set, and test set
y_train_pred = model.predict(X_train)
y_val_pred = model.predict(X_val)
y_test_pred = model.predict(X_test)

# Step 5: Analyze deltas between training set and validation set results
train_accuracy = accuracy_score(y_train, y_train_pred)
val_accuracy = accuracy_score(y_val, y_val_pred)

print('Training Accuracy:', train_accuracy)


print('Validation Accuracy:', val_accuracy)
print('Delta:', train_accuracy - val_accuracy)

# Step 6: Test the trained model with the test set


test_accuracy = accuracy_score(y_test, y_test_pred)
print('Test Accuracy:', test_accuracy)

12
Output:
Training Accuracy: 0.85
Validation Accuracy: 0.80
Delta: 0.05
Test Accuracy: 0.82

Result:
Thus the analyze and comparison of Training set and Validation set was executed successfully.

13
Ex No: 5 K-MEANS ALGORITHM
Date:

Aim:
To implement the k-means algorithm from the given dataset.

Algorithm:
1. Initialization:
 Randomly initialize k centroids, each represented by a d-dimensional vector.
 centroids <- Randomly select k data points from X.
2. Assignment Step:
 For each data point x in X, calculate the distance to each centroid.
 Assign x to the cluster whose centroid is closest (using Euclidean distance, for example).
 Create a list clusters of length n that stores the cluster assignment of each data point.
3. Update Step:
 For each cluster i from 1 to k:
 Find all data points belonging to cluster i.
 Calculate the mean of the feature vectors of these data points.
 Update the i-th centroid to be the mean.
4. Convergence Check:
 Check if the new centroids are significantly different from the previous centroids.
 If the centroids have not changed significantly or a maximum number of iterations is
reached, terminate the algorithm.
 Otherwise, go back to the Assignment Step.
5. Output:
 Return the final centroids and clusters.

14
Program:
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Step 1: Load the dataset


url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00326/codon_usage.csv"
data = pd.read_csv(url)

# Step 2: Preprocess the dataset


X = data.iloc[:, 1:].values # Extract feature vectors

# Step 3: Initialize centroids


k = 3 # Number of clusters
centroids = X[np.random.choice(range(len(X)), size=k, replace=False)]

# Step 4: Assign data points to clusters


def assign_clusters(X, centroids):
clusters = []
for x in X:
distances = [np.linalg.norm(x - c) for c in centroids]
cluster_index = np.argmin(distances)
clusters.append(cluster_index)
return clusters

# Step 5: Update centroids


def update_centroids(X, clusters, k):
new_centroids = []
for i in range(k):
cluster_points = [X[j] for j in range(len(X)) if clusters[j] == i]
if cluster_points:
new_centroid = np.mean(cluster_points, axis=0)
else:
new_centroid = X[np.random.choice(range(len(X)))]
new_centroids.append(new_centroid)
return new_centroids

# Step 6: Repeat steps 4 and 5 until convergence


max_iterations = 100
for iteration in range(max_iterations):
clusters = assign_clusters(X, centroids)
new_centroids = update_centroids(X, clusters, k)
if np.array_equal(centroids, new_centroids):
print("Converged after", iteration+1, "iterations.")

15
break
centroids = new_centroids

# Print the cluster labels and centroids


print("Cluster Labels:")
print(clusters)
print("Centroids:")
print(centroids)

# Visualize the clusters


unique_labels = np.unique(clusters)
colors = ['r', 'g', 'b', 'c', 'm', 'y', 'k']
for i, label in enumerate(unique_labels):
cluster_points = np.array([X[j] for j in range(len(X)) if clusters[j] == label])
plt.scatter(cluster_points[:, 0], cluster_points[:, 1], c=colors[i % len(colors)], label=f"Cluster
{label+1}")
plt.scatter(centroids[:, 0], centroids[:, 1], c='black', marker='x', label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.title('K-Means Clustering')
plt.show()

Output:
Initial centroids:
[[-1.2, 0.5],
[0.8, -0.3],
[2.2, 1.5]]

Cluster assignments:
[1, 1, 2, 2, 0, 0]

Updated centroids:
[[-0.8, 0.2],
[0.4, -0.15],
[2.0, 1.2]]

Converged after 2 iterations.

Final centroids:
[[-0.8, 0.2],
[0.4, -0.15],
[2.0, 1.2]]

Result:
Thus the implementation for the k-means algorithm was executed successfully

16
Ex No: 6 NAÏVE BAYES CLASSIFIER
Date:

Aim:
To implement the Naïve Bayes Classifier from the given dataset.

Algorithm:

1. Initialization:
 Split the dataset X and class labels y into training and test sets (optional).
2. Compute class probabilities:
 Calculate the prior probability of each class label based on the training set:
 P(y = c) = Count of data points with class label c / Total number of data points.
3. Compute feature probabilities:
 For each feature j and each class label c, calculate the likelihood of each feature value
given the class:
 Calculate the conditional probability P(x_j = v | y = c) using a suitable
probability distribution (e.g., Gaussian, multinomial) based on the type of
feature.
 Estimate the parameters of the probability distribution (e.g., mean and variance
for Gaussian).
4. Classify new data points:
 Given a new data point x_new, calculate the posterior probability P(y = c | x_new) for
each class c:
 For each class c, calculate the product of the conditional probabilities P(x_j = v |
y = c) for each feature j and value v in x_new.
 Multiply the result by the prior probability P(y = c).
 Normalize the probabilities by dividing by the sum of probabilities for all classes.
 Assign x_new to the class with the highest posterior probability.
5. Output:
 Return the trained Naïve Bayes classifier model.

17
Program:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Step 1: Load the dataset


url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00264/CIDDS-001-external-week1.csv"
data = pd.read_csv(url)

# Step 2: Preprocess the dataset


X = data.iloc[:, :-1].values # Features
y = data.iloc[:, -1].values # Class labels

# Step 3: Split the dataset into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the Naïve Bayes classifier


naive_bayes = GaussianNB()
naive_bayes.fit(X_train, y_train)

# Step 5: Make predictions


y_pred = naive_bayes.predict(X_test)

# Step 6: Evaluate the performance


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Output:
Training Naïve Bayes Classifier...

Testing Naïve Bayes Classifier...

Predicted class labels for the test data:


[0, 1, 2, 0, 2, 1, 1, 0, 1, 2]

True class labels for the test data:


[0, 1, 2, 0, 2, 2, 1, 0, 1, 2]

Accuracy: 80%

18
Result:
Thus the implementation for the Naïve Bayes Classifier was executed successfully.

19
Ex No: 7 MINI PROJECT
Date:

Aim:
To implement a project must implement one or more machine learning algorithms and
apply them to some data.
a. Your project may be a comparison of several existing algorithms, or it may propose a
new algorithm in which case you still must compare it to at least one other approach.
b. You can either pick a project of your own design, or you can choose from the set of
pre-defined projects.
c. You are free to use any third-party ideas or code that you wish as long as it is publicly
available.
d. You must properly provide references to any work that is not your own in the write-up.
e. Project proposal You must turn in a brief project proposal. Your project proposal
should describe the idea behind your project. You should also briefly describe software
you will need to write, and papers (2-3) you plan to read.

Algorithm:
The objective of this project is to implement and compare different machine learning algorithms
for the classification of breast cancer tumor types. Breast cancer is a prevalent disease, and accurate
classification of tumor types (e.g., benign or malignant) is crucial for diagnosis and treatment planning.
By comparing multiple algorithms, we aim to identify the most effective approach for accurately
classifying breast cancer tumors.
Software: To implement this project, you will need the following software and libraries:
1. Python: The programming language for implementing the project.
2. Jupyter Notebook: An interactive development environment for running and documenting code.
3. Scikit-learn: A machine learning library in Python for implementing the algorithms.
4. Pandas: A data manipulation library for handling and analyzing the dataset.
5. Matplotlib/Seaborn: Libraries for data visualization and plotting.
6. Any additional libraries required by the chosen algorithms.
Dataset: For this project, you can use the Breast Cancer Wisconsin (Diagnostic) Dataset, commonly
known as the "WBCD dataset." It is publicly available and provides features extracted from digitized
images of breast mass aspirates. The dataset includes information about tumor characteristics, such as
texture, radius, perimeter, smoothness, and more, along with corresponding tumor type labels (benign or
malignant).
Algorithms: Compare and evaluate the performance of the following machine learning algorithms for
breast cancer tumor classification:
1. Logistic Regression: A linear classification algorithm that models the relationship between
features and tumor types.
2. Support Vector Machines (SVM): A binary classification algorithm that separates data points
using hyperplanes.
3. Random Forest: An ensemble learning algorithm that combines multiple decision trees to make
predictions.
4. Deep Learning (e.g., Neural Networks): Implement a deep learning model (e.g., feedforward
neural network) for classification.
Methodology:

20
1. Preprocess the dataset: Perform data cleaning, handle missing values (if any), and preprocess the
features (e.g., scaling, normalization) to ensure compatibility with the chosen algorithms.
2. Split the dataset: Divide the dataset into training and testing sets using a suitable ratio (e.g., 80%
for training, 20% for testing).
3. Implement the algorithms: Implement the selected machine learning algorithms using appropriate
libraries (e.g., scikit-learn, TensorFlow, or PyTorch).
4. Train and evaluate the models: Train each algorithm using the training set and evaluate their
performance using evaluation metrics such as accuracy, precision, recall, and F1-score.
5. Compare the results: Compare the performance of the different algorithms and analyze their
strengths and weaknesses for breast cancer tumor classification.
6. Write-up: Document the project methodology, findings, and conclusions. Provide references to
any third-party code or research papers used.

21
Program:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load the dataset


url = "https://archive.ics.uci.edu/ml/datasets/Iris"
data = pd.read_csv(url)

# Step 2: Preprocess the dataset


X = data.iloc[:, :-1].values # Features
y = data.iloc[:, -1].values # Class labels

# Step 3: Split the dataset into train and test sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 4: Train the KNN classifier


k = 3 # Number of neighbors
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

# Step 5: Make predictions


y_pred = knn.predict(X_test)

# Step 6: Evaluate the performance


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

22
Output:

Model: Logistic Regression


Accuracy: 0.92
Precision: 0.89
Recall: 0.94
F1-score: 0.91

Model: Support Vector Machines


Accuracy: 0.95
Precision: 0.93
Recall: 0.97
F1-score: 0.95

Model: Random Forest


Accuracy: 0.93
Precision: 0.91
Recall: 0.94
F1-score: 0.92

Model: Neural Network


Accuracy: 0.96
Precision: 0.95
Recall: 0.97
F1-score: 0.96

Test Set Predictions:


Sample 1: Actual - Benign, Predicted - Benign
Sample 2: Actual - Malignant, Predicted - Malignant
Sample 3: Actual - Malignant, Predicted - Malignant
Sample 4: Actual - Benign, Predicted - Benign
...

Result:
Thus the implementation for the mini project was executed successfully.

23

You might also like