Vtu ML

Scikit-learn:
Introduction to Scikit-learn
Scikit-learn is one of the most widely-used open-source machine learning libraries in Python. It
was developed as a part of the SciPy ecosystem, designed to offer simple and efficient tools for
data analysis and machine learning. Scikit-learn’s appeal lies in its easy-to-understand API,
extensive algorithm options, and strong community support, making it an excellent choice for both
beginners and experienced data scientists.
PACKAGES IN SCIKIT LEARN:
Scikit-learn offers a comprehensive collection of packages (also called modules) that provide
functionality for different stages of the machine learning workflow, from data preprocessing to
model evaluation. Below is an elaboration of the main packages in Scikit-learn, along with
examples to demonstrate their use.
1. sklearn.preprocessing: Data Preprocessing
The preprocessing module contains functions to transform raw data into a format more suitable for
machine learning algorithms. These transformations include scaling, normalizing, encoding
categorical variables, and imputing missing values.
Key Functions:
 StandardScaler: Standardizes features by removing the mean and scaling to unit variance.
 MinMaxScaler: Scales features to a fixed range, typically [0, 1].
 OneHotEncoder: Converts categorical variables into a one-hot numeric array.
 LabelEncoder: Encodes target labels with values between 0 and n_classes - 1.
 SimpleImputer: Fills missing values in the dataset.
Example: Data Scaling

Python
from sklearn.preprocessing import StandardScaler

import numpy as np
# Example data
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Apply standard scaling

scaler = StandardScaler()
1
scaled_data = scaler.fit_transform(data)
print(scaled_data)
Example: One-Hot Encoding
Python
from sklearn.preprocessing import OneHotEncoder

import numpy as np
# Example categorical data

data = np.array([['Male'], ['Female'], ['Female'], ['Male']])
# Apply one-hot encoding

encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(data).toarray()
print(encoded_data)
2. sklearn.model_selection: Model Selection and Evaluation

This module provides tools for splitting datasets, evaluating models, and performing
hyperparameter tuning.
Key Functions:
 train_test_split: Splits the dataset into training and testing sets.

 cross_val_score: Evaluates a model using cross-validation.
 GridSearchCV: Performs exhaustive hyperparameter tuning via grid search.
 RandomizedSearchCV: Performs hyperparameter tuning using a randomized search over the
parameter space.
 KFold: Splits data into k consecutive folds for cross-validation.
Example: Splitting Data for Training and Testing

Python
from sklearn.model_selection import train_test_split

from sklearn.datasets import load_iris
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print("Training data size:", X_train.shape)

print("Test data size:", X_test.shape)
2
Example: GridSearchCV for Hyperparameter Tuning
Python
from sklearn.model_selection import GridSearchCV

from sklearn.ensemble import RandomForestClassifier
# Define the model

rf = RandomForestClassifier()
# Define the hyperparameter grid

param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}
# Perform grid search

grid_search = GridSearchCV(rf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
# Print best hyperparameters

print("Best Parameters:", grid_search.best_params_)
3. sklearn.metrics: Model Evaluation Metrics
The metrics module provides various tools for evaluating the performance of machine learning
models. These include accuracy, precision, recall, F1 score, ROC curves, and regression metrics.
Key Functions:
 accuracy_score: Measures the accuracy of a classification model.

 confusion_matrix: Computes the confusion matrix for classification models.
 mean_squared_error: Measures the mean squared error for regression models.
 roc_auc_score: Calculates the Area Under the Receiver Operating Characteristic Curve (AUC-
ROC).
Example: Confusion Matrix and Accuracy Score

Python
from sklearn.metrics import accuracy_score, confusion_matrix
y_true = [1, 0, 1, 1, 0, 1]
y_pred = [1, 0, 0, 1, 0, 1]
# Confusion matrix
print(confusion_matrix(y_true, y_pred))
# Accuracy score
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)
3
4. sklearn.ensemble: Ensemble Methods
The ensemble module provides a collection of algorithms that combine the predictions of multiple
models to improve performance. These include popular ensemble techniques like bagging and
boosting.
Key Algorithms:
 RandomForestClassifier: Builds a forest of decision trees for classification tasks.

 RandomForestRegressor: Builds a forest of decision trees for regression tasks.
 GradientBoostingClassifier: Builds an ensemble of weak learners (decision trees) and optimizes
them using gradient boosting.
 AdaBoostClassifier: Performs adaptive boosting to improve model accuracy.
Example: Random Forest Classifier

Python
from sklearn.ensemble import RandomForestClassifier

# Load dataset
iris = load_iris()
# Train RandomForest classifier

rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
# Predict using the trained model

predictions = rf.predict(X)
print(predictions)
5. sklearn.linear_model: Linear Models

This module contains a wide variety of linear models for regression and classification, such as
linear regression, logistic regression, ridge regression, and more.
Key Algorithms:
 LinearRegression: Fits a linear model to minimize the sum of squared errors.

 LogisticRegression: Performs classification by fitting a logistic curve to the data.
 Ridge: A linear model with L2 regularization.
 Lasso: A linear model with L1 regularization.
Example: Linear Regression

Python
from sklearn.linear_model import LinearRegression

import numpy as np
4
# Sample data (X: features, y: target variable)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 2, 3, 3, 5])
# Train the model

model = LinearRegression()
model.fit(X, y)
# Predict new values

predictions = model.predict([[6]])
print("Prediction for input 6:", predictions)
6. sklearn.tree: Decision Trees

The tree module provides tools for building decision tree models, both for classification and
regression tasks.
Key Algorithms:
 DecisionTreeClassifier: A decision tree for classification tasks.

 DecisionTreeRegressor: A decision tree for regression tasks.
Example: Decision Tree Classifier

Python
from sklearn.tree import DecisionTreeClassifier

# Load dataset
iris = load_iris()
# Train decision tree classifier

clf = DecisionTreeClassifier()
clf.fit(X, y)

predictions = clf.predict(X)
print(predictions)
7. sklearn.svm: Support Vector Machines (SVM)
The svm module provides Support Vector Machine (SVM) algorithms for classification, regression,
and outlier detection.
Key Algorithms:
 SVC: Support Vector Classification.
5
 SVR: Support Vector Regression.
Example: Support Vector Classifier

Python
from sklearn.svm import SVC

# Load dataset
iris = load_iris()
# Train a support vector classifier

svc = SVC(kernel='linear')
svc.fit(X, y)

predictions = svc.predict(X)
print(predictions)
8. sklearn.cluster: Clustering
The cluster module provides tools for performing unsupervised learning tasks like clustering, which
groups similar data points together.
Key Algorithms:
 KMeans: A popular clustering algorithm that partitions data into k clusters.

 DBSCAN: A density-based clustering algorithm.
 AgglomerativeClustering: A hierarchical clustering algorithm.
Example: K-Means Clustering

Python
from sklearn.cluster import KMeans

import numpy as np
# Example data
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
# Fit KMeans with 2 clusters

kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)
# Predicted cluster labels

print(kmeans.labels_)
6
9. sklearn.decomposition: Dimensionality Reduction
This module includes algorithms for reducing the number of features in a dataset while retaining
as much information as possible. Dimensionality reduction is useful for data visualization and
speeding up machine learning algorithms.
Key Algorithms:
 PCA: Principal Component Analysis for feature extraction.

 TruncatedSVD: A variant of SVD for sparse data.
 FactorAnalysis: Performs factor analysis.
Example: Principal Component Analysis (PCA)

Python
from sklearn.decomposition import PCA

# Load dataset
iris = load_iris()
X = iris.data
# Reduce the dimensionality to 2 components

pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
print(X_reduced)
Conclusion
Scikit-learn is packed with powerful packages that allow users to implement a wide range of
machine learning algorithms. Each package serves a critical role in the machine learning
pipeline—from preprocessing data to fine-tuning models and making predictions. The consistency
of its API makes it a highly usable library for both beginners and professionals.
7
How to Install Scikit-learn
Scikit-learn can be installed easily using Python's package management tools like pip or conda (for
Anaconda users). Follow the steps below to install Scikit-learn.
1. Installing Scikit-learn with pip
If you are using the default Python installation, you can install Scikit-learn via pip.
Bash
pip install scikit-learn
This will install Scikit-learn along with its dependencies like NumPy, SciPy, and joblib.
2. Installing Scikit-learn with conda (Anaconda users)
If you're using the Anaconda distribution of Python, you can install Scikit-learn using conda:
Bash
conda install scikit-learn
This will install the version of Scikit-learn that is compatible with Anaconda and any necessary
dependencies.
Verifying Installation
Once the installation is complete, you can verify it by checking the installed version:
Python
import sklearn
print(sklearn. version )
If no errors occur, and a version number is printed, Scikit-learn is installed successfully.
Basic Operations in Scikit-learn

After installing Scikit-learn, you can start with some basic operations such as loading datasets,
preprocessing data, and building simple models. Below are a few common operations to get
started.
8
1. Loading Datasets
Scikit-learn provides several built-in datasets that are useful for practice and experimentation. You
can load these datasets using the sklearn.datasets module.
Example: Loading the Iris Dataset

Python
# Load the iris dataset

iris = load_iris()
# Features and target

X = iris.data # Feature matrix
y = iris.target # Target vector
print(X.shape) # Output: (150, 4)

print(y.shape) # Output: (150,)
2. Data Preprocessing
Before feeding data into machine learning algorithms, it's common to preprocess it. Scikit-learn
provides several preprocessing tools.
Example: Standardizing Data

Python
from sklearn.preprocessing import StandardScaler
# Example data
data = [[1, 2], [3, 4], [5, 6]]
# Initialize the scaler

scaler = StandardScaler()
# Fit and transform the data

scaled_data = scaler.fit_transform(data)
print(scaled_data)
Example: One-Hot Encoding

Python
from sklearn.preprocessing import OneHotEncoder
9
# Example categorical data
data = [['male'], ['female'], ['female'], ['male']]
# Initialize the encoder

encoder = OneHotEncoder()
# Fit and transform the data

encoded_data = encoder.fit_transform(data).toarray()
print(encoded_data)
3. Splitting the Data into Training and Testing Sets

A common practice is to split your dataset into training and testing subsets. Scikit-learn provides
train_test_split to easily perform this split.
Example: Splitting Data

Python

# Load dataset
iris = load_iris()
# Split the dataset into training (70%) and testing (30%) sets
print(X_train.shape) # Output: (105, 4)

print(X_test.shape) # Output: (45, 4)
4. Building a Simple Classifier
Let's build a simple classifier using the famous K-Nearest Neighbors (KNN) algorithm to classify
the Iris dataset.
Example: KNN Classifier

Python
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
10
# Split the data

# Initialize the KNN classifier

knn = KNeighborsClassifier(n_neighbors=3)
# Train the classifier

knn.fit(X_train, y_train)
# Make predictions
y_pred = knn.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
5. Evaluating the Model
Once you build a model, you can evaluate its performance using various metrics like accuracy,
precision, recall, F1 score, and confusion matrix.
Example: Confusion Matrix and Classification Report

Python
from sklearn.metrics import confusion_matrix, classification_report
# Generate the confusion matrix

conf_matrix = confusion_matrix(y_test, y_pred)
# Print the confusion matrix

print("Confusion Matrix:")
print(conf_matrix)
# Print the classification report

print("Classification Report:")
print(classification_report(y_test, y_pred))
6. Making Predictions
Once you’ve trained your model, you can use it to make predictions on new, unseen data.
Example: Making Predictions on New Data

Python
# Example new data point

new_data = [[5.1, 3.5, 1.4, 0.2]]
# Predict the class
11
prediction = knn.predict(new_data)
print("Predicted class:", prediction)
7. Hyperparameter Tuning
You can use GridSearchCV or RandomizedSearchCV to find the best hyperparameters for your model.
Example: Hyperparameter Tuning with GridSearchCV

Python
from sklearn.model_selection import GridSearchCV
# Define the hyperparameters to search

param_grid = {'n_neighbors': [3, 5, 7]}
# Initialize GridSearchCV
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
# Fit the model

grid_search.fit(X_train, y_train)
# Best parameters
print("Best Parameters:", grid_search.best_params_)
Conclusion
Scikit-learn is an easy-to-use library that simplifies the process of building and evaluating machine
learning models. By following these steps, you can quickly get started with Scikit-learn, from
installation to training a model and evaluating its performance.
12
13

Vtu ML

Uploaded by

Copyright:

Available Formats

Vtu ML

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Vtu ML

Uploaded by

Copyright:

Available Formats

Scikit-learn:

PACKAGES IN SCIKIT LEARN:

1. sklearn.preprocessing: Data Preprocessing

Example: Data Scaling

from sklearn.preprocessing import StandardScaler

# Apply standard scaling

from sklearn.preprocessing import OneHotEncoder

# Example categorical data

# Apply one-hot encoding

2. sklearn.model_selection: Model Selection and Evaluation

 train_test_split: Splits the dataset into training and testing sets.

Example: Splitting Data for Training and Testing

from sklearn.model_selection import train_test_split

# Split the dataset

print("Training data size:", X_train.shape)

from sklearn.model_selection import GridSearchCV

# Define the model

# Define the hyperparameter grid

# Perform grid search

# Print best hyperparameters

3. sklearn.metrics: Model Evaluation Metrics

 accuracy_score: Measures the accuracy of a classification model.

Example: Confusion Matrix and Accuracy Score

from sklearn.metrics import accuracy_score, confusion_matrix

 RandomForestClassifier: Builds a forest of decision trees for classification tasks.

Example: Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

# Train RandomForest classifier

# Predict using the trained model

5. sklearn.linear_model: Linear Models

 LinearRegression: Fits a linear model to minimize the sum of squared errors.

Example: Linear Regression

from sklearn.linear_model import LinearRegression

# Train the model

# Predict new values

6. sklearn.tree: Decision Trees

 DecisionTreeClassifier: A decision tree for classification tasks.

Example: Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier

# Train decision tree classifier

# Predict using the trained model

7. sklearn.svm: Support Vector Machines (SVM)

 SVC: Support Vector Classification.

Example: Support Vector Classifier

from sklearn.svm import SVC

# Train a support vector classifier

# Predict using the trained model

 KMeans: A popular clustering algorithm that partitions data into k clusters.

Example: K-Means Clustering

from sklearn.cluster import KMeans

# Fit KMeans with 2 clusters

# Predicted cluster labels

 PCA: Principal Component Analysis for feature extraction.

Example: Principal Component Analysis (PCA)

from sklearn.decomposition import PCA

# Reduce the dimensionality to 2 components

1. Installing Scikit-learn with pip

pip install scikit-learn

2. Installing Scikit-learn with conda (Anaconda users)