Vtu ML

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Scikit-learn:

Introduction to Scikit-learn

Scikit-learn is one of the most widely-used open-source machine learning libraries in Python. It
was developed as a part of the SciPy ecosystem, designed to offer simple and efficient tools for
data analysis and machine learning. Scikit-learn’s appeal lies in its easy-to-understand API,
extensive algorithm options, and strong community support, making it an excellent choice for both
beginners and experienced data scientists.

PACKAGES IN SCIKIT LEARN:

Scikit-learn offers a comprehensive collection of packages (also called modules) that provide
functionality for different stages of the machine learning workflow, from data preprocessing to
model evaluation. Below is an elaboration of the main packages in Scikit-learn, along with
examples to demonstrate their use.

1. sklearn.preprocessing: Data Preprocessing

The preprocessing module contains functions to transform raw data into a format more suitable for
machine learning algorithms. These transformations include scaling, normalizing, encoding
categorical variables, and imputing missing values.

Key Functions:

 StandardScaler: Standardizes features by removing the mean and scaling to unit variance.
 MinMaxScaler: Scales features to a fixed range, typically [0, 1].
 OneHotEncoder: Converts categorical variables into a one-hot numeric array.
 LabelEncoder: Encodes target labels with values between 0 and n_classes - 1.
 SimpleImputer: Fills missing values in the dataset.

Example: Data Scaling


Python

from sklearn.preprocessing import StandardScaler


import numpy as np

# Example data
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Apply standard scaling


scaler = StandardScaler()

1
scaled_data = scaler.fit_transform(data)

print(scaled_data)
Example: One-Hot Encoding
Python

from sklearn.preprocessing import OneHotEncoder


import numpy as np

# Example categorical data


data = np.array([['Male'], ['Female'], ['Female'], ['Male']])

# Apply one-hot encoding


encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(data).toarray()

print(encoded_data)

2. sklearn.model_selection: Model Selection and Evaluation


This module provides tools for splitting datasets, evaluating models, and performing
hyperparameter tuning.

Key Functions:

 train_test_split: Splits the dataset into training and testing sets.


 cross_val_score: Evaluates a model using cross-validation.
 GridSearchCV: Performs exhaustive hyperparameter tuning via grid search.
 RandomizedSearchCV: Performs hyperparameter tuning using a randomized search over the
parameter space.
 KFold: Splits data into k consecutive folds for cross-validation.

Example: Splitting Data for Training and Testing


Python

from sklearn.model_selection import train_test_split


from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Training data size:", X_train.shape)


print("Test data size:", X_test.shape)

2
Example: GridSearchCV for Hyperparameter Tuning
Python

from sklearn.model_selection import GridSearchCV


from sklearn.ensemble import RandomForestClassifier

# Define the model


rf = RandomForestClassifier()

# Define the hyperparameter grid


param_grid = {'n_estimators': [10, 50, 100], 'max_depth': [None, 10, 20]}

# Perform grid search


grid_search = GridSearchCV(rf, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Print best hyperparameters


print("Best Parameters:", grid_search.best_params_)

3. sklearn.metrics: Model Evaluation Metrics

The metrics module provides various tools for evaluating the performance of machine learning
models. These include accuracy, precision, recall, F1 score, ROC curves, and regression metrics.

Key Functions:

 accuracy_score: Measures the accuracy of a classification model.


 confusion_matrix: Computes the confusion matrix for classification models.
 mean_squared_error: Measures the mean squared error for regression models.
 roc_auc_score: Calculates the Area Under the Receiver Operating Characteristic Curve (AUC-
ROC).

Example: Confusion Matrix and Accuracy Score


Python

from sklearn.metrics import accuracy_score, confusion_matrix

y_true = [1, 0, 1, 1, 0, 1]
y_pred = [1, 0, 0, 1, 0, 1]

# Confusion matrix
print(confusion_matrix(y_true, y_pred))

# Accuracy score
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)

3
4. sklearn.ensemble: Ensemble Methods

The ensemble module provides a collection of algorithms that combine the predictions of multiple
models to improve performance. These include popular ensemble techniques like bagging and
boosting.

Key Algorithms:

 RandomForestClassifier: Builds a forest of decision trees for classification tasks.


 RandomForestRegressor: Builds a forest of decision trees for regression tasks.
 GradientBoostingClassifier: Builds an ensemble of weak learners (decision trees) and optimizes
them using gradient boosting.
 AdaBoostClassifier: Performs adaptive boosting to improve model accuracy.

Example: Random Forest Classifier


Python

from sklearn.ensemble import RandomForestClassifier


from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train RandomForest classifier


rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)

# Predict using the trained model


predictions = rf.predict(X)
print(predictions)

5. sklearn.linear_model: Linear Models


This module contains a wide variety of linear models for regression and classification, such as
linear regression, logistic regression, ridge regression, and more.

Key Algorithms:

 LinearRegression: Fits a linear model to minimize the sum of squared errors.


 LogisticRegression: Performs classification by fitting a logistic curve to the data.
 Ridge: A linear model with L2 regularization.
 Lasso: A linear model with L1 regularization.

Example: Linear Regression


Python

from sklearn.linear_model import LinearRegression


import numpy as np

4
# Sample data (X: features, y: target variable)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 2, 3, 3, 5])

# Train the model


model = LinearRegression()
model.fit(X, y)

# Predict new values


predictions = model.predict([[6]])
print("Prediction for input 6:", predictions)

6. sklearn.tree: Decision Trees


The tree module provides tools for building decision tree models, both for classification and
regression tasks.

Key Algorithms:

 DecisionTreeClassifier: A decision tree for classification tasks.


 DecisionTreeRegressor: A decision tree for regression tasks.

Example: Decision Tree Classifier


Python

from sklearn.tree import DecisionTreeClassifier


from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train decision tree classifier


clf = DecisionTreeClassifier()
clf.fit(X, y)

# Predict using the trained model


predictions = clf.predict(X)
print(predictions)

7. sklearn.svm: Support Vector Machines (SVM)

The svm module provides Support Vector Machine (SVM) algorithms for classification, regression,
and outlier detection.

Key Algorithms:

 SVC: Support Vector Classification.

5
 SVR: Support Vector Regression.

Example: Support Vector Classifier


Python

from sklearn.svm import SVC


from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Train a support vector classifier


svc = SVC(kernel='linear')
svc.fit(X, y)

# Predict using the trained model


predictions = svc.predict(X)
print(predictions)

8. sklearn.cluster: Clustering
The cluster module provides tools for performing unsupervised learning tasks like clustering, which
groups similar data points together.

Key Algorithms:

 KMeans: A popular clustering algorithm that partitions data into k clusters.


 DBSCAN: A density-based clustering algorithm.
 AgglomerativeClustering: A hierarchical clustering algorithm.

Example: K-Means Clustering


Python

from sklearn.cluster import KMeans


import numpy as np

# Example data
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])

# Fit KMeans with 2 clusters


kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)

# Predicted cluster labels


print(kmeans.labels_)

6
9. sklearn.decomposition: Dimensionality Reduction

This module includes algorithms for reducing the number of features in a dataset while retaining
as much information as possible. Dimensionality reduction is useful for data visualization and
speeding up machine learning algorithms.

Key Algorithms:

 PCA: Principal Component Analysis for feature extraction.


 TruncatedSVD: A variant of SVD for sparse data.
 FactorAnalysis: Performs factor analysis.

Example: Principal Component Analysis (PCA)


Python

from sklearn.decomposition import PCA


from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X = iris.data

# Reduce the dimensionality to 2 components


pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)

print(X_reduced)

Conclusion

Scikit-learn is packed with powerful packages that allow users to implement a wide range of
machine learning algorithms. Each package serves a critical role in the machine learning
pipeline—from preprocessing data to fine-tuning models and making predictions. The consistency
of its API makes it a highly usable library for both beginners and professionals.

7
How to Install Scikit-learn

Scikit-learn can be installed easily using Python's package management tools like pip or conda (for
Anaconda users). Follow the steps below to install Scikit-learn.

1. Installing Scikit-learn with pip

If you are using the default Python installation, you can install Scikit-learn via pip.

Bash

pip install scikit-learn

This will install Scikit-learn along with its dependencies like NumPy, SciPy, and joblib.

2. Installing Scikit-learn with conda (Anaconda users)

If you're using the Anaconda distribution of Python, you can install Scikit-learn using conda:

Bash

conda install scikit-learn

This will install the version of Scikit-learn that is compatible with Anaconda and any necessary
dependencies.

Verifying Installation
Once the installation is complete, you can verify it by checking the installed version:

Python

import sklearn
print(sklearn. version )

If no errors occur, and a version number is printed, Scikit-learn is installed successfully.

Basic Operations in Scikit-learn


After installing Scikit-learn, you can start with some basic operations such as loading datasets,
preprocessing data, and building simple models. Below are a few common operations to get
started.

8
1. Loading Datasets

Scikit-learn provides several built-in datasets that are useful for practice and experimentation. You
can load these datasets using the sklearn.datasets module.

Example: Loading the Iris Dataset


Python

from sklearn.datasets import load_iris

# Load the iris dataset


iris = load_iris()

# Features and target


X = iris.data # Feature matrix
y = iris.target # Target vector

print(X.shape) # Output: (150, 4)


print(y.shape) # Output: (150,)

2. Data Preprocessing

Before feeding data into machine learning algorithms, it's common to preprocess it. Scikit-learn
provides several preprocessing tools.

Example: Standardizing Data


Python

from sklearn.preprocessing import StandardScaler

# Example data
data = [[1, 2], [3, 4], [5, 6]]

# Initialize the scaler


scaler = StandardScaler()

# Fit and transform the data


scaled_data = scaler.fit_transform(data)

print(scaled_data)

Example: One-Hot Encoding


Python

from sklearn.preprocessing import OneHotEncoder

9
# Example categorical data
data = [['male'], ['female'], ['female'], ['male']]

# Initialize the encoder


encoder = OneHotEncoder()

# Fit and transform the data


encoded_data = encoder.fit_transform(data).toarray()

print(encoded_data)

3. Splitting the Data into Training and Testing Sets


A common practice is to split your dataset into training and testing subsets. Scikit-learn provides
train_test_split to easily perform this split.

Example: Splitting Data


Python

from sklearn.model_selection import train_test_split


from sklearn.datasets import load_iris

# Load dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(X_train.shape) # Output: (105, 4)


print(X_test.shape) # Output: (45, 4)

4. Building a Simple Classifier

Let's build a simple classifier using the famous K-Nearest Neighbors (KNN) algorithm to classify
the Iris dataset.

Example: KNN Classifier


Python

from sklearn.neighbors import KNeighborsClassifier


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load dataset
iris = load_iris()

10
X, y = iris.data, iris.target

# Split the data


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the KNN classifier


knn = KNeighborsClassifier(n_neighbors=3)

# Train the classifier


knn.fit(X_train, y_train)

# Make predictions
y_pred = knn.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

5. Evaluating the Model

Once you build a model, you can evaluate its performance using various metrics like accuracy,
precision, recall, F1 score, and confusion matrix.

Example: Confusion Matrix and Classification Report


Python

from sklearn.metrics import confusion_matrix, classification_report

# Generate the confusion matrix


conf_matrix = confusion_matrix(y_test, y_pred)

# Print the confusion matrix


print("Confusion Matrix:")
print(conf_matrix)

# Print the classification report


print("Classification Report:")
print(classification_report(y_test, y_pred))

6. Making Predictions

Once you’ve trained your model, you can use it to make predictions on new, unseen data.

Example: Making Predictions on New Data


Python

# Example new data point


new_data = [[5.1, 3.5, 1.4, 0.2]]

# Predict the class

11
prediction = knn.predict(new_data)
print("Predicted class:", prediction)

7. Hyperparameter Tuning

You can use GridSearchCV or RandomizedSearchCV to find the best hyperparameters for your model.

Example: Hyperparameter Tuning with GridSearchCV


Python

from sklearn.model_selection import GridSearchCV

# Define the hyperparameters to search


param_grid = {'n_neighbors': [3, 5, 7]}

# Initialize GridSearchCV
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)

# Fit the model


grid_search.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid_search.best_params_)

Conclusion

Scikit-learn is an easy-to-use library that simplifies the process of building and evaluating machine
learning models. By following these steps, you can quickly get started with Scikit-learn, from
installation to training a model and evaluating its performance.

12
13

You might also like