Vtu ML
Vtu ML
Vtu ML
Introduction to Scikit-learn
Scikit-learn is one of the most widely-used open-source machine learning libraries in Python. It
was developed as a part of the SciPy ecosystem, designed to offer simple and efficient tools for
data analysis and machine learning. Scikit-learn’s appeal lies in its easy-to-understand API,
extensive algorithm options, and strong community support, making it an excellent choice for both
beginners and experienced data scientists.
Scikit-learn offers a comprehensive collection of packages (also called modules) that provide
functionality for different stages of the machine learning workflow, from data preprocessing to
model evaluation. Below is an elaboration of the main packages in Scikit-learn, along with
examples to demonstrate their use.
The preprocessing module contains functions to transform raw data into a format more suitable for
machine learning algorithms. These transformations include scaling, normalizing, encoding
categorical variables, and imputing missing values.
Key Functions:
StandardScaler: Standardizes features by removing the mean and scaling to unit variance.
MinMaxScaler: Scales features to a fixed range, typically [0, 1].
OneHotEncoder: Converts categorical variables into a one-hot numeric array.
LabelEncoder: Encodes target labels with values between 0 and n_classes - 1.
SimpleImputer: Fills missing values in the dataset.
# Example data
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
1
scaled_data = scaler.fit_transform(data)
print(scaled_data)
Example: One-Hot Encoding
Python
print(encoded_data)
Key Functions:
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
2
Example: GridSearchCV for Hyperparameter Tuning
Python
The metrics module provides various tools for evaluating the performance of machine learning
models. These include accuracy, precision, recall, F1 score, ROC curves, and regression metrics.
Key Functions:
y_true = [1, 0, 1, 1, 0, 1]
y_pred = [1, 0, 0, 1, 0, 1]
# Confusion matrix
print(confusion_matrix(y_true, y_pred))
# Accuracy score
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)
3
4. sklearn.ensemble: Ensemble Methods
The ensemble module provides a collection of algorithms that combine the predictions of multiple
models to improve performance. These include popular ensemble techniques like bagging and
boosting.
Key Algorithms:
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
Key Algorithms:
4
# Sample data (X: features, y: target variable)
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 2, 3, 3, 5])
Key Algorithms:
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
The svm module provides Support Vector Machine (SVM) algorithms for classification, regression,
and outlier detection.
Key Algorithms:
5
SVR: Support Vector Regression.
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
8. sklearn.cluster: Clustering
The cluster module provides tools for performing unsupervised learning tasks like clustering, which
groups similar data points together.
Key Algorithms:
# Example data
X = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
6
9. sklearn.decomposition: Dimensionality Reduction
This module includes algorithms for reducing the number of features in a dataset while retaining
as much information as possible. Dimensionality reduction is useful for data visualization and
speeding up machine learning algorithms.
Key Algorithms:
# Load dataset
iris = load_iris()
X = iris.data
print(X_reduced)
Conclusion
Scikit-learn is packed with powerful packages that allow users to implement a wide range of
machine learning algorithms. Each package serves a critical role in the machine learning
pipeline—from preprocessing data to fine-tuning models and making predictions. The consistency
of its API makes it a highly usable library for both beginners and professionals.
7
How to Install Scikit-learn
Scikit-learn can be installed easily using Python's package management tools like pip or conda (for
Anaconda users). Follow the steps below to install Scikit-learn.
If you are using the default Python installation, you can install Scikit-learn via pip.
Bash
This will install Scikit-learn along with its dependencies like NumPy, SciPy, and joblib.
If you're using the Anaconda distribution of Python, you can install Scikit-learn using conda:
Bash
This will install the version of Scikit-learn that is compatible with Anaconda and any necessary
dependencies.
Verifying Installation
Once the installation is complete, you can verify it by checking the installed version:
Python
import sklearn
print(sklearn. version )
8
1. Loading Datasets
Scikit-learn provides several built-in datasets that are useful for practice and experimentation. You
can load these datasets using the sklearn.datasets module.
2. Data Preprocessing
Before feeding data into machine learning algorithms, it's common to preprocess it. Scikit-learn
provides several preprocessing tools.
# Example data
data = [[1, 2], [3, 4], [5, 6]]
print(scaled_data)
9
# Example categorical data
data = [['male'], ['female'], ['female'], ['male']]
print(encoded_data)
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the dataset into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Let's build a simple classifier using the famous K-Nearest Neighbors (KNN) algorithm to classify
the Iris dataset.
# Load dataset
iris = load_iris()
10
X, y = iris.data, iris.target
# Make predictions
y_pred = knn.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
Once you build a model, you can evaluate its performance using various metrics like accuracy,
precision, recall, F1 score, and confusion matrix.
6. Making Predictions
Once you’ve trained your model, you can use it to make predictions on new, unseen data.
11
prediction = knn.predict(new_data)
print("Predicted class:", prediction)
7. Hyperparameter Tuning
You can use GridSearchCV or RandomizedSearchCV to find the best hyperparameters for your model.
# Initialize GridSearchCV
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)
# Best parameters
print("Best Parameters:", grid_search.best_params_)
Conclusion
Scikit-learn is an easy-to-use library that simplifies the process of building and evaluating machine
learning models. By following these steps, you can quickly get started with Scikit-learn, from
installation to training a model and evaluating its performance.
12
13