1 (A) Explain Supervised Learning and Unsupervised Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

1(a) Explain Supervised Learning and Unsupervised Learning

1(b) Outline the concepts involved in the following in context of

preparing the data for Machine Learning algorithms. Write code-snippet as applicable.

i. Data Cleaning,

ii. Handling text and categorical attributes

iii. Feature scaling


In the context of preparing data for Machine Learning algorithms, the following concepts are
essential:

i. Data Cleaning:
Data cleaning involves handling missing values, removing duplicates, and dealing with outliers in the
dataset. It is crucial to ensure the data is accurate and reliable for training machine learning models.
Here is a code snippet demonstrating data cleaning using pandas in Python:

```python

import pandas as pd

# Load the dataset

df = pd.read_csv('dataset.csv')

# Handling missing values

df.dropna(inplace=True)

# Removing duplicates

df.drop_duplicates(inplace=True)

# Dealing with outliers

# Define a function to remove outliers using z-score

def remove_outliers(df, column):

z_scores = (df[column] - df[column].mean()) / df[column].std()

df = df[(z_scores < 3) & (z_scores > -3)]

return df

df = remove_outliers(df, 'feature_column')

```

ii. Handling text and categorical attributes:

Text and categorical attributes need to be converted into numerical representations for machine
learning algorithms to process them effectively. This can be done using techniques like one-hot
encoding or label encoding. Here is an example code snippet using pandas and scikit-learn for one-
hot encoding:

```python
import pandas as pd

from sklearn.preprocessing import OneHotEncoder

# Load the dataset

df = pd.read_csv('dataset.csv')

# Perform one-hot encoding on categorical column 'category'

encoded_df = pd.get_dummies(df, columns=['category'])

# Alternatively, you can use scikit-learn's OneHotEncoder

encoder = OneHotEncoder()

encoded_features = encoder.fit_transform(df[['category']]).toarray()

```

iii. Feature scaling:

Feature scaling is essential to ensure all features have the same scale, preventing certain features
from dominating the model training process. Common techniques include standardization (scaling
features to have zero mean and unit variance) or normalization (scaling features to a range between
0 and 1). Here is a code snippet using scikit-learn for feature scaling:

```python

from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler

scaler = StandardScaler()

# Fit and transform the features

scaled_features = scaler.fit_transform(df[['feature1', 'feature2']])

```

By incorporating these concepts into the data preparation process, you can ensure that your data is
clean, appropriately encoded, and scaled for training machine learning models effectively.
1(c) Identify why model evaluation using k-Fold Cross Validation is

considered a better alternative over standard one set of train and

validation set.

Model evaluation using k-Fold Cross Validation is considered a better alternative over a standard one
set of train and validation set for the following reasons:

1. **Better Utilization of Data**: In k-Fold Cross Validation, the dataset is divided into k subsets
(folds), and the model is trained and evaluated k times, using each fold as a validation set once and
the remaining folds as the training set. This ensures that each data point is used for validation exactly
once, leading to better utilization of the available data.

2. **Reduced Variance**: By averaging the evaluation metrics over k iterations, k-Fold Cross
Validation provides a more reliable estimate of model performance compared to a single train-
validation split. This helps in reducing the variance of the evaluation metrics and provides a more
stable assessment of the model's generalization performance.

3. **Mitigating Overfitting**: Using multiple validation sets in k-Fold Cross Validation helps in
mitigating the risk of overfitting to a specific validation set. The model is evaluated on different
subsets of data, which can help in identifying whether the model is generalizing well across different
data distributions.

4. **Robustness**: k-Fold Cross Validation provides a more robust evaluation of the model's
performance as it is less sensitive to the randomness in the data split compared to a single train-
validation split. This robustness leads to a more reliable assessment of the model's ability to
generalize to unseen data.

5. **Hyperparameter Tuning**: k-Fold Cross Validation is commonly used in hyperparameter tuning


processes, such as grid search or random search, to find the optimal hyperparameters for the model.
By evaluating the model on multiple folds, it helps in selecting hyperparameters that perform well
across different data subsets.

Overall, k-Fold Cross Validation is preferred over a standard one set of train and validation set due to
its ability to provide a more reliable estimate of model performance, better data utilization, reduced
variance, and robustness in evaluating the model's generalization capabilities.

2(a) Explain any four of the main challenges of machine learning.


2(b) Explain the difference between random sampling and stratified

sampling. Write code-snippet to implement stratified sampling.

Random sampling involves randomly selecting data points from a dataset without considering any
specific characteristics or attributes. On the other hand, stratified sampling involves dividing the
dataset into subgroups based on important attributes and then randomly sampling from each
subgroup to ensure that the sample is representative of the overall population.

To implement stratified sampling in Python using Scikit-Learn, you can use the `train_test_split`
function with the `stratify` parameter. Here is an example code snippet:

```python

from sklearn.model_selection import train_test_split

import pandas as pd

# Assuming 'data' is your dataset and 'important_attribute' is the important attribute for stratified
sampling

X = data.drop(columns=['target_column'])
y = data['target_column']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

```

In this code snippet, `train_test_split` is used to split the dataset into training and test sets while
ensuring that the stratification is based on the 'target_column' (important attribute). This helps in
creating a representative test set for more accurate evaluation of the model.

2(c) Identify how Grid Search and Randomized Search help in FineTuning a model. Write code-
snippet as applicable.

Grid Search and Randomized Search help in fine-tuning a model by systematically searching through
hyperparameter combinations to find the best set of values that optimize the model's performance.
Grid Search exhaustively searches through all specified hyperparameter values, while Randomized
Search evaluates a fixed number of combinations by randomly selecting values for each
hyperparameter at every iteration.

Here is an example code snippet for Grid Search:

```python

from sklearn.model_selection import GridSearchCV

param_grid = [

{'preprocessing__geo_n_clusters': [5, 8, 10],

'random_forest__max_features': [2, 4, 6]}

grid_search = GridSearchCV(full_pipeline, param_grid, cv=3,


scoring='neg_root_mean_squared_error')

grid_search.fit(housing, housing_labels)

```

And for Randomized Search:

```python

from sklearn.model_selection import RandomizedSearchCV


from scipy.stats import randint

param_distribs = {

'preprocessing__geo_n_clusters': randint(low=3, high=50),

'random_forest__max_features': randint(low=2, high=20)

rnd_search = RandomizedSearchCV(full_pipeline, param_distributions=param_distribs, n_iter=16,


cv=3, scoring='neg_root_mean_squared_error', random_state=42)

rnd_search.fit(housing, housing_labels)

```

3(a) Apply Find-S algorithm to construct a maximally specific

hypothesis. Show its working over training instances in Table 1.


3(b) Apply List-Then-Eliminate algorithm to find set of hypotheses

consistent with the training instances shown in Table 1.


3(c)

Outline the concepts involved in the following. Write codesnippet as applicable.


i. Measuring accuracy using Cross-Validation

ii. Confusion Matrix

iii. Precision and Recall


4(a)

Identify how the number of semantically distinct hypotheses in

the hypotheses space is only 973 considering the training

instances in Table 1. Also explain how this number will increase

with the inclusion of additional categorical attributes in the training instances.

4(b)

Apply Candidate

-Elimination algorithm to construct boundary of

most specific and most general hypothesis set by referring

training instances in Table 1.


4(c) Outline the concept of Precision/Recall trade

-off with an

example.
CHAPTER: Machine learning Landscape

1. Explain Machine Learning with examples of applications.


2. Explain Supervised Learning and Unsupervised Learning with examples.(repeated)

3. Discuss any four main challenges of Machine Learning. .(repeated)

4. Discuss the differences between Online and Batch Learning systems.


5. Explain Overfitting and Underfitting training data with appropriate figures.(repeated)

6. Contrast Instance-based versus Model-based learning systems.


CHAPTER: End-to-end Machine Learning Project

1. Explain with expressions a few of the popular metrics used in regression tasks.

2. Explain the difference between random sampling and stratified sampling and write

code-snippet to show usage of the latter one.[repeated]

3. In context to prepare the data for Machine Learning algorithms, write a note on

a. Data Cleaning

b. Handling text and categorical attributes

c. Feature scaling[repeated]

4. Explain why model evaluation using k-Fold Cross Validation is considered a better
alternative over standard one set of train and validation set.[repeated]

5. With the code snippets, show how Grid Search and Randomized Search help in

Fine-Tuning a model.[repeated]

MODULE 2

CHAPTER: Concept Learning and Learning Problems

1. Apply Find-S algorithm to construct a maximally specific hypothesis and show its

working by taking the EnjoySport concept and training instances given in Table 1.[repeated]

2. Identify how the number of semantically distinct hypotheses in the hypotheses

space is only 973 considering the training instances in Table 1. Also explain how this

number will increase with the inclusion of additional categorical attributes in the

training instances. [repeated]

3. Apply List-Then-Eliminate algorithm to find set of hypotheses consistent with the

training example shown in Table 1. [repeated]

4. Apply Candidate-Elimination algorithm to construct boundary of most specific and

general hypothesis set by referring training instances in Table 1. [repeated]

CHAPTER: Classification

1. Using code snippets, outline the concepts involved in

a. Measuring accuracy using Cross-Validation

b. Confusion Matrix

c. Precision and Recall

d. F1 Score [repeated]

2. Outline the concept of Precision/Recall trade-off with an example.

3. With the code snippet explain how Multilabel classification different from

multiclass Multioutput classification?

You might also like