TD2345
TD2345
TD2345
The challenge
In this exercise we'll implement logistic regression and apply it to a classification task.
In the first part of this exercise, we'll build a logistic regression model to predict
whether an administration employee will get promoted or not. You have to
determine each employee's chance of promotion based on their results on two
domain-related exams, age and sex. You have historical data from previous applicants
that you can use as a training set for logistic regression. To accomplish this, we're
going to build a classification model that estimates the probability of admission
based on the exam scores.
Load the data from here, examin it using pandas methods, check your variable
types and convert categorical variables to numerical
Scale your dataset using MinMaxScaler from sklearn.preprocessing
Scatter plot of the two scores and use color coding to visualize if the example
is positive (promotted) or negative (not promotted)
Scatter plot of the two scores and use color coding to visualize if the example
is male or female
Implement the sigmoid function and test it (by plotting it ) using generated
data
Write the cost function that takes (X,y,theta) as entries
Extract your dataset (to be used in ML) and labels from the dataframe and
name them (X, y).
Initialize your parameter vector theta as a zero np.array of size 5 (=dimension
of your data + 1 (bias))
Compute the initial cost (with initial values of theta)
Copy the function below in your code. It computes the gradient during the
optimization process
Clustering is a Machine Learning technique that involves the grouping of data points.
Given a set of data points, we can use a clustering algorithm to classify each data
point into a specific group. In theory, data points that are in the same group should
have similar properties and/or features, while data points in different groups should
have highly dissimilar properties and/or features. Clustering is a method of
unsupervised learning and is a common technique for statistical data analysis used in
many fields.
Perform Kmeans clustering on the dataset using only two features (scores of
the first and second exams) using fit_predict function of kmeans
Try multiple values of k. For each value of k, for each cluster, plot the
histogram of the Age, Sex and Salary. Inspect cluster centers using
the cluster_centers_ attribute of kmeans algorithm.
For each value of k, compute the silhouette score to evaluate your clustering
algorithm sklearn.metrics.silhouette_score . Plot the silhouette scores for each
value of k and choose the value of k that achieves the best clustering (hint: it is
the value that maximizes the silhouette score)
COURS
Lecture notes for classification algorithms
John Klein web page
Very good tutorials in Hugo LaRochelle youtube page
For more details, you can check the first chapter of my PhD thesis