Machine Learning With Python
Machine Learning With Python
Machine Learning With Python
Machine learning can be described as a form of statistical analysis, often even utilizing
well-known and familiar techniques, that has bit of a different focus than traditional
analytical practice in applied disciplines. The key notion is that flexible, automatic
approaches are used to detect patterns within the data, with a primary focus on making
predictions on future data. Python versions of the model examples are available here. In
addition, Marcio Mourao has provided additional Python examples.
As for prerequisite knowledge, I will assume a basic familiarity with regression analyses
typically presented in applied disciplines. Regarding programming, none is really
required to follow most of the content here. Note that I will not do much explaining of the
code, as I will be more concerned with getting to a result than clearly detailing the path
to it.
The book discusses many methods that have their bases in different fields: statistics,
pattern recognition, neural networks, artificial intelligence, signal processing, control,
and data mining. In the past, research in these different communities followed different
paths with different emphases. In this book, the aim is to incorporate them together to
give a unified treatment of the problems and the proposed solutions to them.
The book can be used for a one-semester course by sampling from the chapters. I very
much enjoyed writing this book; I hope you will enjoy reading it.
Note: External sources of text and images as a contribution for this book are
clearly mentioned inline along with the respective text and images.
2
ABOUT AUTHOR
Ajit Singh
Assistant Professor (Ad-hoc)
Department of Computer Application
Patna Women's College, Patna, Bihar.
20+ Years of strong teaching experience for Under Graduate and Post Graduate courses of Computer
Science across several colleges of Patna University and NIT Patna, Bihar, IND.
[Memberships]
1. InternetSociety (2168607) - Delhi/Trivandrum Chapters
2. IEEE (95539159)
3. International Association of Engineers (IAENG-233408)
4. Eurasia Research STRA-M19371
5. Member – IoT Council
5. ORCID https://orcid.org/0000-0002-6093-3457
6. Python Software Foundation
7. Data Science Foundation
8. Non Fiction Authors Association (NFAA-21979)
CONENTS
1 Introduction to Machine Learning 7
Machine learning within data science
IT/computing science tools
Statistics and applied mathematics
Data analysis methodology
2 Python language 12
Set up your programming environment using Anaconda
Import libraries
Data types
Math
Comparisons and Boolean operations
Conditional statements
Lists
Tuples
Strings
Dictionaries
Sets
Functions
Loops
List comprehensions
Exceptions handling
Basic operating system interfaces (os)
Object Oriented Programming (OOP)
Exercises
5 Matplotlib: Plotting 33
Preamble about the F-distribution
Basic plots
Scatter (2D) plots
Saving Figures
4
Exploring data (with seaborn)
Density plot with one figure containing multiple axes
6 Univariate statistics 43
Estimators of the main statistical measures
Main distributions
Testing pairwise associations
Non-parametric test of pairwise associations
Linear model
Linear model with stats models
Multiple comparisons
Exercise
7 Multivariate statistics 65
Linear Algebra
Mean vector
Covariance matrix
Precision matrix
Mahalanobis distance
Multivariate normal distribution
Exercises
9 Clustering 84
K-means clustering
Hierarchical clustering
Gaussian mixture models
Model selection
6
CHAPTER
ONE
The learning that is being done is always based on some sort of observations or data, such as
examples (the most common case in this book), direct experience, or instruction. So in general,
machine learning is about learning to do better in the future based on what was experienced in
the past.
The emphasis of machine learning is on automatic methods. In other words, the goal is to
devise learning algorithms that do the learning automatically without human intervention or
assistance. The machine learning paradigm can be viewed as programming by example." Often
we have a specific task in mind, such as spam filtering. But rather than program the computer to
solve the task directly, in machine learning, we seek methods by which the computer will come
up with its own program based on examples that we provide.
In 1959, Arthur Samuel defined machine learning as a “Field of study that gives computers the
ability to learn without being explicitly programmed”. Tom M. Mitchell provided a widely quoted,
more formal definition: “A computer program is said to learn from experience E with respect to
some class of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E”. This definition is notable for its defining machine
learning in fundamentally operational rather than cognitive terms, thus following Alan Turing's
proposal in his paper "Computing Machinery and Intelligence" that the quest ion “Can machines
think?" be replaced with the quest ion “Can machines do what we (as thinking entities) can do?"
Machine learning is a core subarea of artificial intelligence. It is very unlikely that we will be able
to build any kind of intelligent system capable of any of the facilities that we associate with
intelligence, such as language or vision, without using learning to get there. These tasks are
otherwise simply too difficult to solve. Further, we would not consider a system to be truly
intelligent if it were incapable of learning since learning is at the core of intelligence.
Although a subarea of AI, machine learning also intersects broadly with other fields, especially
statistics, but also mathematics, physics, theoretical computer science and more.
Topic spotting: categorize news articles (say) as to whether they are about politics, sports,
entertainment, etc.
Spoken language understanding: within the context of a limited domain, determine the meaning
of something uttered by a speaker to the extent that it can be classified into one of a fixed set of
categories.
Medical diagnosis: diagnose a patient.
Customer segmentation: predict, for instance, which customers will respond to a particular
promotion.
Fraud detection: identify credit card transactions (for instance) which may be fraudulent in
nature.
Weather prediction: predict, for instance, whether or not it will rain tomorrow.
Although much of what we will talk about will be about classification problems, there are other
important learning problems. In classification, we want to categorize objects into fixed
categories. In regression, on the other hand, we are trying to predict a real value. For instance,
we may wish to predict how much it will rain tomorrow. Or, we might want to predict how much
a house will sell for.
A richer learning scenario is one in which the goal is actually to behave intelligently, or to make
intelligent decisions. For instance, a robot needs to learn to navigate through its environment
without colliding with anything. To use machine learning to make money on the stock market,
we might treat investment as a classification problem (will the stock go up or down) or a
regression problem (how much will the stock go up), or, dispensing with these intermediate
goals, we might want the computer to learn directly how to decide to make investments so as to
maximize wealth.
Learning algorithms should also be as general purpose as possible. We are looking for
algorithms that can be easily applied to a broad class of learning problems, such as those listed
above.
8
Of course, we want the result of learning to be a prediction rule that is as accurate as possible
in the predictions that it makes.
Occasionally, we may also be interested in the interpretability of the prediction rules produced by
learning. In other words, in some contexts (such as medical diagnosis), we want the computer to do
prediction rules that are easily understandable by human experts.
Also, humans often have trouble expressing what they know, but have no difficulty labelling items.
For instance, it is easy for all of us to label images of letters by the character represented, but we
would have a great deal of trouble explaining how we do it in precise terms.
Another reason to study machine learning is the hope that it will provide insights into the general
phenomenon of learning. Some of the questions that might be answered include:
1. What are the intrinsic properties of a given learning problem that make it hard or easy to
solve?
2. How much do you need to know ahead of time about what is being learned in order to be
able to learn it effectively?
3. Why are "simpler" hypotheses better?
Machine learning tasks are typically classified into three broad categories, depending on the
nature of the learning “signal” or “feedback” available to a learning system.
These are:
• Supervised learning: The computer is presented with example inputs and their desired
outputs, given by a “teacher”, and the goal is to learn a general rule that maps inputs to
outputs.
• Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to
find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden
patterns in data) or a means towards an end.
• Reinforcement learning: A computer program interacts with a dynamic environment in which
it must perform a certain goal (such as driving a vehicle), without a teacher explicitly telling it
whether it has come close to its goal or not. Another example is learning to play a game by
playing against an opponent.
Machine learning within data science
1. Exploratory analysis: Unsupervised learning. Discover the structure within the data. E.g.:
Experience (in years in a company) and salary are correlated.
2. Predictive analysis: Supervised learning. This is sometimes described as to “learn from the
past to predict the future”. Scenario: a company wants to detect potential future clients
among a base of prospect. Retrospective data analysis: given the base of prospected
company (with their characteristics: size, domain, localization, etc.) some became clients,
some do not. Is it possible to learn to predict those that are more likely to become clients
from their company characteristics? The training data consist of a set of n training samples.
Each sample, xi, is a vector of p input features (company characteristics) and a target
feature (yi ∈ { Yes,No } (whether they became a client or not).
10
Data analysis methodology
DIKW Pyramid: Data, Information, Knowledge, and Wisdom
Methodology
1. Discuss with your customer:
• Understand his needs.
• Formalize his needs into a learning problem.
• Define with your customer the learning dataset required for the project.
• Goto 1. Until convergence of both sides (you and the customer).
2. In a document formalize (i) the project objectives; (ii) the required learning dataset; more
specifically the input data and the target variables. (iii) The conditions that define the
acquisition of the dataset. In this document warm the customer that the learned algorithms
may not work on new data acquired under different condition.
3. Read your learning dataset (level D of the pyramid) provided by the customer.
4. Clean your data (QC: Quality Control) (reach level I of the pyramid).
5. Explore data (visualization, PCA) and perform basics univariate statistics (reach level K of
the pyramid).
7. Perform more complex multivariate-machine learning.
8. Model validation. First deliverable: the predictive model with performance on training
dataset.
9. Apply on new data (level W of the pyramid).
CHAPTER
TWO
PYTHON LANGUAGE
Python 3.7:
bash Anaconda3-3.7.0-Linux-x86_64.sh
3. Add anaconda path in your PATH variable in your .bashrc file:
Python 2.7:
export PATH="${HOME}/anaconda2/bin:$PATH"
Python 3.7:
export PATH="${HOME}/anaconda3/bin:$PATH"
4. Optional: install additional packages:
Using conda:
conda install seaborn
Using pip:
pip install -U --user seaborn
Optional:
pip install -U --user nibabel pip install -U --
user nilearn
5. Python editor spyder:
• Consoles/Open IPython consol.
• Left panel text editor
• Right panel ipython consol
• F9 run selection or current line (in recent version of spyder)
6. Python interpreter python or ipython same as python with many useful features.
Import libraries
# 'generic import' of math module
import math math.sqrt(25)
# import a function from math
import sqrt sqrt(25)# no longer have to reference the module
# import multiple functions at once
from math import cos, floor
# import all functions in a module (generally discouraged) from os import *
# define an alias
import numpy as np
# show all functions in math module
content = dir(math)
12
Data types
# determine the type of an object
type(2) # returns 'int'
type(2.0) # returns 'float'
type('two') # returns 'str'
type(True) # returns 'bool'
type(None) # returns 'NoneType'
# check if an object is of a given type
isinstance(2.0, int) # returns False
isinstance(2.0, (int, float)) # returns True
# convert an object to a given type
float(2)
int(2.9)
str(2.9)
# zero, None, and empty containers are converted to False
bool(0)
bool(None)
bool('') # empty string bool([])
# empty list bool({})
#empty dictionary
# non-empty containers and non-zeros are converted to True
bool(2)
bool('two')
bool([2])
True
Math
# basic operations
10 + 4 # add (returns 14)
10 - 4 # subtract (returns 6)
10 * 4 # multiply (returns 40)
10 ** 4 # exponent (returns 10000)
10 / 4 # divide (returns 2 because both types are 'int')
10 / float(4) # divide (returns 2.5)
5%4 # modulo (returns 1) - also known as the remainder
# force '/' in Python 2.x to perform 'true division' (unnecessary in Python 3.x)
from __future__ import division
10 / 4 # true division (returns 2.5)
10 // 4 # floor division (returns 2)
2
Conditional statements
x=3
# if statement
if x >0:
print('positive')
# if/else statement
if x > 0:
print('positive')
else:
print('zero or negative')
# if/elif/else statement
if x > 0:
print('positive')
elif x == 0:
print('zero')
else:
print('negative')
# single-line if statement (sometimes discouraged)
if x > 0: print('positive')
# single-line if/else statement (sometimes discouraged)
# known as a 'ternary operator' 'positive'
if x > 0 else 'zero or negative'
positive positive positive
positive
'positive'
Lists
## properties: ordered, iterable, mutable, can contain multiple data types
# create an empty list (two ways)
empty_list = []
empty_list = list()
# create a list
simpsons = ['homer', 'marge', 'bart']
# examine a list
simpsons[0] # print element 0 ('homer')
len(simpsons) # returns the length (3)
# modify a list (does not return the list)
simpsons.append('lisa') # append element to end
simpsons.extend(['itchy', 'scratchy']) # append multiple elements to end
simpsons.insert(0, 'maggie') # insert element at index 0 (shifts everything right)
simpsons.remove('bart') # searches for first instance and removes it
simpsons.pop(0) # removes element 0 and returns it
del simpsons[0] # removes element 0 (does not return it)
simpsons[0] = 'krusty' # replace element 0
14
# concatenate lists (slower than 'extend' method)
neighbors = simpsons + ['ned','rod','todd']
# find elements in a list
simpsons.count('lisa') # counts the number of instances
simpsons.index('itchy') # returns index of first instance
# list slicing [start:end:stride]
weekdays = ['mon','tues','wed','thurs','fri']
weekdays[0] # element 0
weekdays[0:3] # elements 0, 1, 2
weekdays[:3] # elements 0, 1, 2
weekdays[3:] # elements 3, 4
weekdays[-1] # last element (element 4)
weekdays[::2] # every 2nd element (0, 2, 4)
weekdays[::-1] # backwards (4, 3, 2, 1, 0)
# alternative method for returning the list backwards
list(reversed(weekdays))
# sort a list in place (modifies but does not return the list)
simpsons.sort()
simpsons.sort(reverse=True) # sort in reverse
simpsons.sort(key=len) # sort by a key
# return a sorted list (but does not modify the original list)
sorted(simpsons)
sorted(simpsons, reverse=True)
sorted(simpsons, key=len)
# create a second reference to the same list
num = [1, 2, 3]
same_num = num
same_num[0] = 0 # modifies both 'num' and 'same_num'
# copy a list (three ways)
new_num = num.copy()
new_num = num[:]
new_num = list(num)
# examine objects
id(num) == id(same_num) # returns True
id(num) == id(new_num) # returns False
num is same_num # returns True
num is new_num # returns False
num == same_num # returns True
num == new_num # returns True (their contents are equivalent)
# conatenate +, replicate *
[1, 2, 3] + [4, 5, 6]
["a"] * 2 + ["b"] * 3
['a', 'a', 'b', 'b', 'b']
Tuples
Like lists, but they don’t change size properties: ordered, iterable, immutable, can contain
multiple data types
# create a tuple
digits = (0, 1, 'two') # create a tuple directly
digits = tuple([0, 1, 'two']) # create a tuple from a list zero = (0,) # trailing comma is required to indicate it's a tuple
# examine a tuple
digits[2] # returns 'two'
len(digits) # returns 3
digits.count(0) # counts the number of instances of that value (1)
digits.index(1) # returns the index of the first instance of that value (1)
# elements of a tuple cannot be modified # digits[2] = 2
# throws an error
# concatenate tuples
digits = digits + (3, 4)
# create a single tuple with elements repeated (also works with lists)
(3, 4) * 2 # returns (3, 4, 3, 4)
# tuple unpacking
bart = ('male', 10, 'simpson') # create a tuple
Strings
Properties: iterable, immutable
from __future__ import print_function
# create a string
s = str(42) # convert another data type into a string s = 'I like you'
# examine a string
s[0] # returns 'I'
len(s) # returns 10
# string slicing like lists
s[:6] # returns 'I like'
s[7:] # returns 'you'
s[1] # returns 'u'
# basic string methods (does not modify the original string)
s.lower() # returns 'i like you'
s.upper() # returns 'I LIKE YOU'
s.startswith('I') # returns True
s.endswith('you') # returns True
s.isdigit() # returns False (returns True if every character in the string is →a digit)
s.find('like') # returns index of first occurrence (2), but doesn't support regex
s.find('hate') # returns -1 since not found
s.replace('like','love') # replaces all instances of 'like' with 'love'
# split a string into a list of substrings separated by a delimiter
s.split(' ') # returns ['I','like','you']
s.split() # same thing
s2 = 'a, an, the'
s2.split(',') # returns ['a',' an',' the']
# join a list of strings into one string using a delimiter
stooges = ['larry','curly','moe']
' '.join(stooges) # returns 'larry curly moe'
# concatenate strings
s3 = 'The meaning of life is'
s4 = '42' s3 + ' ' + s4 # returns 'The meaning of life is 42'
s3 + ' ' + str(42) # same thing
# remove whitespace from start and end of a string
s5 = ' ham and cheese '
s5.strip() # returns 'ham and cheese'
# string substitutions: all of these return 'raining cats and dogs'
'raining %s and %s' % ('cats','dogs') # old way
'raining {} and {}'.format('cats','dogs') # new way
'raining {arg1} and {arg2}'.format(arg1='cats',arg2='dogs') # named arguments
# string formatting
# more examples: http://mkaz.com/2012/10/10/python-string-format/
'pi is {:.2f}'.format(3.14159) # returns 'pi is 3.14'
16
# normal strings versus raw strings print('first line\nsecond line') # normal strings allow for escaped
characters
print(r'first line\nfirst line') # raw strings treat backslashes as literal characters
Dictionaries
Properties: unordered, iterable, mutable, can contain multiple data types made up of key-value
pairs keys must be unique, and can be strings, numbers, or tuples values can be any type
# create an empty dictionary (two ways)
empty_dict = {}
empty_dict = dict()
# create a dictionary (two ways)
family = {'dad':'homer', 'mom':'marge', 'size':6}
family = dict(dad='homer', mom='marge', size=6)
# convert a list of tuples into a dictionary
list_of_tuples = [('dad','homer'), ('mom','marge'), ('size', 6)]
family = dict(list_of_tuples)
# examine a dictionary
family['dad'] # returns 'homer'
len(family) # returns 3
family.keys() # returns list: ['dad', 'mom', 'size']
family.values() # returns list: ['homer', 'marge', 6]
family.items() # returns list of tuples:
#[('dad', 'homer'), ('mom', 'marge'), ('size', 6)]
'mom' in family # returns True
'marge' in family # returns False (only checks keys)
# modify a dictionary (does not return the dictionary)
family['cat'] = 'snowball' # add a new entry
family['cat'] = 'snowball ii' # edit an existing entry
del family['cat'] # delete an entry
family['kids'] = ['bart', 'lisa'] # value can be a list
family.pop('dad') # removes an entry and returns the value ('homer')
family.update({'baby':'maggie', 'grandpa':'abe'}) # add multiple entries
# accessing values more safely with 'get'
family['mom'] # returns 'marge'
family.get('mom') # same thing
try:
family['grandma'] # throws an error
except KeyError as e:
print("Error", e)
family.get('grandma') # returns None
family.get('grandma', 'not found') # returns 'not found' (the default)
# accessing a list element within a dictionary
family['kids'][0] # returns 'bart'
family['kids'].remove('lisa') # removes 'lisa'
# string substitution using a dictionary
'youngest child is %(baby)s' % family # returns 'youngest child is maggie'
Error 'grandma'
'youngest child is maggie'
Sets
Like dictionaries, but with keys only (no values) properties: unordered, iterable, mutable, can
contain multiple data types made up of unique elements (strings, numbers, or tuples)
# create an empty set
empty_set = set()
# create a set
languages = {'python', 'r', 'java'} # create a set directly
snakes = set(['cobra', 'viper', 'python']) # create a set from a list
# examine a set
len(languages) # returns 3
'python' in languages # returns True
# set operations
languages & snakes # returns intersection: {'python'}
languages | snakes # returns union: {'cobra', 'r', 'java', 'viper', 'python'}
languages - snakes # returns set difference: {'r', 'java'}
snakes - languages # returns set difference: {'cobra', 'viper'}
# modify a set (does not return the set)
languages.add('sql') # add a new element
languages.add('r') # try to add an existing element (ignored, no error)
languages.remove('java') # remove an element
try:
languages.remove('c') # try to remove a non-existing element (throws an error)
except KeyError as e:
print("Error", e)
languages.discard('c') # removes an element if present, but ignored otherwise
languages.pop() # removes and returns an arbitrary element
languages.clear() # removes all elements
languages.update('go', 'spark') # add multiple elements (can also pass a list or set)
# get a sorted list of unique elements from a list
sorted(set([9, 0, 2, 1, 0])) # returns [0, 1, 2, 9]
Error 'c'
[0, 1, 2, 9]
Functions
# define a function with no arguments and no return values
def print_text(): print('this is text')
# call the function
print_text()
# define a function with one argument and no return values
def print_this(x): print(x)
# call the function
print_this(3) #prints 3
n = print_this(3) # prints 3, but doesn't assign 3 to n
# because the function has no return statement
# define a function with one argument and one return value
def square_this(x):
return x ** 2
# include an optional docstring to describe the effect of a function
def square_this(x):
"""Return the square of a number."""
return x ** 2
# call the function
square_this(3) # prints 9
var = square_this(3) # assigns 9 to var, but does not print 9
# default arguments
def power_this(x, power=2):
return x ** power
power_this(2) #4
power_this(2, 3) # 8
# use 'pass' as a placeholder if you haven't written the function body
def stub():
pass
# return two values from a single function
def min_max(nums):
return min(nums), max(nums)
# return values can be assigned to a single variable as a tuple
nums = [1, 2, 3]
min_max_num = min_max(nums) # min_max_num = (1, 3)
# return values can be assigned into multiple variables using tuple unpacking
18
min_num, max_num = min_max(nums) # min_num = 1, max_num = 3
this is text 3
3
Loops
# range returns a list of integers
range(0, 3) # returns [0, 1, 2]: includes first value but excludes second value
range(3) # same thing: starting at zero is the default
range(0, 5, 2) # returns [0, 2, 4]: third argument specifies the 'stride'
# for loop (not recommended)
fruits = ['apple', 'banana', 'cherry']
for i in range(len(fruits)):
print(fruits[i].upper())
# alternative for loop (recommended style)
for fruit in fruits:
print(fruit.upper())
# use range when iterating over a large sequence to avoid actually creating the integer list in memory
for i in range(10**6):
pass
# iterate through two things at once (using tuple unpacking)
family = {'dad':'homer', 'mom':'marge', 'size':6}
for key, value in family.items():
print(key, value)
# use enumerate if you need to access the index value within the loop
for index, fruit in enumerate(fruits):
print(index, fruit)
# for/else loop
for fruit in fruits:
if fruit == 'banana':
print("Found the banana!")
break # exit the loop and skip the 'else' block
else:
# this block executes ONLY if the for loop completes without hitting 'break'
print("Can't find the banana")
# while loop
count = 0
while count < 5:
print("This will print 5 times")
count += 1 # equivalent to 'count = count + 1'
APPLE
BANANA
CHERRY
APPLE
BANANA
CHERRY
mom marge
dad homer
size 6 0
apple
1 banana
2 cherry
Found the banana!
This will print 5 times
This will print 5 times
This will print 5 times
This will print 5 times
This will print 5 times
List comprehensions
# for loop to create a list of cubes
nums = [1, 2, 3, 4, 5]
cubes = [] for num in nums:
cubes.append(num**3)
# equivalent list comprehension
cubes = [num**3 for num in nums] # [1, 8, 27, 64, 125]
# for loop to create a list of cubes of even numbers
cubes_of_even = [] for num in nums:
if num % 2 == 0:
cubes_of_even.append(num**3)
# equivalent list comprehension
# syntax: [expression for variable in iterable if condition]
cubes_of_even = [num**3 for num in nums if num % 2 == 0] # [8, 64]
# for loop to cube even numbers and square odd numbers
cubes_and_squares = [] for num in nums:
if num % 2 == 0:
cubes_and_squares.append(num**3)
else:
cubes_and_squares.append(num**2)
# equivalent list comprehension (using a ternary expression)
# syntax: [true_condition if condition else false_condition for variable in iterable]
cubes_and_squares = [num**3 if num % 2 == 0 else num**2 for num in nums] # [1, 8, 9, 64, 25]
,→
20
Exceptions handling
dct = dict(a=[1, 2], b=[4, 5])
key = 'c'
try: dct[key]
except:
print("Key %s is missing. Add it with empty value" % key)
dct['c'] = []
print(dct)
Key c is missing. Add it with empty value
{'c': [], 'b': [4, 5], 'a': [1, 2]}
Basic operating system interfaces (os)
import os
import tempfile
tmpdir = tempfile.gettempdir()
# list containing the names of the entries in the directory given by path.
os.listdir(tmpdir)
# Change the current working directory to path.
os.chdir(tmpdir)
# Get current working directory.
print('Working dir:', os.getcwd())
# Join paths mytmpdir = os.path.join(tmpdir, "foobar")
# Create a directory if not
os.path.exists(mytmpdir):
os.mkdir(mytmpdir)
filename = os.path.join(mytmpdir, "myfile.txt")
print(filename)
# Write
lines = ["Dans python tout est bon", "Enfin, presque"]
## write line by line
fd = open(filename, "w")
fd.write(lines[0] + "\n")
fd.write(lines[1]+ "\n") fd.close()
## use a context manager to automatically close your file
with open(filename, 'w') as f:
for line in lines:
f.write(line + '\n')
# Read
## read one line at a time (entire file does not have to fit into memory)
f = open(filename, "r")
f.readline() # one string per line (including newlines)
f.readline() # next line
f.close()
## read one line at a time (entire file does not have to fit into memory)
f = open(filename, 'r')
f.readline() # one string per line (including newlines)
f.readline() # next line
f.close()
## read the whole file at once, return a list of lines
f = open(filename, 'r')
f.readlines() # one list, each line is one string
f.close()
import math
class Shape2D:
def area(self):
raise NotImplementedError()
# __init__ is a special method called the constructor
# Inheritance + Encapsulation
class Square(Shape2D):
def __init__(self, width):
self.width = width
def area(self): return self.width **2
class Disk(Shape2D):
def __init__(self, radius):
self.radius = radius
def area(self):
return math.pi * self.radius ** 2
shapes = [Square(2), Disk(3)]
# Polymorphism print([s.area()
for s in shapes])
s = Shape2D() try:
s.area()
except
NotImplementedError as e: print("NotImplementedError")
[4, 28.274333882308138]
NotImplementedError
Exercises
Exercise 1: functions
Create a function that acts as a simple calulator If the operation is not specified, default to
addition If the operation is misspecified, return an prompt mesdf.height.isnu6,5,"multiply")
returns 20 Ex: calc(3,5) returns 8 Ex: calc(1,2,"something") returns error message
22
Exercise 3: File I/O
Copy/past the bsd 4 clause license into a text file. Read, the file (assuming this file could be
huge) and could not the occurrences of each word within the file. Words are separated by
whitespace or new line characters.
Exercise 4: OOP
1. Create a class Employee with 2 attributes provided in the constructor: name,
years_of_service. With one method salary with is obtained by 1500 + 100 *
years_of_service.
2. Create a subclass Manager which redefine salary method 2500 + 120 * years_of_service.
3. Create a small dictionnary-based database where the key is the employee’s name.
Populate the database with: samples = Employee(‘lucy’, 3), Employee(‘john’, 1),
Manager(‘julie’, 10), Manager(‘paul’, 3)
4. Return a table of made name, salary rows, ie. a list of list [[name, salary]]
5. Compute the average salary
CHAPTER
THREE
Create arrays
# create ndarrays from lists
# note: every element must be the same type (will be converted if possible)
data1 = [1, 2, 3, 4, 5] # list
arr1 = np.array(data1) # 1d array
data2 = [range(1, 5), range(5, 9)] # list of lists
arr2 = np.array(data2) # 2d array
arr2.tolist() # convert array back to list
# examining arrays
arr1.dtype # float64
arr2.dtype # int32
arr2.ndim #2
arr2.shape # (2, 4) - axis 0 is rows, axis 1 is columns
arr2.size # 8 - total number of elements
len(arr2) # 2 - size of first dimension (aka axis)
# create special arrays
np.zeros(10)
np.zeros((3, 6)
np.ones(10)
np.linspace(0, 1, 5) # 0 to 1 (inclusive) with 5 points
np.logspace(0, 3, 4) # 10^0 to 10^3 (inclusive) with 4 points
# arange is like range, except it returns an array (not a list)
int_array = np.arange(5)
float_array = int_array.astype(float)
Reshaping
matrix= np.arange(10,dtype=float).reshape((2,5)) print(matrix.shape)
print(matrix.reshape(5, 2))
# Add an axis
a = np.array([0, 1])
a_col = a[:, np.newaxis]
# array([[0],
# [1]])
# Transpose
a_col.T
#array([[0, 1]])
Stack arrays
Stack flat arrays in columns
Selection
Single item
arr1[0] # 0th element (slices like a list)
arr2[0, 3] # row 0, column 3: returns 4
arr2[0][3] # alternative syntax
Slicing
arr2[0, :] # row 0: returns 1d array ([1, 2, 3, 4])
arr2[:, 0] # column 0: returns 1d array ([1, 5])
arr2[:, :2] # columns strictly before index 2 (2 first columns)
arr2[:, 2:] # columns after index 2 included
arr2[:, 1:4] # columns between index 1 (included) and 4 (exluded)
24
arr[arr > 5] = 0
print(arr)
names = np.array(['Bob', 'Joe', 'Will', 'Bob'])
names == 'Bob' # returns a boolean array
names[names != 'Bob'] # logical selection
(names == 'Bob') | (names == 'Will') # keywords "and/or" don't work with boolean arrays
names[names != 'Bob'] = 'Joe' # assign based on a logical selection
np.unique(names) # set function
Vectorized operations
nums = np.arange(5)
nums * 10 # multiply each element by 10 nums = np.sqrt(nums) # square root of each element
np.ceil(nums) # also floor, rint (round to nearest int) np.isnan(nums) # checks for NaN nums +
np.arange(5) # add element-wise
1
np.maximum(nums, np.array([1, -2, , -4, 5])) # compare element-wise
# Compute Euclidean distance between 2 vectors
vec1 = np.random.randn(10)
vec2 = np.random.randn(10)
dist = np.sqrt(np.sum((vec1 - vec2) ** 2))
# math and stats
rnd = np.random.randn(4, 2) # random normals in 4x2 array
rnd.mean()
rnd.std()
rnd.argmin() # index of minimum element
rnd.sum()
rnd.sum(axis=0) # sum of columns
rnd.sum(axis=1) # sum of rows
# methods for boolean arrays
(rnd > 0).sum() # counts number of positive values
(rnd > 0).any() # checks if any value is True
(rnd > 0).all() # checks if all values are True
# reshape, transpose, flatten
nums = np.arange(32).reshape(8, 4) # creates 8x4 array
nums.T # transpose
nums.flatten() # flatten
# random numbers
np.random.seed(12234) # Set the seed
np.random.rand(2, 3) # 2 x 3 matrix in [0, 1]
np.random.randn(10) # random normals (mean 0, sd 1)
np.random.randint(0, 2, 10) # 10 randomly picked 0 or 1
Exercises
Given the array:
X = np.random.randn(4, 2) # random normals in 4x2 array
• For each column find the row index of the minimum value.
• Write a function standardize(X) that return an array whose columns are cantered and
scaled (by std-dev).
1
Vectorized operations
CHAPTER
FOUR
Create DataFrame
columns = ['name', 'age', 'gender', 'job']
user1 = pd.DataFrame([['alice', 19, "F", "student"],['john', 26, "M", "student"]],columns=columns)
user2 = pd.DataFrame([['eric', 22, "M", "student"],['paul', 58, "F", "manager"]],columns=columns)
user3 = pd.DataFrame(dict(name=['peter', 'julie'], age=[33, 44],
gender=['M','F'],job=['engineer', scientist']))
Concatenate DataFrame
user1.append(user2)
users = pd.concat([user1, user2, user3])
print(users)
# age gende job name
#0 19 F student alice
#1 26 M student john
#0 22 M student eric
#1 58 F manager paul
#0 33 M engineer peter
#1 44 F scientist julie
26
Join DataFrame
user4 = pd.DataFrame(dict(name=['alice', 'john', 'eric', 'julie'], height=[165, 180, 175, 171]))
print(user4)
# height name
#0 165 alice
#1 180 john
#2 175 eric
#3 171 julie
# Use intersection of keys from both frames
merge_inter = pd.merge(users, user4, on="name")
print(merge_inter)
# age gender job name height
#0 19 F student alice 165
#1 26 M student john 180
#2 22 M student eric 175
#3 44 F scientist julie 171
# Use union of keys from both frames
users = pd.merge(users, user4, on="name", how='outer')
print(users)
# age gender job name height
#0 19 F student alice 165
#1 26 M student john 180
#2 22 M student eric 175
#3 58 F manager paul NaN
#4 33 M engineer peter NaN
#5 44 F scientist julie 171
Summarizing
Columns selection
users['gender'] # select one column
type(users['gender']) # Series
users.gender # select one column using the DataFrame
# select multiple columns
users[['age', 'gender']] # select two columns
Rows selection
# iloc is strictly integer position based
df = users.copy()
df.iloc[0] # first row
df.iloc[0, 0] # first item of first row
df.iloc[0, 0] = 55
for i in range(users.shape[0]):
row = df.iloc[i] row.age *= 100 # setting a copy, and not the original frame data.
print(df) # df is not modified
# ix supports mixed integer and label based access.
df = users.copy()
df.ix[0] # first row
df.ix[0, "age"] # first item of first row
df.ix[0, "age"] = 55
for i in range(df.shape[0]):
df.ix[i, "age"] *= 10
print(df) # df is modified
Rows selction / filtering
# simple logical filtering
users[users.age < 20]
# only show users with age < 20
young_bool = users.age < 20
# or, create a Series of booleans...
users[young_bool]
# ...and use that Series to filter rows
users[users.age < 20].job
# select one column from the filtered results
# advanced logical filtering
users[users.age < 20][['age', 'job']]
# select multiple columns
users[(users.age > 20) & (users.gender=='M')]
# use multiple conditions
users[users.job.isin(['student', 'engineer'])]
# filter specific values
Sorting
df = users.copy()
df.age.sort_values() # only works for a Series
df.sort_values(by='age') # sort rows by a specific column
df.sort_values(by='age', ascending=False) # use descending order instead
df.sort_values(by=['job', 'age']) # sort by multiple columns
df.sort_values(by=['job', 'age'], inplace=True) # modify df
Reshaping by pivoting
# “Unpivots” a DataFrame from wide format to long (stacked) format,
staked = pd.melt(users, id_vars="name", var_name="variable", value_name="value")
print(staked)
# name variable value
#0 alice age 19
#1 john age 26
#2 eric age 22
#3 paul age 58
#4 peter age 33
#5 julie age 44
#6 alice gender F
# ...
#11 julie gender F
#12 alice job student
# ...
#17 julie job scientist
#18 alice height 165
# ...
#23 julie height 171
# “pivots” a DataFrame from long (stacked) format to wide format,
print(staked.pivot(index='name', columns='variable', values='value'))
#variable age gender height job
#name
#alice 19 F 165 student
#eric 22 M 175 student
#john 26 M 180 student
#julie 44 F 171 scientist
#paul 58 F NaN manager
#peter 33 M NaN engineer
28
Quality control: duplicate data
df = users.append(df.iloc[0], ignore_index=True)
print(df.duplicated()) # Series of booleans
# (True if a row is identical to a previous row)
df.duplicated().sum() # count of duplicates
df[df.duplicated()] # only show duplicates
df.age.duplicated() # check a single column for duplicates
df.duplicated(['age', 'gender']).sum() # specify columns for finding duplicates
df = df.drop_duplicates() # drop duplicate rows
Rename values
df = users.copy()
print(df.columns) df.columns = ['age', 'genre', 'travail', 'nom', 'taille']
df.travail = df.travail.map({ 'student':'etudiant', 'manager':'manager',
'engineer':'ingenieur', 'scientist':'scientific'})
assert df.travail.isnull().sum() == 0
Groupby
for grp, data in users.groupby("job"):
print(grp, data)
File I/O
csv
import tempfile, os.path tmpdir = tempfile.gettempdir()
csv_filename = os.path.join(tmpdir, "users.csv")
users.to_csv(csv_filename,index=False)
other = pd.read_csv(csv_filename)
Read csv from url
url = 'https://raw.github.com/neurospin/pystatsml/master/data/salary_table.csv'
salary = pd.read_csv(url)
Excel
xls_filename = os.path.join(tmpdir, "users.xlsx")
users.to_excel(xls_filename, sheet_name='users', index=False)
pd.read_excel(xls_filename, sheetname='users')
# Multiple sheets
with pd.ExcelWriter(xls_filename) as writer:
users.to_excel(writer, sheet_name='users', index=False)
df.to_excel(writer, sheet_name='salary', index=False)
Groupby
pd.read_excel(xls_filename, sheetname='users')
pd.read_excel(xls_filename, sheetname='salary')
Exercises
Data Frame
1. Read the iris dataset at ‘https://raw.github.com/neurospin/pystatsml/master/data/iris.csv‘
2. Print column names
3. Get numerical columns
4. For each species compute the mean of numerical columns and store it in a stats table like:
species sepal_length sepal_width petal_length petal_width
0 Setosa 5.006 3.428 1.462 0.246
1 Versicolor 5.936 2.770 4.260 1.326
2 Virginica 6.588 2.974 5.552 2.026
Missing data
Add some missing data to the previous table users:
df = users.copy()
df.ix[[0, 2], "age"] = None
df.ix[[1, 3], "gender"] = None
1. Write a function fillmissing_with_mean(df) that fill all missing value of numerical column
with the mean of the current columns.
30
2. Save the original users and “imputed” frame in a single excel file “users.xlsx” with 2 sheets:
original, imputed.
CHAPTER
FIVE
MATPLOTLIB: PLOTTING
Sources - Nicolas P. Rougier: http://www.labri.fr/perso/nrougier/teaching/matplotlib -
https://www.kaggle.com/ benhamner/d/uciml/iris/python-data-visualizations
# Step by step
plt.plot(x, sinus, label='sinus', color='blue', linestyle='--', linewidth=2)
plt.plot(x, cosinus, label='cosinus', color='red', linestyle='-', linewidth=2)
plt.legend()
plt.show()
32
Scatter (2D) plots
Load dataset
import pandas as pd
try:
salary = pd.read_csv("../data/salary_table.csv")
except:
url = 'https://raw.github.com/duchesnay/pylearn-doc/master/data/salary_table.csv'
salary = pd.read_csv(url)
df = salary
Simple scatter with colors
colors = colors_edu = {'Bachelor':'r', 'Master':'g', 'Ph.D':'blue'}
plt.scatter(df['experience'], df['salary'], c=df['education'].apply(lambda x: colors[x]), s=100)
,→
<matplotlib.collections.PathCollection at 0x7fc113f387f0>
Scatter plot with colors and symbols
## Figure size
plt.figure(figsize=(6,5))
## Define colors / sumbols manually
symbols_manag = dict(Y='*', N='.')
colors_edu = {'Bachelor':'r', 'Master':'g', 'Ph.D':'blue'}
## group by education x management => 6 groups
for values, d in salary.groupby(['education','management']):
edu, manager = values
plt.scatter(d['experience'], d['salary'], marker=symbols_manag[manager],
, color=colors_edu[edu], s=150, label=manager+"/"+edu)
## Set labels
plt.xlabel('Experience')
plt.ylabel('Salary')
plt.legend(loc=4) # lower right
plt.show()
34
Saving Figures
### bitmap format
plt.plot(x,sinus)
plt.savefig("sinus.png")
plt.close()
# Prefer vectorial format (SVG: Scalable Vector Graphics) can be edited with
# Inkscape, Adobe Illustrator, Blender, etc.
plt.plot(x, sinus)
plt.savefig("sinus.svg")
Saving Figures
plt.close()
# Or pdf
plt.plot(x, sinus)
plt.savefig("sinus.pdf")
plt.close()
36
Fig. 5.1
38
g = sns.PairGrid(salary, hue="management")
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
g.add_legend()
/home/ed203246/anaconda3/lib/python3.7/site-packages/matplotlib/__init__.py:
892:UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
warnings.warn(self.msg_depr % (key, alt_key))
<seaborn.axisgrid.PairGrid at 0x7f3fda195da0>
40
CHAPTER
SIX
UNIVARIATE STATISTICS
Basics univariate statistics are required to explore dataset:
• Discover associations between a variable of interest and potential predictors. It is strongly
recommended to start with simple univariate methods before moving to complex
multivariate predictors.
• Assess the prediction performances of machine learning predictors.
• Most of the univariate statistics are based on the linear model which is one of the main
model in machine learning.
Variance
2 2 2
V ar(X) = E((X − E(X)) ) = E(X ) − (E(X))
The estimator is
Note here the subtracted 1 degree of freedom (df) in the divisor. In standard statistical practice,
df = 1 provides an unbiased estimator of the variance of a hypothetical infinite population. With df
= 0 it instead provides a maximum likelihood estimate of the variance for normally distributed
variables.
Standard deviation
√
Std(X) = V ar(X)
√
The estimator is simply σx = σx2.
Covariance
Cov(X,Y ) = E((X − E(X))(Y − E(Y ))) = E(XY ) − E(X)E(Y ).
Properties:
Cov(X,X) = Var(X)
Cov(X,Y ) = Cov(Y,X)
41
Cov(cX,Y ) = cCov(X,Y )
Cov(X + c,Y ) = Cov(X,Y )
The estimator with df = 1 is
.
Correlation
The estimator is
.
Standard Error (SE)
The standard error (SE) is the standard deviation (of the sampling distribution) of a statistic:
Exercises
• Generate 2 random samples: x ∼ N(1.78,0.1) and y ∼ N(1.66,0.1), both of size 10.
Main distributions
Normal distribution
The normal distribution is useful because of the central limit theorem (CLT) which states that,
given certain conditions, the arithmetic mean of a sufficiently large number of iterates of
independent random variables, each with a well-defined expected value and well-defined
variance, will be approximately normally distributed, regardless of the underlying distribution.
Parameters: µ mean (location) and σ2 > 0 variance. Estimators: x¯ and σx.
The F-distribution plays a central role in hypothesis testing answering the question: Are two
variances equals?
import numpy as np from scipy.stats
import f import matplotlib.pyplot as plt
%matplotlib inline
fvalues = np.linspace(.1, 5, 100)
# pdf(x, df1, df2): Probability density function at x of F.
plt.plot(fvalues, f.pdf(fvalues, 1, 30), 'b-', label="F(1, 30)")
plt.plot(fvalues, f.pdf(fvalues, 5, 30), 'r-', label="F(5, 30)")
plt.legend()
# cdf(x, df1, df2): Cumulative distribution function of F.
# ie. proba_at_f_inf_3 = f.cdf(3, 1, 30) # P(F(1,30) < 3)
# ppf(q, df1, df2): Percent point function (inverse of cdf) at q of F.
f_at_proba_inf_95 = f.ppf(.95, 1, 30) # q such P(F(1,30) < .95)
assert
f.cdf(f_at_proba_inf_95, 1, 30) == .95
# sf(x, df1, df2): Survival function (1 - cdf) at x of F.
proba_at_f_sup_3 = f.sf(3, 1, 30) # P(F(1,30) > 3)
assert
proba_at_f_inf_3 + proba_at_f_sup_3 == 1
# p-value: P(F(1, 30)) < 0.05
low_proba_fvalues = fvalues[fvalues > f_at_proba_inf_95]
plt.fill_between(low_proba_fvalues, 0, f.pdf(low_proba_fvalues, 1, 30), alpha=.8, label="P < 0.05")
plt.show()
The distribution of the difference between an estimated parmeter and its true (or assumed)
value divided by the standard deviation of the estimated parameter (standard error) follow a t-
distribution. Is this parameters different from a given value?
• An ordinal variable is similar to a categorical variable. The difference between the two is
that there is a clear ordering of the variables. For example, suppose you have a variable,
economic status, with three categories (low, medium and high). In addition to being able to
classify people into these three categories, you can order the categories as low, medium
and high.
• A continuous or quantitative variable x ∈ R is one that can take any value in a range of
possible values, possibly infinite. E.g.: Salary, Experience in years.
Although the parent population does not need to be normally distributed, the distribution of the
population of sample means, x, is assumed to be normal. By the central limit theorem, if the
sampling of the parent population is independent then the sample means will be approximately
normal.
Exercise
• Given the following samples, test whether its true mean is 1.75.
Warning, when computing the std or the variance, set ddof=1. The default value, ddof=0, leads
to the biased estimator of the variance.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(seed=42) # make example reproducible
n = 100
x = np.random.normal(loc=1.78, scale=.1, size=n)
• Compute the t-value (tval)
• Plot the T(n-1) distribution for 100 tvalues values within [0,10]. Draw P(T(n-1)>tval) i.e.
color the surface defined by x values larger than tval below the T(n-1). Use the code.
# compute with scipy
tval, pval = stats.ttest_1samp(x, 1.75)
#tval = 2.1598800019529265 # assume the t-value
tvalues = np.linspace(-10, 10, 100)
plt.plot(tvalues, stats.t.pdf(tvalues, n-1), 'b-', label="T(n-1)")
upper_tval_tvalues = tvalues[tvalues > tval]
plt.fill_between(upper_tval_tvalues, 0, stats.t.pdf(upper_tval_tvalues, n-1), alpha=. ,8, label="p-value")
plt.legend()
where
is an estimator of the common standard deviation of the two samples: it is defined in this way so
that its square is an unbiased estimator of the common variance whether or not the population
means are the same.
Equal or unequal sample sizes, unequal variances (Welch’s t-test)
Welch’s t-test defines the t statistic as
45
To compute the p-value one needs the degrees of freedom associated with this variance
estimate. It is approximated using the Welch–Satterthwaite equation:
.
Exercise
Given the following two samples, test whether their means are equal using the standard t-test,
assuming equal variance.
import scipy.stats as stats
nx, ny = 50, 25
x = np.random.normal(loc=1.76, scale=0.1, size=nx)
y = np.random.normal(loc=1.70, scale=0.12, size=ny)
# Compute with scipy
tval, pval = stats.ttest_ind(x, y, equal_var=True)
3. F-test
Source: https://en.wikipedia.org/wiki/F-test
The ANOVA F-test can be used to assess whether any of the strategies is on average superior,
or inferior, to the others versus the null hypothesis that all four strategies yield the same mean
response (increase of business volume). This is an example of an “omnibus” test, meaning that
a single test is performed to detect any of several possible differences. Alternatively, we could
carry out pair-wise tests among the strategies. The advantage of the ANOVA F-test is that we
do not need to pre-specify which strategies are to be compared, and we do not need to adjust
for making multiple comparisons. The disadvantage of the ANOVA F-test is that if we reject the
null hypothesis, we do not know which strategies can be said to be significantly different from
the others. The formula for the one-way ANOVA F-test statistic is
explained variance
F= ,
unexplained variance
or
between-group variability
F= .
within-group variability
47
import numpy as np
import pandas as pd
import scipy.stats as stats
# Dataset:
# 15 samples:
# 10 first with canalar tumor, 5 last without
canalar_tumor = np.array([1] * 10 + [0] * 5)
# 8 first with metastasis, 6 without, the last with.
meta = np.array([1] * 8 + [0] * 6 + [1])
crosstab = pd.crosstab(canalar_tumor, meta, rownames=['canalar_tumor'], colnames=['meta'])
print("Observed table:")
print("---------------")
print(crosstab)
chi2, pval, dof, expected = stats.chi2_contingency(crosstab)
print("Statistics:")
print("-----------")
print("Chi2 = %f, pval = %f" % (chi2, pval))
print("Expected table:")
print("---------------")
print(expected)
Observed table: ---------------
meta 01
canalar_tumor
1 41
2 28
Statistics:
-----------
Chi2 = 2.812500, pval = 0.093533
Expected table:
---------------
[[ 2. 3.]
[ 4. 6.]]
Exercise
Write a function univar_stat(df,target,variables) that computes the parametric statistics and p-
values between the target variable (provided as string) and all variables (provided as a list of
string) of the pandas DataFrame df. The target is a quantitative variable but variables may be
quantitative or qualitative. The function returns a DataFrame with four columns: variable, test,
value, p_value.
Apply it to the salary dataset available at
https://raw.github.com/neurospin/pystatsml/master/data/salary_table.csv, with target being S: salaries
for IT staff in a corporation.
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(seed=42) # make example reproducible
n = 50
noutliers = 10 x = np.random.normal(size=n) y = 2 * x + np.random.normal(size=n)
y[:noutliers] = np.random.normal(loc=-10, size=noutliers) # Add 40 outliers
outlier = np.array(["N"] * n)
outlier[:noutliers] = "Y"
# Compute with scipy
cor, pval = stats.spearmanr(x, y)
print("Non-Parametric Spearman cor test, cor: %.4f, pval: %.4f" % (cor, pval))
# Plot distribution + pairwise scatter plot df =
pd.DataFrame(dict(x=x, y=y, outlier=outlier))
g = sns.PairGrid(df, hue="outlier")
g.map_diag(plt.hist)
g.map_offdiag(plt.scatter)
g = g.add_legend()
# Compute the parametric Pearsonw cor test
cor, pval = stats.pearsonr(x, y)
print("Parametric Pearson cor test: cor: %.4f, pval: %.4f" % (cor, pval))
Non-Parametric Spearman cor test, cor: 0.2996, pval: 0.0345
/home/ed203246/anaconda3/lib/python3.5/site-packages/matplotlib/__init__.py:898:userWarning:
axes.color_cycle is deprecated and replaced with axes.prop_cycle; ,→please use the latter.
warnings.warn(self.msg_depr % (key, alt_key))
49
Wilcoxon signed-rank test (quantitative ~ cte)
Source: https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test
The Wilcoxon signed-rank test is a non-parametric statistical hypothesis test used when
comparing two related samples, matched samples, or repeated measurements on a single
sample to assess whether their population mean ranks differ (i.e. it is a paired difference test). It
is equivalent to one-sample test of the difference of paired samples.
It can be used as an alternative to the paired Student’s t-test, t-test for matched pairs, or the t-
test for dependent samples when the population cannot be assumed to be normally distributed.
When to use it? Observe the data distribution: - presence of outliers - the distribution of the
residuals is not Gaussian
It has a lower sensitivity compared to t-test. May be problematic to use when the sample size is
small.
Null hypothesis H0: difference between the pairs follows a symmetric distribution around zero.
Linear model
For the regression case, the statistical model is as follows. Given a (random) sample (yi,x1i,...,xpi ),
p
i = 1,...,n the relation between the observations yi and the independent variables x i is formulated
as
i = 1,...,n
• An independent variable (IV) is exactly what it sounds like. It is a variable that stands alone
and is not changed by the other variables you are trying to measure. For example,
someone’s age might be an independent variable. Other factors (such as what they eat,
how much they go to school, how much television they watch) aren’t going to change a
person’s age. In fact, when you are looking for some kind of relationship between variables
you are trying to see if the independent variable causes some kind of change in the other
variables, or dependent variables. In Machine Learning, these variables are also called the
predictors.
• A dependent variable is exactly what it sounds like. It is something that depends on other
factors. For example, a test score could be a dependent variable because it could change
depending on several factors such as how much you studied, how much sleep you got the
night before you took the test, or even how hungry you were when you took it. Usually
when you are looking for a relationship between two things you are trying to find out what
makes the dependent variable change the way it does. In Machine Learning this variable is
called a target variable.
51
import matplotlib.pyplot as plt %matplotlib inline
url= 'https://raw.github.com/neurospin/pystatsml/master/data/salary_table.csv'
salary = pd.read_csv(url)
Plug in β0:
.
from scipy import stats
import numpy as np
y, x = salary.salary, salary.experience beta, beta0, r_value, p_value, std_err =
stats.linregress(x,y) print("y = %f x + %f, r: %f, r-squared: %f,\np-value: %f, std_err: %f" % (beta,
beta0, r_value, r_value**2, p_value, std_err))
# plotting the line
yhat = beta * x + beta0 # regression line
plt.plot(x, yhat, 'r-', x, y,'o')
plt.xlabel('Experience (years)')
plt.ylabel('Salary')
plt.show()
3. F-Test
3.1 Goodness of fit
The goodness of fit of a statistical model describes how well it fits a set of observations.
Measures of goodness of fit typically summarize the discrepancy between observed values and
the values expected under the model in question. We will consider the explained variance also
known as the coefficient of determination, denoted R2 pronounced R-squared.
The total sum of squares, SStot is the sum of the sum of squares explained by the regression,
SSreg, plus the sum of squares of residuals unexplained by the regression, SSres, also called the
SSE, i.e. such that
SStot = SSreg + SSres
The mean of y is
.
The total sum of squares is the total squared sum of deviations from the mean of y, i.e.
SStot = ∑ (yi − y¯ )2
i
The regression sum of squares, also called the explained sum of squares:
,
53
where yˆi = βxi + β0 is the estimated value of salary yˆi given a value of experience xi.
Fig. 6.2
Test
Let σˆ2 = SSres/(n − 2) be an estimator of the variance of . The 2 stem from the number of
estimated parameters, intercept and coefficient.
• Unexplained variance:
• Explained variance: . The single degree of freedom comes from the difference
between and , i.e. (n − 1) − (n − 2) degree of freedom.
Using the F-distribution, compute the probability of observing a value greater than F under H0,
i.e.: P(x > F|H0), i.e. the survival function (1 − Cumulative Distribution Function) at x of the given
F-distribution.
Exercise
Compute:
• y¯ : y_mu
• SStot: ss_tot
• SSreg: ss_reg
• SSres: ss_res
• Check partition of variance formula based on sum of
squares by using assert np.allclose(val1,val2,atol=1e-05)
2
• Compute R and compare it with the r_value above
• Compute the F score
• Compute the p-value:
• Plot the F(1,n) distribution for 100 f values within [10,25]. Draw P(F(1,n) > F), i.e. color the
surface defined by the x values larger than F below the F(1,n).
• P(F(1,n) > F) is the p-value, compute it.
Multiple regression
Theory
Multiple Linear Regression is the most basic supervised learning algorithm.
Given: a set of training data {x1,...,xN} with corresponding targets {y1,...,yN}.
In linear regression, we assume that the model that generates the data involves only a linear
combination of the input variables, i.e.
0 1 1 P P
y(xi,β) = β + β x i + ... + β x i ,
or, simplified
P−1
∑ j
y(xi,β) = β0 + βjx i.
j=1
Extending each sample with an intercept, xi := [1,xi] ∈ RP+1 allows us to use a more general
notation based on linear algebra and write it as a simple dot product:
T
y(xi,β) = x i β,
where β ∈ RP+1 is a vector of weights that define the P +1 parameters of the model. From now we
have P regressors + the intercept.
Minimize the Mean Squared Error MSE loss:
.
The β that minimises the MSE can be found
by:
(6.4)
(6.5)
(6.6)
(6.7) (6.8)
(6.9)
Multiple regression
Interface with Numpy
import statsmodels.api as sm
## Fit and summary:
model = sm.OLS(y, X).fit()
print(model.summary())
# prediction of new values
ypred = model.predict(X)
# residuals + prediction == true values
assert
np.all(ypred + model.resid == y)
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.363
Model: OLS Adj. R-squared: 0.322
Method: Least Squares F-statistic: 8.748
Date: Fri, 06 Jan 2017 Prob (F-statistic): 0.000106
Time: 10:42:31 Log-Likelihood: -71.271
No. Observations: 50 AIC: 150.5
Df Residuals: 46 BIC: 158.2
Df Model: 3
Covariance Type: nonrobust
============================================================================== coef std
err t P>|t| [95.0% Conf. Int.]
-----------------------------------------------------------------------------const 10.1474 0.150 67.520 0.000 9.845
10.450 x1 0.5794 0.160 3.623 0.001 0.258 0.901 x2 0.5165 0.151 3.425 0.001
0.213 0.820 x3 0.1786 0.144 1.240 0.221 -0.111 0.469
==============================================================================
Omnibus: 2.493 Durbin-Watson: 2.369
Prob(Omnibus): 0.288 Jarque-Bera (JB): 1.544
Skew: 0.330 Prob(JB): 0.462
Kurtosis: 3.554 Cond. No. 1.27
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly ,→specified.
One-way AN(C)OVA
• ANOVA: one categorical independent variable, i.e. one factor.
• ANCOVA: ANOVA with some covariates.
import statsmodels.formula.api as smfrmla
oneway = smfrmla.ols('salary ~ management + experience', salary).fit()
print(oneway.summary())
aov = sm.stats.anova_lm(oneway, typ=2) # Type 2 ANOVA DataFrame
print(aov)
57
OLS Regression Results
==============================================================================
Dep. Variable: salary R-squared: 0.865
Model: OLS Adj. R-squared: 0.859
Method: Least Squares F-statistic: 138.2
Date: Fri, 06 Jan 2017 Prob (F-statistic): 1.90e-19
Time: 10:45:13 Log-Likelihood: -407.76
No. Observations: 46 AIC: 821.5
Df Residuals: 43 BIC: 827.0
Df Model: 2
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
-----------------------------------------------------------------------------------
Intercept 1.021e+04 525.999 19.411 0.000 9149.578 1.13e+04 management[T.Y] 7145.0151 527.320 13.550
0.000 6081.572 8208.458 experience 527.1081 51.106 10.314 0.000 424.042 630.174
==============================================================================
Omnibus: 11.437 Durbin-Watson: 2.193
Prob(Omnibus): 0.003 Jarque-Bera (JB): 11.260
Skew: -1.131 Prob(JB): 0.00359
Kurtosis: 3.872 Cond. No. 22.4
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
,→specified.
sum_sq df F PR(>F)
management 5.755739e+08 1.0 183.593466 4.054116e-17 experience
3.334992e+08 1.0 106.377768 3.349662e-13
Residual 1.348070e+08 43.0 NaN NaN
Two-way AN(C)OVA
Ancova with two categorical independent variables, i.e. two factors.
import statsmodels.formula.api as smfrmla
twoway = smfrmla.ols('salary ~ education + management + experience', salary).fit()
print(twoway.summary())
aov = sm.stats.anova_lm(twoway, typ=2) # Type 2 ANOVA DataFrame
print(aov)
sum_sq df F PR(>F)
education 9.152624e+07 2.0 43.351589 7.672450e-11 management
5.075724e+08 1.0 480.825394 2.901444e-24 experience 3.380979e+08
1.0 320.281524 5.546313e-21
Residual 4.328072e+07 41.0 NaN NaN
Factor coding
See
http://statsmodels.sourceforge.net/devel/contrasts.html
By default Pandas use “dummy coding”. Explore:
print(twoway.model.data.param_names)
print(twoway.model.data.exog[:10, :]) ['Intercept', 'education[T.Master]', 'education[T.Ph.D]',
'management[T.Y]',,'experience']
[[ 1. 0. 0. 1. 1.]
[ 1. 0. 1. 0. 1.]
[ 1. 0. 1. 1. 1.]
[ 1. 1. 0. 0. 1.]
[ 1. 0. 1. 0. 1.]
[ 1. 1. 0. 1. 2.]
[ 1. 1. 0. 0. 2.]
[ 1. 0. 0. 0. 2.]
[ 1. 0. 1. 0. 2.]
[ 1. 1. 0. 0. 3.]]
59
Contrasts and post-hoc tests
# t-test of the specific contribution of experience:
ttest_exp = twoway.t_test([0, 0, 0, 0, 1])
ttest_exp.pvalue, ttest_exp.tvalue
print(ttest_exp)
# Alternatively, you can specify the hypothesis tests using a string
twoway.t_test('experience')
# Post-hoc is salary of Master different salary of Ph.D?
# ie. t-test salary of Master = salary of Ph.D.
print(twoway.t_test('education[T.Master] = education[T.Ph.D]'))
Test for Constraints
============================================================================== coef std err
t P>|t| [95.0% Conf. Int.]
-----------------------------------------------------------------------------c0 546.1840 30.519 17.896 0.000 484.549
607.819
==============================================================================
Test for Constraints
============================================================================== coef std err
t P>|t| [95.0% Conf. Int.]
-----------------------------------------------------------------------------c0 147.8249 387.659 0.381 0.705 -
635.069 930.719
==============================================================================
Multiple comparisons
import numpy as np
np.random.seed(seed=42) # make example reproducible
# Dataset
n_samples, n_features = 100, 1000
n_info = int(n_features/10) # number of features with information
n1, n2 = int(n_samples/2), n_samples - int(n_samples/2) snr = .5
Y = np.random.randn(n_samples, n_features)
grp = np.array(["g1"] * n1 + ["g2"] * n2)
61
The False discovery rate (FDR) correction for multiple comparisons
FDR-controlling procedures are designed to control the expected proportion of rejected null
hypotheses that were incorrect rejections (“false discoveries”). FDR-controlling procedures
provide less stringent control of Type I errors compared to the familywise error rate (FWER)
controlling procedures (such as the Bonferroni correction), which control the probability of at
least one Type I error. Thus, FDR-controlling procedures have greater power, at the cost of
increased rates of Type I errors.
import statsmodels.sandbox.stats.multicomp as multicomp
pvals_fdr = multicomp.multipletests(pvals, alpha=0.05,method='fdr_bh')
TP = np.sum(pvals_fdr[:n_info ] < 0.05) # True Positives
FP = np.sum(pvals_fdr[n_info: ] < 0.05) # False Positives
print("FDR correction, FP: %i, TP: %i" % (FP, TP))
Multiple comparisons
FDR correction, FP: 3, TP: 20
Exercise
This exercise has 2 goals: apply you knowledge of statistics using vectorized numpy operations.
Given the dataset provided for multiple comparison, compute the two-sample t-test (assuming
equal variance) for each (column) feature of Y given the two groups defined by grp. You should
return two vectors of size n_features: one for the t-values and one for the p-values.
CHAPTER
SEVEN
MULTIVARIATE STATISTICS
Multivariate statistics includes all statistical techniques for analyzing samples made of two or
more variables. The data set (a N ×P matrix X) is a collection of N independent samples column
vectors [x1,...,xi,...,xN] of length P.
Linear Algebra
Euclidean norm and distance
The Euclidian norm of a vector a ∈ RP is denoted
In particular, if a and b are orthogonal, then the angle between them is 90° and
a · b = 0.
At the other extreme, if they are codirectional, then the angle between them is 0° and
a · b = ǁaǁ2 ǁbǁ2
The scalar projection (or scalar component) of a Euclidean vector a in the direction of a
Euclidean vector b is given by
ab = ǁaǁ2 cosθ,
where θ is the angle between a and b.
63
Fig. 7.1: Projection.
import numpy as np
np.random.seed(42)
a = np.random.randn(10)
b = np.random.randn(10)
np.dot(a, b)
-4.0857885326599241
Mean vector
The mean (P × 1) column-vector µ whose estimator is
Covariance matrix
• The covariance matrix ΣXX is a symmetric positive semi-definite matrix whose element in
the j,k position is the covariance between the jth and kth elements of a random vector i.e. the
th th
j and k columns of X.
• The covariance matrix generalizes the notion of covariance to multiple dimensions.
• The covariance matrix describe the shape of the sample distribution around the mean
assuming an elliptical distribution:
T
ΣXX = E(X − E(X)) E(X − E(X)),
whose estimator SXX is a P × P matrix given by
SXX ,
where
is an estimator of the covariance between the jth and kth variables.
65
Precision matrix
In statistics, precision is the reciprocal of the variance, and the precision matrix is the matrix
inverse of the covariance matrix.
It is related to partial correlations that measures the degree of association between two
variables, while controlling the effect of other variables.
import numpy as np
Cov = np.array([[1.0, 0.9, 0.9, 0.0, 0.0, 0.0],
[0.9, 1.0, 0.9, 0.0, 0.0, 0.0],
[0.9, 0.9, 1.0, 0.0, 0.0, 0.0],
[0.0, 0.0, 0.0, 1.0, 0.9, 0.0],
[0.0, 0.0, 0.0, 0.9, 1.0, 0.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 1.0]])
print("# Precision matrix:")
Prec=np.linalg.inv(Cov)
print(Prec.round(2))
print("# Partial correlations:")
Pcor = np.zeros(Prec.shape)
Pcor[::] = np.NaN
Mahalanobis distance
• The Mahalanobis distance is a measure of the distance between two points x and µ where
the dispersion (i.e. the covariance structure) of the samples is taken into account.
• The dispersion is considered through covariance matrix.
This is formally expressed as
T −1
DM(x,µ) = √ (x − µ) Σ (x − µ).
Intuitions
• Distances along the principal directions of dispersion are contracted since they correspond
to likely dispersion of points.
• Distances othogonal to the principal directions of dispersion are dilated since they
correspond to unlikely dispersion of points.
For example
T −1
DM(1) = √ 1 Σ 1.
ones = np.ones(Cov.shape[0])
d_euc = np.sqrt(np.dot(ones, ones))
d_mah = np.sqrt(np.dot(np.dot(ones, Prec), ones))
print("Euclidian norm of ones=%.2f. Mahalanobis norm of ones=%.2f" % (d_euc, d_mah))
67
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import pystatsml.plot_utils
%matplotlib inline np.random.seed(40)
colors = sns.color_palette()
mean = np.array([0, 0])
Cov = np.array([[1, .8],[.8, 1]])
samples = np.random.multivariate_normal(mean, Cov, 100)
x1 = np.array([0, 2])
x2 = np.array([2, 2])
plt.scatter(samples[:, 0], samples[:, 1], color=colors[0])
plt.scatter(mean[0], mean[1], color=colors[0], s=200, label="mean")
plt.scatter(x1[0], x1[1], color=colors[1], s=200, label="x1")
plt.scatter(x2[0], x2[1], color=colors[2], s=200, label="x2")
# plot covariance ellipsis
pystatsml.plot_utils.plot_cov_ellipse(Cov, pos=mean, facecolor='none', linewidth=2,
edgecolor=colors[0])
# Compute distances
d2_m_x1 = scipy.spatial.distance.euclidean(mean, x1)
d2_m_x2 = scipy.spatial.distance.euclidean(mean, x2)
Covi = scipy.linalg.inv(Cov)
dm_m_x1 = scipy.spatial.distance.mahalanobis(mean, x1, Covi)
dm_m_x2 = scipy.spatial.distance.mahalanobis(mean, x2, Covi)
# Plot distances
vm_x1 = (x1 - mean) / d2_m_x1
vm_x2 = (x2 - mean) / d2_m_x2
jitter = .1 plt.plot([mean[0] - jitter, d2_m_x1 * vm_x1[0] - jitter],
[mean[1], d2_m_x1 * vm_x1[1]], color='k')
plt.plot([mean[0] - jitter, d2_m_x2 * vm_x2[0] - jitter],
[mean[1], d2_m_x2 * vm_x2[1]], color='k')
plt.plot([mean[0] + jitter, dm_m_x1 * vm_x1[0] + jitter],
[mean[1], dm_m_x1 * vm_x1[1]], color='r')
plt.plot([mean[0] + jitter, dm_m_x2 * vm_x2[0] + jitter],
[mean[1], dm_m_x2 * vm_x2[1]], color='r')
plt.legend(loc='lower right')
plt.text(-6.1, 3,'Euclidian: d(m, x1) = %.1f<d(m, x2) =
%.1f' % (d2_m_x1, d2_m_x2), color='k')
plt.text(-6.1, 3.5,'Mahalanobis: d(m, x1) = %.1f>d(m, x2) = %.1f' % (dm_m_x1, dm_m_x2), color='r')
Mahalanobis distance
plt.axis('equal')
print('Euclidian d(m, x1) = %.2f < d(m, x2) = %.2f' % (d2_m_x1, d2_m_x2))
print('Mahalanobis d(m, x1) = %.2f > d(m, x2) = %.2f' % (dm_m_x1, dm_m_x2))
If the covariance matrix is the identity matrix, the Mahalanobis distance reduces to the
Euclidean distance. If the covariance matrix is diagonal, then the resulting distance measure is
called a normalized Euclidean distance.
More generally, the Mahalanobis distance is a measure of the distance between a point x and a
distribution ∼(x|µ,Σ). It is a multi-dimensional generalization of the idea of measuring how many
standard deviations away x is from the mean. This distance is zero if x is at the mean, and
grows as x moves away from the mean: along each principal component axis, it measures the
number of standard deviations from x to the mean of the distribution.
Multivariate normal distribution
The distribution, or probability density function (PDF) (sometimes just density), of a continuous
random variable is a function that describes the relative likelihood for this random variable to
take on a given value.
The multivariate normal distribution, or multivariate Gaussian distribution, of a P-dimensional
random vector x = [x1,x2,...,xP]T is
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats from scipy.stats
import multivariate_normal from mpl_toolkits.mplot3d
import Axes3D
def multivariate_normal_pdf(X, mean, sigma):
"""Multivariate normal probability density function over X (n_samples x n_features)"""
P = X.shape[1] det = np.linalg.det(sigma)
norm_const = 1.0 / (((2*np.pi) ** (P/2)) * np.sqrt(det))
X_mu = X - mu
inv = np.linalg.inv(sigma)
d2 = np.sum(np.dot(X_mu, inv) * X_mu, axis=1)
return norm_const * np.exp(-0.5 * d2)
# mean and covariance
mu = np.array([0, 0])
sigma = np.array([[1, -.5],[-.5, 1]])
# x, y grid
x, y = np.mgrid[-3:3:.1, -3:3:.1]
X = np.stack((x.ravel(), y.ravel())).T
norm = multivariate_normal_pdf(X, mean, sigma).reshape(x.shape)
# Do it with scipy
norm_scpy = multivariate_normal(mu, sigma).pdf(np.stack((x, y), axis=2))
assert
np.allclose(norm, norm_scpy)
# Plot
fig = plt.figure(figsize=(10, 7))
ax = fig.gca(projection='3d')
surf = ax.plot_surface(x, y, norm, rstride=3,cstride=3,
cmap=plt.cm.coolwarm, linewidth=1, antialiased=False
)
ax.set_zlim(0, 0.2)
ax.zaxis.set_major_locator(plt.LinearLocator(10))
ax.zaxis.set_major_formatter(plt.FormatStrFormatter('%.02f'))
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('p(x)')
plt.title('Bivariate Normal/Gaussian distribution')
fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.coolwarm)
plt.show()
Exercises
Dot product and Euclidean norm
Given a = [2,1]T and b = [1,1]T
1. Write a function euclidian(x) that computes the Euclidian norm of vector, x.
2. Compute the Euclidean norm of a.
3. Compute the Euclidean distance of ǁa − bǁ2.
4. Compute the projection of b in the direction of vector a: ba.
69
5. Simulate a dataset X of N = 100 samples of 2-dimensional vectors.
6. Project all samples in the direction of the vector a.
T
∼(µ,Σ) where µ = [1,1] and .
2. Compute the mean vector x¯ and center X. Compare the estimated mean x¯ to the true
mean, µ.
3. Compute the empirical covariance matrix S. Compare the estimated covariance matrix S to
the true covariance matrix, Σ.
−1
4. Compute S (Sinv) the inverse of the covariance matrix by using scipy.linalg.inv(S).
5. Write a function mahalanobis(x,xbar,Sinv) that computes the Mahalanobis distance of a
vector x to the mean, x¯ .
6. Compute the Mahalanobis and Euclidian distances of each sample xi to the mean x¯ . Store
the results in a 100 × 2 dataframe.
CHAPTER
EIGHT
Introduction
In machine learning and statistics, dimensionality reduction or dimension reduction is the
process of reducing the number of features under consideration, and can be divided into feature
selection (not addressed here) and feature extraction.
Feature extraction starts from an initial set of measured data and builds derived values
(features) intended to be informative and non-redundant, facilitating the subsequent learning
and generalization steps, and in some cases leading to better human interpretations. Feature
extraction is related to dimensionality reduction.
The goal is to learn a transformation that extracts a few relevant features. This is generally done
by exploiting the covariance ΣXX between the input features.
Matrix factorization
71
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns %matplotlib
inline
np.random.seed(42)
# dataset n_samples = 100
experience = np.random.normal(size=n_samples)
salary = 1500 + experience + np.random.normal(size=n_samples, scale=.5)
X = np.column_stack([experience, salary])
# PCA using SVD
X -= X.mean(axis=0) # Centering is required
U, s, Vh = scipy.linalg.svd(X, full_matrices=False)
# U : Unitary matrix having left singular vectors as columns.
#Of shape (n_samples,n_samples) or (n_samples,n_comps), depending on # full_matrices.
# s : The singular values, sorted in non-increasing order. Of shape (n_comps,), # with n_comps =
# min(n_samples, n_features).
# Vh: Unitary matrix having right singular vectors as rows.
# Of shape (n_features, n_features) or (n_comps, n_features) depending
# on full_matrices.
plt.figure(figsize=(9, 3))
plt.subplot(131)
plt.scatter(U[:, 0], U[:, 1], s=50)
plt.axis('equal')
plt.title("U: Rotated and scaled data")
plt.subplot(132)
# Project data
PC = np.dot(X, Vh.T)
plt.scatter(PC[:, 0], PC[:, 1], s=50)
plt.axis('equal')
plt.title("Rotated data (PCs)")
plt.xlabel("Princ. Comp. 1 (PC1)")
plt.ylabel("Princ. Comp. 2 (PC2)")
plt.subplot(133)
plt.scatter(X[:, 0], X[:, 1], s=50)
for i in range(Vh.shape[0]):
plt.arrow(x=0, y=0, dx=Vh[i, 0], dy=Vh[i, 1], head_width=0.2, head_length=0.2, linewidth=2, fc='r', ec='r')
plt.text(Vh[i, 0], Vh[i, 1],'v%i' % (i+1), color="r", fontsize=15, horizontalalignment='right', verticalalignment='top')
plt.axis('equal')
plt.ylim(-4, 4)
plt.title("Original data, with PC dir. u1, u2")
plt.xlabel("experience")
plt.ylabel("salary")
plt.tight_layout()
Principal components analysis (PCA)
Sources:
• C. M. Bishop Pattern Recognition and Machine Learning, Springer, 2006
• Everything you did and didn’t know about PCA
• Principal Component Analysis in 3 Simple Steps
Principles
• Principal components analysis is the main method used for linear dimension reduction.
• The idea of principal component analysis is to find the K principal components directions
(called the loadings) VK×P that capture the variation in the data as much as possible.
• It converts a set of N P-dimensionnal observations NN×P of possibly correlated variables into
a set of N Kdimensionnal samples CN×K, where the K < P. The new variables are linearly
uncorrelated. The columns of CN×K are called the principal components.
• The dimension reduction is obtained by using only K < P components that exploit
correlation (covariance) among the original variables.
• PCA is mathematically defined as an orthogonal linear transformation VK×P that transforms
the data to a new coordinate system such that the greatest variance by some projection of
the data comes to lie on the first coordinate (called the first principal component), the
second greatest variance on the second coordinate, and so on.
CN×K = XN×PVP×K
• PCA can be thought of as fitting a P-dimensional ellipsoid to the data, where each axis of
the ellipsoid represents a principal component. If some axis of the ellipse is small, then the
variance along that axis is also small, and by omitting that axis and its corresponding
principal component from our representation of the dataset, we lose only a
commensurately small amount of information.
• Finding the K largest axes of the ellipse will permit project the data onto a space having
dimensionality K < P while maximizing the variance of the projected data.
Dataset pre-processing
Centering
Consider a data matrix, X , with column-wise zero empirical mean (the sample mean of each
column has been shifted to zero), ie. X is replaced by X − 1x¯ T.
Standardizing
Optionally, standardize the columns, i.e., scale them by their standard-deviation. Without
standardization, a variable with a high variance will capture most of the effect of the PCA. The
principal direction will be aligned with this variable. Standardization will, however, raise noise
variables to the save level as informative variables.
The covariance matrix of centered standardized data is the correlation matrix.
73
Each P-dimensional data point xi is then projected onto v, where the coordinate (in the
coordinate system of v) is a scalar value, namely xTi v. I.e., we want to find the vector v that
maximizes these coordinates along v, which we will see corresponds to maximizing the variance
of the projected data. This is equivalently expressed as
v = arg max
i
We can write this in matrix form as
v = arg max
T
Xv = v SXXv
where SXX is a biased estimate of the covariance matrix of the data, i.e.
SXX .
We now maximize the projected variance vTSXXv with respect to v. Clearly, this has to be a
constrained maximization to prevent ǁv2ǁ → ∞. The appropriate constraint comes from the
T
normalization condition v v = 1.
To enforce this constraint, we introduce a Lagrange multiplier that we shall denote by λ, and
then make an unconstrained maximization of
T T
v SXXv − λ(v v − 1).
By setting the gradient with respect to v equal to zero, we see that this quantity has a stationary
point when
SXXv = λv.
We note that v is an eigenvector of SXX.
If we left-multiply the above equation by vT and make use of vTv = 1, we see that the variance is
given by
T
v SXXv = λ,
and so the variance will be at a maximum when v is equal to the eigenvector corresponding to
the largest eigenvalue, λ. This eigenvector is known as the first principal component.
We can define additional principal components in an incremental fashion by choosing each new
direction to be that which maximizes the projected variance amongst all possible directions that
are orthogonal to those already considered. If we consider the general case of a K-dimensional
projection space, the optimal linear projection for which the variance of the projected data is
maximized is now defined by the K eigenvectors, v1,...,vK, of the data covariance matrix SXX that
corresponds to the K largest eigenvalues, λ1 ≥ λ2 ≥ ··· ≥ λK.
Back to SVD
The sample covariance matrix of centered data X is given by
T T T T
X X = (UDV ) (UDV )
T T T
= VD U UDV
2 T
= VD V
T T 2
V X XV = D
XV
T 1 2
V SXXV = D N −
1
.
Considering only the kth right-singular vectors vk associated to the singular value dk
vkTSXXvk ,
It turns out that if you have done the singular value decomposition then you already have the
Eigenvalue decomposition for XTX. Where - The eigenvectors of SXX are equivalent to the right
singular vectors, V, of X.
The eigenvalues, λk, of SXX, i.e. the variances of the components, are equal to times the
squared singular values, dk.
Moreover computing PCA with SVD do not require to form the matrix XTX, so computing the
SVD is now the standard way to calculate a principal components analysis from a data matrix,
unless only a handful of components are required.
PCA outputs
The SVD or the Eigen decomposition of the data covariance matrix provides three main
quantities:
T
1. Principal component directions or loadings are the eigenvectors of X X. The VK×P or the
right-singular vectors of an SVD of X are called principal component directions of X. They
are generally computed using the SVD of X.
2. Principal components is the N × K matrix C which is obtained by projecting X onto the
principal components directions, i.e.
CN×K = XN×PVP×K.
Since X = UDVT and V is orthogonal (VTV = I):
3. The variance of each component is given by the eigen values λk,k = 1,...K. It can be
obtained from the singular values:
(8.5)
75
c = Xv (8.8)
T T
X c = X Xv (8.9)
T −1 T
(X X) X c = v (8.10)
Another way to evaluate the contribution of the original variables in each PC can be obtained by
computing the correlation between the PCs and the original variables, i.e. columns of X,
denoted xj, for j = 1,...,P. For the kth PC, compute and plot the correlations with all original
variables
cor(ck,xj),j = 1...K,j = 1...K.
These quantities are sometimes called the correlation loadings.
import numpy as np from
sklearn.decomposition import PCA
import matplotlib.pyplot as plt
np.random.seed(42)
# dataset
n_samples = 100
experience = np.random.normal(size=n_samples)
salary = 1500 + experience + np.random.normal(size=n_samples, scale=.5)
X = np.column_stack([experience, salary])
# PCA with scikit-learn
pca = PCA(n_components=2)
pca.fit(X)
print(pca.explained_variance_ratio_)
PC = pca.transform(X)
plt.subplot(121)
plt.scatter(X[:, 0], X[:, 1])
plt.xlabel("x1");
plt.ylabel("x2")
plt.subplot(122)
plt.scatter(PC[:, 0], PC[:, 1])
plt.xlabel("PC1 (var=%.2f)" % pca.explained_variance_ratio_[0])
plt.ylabel("PC2 (var=%.2f)" % pca.explained_variance_ratio_[1])
plt.axis('equal')
plt.tight_layout()
[ 0.93646607 0.06353393]
Exercises
Write a basic PCA class
Write a class BasicPCA with two methods:
• fit(X) that estimates the data mean, principal components directions V and the explained
variance of each component.
• transform(X) that projects the data onto the principal components.
Check that your BasicPCA gave similar results, compared to the results from sklearn.
The purpose of MDS is to find a low-dimensional projection of the data in which the pairwise
distances between data points is preserved, as closely as possible (in a least-squares sense).
• Let D be the (N × N) pairwise distance matrix where dij is a distance between points i and j.
77
• The MDS concept can be extended to a wide variety of data types specified in terms of a
similarity matrix.
Given the dissimilarity (distance) matrix DN×N = [dij], MDS attempts to find K-dimensional
projections of the N points x1,...,xN ∈ RK, concatenated in an XN×K matrix, so that dij ≈ ǁxi − xjǁ are as
close as possible. This can be obtained by the minimization of a loss function called the stress
function
∑ 2
stress(X) = (dij − ǁxi − xjǁ) .
i̸=j
stressSammon .
The Sammon mapping performs better at preserving small distances compared to the least-
squares scaling.
Xr[:, 0] *= -1
plt.scatter(Xr[:, 0], Xr[:, 1])
for i in range(len(city)):
plt.text(Xr[i, 0], Xr[i, 1], city[i])
plt.axis('equal')
79
Determining the number of components
We must choose K* ∈ {1,...,K} the number of required components. Plotting the values of the
stress function, obtained using k ≤ N − 1 components. In general, start with 1,...K ≤ 4. Choose K*
where you can clearly distinguish an elbow in the stress curve.
Thus, in the plot below, we choose to retain information accounted for by the first two
components, since this is where the elbow is in the stress curve.
k_range = range(1, min(5, D.shape[0]-1))
stress = [MDS(dissimilarity='precomputed', n_components=k, random_state=42, max_iter=300, eps=1e-
9).fit(D).stress_ for k in k_range]
print(stress)
plt.plot(k_range, stress)
plt.xlabel("k")
plt.ylabel("stress")
Exercises
Apply MDS from sklearn on the iris dataset available at:
https://raw.github.com/neurospin/pystatsml/master/data/iris.csv
• Center and scale the dataset.
• Compute Euclidean pairwise distances matrix.
• Select the number of components.
• Show that classical MDS on Euclidean pairwise distances matrix is equivalent to PCA.
Isomap
Isomap is a nonlinear dimensionality reduction method that combines a procedure to compute
the distance matrix with MDS. The distances calculation is based on geodesic distances
evaluated on neighbourhood graph:
1. Determine the neighbors of each point. All points in some fixed radius or K nearest
neighbors.
2. Construct a neighbourhood graph. Each point is connected to other if it is a K nearest
neighbour. Edge length equal to Euclidean distance.
3. Compute shortest path between pairwise of points dij to build the distance matrix D.
4. Apply MDS on D.
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import manifold, datasets
X, color = datasets.samples_generator.make_s_curve(1000, random_state=42)
fig = plt.figure(figsize=(10, 5))
plt.suptitle("Isomap Manifold Learning", fontsize=14)
ax = fig.add_subplot(121, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=color, cmap=plt.cm.Spectral)
ax.view_init(4, -72)
plt.title('2D "S shape" manifold in 3D')
Y = manifold.Isomap(n_neighbors=10, n_components=2).fit_transform(X)
ax = fig.add_subplot(122)
plt.scatter(Y[:, 0], Y[:, 1], c=color, cmap=plt.cm.Spectral)
plt.title("Isomap")
plt.xlabel("First component")
plt.ylabel("Second component")
plt.axis('tight')
(-5.4131242078919239,
5.2729984345096854,
-1.2877687637642998,
1.2316524684384262)
81
CHAPTER
NINE
CLUSTERING
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects
in the same group (called a cluster) are more similar (in some sense or another) to each other
than to those in other groups (clusters). Clustering is one of the main task of exploratory data
mining, and a common technique for statistical data analysis, used in many fields, including
machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.
Sources: http://scikit-learn.org/stable/modules/clustering.html
K-means clustering
Source: C. M. Bishop Pattern Recognition and Machine Learning, Springer, 2006
Suppose we have a data set X = {x1,··· ,xN} that consists of N observations of a random D-
dimensional Euclidean variable x. Our goal is to partition the data set into some number, K, of
clusters, where we shall suppose for the moment that the value of K is given. Intuitively, we
might think of a cluster as comprising a group of data points whose inter-point distances are
small compared to the distances to points outside of the cluster. We can formalize this notion by
first introducing a set of D-dimensional vectors µk, where k = 1,...,K, in which µk is a prototype
associated with the kth cluster. As we shall see shortly, we can think of the µk as representing the
centres of the clusters. Our goal is then to find an assignment of data points to clusters, as well
as a set of vectors {µk}, such that the sum of the squares of the distances of each data point to
its closest prototype vector µk, is at a minimum.
It is convenient at this point to define some notation to describe the assignment of data points to
clusters. For each data point xi , we introduce a corresponding set of binary indicator variables rik
∈ {0,1}, where k = 1,...,K, that describes which of the K clusters the data point xi is assigned to, so
that if data point xi is assigned to cluster k then rik = 1, and rij = 0 for j ̸= k. This is known as the 1-
of-K coding scheme. We can then define an objective function, denoted inertia, as
which represents the sum of the squares of the Euclidean distances of each data point to its
assigned vector µk. Our goal is to find values for the {rik} and the {µk} so as to minimize the
function J. We can do this through an iterative procedure in which each iteration involves two
successive steps corresponding to successive optimizations with respect to the rik and the µk .
First we choose some initial values for the µk. Then in the first phase we minimize J with respect
to the rik, keeping the µk fixed. In the second phase we minimize J with respect to the µk, keeping
rik fixed. This two-stage optimization process is then repeated until convergence. We shall see
that these two stages of updating rik and µk correspond respectively to the expectation (E) and
maximization (M) steps of the expectation maximisation (EM) algorithm, and to emphasize this
we shall use the terms E step and M step in the context of the K-means algorithm.
Consider first the determination of the rik . Because J in is a linear function of rik , this optimization
can be performed easily to give a closed form solution. The terms involving different i are
independent and so we can optimize for each i separately by choosing rik to be 1 for whichever
value of k gives the minimum value of ||xi −µk||2 . In other words, we simply assign the ith data
point to the closest cluster centre.
83
Now consider the optimization of the µk with the rik held fixed. The objective function J is a
quadratic function of µk, and it can be minimized by setting its derivative with respect to µk to
zero giving
2∑ rik(xi − µk) = 0
i
which we can easily solve for µk to give
The denominator in this expression is equal to the number of points assigned to cluster k, and
so this result has a simple interpretation, namely set µk equal to the mean of all of the data
points xi assigned to cluster k. For this reason, the procedure is known as the K-means
algorithm.
The two phases of re-assigning data points to clusters and re-computing the cluster means are
repeated in turn until there is no further change in the assignments (or until some maximum
number of iterations is exceeded). Because each phase reduces the value of the objective
function J, convergence of the algorithm is assured. However, it may converge to a local rather
than global minimum of J.
from sklearn import cluster, datasets
import matplotlib.pyplot as plt
import seaborn as sns # nice color
%matplotlib inline
iris = datasets.load_iris()
X = iris.data[:, :2] # use only 'sepal length and sepal width' y_iris = iris.target
km2 = cluster.KMeans(n_clusters=2).fit(X)
km3 = cluster.KMeans(n_clusters=3).fit(X)
km4 = cluster.KMeans(n_clusters=4).fit(X)
plt.figure(figsize=(9, 3)) plt.subplot(131)
plt.scatter(X[:, 0], X[:, 1], c=km2.labels_) plt.title("K=2, J=%.2f" % km2.inertia_)
plt.subplot(132) plt.scatter(X[:, 0], X[:, 1], c=km3.labels_)
plt.title("K=3, J=%.2f" % km3.inertia_)
plt.subplot(133) plt.scatter(X[:, 0], X[:, 1], c=km4.labels_)#.astype(np.float))
plt.title("K=4, J=%.2f" % km4.inertia_)
<matplotlib.text.Text at 0x7fe4ad47b710>
Exercises
1. Analyse clusters
• Analyse the plot above visually. What would a good value of K be?
• If you instead consider the inertia, the value of J, what would a good value of K be?
• Explain why there is such difference.
• For K = 2 why did K-means clustering not find the two “natural” clusters? See the
assumptions of K-means: http://scikit-
learn.org/stable/auto_examples/cluster/plot_kmeans_assumptions.html# example-cluster-plot-
kmeans-assumptions-py
2. Re-implement the K-means clustering algorithm (homework)
Write a function kmeans(X,K) that return an integer vector of the samples’ labels.
Hierarchical clustering
Hierarchical clustering is an approach to clustering that build hierarchies of clusters in two main
approaches:
• Agglomerative: A bottom-up strategy, where each observation starts in their own cluster,
and pairs of clusters are merged upwards in the hierarchy.
• Divisive: A top-down strategy, where all observations start out in the same cluster, and then
the clusters are split recursively downwards in the hierarchy.
In order to decide which clusters to merge or to split, a measure of dissimilarity between clusters
is introduced. More specific, this comprise a distance measure and a linkage criterion. The
distance measure is just what it sounds like, and the linkage criterion is essentially a function of
the distances between points, for instance the minimum distance between points in two clusters,
the maximum distance between points in two clusters, the average distance between points in
two clusters, etc. One particular linkage criterion, the Ward criterion, will be discussed next.
Ward clustering
Ward clustering belongs to the family of agglomerative hierarchical clustering algorithms. This
means that they are based on a “bottoms up” approach: each sample starts in its own cluster,
and pairs of clusters are merged as one moves up the hierarchy.
In Ward clustering, the criterion for choosing the pair of clusters to merge at each step is the
minimum variance criterion. Ward’s minimum variance criterion minimizes the total within-cluster
variance by each merge. To implement this method, at each step: find the pair of clusters that
leads to minimum increase in total within-cluster variance after merging. This increase is a
weighted squared distance between cluster centers.
The main advantage of agglomerative hierarchical clustering over K-means clustering is that you
can benefit from known neighbourhood information, for example, neighbouring pixels in an
image.
from sklearn import cluster, datasets
import matplotlib.pyplot as plt
import seaborn as sns # nice color
iris = datasets.load_iris()
X = iris.data[:, :2] # 'sepal length (cm)''sepal width (cm)'
y_iris = iris.target
ward2 = cluster.AgglomerativeClustering(n_clusters=2, linkage='ward').fit(X)
ward3 = cluster.AgglomerativeClustering(n_clusters=3, linkage='ward').fit(X)
ward4 = cluster.AgglomerativeClustering(n_clusters=4, linkage='ward').fit(X)
plt.figure(figsize=(9, 3)) plt.subplot(131) plt.scatter(X[:, 0], X[:,
1], c=ward2.labels_)
plt.title("K=2")
plt.subplot(132)
plt.scatter(X[:, 0], X[:, 1], c=ward3.labels_)
plt.title("K=3")
plt.subplot(133)
plt.scatter(X[:, 0], X[:, 1], c=ward4.labels_) # .astype(np.float))
plt.title("K=4")
<matplotlib.text.Text at 0x7fe4ace5cfd0>
85
Gaussian mixture models
The Gaussian mixture model (GMM) is a simple linear superposition of Gaussian components
over the data, aimed at providing a rich class of density models. We turn to a formulation of
Gaussian mixtures in terms of discrete latent variables: the K hidden classes to be discovered.
Differences compared to K-means:
• Whereas the K-means algorithm performs a hard assignment of data points to clusters, in
which each data point is associated uniquely with one cluster, the GMM algorithm makes a
soft assignment based on posterior probabilities.
• Whereas the classic K-means is only based on Euclidean distances, classic GMM use a
Mahalanobis distances that can deal with non-spherical distributions. It should be noted
that Mahalanobis could be plugged within an improved version of K-Means clustering. The
Mahalanobis distance is unit less and scale-invariant, and takes into account the
correlations of the data set.
The Gaussian mixture distribution can be written as a linear superposition of K Gaussians in the
form:
K
∑
p(x) = ∼(x|µk,Σk)p(k), k=1
where:
• ∼(x|µk,Σk) is the multivariate Gaussian distribution defined over a P-dimensional vector x of
continuous variables.
• The p(k) are the mixing coefficients also know as the class probability of class k, and they
sum to one:
∑ K
k=1 p(k)= 1.
• ∼(x|µk,Σk) = p(x|k) is the conditional distribution of x given a particular class k.
The goal is to maximize the log-likelihood of the GMM:
To compute the classes parameters: p(k),µk,Σk we sum over all samples, by weighting each
sample i by its responsibility or contribution to class k: p(k |xi) such that for each point its
∑
contribution to all classes sum to one k p(k |xi) = 1. This contribution is the conditional
probability of class k given x: p(k |x) (sometimes called the posterior). It can be computed using
Bayes’ rule:
Since the class parameters, p(k), µk and Σk, depend on the responsibilities p(k |x) and the
responsibilities depend on class parameters, we need a two-step iterative algorithm: the
expectation-maximization (EM) algorithm. We discuss this algorithm next.
2. M step. For each class, re-estimate the parameters using the current responsibilities
and check for convergence of either the parameters or the log-likelihood. If the convergence
criterion is not satisfied return to step 1.
import numpy as np from sklearn
import datasets
import matplotlib.pyplot as plt
import seaborn as sns # nice color
import sklearn from sklearn.mixture
import GaussianMixture
import pystatsml.plot_utils
colors = sns.color_palette()
iris = datasets.load_iris()
X = iris.data[:, :2] # 'sepal length (cm)''sepal width (cm)'
y_iris = iris.target
gmm2 = GaussianMixture(n_components=2, covariance_type='full').fit(X)
gmm3 = GaussianMixture(n_components=3, covariance_type='full').fit(X)
gmm4 = GaussianMixture(n_components=4, covariance_type='full').fit(X)
plt.figure(figsize=(9, 3)) plt.subplot(131)
plt.scatter(X[:, 0], X[:, 1], c=[colors[lab]
for lab in gmm2.predict(X)])#color=colors)
for i in range(gmm2.covariances_.shape[0]):
87
pystatsml.plot_utils.plot_cov_ellipse(cov=gmm2.covariances_[i, :], pos=gmm2.means[i, :], facecolor='none',
linewidth=2, edgecolor=colors[i])
plt.scatter(gmm2.means_[i, 0], gmm2.means_[i, 1], edgecolor=colors[i], marker="o", s=100, facecolor="w",
linewidth=2)
plt.title("K=2")
plt.subplot(132)
plt.scatter(X[:, 0], X[:, 1], c=[colors[lab]
for lab in gmm3.predict(X)])
for i in range(gmm3.covariances_.shape[0]):
pystatsml.plot_utils.plot_cov_ellipse(cov=gmm3.covariances_[i, :], pos=gmm3.means_ ,→[i, :],
Model selection
Bayesian information criterion
In statistics, the Bayesian information criterion (BIC) is a criterion for model selection among a
finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the
likelihood function and it is closely related to the Akaike information criterion (AIC).
X = iris.data
y_iris = iris.target
bic = list() #print(X)
ks = np.arange(1, 10)
for k in ks:
gmm = GaussianMixture(n_components=k, covariance_type='full')
gmm.fit(X)
bic.append(gmm.bic(X))
k_chosen = ks[np.argmin(bic)]
plt.plot(ks, bic)
plt.xlabel("k")
plt.ylabel("BIC")
print("Choose k=", k_chosen)
Choose k= 2
89
CHAPTER
TEN
R-squared = 0.897194261083
Coefficients = [ 0.04575482 0.18799423]
Overfitting
In statistics and machine learning, overfitting occurs when a statistical model describes random
errors or noise instead of the underlying relationships. Overfitting generally occurs when a
model is excessively complex, such as having too many parameters relative to the number of
observations. A model that has been over fit will generally have poor predictive performance, as
it can exaggerate minor fluctuations in the data.
A learning algorithm is trained using some set of training samples. If the learning algorithm has
the capacity to overbite the training samples the performance on the training sample set will
improve while the performance on unseen test sample set will decline.
The overfitting phenomenon has three main explanations: - excessively complex models, -
multicollinearity, and - high dimensionality.
Model complexity
91
Complex learners with too many parameters relative to the number of observations may overfit
the training dataset.
Multicollinearity
Predictors are highly correlated, meaning that one can be linearly predicted from the others. In
this situation the coefficient estimates of the multiple regression may change erratically in
response to small changes in the model or the data. Multicollinearity does not reduce the
predictive power or reliability of the model as a whole, at least not within the sample data set; it
only affects computations regarding individual predictors. That is, a multiple regression model
with correlated predictors can indicate how well the entire bundle of predictors predicts the
outcome variable, but it may not give valid results about any individual predictor, or about which
predictors are redundant with respect to others. In case of perfect multicollinearity the predictor
matrix is singular and therefore cannot be inverted. Under these circumstances, for a general
linear model y = Xβ+ε, the ordinary least-squares estimator, βOLS = (XTX)−1XTy, does not exist.
An example where correlated predictor may produce an unstable model follows:
High dimensionality
High dimensions means a large number of input features. Linear predictor associate one
parameter to each input feature, so a high-dimensional situation (P, number of features, is large)
with a relatively small number of samples N (so-called large P small N situation) generally lead
to an overfit of the training data. Thus it is generally a bad idea to add many input features into
the learner. This phenomenon is called the curse of dimensionality.
One of the most important criteria to use when choosing a learning algorithm is based on the
relative size of P and N.
T
• Remember that the “covariance” matrix X X used in the linear model is a P × P matrix of
rank min(N,P). Thus if P > N the equation system is over parameterized and admit an infinity
of solutions that might be specific to the learning dataset. See also ill-conditioned or
singular matrices.
93
1/P
• The sampling density of N samples in an P-dimensional space is proportional to N . Thus
a high dimensional space becomes very sparse, leading to poor estimations of samples
densities.
• Another consequence of the sparse sampling in high dimensions is that all sample points
are close to an edge of the sample. Consider N data points uniformly distributed in a P-
dimensional unit ball centered at the origin. Suppose we consider a nearest-neighbour
estimate at the origin. The median distance from the origin to the closest data point is
given by the expression
.
A more complicated expression exists for the mean distance to the closest point. For N = 500, P
= 10 , d(P,N) ≈ 0.52, more than halfway to the boundary. Hence most data points are closer to
the boundary of the sample space than to any other data point. The reason that this presents a
problem is that prediction is much more difficult near the edges of the training sample. One must
extrapolate from neighbouring sample points rather than interpolate between them. (Source: T
Hastie, R Tibshirani, J Friedman. The Elements of Statistical Learning: Data Mining, Inference, and
Prediction. Second Edition, 2009.)
Exercices
Study the code above and:
• Describe the datasets: N: nb_samples, P: nb_features.
• What is n_features_info?
• Give the equation of the generative model.
• What is modified by the loop?
• What is the SNR?
Comment the graph above, in terms of training and test performances:
• How does the train and test performances change as a function of x?
• Is it the expected results when compared to the SNR?
• What can you conclude?
Ridge .
∇βRidge(β) = 0 (10.6)
( T ) T (10.7)
∇β (y − Xβ) (y − Xβ) + λβ β = 0
( T T T T T T ) (10.8)
∇β (y y − 2β X y + β X Xβ + λβ β) = 0
T T
−2X y + 2X Xβ + 2λβ = 0 (10.9)
T T
−X y + (X X + λI)β = 0 (10.10)
T T
(X X + λI)β = X y
T −1 T
β = (X X + λI) X y)
T
• The solution adds a positive constant to the diagonal of X X before inversion. This makes
T
the problem nonsingular, even if X X is not of full rank, and was the main motivation behind
ridge regression.
• Increasing λ shrinks the β coefficients toward 0.
• This approach penalizes the objective function by the Euclidian (:math:‘ell_2‘) norm of the
coefficients such that solutions with large coefficients become unattractive.
The ridge penalty shrinks the coefficients toward zero. The figure illustrates: the OLS solution on
the left. The `1 and `2 penalties in the middle pane. The penalized OLS in the right pane. The
right pane shows how the penalties shrink the coefficients toward zero. The black points are the
minimum found in each case, and the white points represents the true solution used to generate
the data.
Fig. 10.1:shrinkages
import matplotlib.pyplot as plt
import numpy as np
import sklearn.linear_model as lm
# lambda is alpha! mod =
lm.Ridge(alpha=10)
# Fit models on dataset n_features, r2_train, r2_test, snr =
fit_on_increasing_size(model=mod)
argmax = n_features[np.argmax(r2_test)]
# plot fig, axis = plt.subplots(1, 2, figsize=(9, 3))
# Left pane: all features
plot_r2_snr(n_features, r2_train, r2_test, argmax, snr, axis[0])
# Right pane: Zoom on 100 first features plot_r2_snr(n_features[n_features <= 100],
r2_train[n_features <= 100], r2_test[n_features <= 100], argmax,
snr[n_features <= 100], axis[1])
plt.tight_layout()
Exercice
What benefit has been obtained by using `2 regularization?
97
Sparsity of the `1 norm
Occam’s razor
Occam’s razor (also written as Ockham’s razor, and lex parsimoniae in Latin, which means law
of parsimony) is a problem solving principle attributed to William of Ockham (1287-1347), who
was an English Franciscan friar and scholastic philosopher and theologian. The principle can be
interpreted as stating that among competing hypotheses, the one with the fewest assumptions
should be selected.
Principle of parsimony
The simplest of two competing theories is to be preferred. Definition of parsimony: Economy of
explanation in conformity with Occam’s razor.
Among possible models with similar loss, choose the simplest one:
• Choose the model with the smallest coefficient vector, ie. smallest `2 (ǁβǁ2) or `1 (ǁβǁ1) norm
of β, i.e. `2 or `1 penalty. See also bias-variance tradeoff.
• Choose the model that uses the smallest number of predictors. In other words, choose the
model that has many predictors with zero weights. Two approaches are available to obtain
this: (i) Perform a feature selection as a preprocessing prior to applying the learning
algorithm, or (ii) embed the feature selection procedure within the learning process.
Optimization issues
Section to be completed
• No more closed-form solution.
• Convex but not differentiable.
• Requires specific optimization algorithms, such as the fast iterative shrinkage-shareholding
algorithm (FISTA): Amir Beck and Marc Teboulle, A Fast Iterative Shrinkage-Thresholding
Algorithm for Linear Inverse Problems SIAM J. Imaging Sci., 2009.
99
import matplotlib.pyplot as plt
import numpy as np
import sklearn.linear_model as lm
mod = lm.ElasticNet(alpha=.5, l1_ratio=.5)
# Fit models on dataset
n_features, r2_train, r2_test, snr = fit_on_increasing_size(model=mod)
argmax = n_features[np.argmax(r2_test)]
# plot fig, axis = plt.subplots(1, 2, figsize=(9, 3))
# Left pane: all features plot_r2_snr(n_features, r2_train, r2_test, argmax, snr, axis[0])
# Right pane: Zoom on 100 first features plot_r2_snr(n_features[n_features <= 100],
r2_train[n_features <= 100], r2_test[n_features <= 100], argmax, snr[n_features <= 100],
axis[1])
plt.tight_layout()
CHAPTER
ELEVEN
LINEAR CLASSIFICATION
A linear classifier achieves its classification decision of yˆi based on the value of a linear
combination of the input features of a given sample xi, such that
yˆi = f(w · xi),
where w · xi := wTxi is the dot product between w and xi.
Xc, (11.2)
where Xc is the (N × P) matrix of data centered on their respective means:
Xc ,
where X0 and X1 are the (N0 × P) and (N1 × P) matrices of samples of classes C0 and C1.
Let SB being the scatter “between-class” matrix, given by
T
SB = (µ1 − µ0)(µ1 − µ0) .
The linear combination of features wTx have means wTµi for i = 0,1, and variance wTXcTXcw. Fisher
defined the separation between these two distributions to be the ratio of the variance between
the classes to the variance within the classes:
FFisher(w) =
∇w
Since we do not care about the magnitude of w, only its direction, we replaced the scalar factor
T T
(w SBw)/(w SWw) by λ.
101
In the multiple-class case, the solutions w are determined by the eigenvectors of SW −1SB that
correspond to the K − 1 largest eigenvalues.
However, in the two-class case (in which SB = (µ1 −µ0)(µ1 −µ0)T)
it is easy to show that w = SW −1(µ1 −µ0) is the unique eigenvector of SW −1SB:
SW −1(µ1 − µ0)(µ1 − µ0)Tw = λw
SW −1(µ1 − µ0)(µ1 − µ0)TSW −1(µ1 − µ0) = λSW −1(µ1 − µ0),
where here
T −1
λ = (µ1 − µ0) SW (µ1 − µ0). Which leads to the result
w ∝ SW −1(µ1 − µ0).
.
Fig. 11.1: The Fisher most discriminant projection
Exercise
Write a class FisherLinearDiscriminant that implements the Fisher’s linear discriminant analysis.
This class must be compliant with the scikit-learn API by providing two methods: fit(X,y) which
fits the model and returns the object itself; predict(X) which returns a vector of the predicted
values. Apply the object on the dataset presented for the LDA.
p(Ck|x) =
Where p(x) is the marginal distribution obtained by suming of classes: As usual, the denominator
in Bayes’ theorem can be found in terms of the quantities appearing in the numerator, because
LDA is a generative model since the class-conditional distributions cal be used to generate
samples of each classes.
LDA is useful to deal with imbalanced group sizes (eg.: N1 ≫ N0) since priors probabilities can be
used to explicitly re-balance the classification by setting p(C0) = p(C1) = 1/2 or whatever seems
relevant.
LDA can be generalised to the multicasts case with K > 2.
With N1 = N0, LDA lead to the same solution than Fisher’s linear discriminant.
Exercise
How many parameters are required to estimate to perform a LDA ?
%matplotlib inline
import numpy as np
from sklearn.discriminant_analysis
import LinearDiscriminantAnalysis as LDA
# Dataset
n_samples, n_features = 100, 2
mean0, mean1 = np.array([0, 0]), np.array([0, 2])
Cov = np.array([[1, .8],[.8, 1]])
np.random.seed(42)
Logistic regression
Logistic regression is called a generalized linear models. ie.: it is a linear model with a link
function that maps the output of linear multiple regression to a the posterior probability of each
class p(Ck|x) ∈ [0,1] using the logistic sigmoid function:
1
p(Ck|w,xi) =
1 + exp(−w · xi)
Logistic regression seeks to minimizes the likelihood L as Loss function ∼:
Logistic regression is a discriminative model since it focuses only on the posterior probability of
each class p(Ck|x). It only requires to estimate the P weight of the w vector. Thus it should be
favoured over LDA with many input features. In small dimension and balanced situations it
would provide similar predictions than LDA.
However imbalanced group sizes cannot be explicitly controlled. It can be managed using a
reweighting of the input samples.
105
from sklearn import linear_model
logreg = linear_model.LogisticRegression(C=1e8)
# This class implements regularized logistic regression. C is the Inverse of regularization strength.
# Large value => no regularization.
logreg.fit(X, y)
y_pred_logreg = logreg.predict(X)
errors = y_pred_logreg != y
print("Nb errors=%i, error rate=%.2f" % (errors.sum(), errors.sum() / len(y_pred_logreg)))
print(logreg.coef_)
Exercise
Explore the Logistic Regression parameters and proposes a solution in cases of highly
imbalanced training dataset N1 ≫ N0 when we know that in reality both classes have the same
probability p(C1) = p(C0).
Overfitting
VC dimension (for Vapnik–Chervonenkis dimension) is a measure of the capacity (complexity,
expressive power, richness, or flexibility) of a statistical classification algorithm, defined as the
cardinality of the largest set of points that the algorithm can shatter.
Theorem: Linear classifier in RP have VC dimension of P +1. Hence in dimension two (P = 2) any
random partition of 3 points can be learned.
107
Lasso logistic regression (L1-regularization)
The objective function to be minimized is now the combination of the logistic loss logL(w) with a
penalty of the L1 norm of the weights vector. In the two-class case, using the 0/1 coding we
obtain:
min Logistic Lasso(w) = logL(w) + λ ǁwǁ1 (11.14)
N
∑ w·xi
= {yi w · xi − log(1 + exp )} + λ ǁwǁ1 (11.15)
i
from sklearn import linear_model
lrl1 = linear_model.LogisticRegression(penalty='l1')
# This class implements regularized logistic regression. C is the Inverse of regularization strength.
# Large value => no regularization.
lrl1.fit(X, y)
y_pred_lrl1 = lrl1.predict(X)
errors = y_pred_lrl1 != y
print("Nb errors=%i, error
rate=%.2f" % (errors.sum(),
errors.sum() / len(y_pred_lrl1)))
print(lrl1.coef_)
Nb errors=27, error rate=0.27 [[-
0.11335795 0.68150965 0. 0. 0.19754476 0.36480308
Here we introduced the slack variables: ξi, with ξi = 0 for points that are on or inside the correct
margin boundary and ξi = |yi − (w cdot · xi)| for other points. Thus:
1. If yi(w · xi) ≥ 1 then the point lies outside the margin but on the correct side of the decision
boundary. In this case ξi = 0. The constraint is thus not active for this point. It does not
contribute to the prediction.
2. If 1 > yi(w · xi) ≥ 0 then the point lies inside the margin and on the correct side of the
decision boundary. In this case 0 < ξi ≤ 1. The constraint is active for this point. It does
contribute to the prediction as a support vector.
3. If 0 < yi(w · xi)) then the point is on the wrong side of the decision boundary
(misclassification). In this case 0 < ξi > 1. The constraint is active for this point. It does
contribute to the prediction as a support vector.
This loss is called the hinge loss, defined as:
max(0,1 − yi (w · xi))
So linear SVM is closed to Ridge logistic regression, using the hing loss instead of the logistic
loss. Both will provide very similar predictions.
109
Exercise
Compare predictions of Logistic regression (LR) and their SVM counterparts, ie.: L2 LR vs L2
SVM and L1 LR vs
L1 SVM
Exercise
Compare predictions of Elastic-net Logistic regression (LR) and Hinge-loss Elastic-net
• Compute the correlation between pairs of weights vectors.
• Compare the predictions of two classifiers using their decision function:
– Compute the correlation decision function.
– Plot the pairwise decision function of the classifiers.
• Conclude on the differences between the two losses.
Imagine a study evaluating a new test that screens people for a disease. Each person taking the
test either has or does not have the disease. The test outcome can be positive (classifying the
person as having the disease) or negative (classifying the person as not having the disease).
The test results for each subject may or may not match the subject’s actual status. In that
setting:
• True positive (TP): Sick people correctly identified as sick
• False positive (FP): Healthy people incorrectly identified as sick
• True negative (TN): Healthy people correctly identified as healthy
• False negative (FN): Sick people incorrectly identified as healthy
• Accuracy (ACC):
ACC = (TP + TN) / (TP + FP + FN + TN)
• Sensitivity (SEN) or recall of the positive class or true positive rate (TPR) or hit rate:
SEN = TP / P = TP / (TP+FN)
• Specificity (SPC) or recall of the negative class or true negative rate:
SPC = TN / N = TN / (TN+FP)
• Precision or positive predictive value (PPV):
PPV = TP / (TP + FP)
• Balanced accuracy (bACC):is a useful performance measure is the balanced accuracy
which avoids inflated performance estimates on imbalanced datasets (Brodersen, et al.
(2010). “The balanced accuracy and its posterior distribution”). It is defined as the
arithmetic mean of sensitivity and specificity, or the average accuracy obtained on either
class:
bACC = 1/2 * (SEN + SPC)
• F1 Score (or F-score) which is a weighted average of precision and recall are useful to
deal with imbalanced datasets
The four outcomes can be formulated in a 2×2 contingency table or confusion matrix
https://en.wikipedia.org/wiki/ Sensitivity_and_specificity
In this case it is recommended to use the AUC of a ROC analysis which basically provide a
measure of overlap of the two classes when points are projected on the discriminative axis. For
more detail on ROC and AUC
see:https://en.wikipedia.org/wiki/Receiver_operating_characteristic.
111
from sklearn import metrics
score_pred = np.array([.1 ,.2, .3, .4, .5, .6, .7, .8])
y_true = np.array([0, 0, 0, 0, 1, 1, 1, 1])
thres = .9 y_pred = (score_pred > thres).astype(int)
print("Predictions:", y_pred) metrics.accuracy_score(y_true, y_pred)
# The overall precision an recall on each individual class
p, r, f, s = metrics.precision_recall_fscore_support(y_true, y_pred)
print("Recalls:", r)
# 100% of specificity, 0% of sensitivity
# However AUC=1 indicating a perfect separation of the two classes
auc = metrics.roc_auc_score(y_true, score_pred)
print("AUC:", auc)
Predictions: [0 0 0 0 0 0 0 0]
Recalls: [ 1. 0.]
AUC: 1.0
Imbalanced classes
Learning with discriminative (logistic regression, SVM) methods is generally based on
minimizing the misclassification of training samples, which may be unsuitable for imbalanced
datasets where the recognition might be biased in favour of the most numerous class. This
problem can be addressed with a generative approach, which typically requires more
parameters to be determined leading to reduced performances in high dimension.
Dealing with imbalanced class may be addressed by three main ways (see Japkowicz and
Stephen (2002) for a review), resembling, reweighting and one class learning.
In sampling strategies, either the minority class is oversampled or majority class is under
sampled or some combination of the two is deployed. Under sampling (Zhang and Mani, 2003)
the majority class would lead to a poor usage of the left-out samples. Sometime one cannot
afford such strategy since we are also facing a small sample size problem even for the majority
class. Informed oversampling, which goes beyond a trivial duplication of minority class samples,
requires the estimation of class conditional distributions in order to generate synthetic samples.
Here generative models are required. An alternative, proposed in (Chawla et al., 2002) generate
samples along the line segments joining any/all of the k minority class nearest neighbors. Such
procedure blindly generalizes the minority area without regard to the majority class, which may
be particularly problematic with high-dimensional and potentially skewed class distribution.
Reweighting, also called cost-sensitive learning, works at an algorithmic level by adjusting the
costs of the various classes to counter the class imbalance. Such reweighting can be
implemented within SVM (Chang and Lin, 2001) or logistic regression (Friedman et al., 2010)
classifiers. Most classers of Scikit learn offer such reweighting possibilities.
The clas\boldsymbol{S_W}eight parameter can be positioned into the "balanced" mode which
uses the values of y to automatically adjust weights inversely proportional to class frequencies in
the input data as N/(2Nk).
import numpy as np from sklearn
import linear_model from sklearn
import datasets from sklearn
import metrics
import matplotlib.pyplot as plt
# dataset
X, y = datasets.make_classification(n_samples=500, n_features=5,
n_informative=2, n_redundant=0,
n_repeated=0, n_classes=2,
random_state=1, shuffle=False)
print(*["#samples of class %i = %i;" % (lev, np.sum(y == lev)) for lev in np.unique(y)])
print('# No Reweighting balanced dataset')
lr_inter = linear_model.LogisticRegression(C=1)
lr_inter.fit(X, y)
p, r, f, s = metrics.precision_recall_fscore_support(y, lr_inter.predict(X))
print("SPC: %.3f; SEN: %.3f" % tuple(r))
print('# => The predictions are balanced in sensitivity and specificity\n')
# Create imbalanced dataset, by subsampling sample of calss 0: keep only 10% of
# classe 0's samples and all classe 1's samples.
n0 = int(np.rint(np.sum(y == 0) / 20))
subsample_idx = np.concatenate((np.where(y == 0)[0][:n0], np.where(y == 1)[0]))
Ximb = X[subsample_idx, :]
yimb = y[subsample_idx]
print(*["#samples of class %i = %i;" % (lev, np.sum(yimb == lev))
for lev in np.unique(yimb)])
print('# No Reweighting on imbalanced dataset')
lr_inter = linear_model.LogisticRegression(C=1)
lr_inter.fit(Ximb, yimb)
p, r, f, s = metrics.precision_recall_fscore_support(yimb, lr_inter.predict(Ximb))
print("SPC: %.3f; SEN: %.3f" % tuple(r))
print('# => Sensitivity >> specificity\n')
print('# Reweighting on imbalanced dataset')
lr_inter_reweight = linear_model.LogisticRegression(C=1, clas\boldsymbol{S_W}
eight="balanced")
lr_inter_reweight.fit(Ximb, yimb)
p, r, f, s = metrics.precision_recall_fscore_support(yimb, lr_inter_reweight.predict(Ximb))
print("SPC: %.3f; SEN: %.3f" % tuple(r))
print('# => The predictions are balanced in sensitivity and specificity\n')
File "<ipython-input-34-2de881c6d3f4>", line 43 lr_inter_reweight = linear_model.LogisticRegression(C=1,
clas\boldsymbol{S_W} ,→eight="balanced")
, ^
SyntaxError: unexpected character after line continuation character
CHAPTER
113
TWELVE
(12.1)
Where σ (or γ) defines the kernel width parameter. Basically, we consider a Gaussian function
centered on each training sample xi. it has a ready interpretation as a similarity measure as it
decreases with squared Euclidean distance between the two feature vectors.
Non linear SVM also exists for regression problems.
import numpy as np
from sklearn.svm import SVC
from sklearn import datasets
Fig. 12.1: Support Vector Machines.
import matplotlib.pyplot as plt
# dataset
X, y = datasets.make_classification(n_samples=10, n_features=2,n_redundant=0, n_classes=2,
random_state=1, shuffle=False)
clf = SVC(kernel='rbf')
#, gamma=1)
clf.fit(X, y)
print("#Errors: %i" % np.sum(y != clf.predict(X)))
clf.decision_function(X)
# Usefull internals:
# Array of support vectors
clf.support_vectors_
# indices of support vectors within original X
np.all(X[clf.support_,:] == clf.support_vectors_)
#Errors: 0
True
Random forest
A random forest is a meta estimator that fits a number of decision tree learners on various sub-
samples of the dataset and use averaging to improve the predictive accuracy and control over-
fitting.
115
Decision trees are simple to understand and interpret however they tend to overfit the data.
However decision trees tend to overbite the training set. Leo Breiman propose random forest to
deal with this issue.
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 100)
forest.fit(X, y)
print("#Errors: %i" % np.sum(y != forest.predict(X)))
CHAPTER
THIRTEEN
RESAMPLING METHODS
The test error is the average error that results from a learning method to predict the response on
a new samples that is, on samples that were not used in training the method. Given a data set,
the use of a particular learning method is warranted if it results in a low test error. The test error
can be easily calculated if a designated test set is available. Unfortunately, this is usually not the
case.
Thus the original dataset is generally splited in a training and a test (or validation) data sets.
Large training set (80%) small test set (20%) might provide a poor estimation of the predictive
performances. On the contrary, large test set and small training set might produce a poorly
estimated learner. This is why, on situation where we cannot afford such split, it recommended
to use cross-Validation scheme to estimate the predictive power of a learning algorithm.
Cross-Validation (CV)
Cross-Validation scheme randomly dived the set of observations into K groups, or folds, of
approximately equal size. The first fold is treated as a validation set, and the method f() is fitted
on the remaining union of K − 1 folds: (f(X−K,y−K)).
The mean error measure (generally a loss function) is evaluated of the on the observations in
the held-out fold. For each sample i we consider the model estimated on the data set that did
not contain it, noted −K(i). This procedure is repeated K times; each time, a different group of
observations is treated as a test set. Then we compare the predicted value (f(X−K(i)) = yˆi) with
true value yi using a Error function L(). Then the cross validation estimate of prediction error is
.
This validation scheme is known as the K-Fold CV. Typical choices of K are 5 or 10, [Kohavi
1995]. The extreme case where K = N is known as leave-one-out cross-validation, LOO-CV.
CV for regression
Usually the error function L() is the r-squared score. However other function could be used.
import numpy as np from sklearn
import datasets
import sklearn.linear_model as lm
import sklearn.metrics as metrics from
sklearn.cross_validation
import KFold
X, y = datasets.make_regression(n_samples=100, n_features=100, n_informative=10, random_state=42)
model = lm.Ridge(alpha=10)
cv = KFold(len(y), n_folds=5, random_state=42)
y_test_pred = np.zeros(len(y)) y_train_pred = np.zeros(len(y))
for train, test in cv:
X_train, X_test, y_train, y_test = X[train, :], X[test, :], y[train], y[test] model.fit(X_train, y_train)
y_test_pred[test] = model.predict(X_test) y_train_pred[train] = model.predict(X_train)
print("Train r2:%.2f" % metrics.r2_score(y, y_train_pred))
print("Test r2:%.2f" % metrics.r2_score(y, y_test_pred))
Train r2:0.99
Test r2:0.72
CV for classification
With classification problems it is essential to sample folds where each set contains
approximately the same percentage of samples of each target class as the complete set. This is
called stratification. In this case, we will use StratifiedKFold with is a variation of k-fold which
returns stratified folds.
Usually the error function L() are, at least, the sensitivity and the specificity. However other
function could be used.
model = lm.LogisticRegression(C=1)
cv = StratifiedKFold(y, n_folds=5)
y_test_pred = np.zeros(len(y))
y_train_pred = np.zeros(len(y))
for train, test in cv:
X_train, X_test, y_train, y_test = X[train, :], X[test, :], y[train], y[test]
model.fit(X_train, y_train) y_test_pred[test] = model.predict(X_test)
y_train_pred[train] = model.predict(X_train)
recall_test = metrics.recall_score(y, y_test_pred, average=None)
recall_train = metrics.recall_score(y, y_train_pred, average=None)
acc_test = metrics.accuracy_score(y, y_test_pred)
print("Train SPC:%.2f; SEN:%.2f" % tuple(recall_train))
print("Test SPC:%.2f; SEN:%.2f" % tuple(recall_test))
print("Test ACC:%.2f" % acc_test)
Train SPC:1.00; SEN:1.00
Test SPC:0.80; SEN:0.82
Test ACC:0.81
117
Scikit-learn provides user-friendly function to perform CV:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(estimator=model, X=X, y=y, cv=5) scores.mean()
# provide CV and score
def balanced_acc(estimator, X, y):
'''
Balanced acuracy scorer
'''
return metrics.recall_score(y, estimator.predict(X), average=None).mean()
scores = cross_val_score(estimator=model, X=X, y=y, cv=cv, scoring=balanced_acc)
print("Test ACC:%.2f" % scores.mean())
Test ACC:0.81
Note that with Scikit-learn user-friendly function we average the scores’ average obtained on
individual folds which may provide slightly different results that the overall average presented
earlier.
2. Model selection: estimating the performance of different models in order to choose the best
one. One special case of model selection is the selection model’s hyper parameters.
Indeed remember that most of learning algorithm have a hyper parameters (typically the
regularization parameter) that has to be set.
Generally we must address the two problems simultaneously. The usual approach for both
problems is to randomly divide the dataset into three parts: a training set, a validation set, and a
test set.
• The training set (train) is used to fit the models;
• the validation set (val) is used to estimate prediction error for model selection or to
determine the hyper parameters over a grid of possible values.
• the test set (test) is used for assessment of the generalization error of the final chosen
model.
118
One inner CV loop, for model selection. For each run of the outer loop, the inner loop performs L
splits of dataset (X−K,y−K) into training set: (X−K,−L,y−K,−L) and a validation set: (X−K,L,y−K,L).
Implementation with scikit-learn
Note that the inner CV loop combined with the learner form a new learner with an automatic
model (parameter) selection procedure. This new learner can be easily constructed using Scikit-
learn. The learned is wrapped inside a GridSearchCV class.
Then the new learned can be pluged into the classical outer CV loop.
import numpy as np from sklearn
import datasets
import sklearn.linear_model as lm
from sklearn.grid_search import GridSearchCV
import sklearn.metrics as metrics
from sklearn.cross_validation
import KFold
# Dataset
noise_sd = 10
Test r2:0.55
Random Permutations
A permutation test is a type of non-parametric randomization test in which the null distribution of
a test statistic is estimated by randomly permuting the observations.
Permutation tests are highly attractive because they make no assumptions other than that the
observations are independent and identically distributed under the null hypothesis.
1. Compute a obseverved statistic tobs on the data.
2. Use randomization to compute the distribution of t under the null hypothesis: Perform N
random permutation of the data. For each sample of permuted data, i the data compute the
statistic ti. This procedure provides the distribution of t under the null hypothesis H0: P(t|H0)
120
3. Compute the p-value = P(t > tobs|H0)|{ti > tobs}|, where ti‘s include tobs.
plt.legend(loc="upper left")
# One-tailed empirical p-value pval_perm = np.sum(perms >= perms[0]) /
perms.shape[0]
# Compare with Pearson's correlation test _, pval_test = stats.pearsonr(x, y) print("Permutation two tailed p-
value=%.5f. Pearson test p-value=%.5f" % (2*pval_,→perm, pval_test))
121
Exercise
Given the logistic regression presented above and its validation given a 5 folds CV.
1. Compute the p-value associated with the prediction accuracy using a permutation test.
2. Compute the p-value associated with the prediction accuracy using a parametric test.
Bootstrapping
Bootstrapping is a random sampling with replacement strategy which provides an non-
parametric method to assess the variability of performances scores such standard errors or
confidence intervals.
A great advantage of bootstrap is its simplicity. It is a straightforward way to derive estimates of
standard errors and confidence intervals for complex estimators of complex parameters of the
distribution, such as percentile points, proportions, odds ratio, and correlation coefficients.
1. Perform B sampling, with replacement, of the dataset.
2. For each sample i fit the model and compute the scores.
3. Assess standard errors and confidence intervals of scores using the scores obtained on
the B resampled dataset.
import numpy as np from sklearn
import datasets
import sklearn.linear_model as lm
import sklearn.metrics as metrics
import pandas as pd
# Regression dataset
n_features = 5
n_features_info = 2
n_samples = 100
X = np.random.randn(n_samples, n_features)
beta = np.zeros(n_features)
beta[:n_features_info] = 1
Xbeta = np.dot(X, beta)
eps = np.random.randn(n_samples)
y = Xbeta + eps
# Fit model on all data (!! risk of overfit)
model = lm.RidgeCV()
model.fit(X, y)
print("Coefficients on all data:")
print(model.coef_)
# Bootstrap loop
nboot = 100 # !! Should be at least 1000
scores_names = ["r2"]
scores_boot = np.zeros((nboot, len(scores_names)))
coefs_boot = np.zeros((nboot, X.shape[1]))
orig_all = np.arange(X.shape[0])
for boot_i in range(nboot):
boot_tr = np.random.choice(orig_all, size=len(orig_all), replace=True)
boot_te = np.setdiff1d(orig_all, boot_tr, assume_unique=False)
Xtr, ytr = X[boot_tr, :], y[boot_tr] Xte, yte = X[boot_te, :], y[boot_te]
model.fit(Xtr, ytr) y_pred = model.predict(Xte).ravel() scores_boot[boot_i,
:] = metrics.r2_score(yte, y_pred) coefs_boot[boot_i, :] = model.coef_
# Compute Mean, SE, CI scores_boot = pd.DataFrame(scores_boot, columns=scores_names)
scores_stat = scores_boot.describe(percentiles=[.99, .95, .5, .1, .05, 0.01])
print("r-squared: Mean=%.2f, SE=%.2f, CI=(%.2f %.2f)" %\ tuple(scores_stat.ix[["mean", "std", "5%", "95%"], "r2"]))
coefs_boot = pd.DataFrame(coefs_boot)
coefs_stat = coefs_boot.describe(percentiles=[.99, .95, .5, .1, .05, 0.01])
print("Coefficients distribution")
print(coefs_stat)
122
Coefficients on all data:
[ 0.98143428 0.84248041 0.12029217 0.09319979 0.08717254] r-squared:
Mean=0.57, SE=0.09, CI=(0.39 0.70) Coefficients distribution
0 1 2 3 4
count 100.000000 100.000000 100.000000 100.000000 100.000000 mean 0.975189
0.831922 0.116888 0.099109 0.085516 std 0.106367 0.096548 0.108676 0.090312
0.091446 min 0.745082 0.593736 -0.112740 -0.126522 -0.141713 1% 0.770362
0.640142 -0.088238 -0.094403 -0.113375
5% 0.787463 0.657473 -0.045593 -0.046201 -0.090458
10% 0.829129 0.706492 -0.037838 -0.020650 -0.044990
50% 0.980603 0.835724 0.133070 0.093240 0.088968
95% 1.127518 0.999604 0.278735 0.251137
0.221887 99% 1.144834 1.036715 0.292784
0.291197 0.287006 max 1.146670 1.077265
0.324374 0.298135 0.289569
123
CHAPTER
FOURTEEN
import numpy as np
# dataset
np.random.seed(42)
n_samples, n_features, n_features_info = 100, 5, 3
124
X = np.random.randn(n_samples, n_features)
beta = np.zeros(n_features)
beta[:n_features_info] = 1
Xbeta = np.dot(X, beta)
eps = np.random.randn(n_samples)
y = Xbeta + eps
X[:, 0] *= 1e6 # inflate the first feature
X[:, 1] += 1e6 # bias the second feature
y = 100 * y + 1000 # bias and scale the output
import sklearn.linear_model as lm from sklearn
import preprocessing
from sklearn.cross_validation
import cross_val_score
print("== Linear regression: scaling is not required ==")
model =lm.LinearRegression()
model.fit(X, y)
print("Coefficients:", model.coef_, model.intercept_)
print("Test R2:%.2f" % cross_val_score(estimator=model, X=X, y=y, cv=5).mean())
print("== Lasso without scaling ==")
model = lm.LassoCV() model.fit(X, y)
print("Coefficients:", model.coef_, model.intercept_)
print("Test R2:%.2f" % cross_val_score(estimator=model, X=X, y=y, cv=5).mean())
print("== Lasso with scaling ==")
model = lm.LassoCV()
scaler = preprocessing.StandardScaler()
Xc = scaler.fit(X).transform(X)
model.fit(Xc, y)
print("Coefficients:", model.coef_, model.intercept_)
print("Test R2:%.2f" % cross_val_score(estimator=model, X=Xc, y=y, cv=5).mean())
== Linear regression: scaling is not required == Coefficients: [
1.05421281e-04 1.13551103e+02 9.78705905e+01 1.60747221e+01
-7.23145329e-01] -113550117.827
Test R2:0.77
== Lasso without scaling ==
Coefficients: [ 8.61125764e-05 0.00000000e+00 0.00000000e+00 0.00000000e+00
0.00000000e+00] 986.15608907 ]
Test R2:0.09
== Lasso with scaling ==
Coefficients: [ 87.46834069 105.13635448 91.22
,→982.302793647 Test
R2:0.77
718731 9.22953206 -0.
Scikit-learn pipelines
Sources: http://scikit-learn.org/stable/modules/pipeline.html
Note that statistics such as the mean and standard deviation are computed from the training
data, not from the validation or test data. The validation and test data must be standardized
using the statistics computed from the training data. Thus Standardization should be merged
together with the learner using a Pipeline.
Pipeline chain multiple estimators into one. All estimators in a pipeline, except the last one,
must have the fit() and transform() methods. The last must implement the fit() and predict()
methods.
125
Standardization of input features
from sklearn
import preprocessing
import sklearn.linear_model as lm
from sklearn.pipeline
import make_pipeline
model = make_pipeline(preprocessing.StandardScaler(), lm.LassoCV())
# or
from sklearn.pipeline import Pipeline
model = Pipeline([('standardscaler', preprocessing.StandardScaler()),('lassocv',
lm.LassoCV())])
scores = cross_val_score(estimator=model, X=X, y=y, cv=5)
print("Test r2:%.2f" % scores.mean())
Test r2:0.77
Features selection
An alternative to features selection based on `1 penalty is to use a preprocessing step of
univariate feature selection.
Such methods, called filters, are a simple, widely used method for supervised dimension
reduction [26]. Filters are univariate methods that rank features according to their ability to
predict the target, independently of other features. This ranking may be based on parametric
(e.g., t-tests) or nonparametric (e.g., Wilcoxon tests) statistical methods. Filters are
computationally efficient and more robust to overfitting than multivariate methods.
import numpy as np
import sklearn.linear_model as lm from sklearn
import preprocessing from sklearn.cross_validation
import cross_val_score from sklearn.feature_selection
import SelectKBest from sklearn.feature_selection
import f_regression from sklearn.pipeline
import Pipeline
np.random.seed(42)
n_samples, n_features, n_features_info = 100, 100, 3
X = np.random.randn(n_samples, n_features)
beta = np.zeros(n_features)
beta[:n_features_info] = 1
Xbeta = np.dot(X, beta)
eps = np.random.randn(n_samples)
y = Xbeta + eps
X[:, 0] *= 1e6 # inflate the first feature
126
Regression pipelines with CV for parameters selection
Now we combine standardization of input features, feature selection and learner with hyper-
parameter within a pipeline which is warped in a grid search procedure to select the best
hyperparameters based on a (inner)CV. The overall is plugged in an outer CV.
import numpy as np from sklearn
import datasets
import sklearn.linear_model as lm from sklearn
import preprocessing from sklearn.cross_validation
import cross_val_score from sklearn.feature_selection
import SelectKBest from sklearn.feature_selection
import f_regression from sklearn.pipeline
import Pipeline from sklearn.grid_search
import GridSearchCV
import sklearn.metrics as metrics
# Datasets
n_samples, n_features, noise_sd = 100, 100, 20
X, y, coef = datasets.make_regression(n_samples=n_samples, n_features=n_features, noise=noise_sd,
n_informative=5, random_state=42, coef=True)
# Use this to tune the noise parameter such that snr < 5
print("SNR:", np.std(np.dot(X, coef)) / noise_sd)
print("=============================")
print("== Basic linear regression ==")
print("=============================")
scores = cross_val_score(estimator=lm.LinearRegression(), X=X, y=y, cv=5)
print("Test r2:%.2f" % scores.mean())
print("==============================================")
print("== Scaler + anova filter + ridge regression ==")
print("==============================================")
anova_ridge = Pipeline([
('standardscaler', preprocessing.StandardScaler()),
('selectkbest', SelectKBest(f_regression)),
('ridge', lm.Ridge())
]) param_grid = {'selectkbest__k':np.arange(10, 110, 10),
127
'ridge__alpha':[.001, .01, .1, 1, 10, 100] }
# Expect execution in ipython, for python remove the %time
print("----------------------------")
print("-- Parallelize inner loop --")
print("----------------------------")
anova_ridge_cv = GridSearchCV(anova_ridge, cv=5, param_grid=param_grid, n_jobs=-1)
%time scores = cross_val_score(estimator=anova_ridge_cv, X=X, y=y, cv=5)
print("Test r2:%.2f" % scores.mean())
print("----------------------------")
print("-- Parallelize outer loop --") print("----------------------------")
anova_ridge_cv = GridSearchCV(anova_ridge, cv=5, param_grid=param_grid)
%time scores = cross_val_score(estimator=anova_ridge_cv, X=X, y=y, cv=5, n_jobs=-1)
print("Test r2:%.2f" % scores.mean())
print("=====================================")
print("== Scaler + Elastic-net regression ==")
print("=====================================")
alphas = [.0001, .001, .01, .1, 1, 10, 100, 1000] l1_ratio = [.1, .5, .9]
print("----------------------------")
print("-- Parallelize outer loop --")
print("----------------------------")
enet = Pipeline([
('standardscaler', preprocessing.StandardScaler()),
('enet', lm.ElasticNet(max_iter=10000)),
]) param_grid = {'enet__alpha':alphas
,'enet__l1_ratio':l1_ratio}
enet_cv = GridSearchCV(enet, cv=5, param_grid=param_grid)
%time scores = cross_val_score(estimator=enet_cv, X=X, y=y, cv=5, n_jobs=-1)
print("Test r2:%.2f" % scores.mean())
print("-----------------------------------------------")
print("-- Parallelize outer loop + built-in CV --")
print("-- Remark: scaler is only done on outer loop --")
print("-----------------------------------------------")
enet_cv = Pipeline([
('standardscaler', preprocessing.StandardScaler()),
('enet', lm.ElasticNetCV(max_iter=10000, l1_ratio=l1_ratio, alphas=alphas)),
])
%time scores = cross_val_score(estimator=enet_cv, X=X, y=y, cv=5)
print("Test r2:%.2f" % scores.mean())
SNR: 3.28668201676
=============================
== Basic linear regression ==
=============================
128
Test r2:0.29
==============================================
== Scaler + anova filter + ridge regression ==
==============================================
----------------------------
-- Parallelize inner loop --
----------------------------
CPU times: user 6.06 s, sys: 836 ms, total: 6.9 s
Wall time: 7.97 s
Test r2:0.86
----------------------------
-- Parallelize outer loop --
----------------------------
CPU times: user 270 ms, sys: 129 ms, total: 399 ms
Wall time: 3.51 s
Test r2:0.86
=====================================
== Scaler + Elastic-net regression ==
=====================================
----------------------------
-- Parallelize outer loop --
----------------------------
CPU times: user 44.4 ms, sys: 80.5 ms, total: 125 ms
Wall time: 1.43 s
Test r2:0.82
-----------------------------------------------
-- Parallelize outer loop + built-in CV --
-- Remark: scaler is only done on outer loop --
----------------------------------------------CPU times: user 227 ms, sys: 0
ns, total: 227 ms
Wall time: 225 ms
Test r2:0.82
129
def balanced_acc(estimator, X, y):
'''
Balanced acuracy scorer
''' return metrics.recall_score(y, estimator.predict(X), average=None).mean()
print("=============================")
print("== Basic logistic regression ==")
print("=============================")
scores = cross_val_score(estimator=lm.LogisticRegression(C=1e8, class_weight='balanced
,→'),
130
print("----------------------------") print("-- Parallelize outer loop --")
print("----------------------------")
lasso = Pipeline([
('standardscaler', preprocessing.StandardScaler()),
('lasso', lm.LogisticRegression(penalty='l1', class_weight='balanced')),
])
param_grid = {'lasso__C':Cs}
enet_cv = GridSearchCV(lasso, cv=5, param_grid=param_grid, scoring=balanced_acc) %time
scores = cross_val_score(estimator=enet_cv, X=X, y=y, cv=5,\ scoring=balanced_acc, n_jobs=-1)
print("Test bACC:%.2f" % scores.mean())
print("-----------------------------------------------")
print("-- Parallelize outer loop + built-in CV --")
print("-- Remark: scaler is only done on outer loop --")
print("-----------------------------------------------")
lasso_cv = Pipeline([
('standardscaler', preprocessing.StandardScaler()),
('lasso', lm.LogisticRegressionCV(Cs=Cs, scoring=balanced_acc)),
])
%time scores = cross_val_score(estimator=lasso_cv, X=X, y=y, cv=5)
print("Test bACC:%.2f" % scores.mean())
print("=============================================") print("== Scaler +
Elasticnet logistic regression ==")
print("=============================================")
print("----------------------------")
print("-- Parallelize outer loop --") print("----------------------------")
enet = Pipeline([
('standardscaler', preprocessing.StandardScaler()),
('enet', lm.SGDClassifier(loss="log", penalty="elasticnet", alpha=0.0001, l1_ratio=0.15,
class_weight='balanced')),
])
param_grid = {'enet__alpha':alphas,'enet__l1_ratio':l1_ratio}
enet_cv = GridSearchCV(enet, cv=5, param_grid=param_grid, scoring=balanced_acc)
%time scores = cross_val_score(estimator=enet_cv, X=X, y=y, cv=5,\ scoring=balanced_acc, n_jobs=-1)
print("Test bACC:%.2f" % scores.mean())
=============================
== Basic logistic regression ==
=============================
Test bACC:0.52
=======================================================
== Scaler + anova filter + ridge logistic regression ==
=======================================================
----------------------------
-- Parallelize inner loop --
----------------------------
CPU times: user 3.02 s, sys: 562 ms, total: 3.58 s
Wall time: 4.43 s
Test bACC:0.67
----------------------------
-- Parallelize outer loop --
----------------------------
CPU times: user 59.3 ms, sys: 114 ms, total: 174 ms
Wall time: 1.88 s
Test bACC:0.67
========================================
== Scaler + lasso logistic regression ==
========================================
----------------------------
-- Parallelize outer loop --
----------------------------
CPU times: user 81 ms, sys: 96.7 ms, total: 178 ms
131
Wall time: 484 ms
Test bACC:0.57
-----------------------------------------------
-- Parallelize outer loop + built-in CV --
-- Remark: scaler is only done on outer loop --
-----------------------------------------------
CPU times: user 575 ms, sys: 3.01 ms, total: 578 ms
Wall time: 327 ms
Test bACC:0.60
=============================================
== Scaler + Elasticnet logistic regression ==
=============================================
----------------------------
-- Parallelize outer loop --
----------------------------
CPU times: user 429 ms, sys: 100 ms, total: 530 ms
Wall time: 979 ms
Test bACC:0.61
132
CHAPTER
FIFTEEN
CASE STUDIES OF ML
133
Read dataset
from __future__ import print_function
import pandas as pd import
numpy as np
url = 'https://raw.github.com/neurospin/pystatsml/master/data/default%20of%20credit%20card%20clients.xls'
data = pd.read_excel(url, skiprows=1, sheetname='Data')
df = data.copy()
target = 'default payment next month'
print(df.columns)
#Index(['ID', 'LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0',
#'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
#'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
#'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
#'default payment next month'],
#dtype='object')
Missing data
print(df.isnull().sum()) #ID
0
#LIMIT_BAL 0
#SEX 0
#EDUCATION 468
#MARRIAGE 377
#AGE 0
#PAY_0 0
#PAY_2 0
#PAY_3 0
#PAY_4 0
#PAY_5 0
#PAY_6 0
#BILL_AMT1 0
#BILL_AMT2 0
#BILL_AMT3 0
#BILL_AMT4 0
#BILL_AMT5 0
#BILL_AMT6 0
#PAY_AMT1 0
#PAY_AMT2 0
#PAY_AMT3 0
#PAY_AMT4 0
#PAY_AMT5 0
#PAY_AMT6 0
134
#default payment next month 0
#dtype: int64
df.ix[df["EDUCATION"].isnull(), "EDUCATION"] = df["EDUCATION"].mean() df.ix[df["MARRIAGE"].isnull(),
"MARRIAGE"] = df["MARRIAGE"].mean() print(df.isnull().sum().sum()) # O
describe_factor(df[target]) {0: 23364, 1: 6636}
Univariate analysis
Machine Learning with SVM
On this large dataset, we can afford to set aside some test samples. This will also save
computation time. However we will have to do some manual work.
import numpy as np from sklearn
import datasets
import sklearn.svm as svm from sklearn
import preprocessing from sklearn.cross_validation
import cross_val_score, train_test_split from sklearn.cross_validation
import StratifiedKFold from sklearn.feature_selection
import SelectKBest from sklearn.feature_selection
import f_classif from sklearn.pipeline
import Pipeline from sklearn.grid_search
import GridSearchCV import sklearn.metrics as metrics
def balanced_acc(estimator, X, y):
return metrics.recall_score(y, estimator.predict(X), average=None).mean()
print("===============================================")
print("== Put aside half of the samples as test set ==")
print("===============================================")
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.5, random_state=0, stratify=y)
print("=================================")
print("== Scale trainin and test data ==")
print("=================================")
scaler = preprocessing.StandardScaler()
Xtrs = scaler.fit(Xtr).transform(Xtr)
Xtes = scaler.transform(Xte)
135
print("=========") print("== SVM
==") print("=========")
svc = svm.LinearSVC(class_weight='balanced', dual=False)
%time scores = cross_val_score(estimator=svc,\
X=Xtrs, y=ytr, cv=2, scoring=balanced_acc)
print("Validation bACC:%.2f" % scores.mean())
#CPU times: user 1.01 s, sys: 39.7 ms, total: 1.05 s
#Wall time: 112 ms #Validation
bACC:0.67
svc_rbf = svm.SVC(kernel='rbf', class_weight='balanced')
%time scores = cross_val_score(estimator=svc_rbf,\ X=Xtrs, y=ytr, cv=2, scoring=balanced_acc)
print("Validation bACC:%.2f" % scores.mean())
#CPU times: user 10.2 s, sys: 136 ms, total: 10.3 s
#Wall time: 10.3 s #Test
bACC:0.71
svc_lasso = svm.LinearSVC(class_weight='balanced', penalty='l1', dual=False)
%time scores = cross_val_score(estimator=svc_lasso,\ X=Xtrs, y=ytr, cv=2, scoring=balanced_acc)
print("Validation bACC:%.2f" % scores.mean())
#CPU times: user 4.51 s, sys: 168 ms, total: 4.68 s
#Wall time: 544 ms #Test
bACC:0.67
print("========================") print("== SVM
CV Grid search ==")
print("========================")
Cs = [0.001, .01, .1, 1, 10, 100, 1000]
param_grid = {'C':Cs}
print("-------------------") print("-- SVM Linear L2 --")
print("-------------------")
svc_cv = GridSearchCV(svc, cv=3, param_grid=param_grid, scoring=balanced_acc, n_jobs=-1) # What
are the best parameters ?
%time svc_cv.fit(Xtrs, ytr).best_params_
#CPU times: user 211 ms, sys: 209 ms, total: 421 ms
#Wall time: 1.07 s #{'C': 0.01}
scores = cross_val_score(estimator=svc_cv,\ X=Xtrs, y=ytr, cv=2, scoring=balanced_acc)
print("Validation bACC:%.2f" % scores.mean())
#Validation bACC:0.67
print("-------------") print("-- SVM RBF --")
print("-------------")
svc_rbf_cv = GridSearchCV(svc_rbf, cv=3, param_grid=param_grid,
scoring=balanced_acc, n_jobs=-1) # What are the best parameters ?
136
%time svc_rbf_cv.fit(Xtrs, ytr).best_params_
#Wall time: 1min 10s
#Out[6]: {'C': 1}
# reduce the grid search svc_rbf_cv.param_grid={'C': [0.1, 1,
10]} scores = cross_val_score(estimator=svc_rbf_cv,\
X=Xtrs, y=ytr, cv=2, scoring=balanced_acc)
print("Validation bACC:%.2f" % scores.mean())
#Validation bACC:0.71
print("-------------------") print("-- SVM Linear L1 --")
print("-------------------")
svc_lasso_cv = GridSearchCV(svc_lasso, cv=3, param_grid=param_grid,
scoring=balanced_acc, n_jobs=-1) # What are the best parameters ?
%time svc_lasso_cv.fit(Xtrs, ytr).best_params_
#CPU times: user 514 ms, sys: 181 ms, total: 695 ms
#Wall time: 2.07 s
#Out[10]: {'C': 0.1}
# reduce the grid search svc_lasso_cv.param_grid={'C': [0.1, 1, 10]}
scores = cross_val_score(estimator=svc_lasso_cv,\ X=Xtrs, y=ytr, cv=2, scoring=balanced_acc)
print("Validation bACC:%.2f" % scores.mean())
#Validation bACC:0.67
print("SVM-RBF, test bACC:%.2f" % balanced_acc(svc_rbf_cv, Xtes, yte))
# SVM-RBF, test bACC:0.70
print("SVM-Lasso, test bACC:%.2f" % balanced_acc(svc_lasso_cv, Xtes, yte))
# SVM-Lasso, test bACC:0.67
137