Jade Abbott - Mls Hidden Tasks

Download as pdf or txt
Download as pdf or txt
You are on page 1of 78

ML

Alice was excited!

Lots of tutorials

Loads of resources
ML
Endless examples

Fast paced research


How to even data
science?
How to even data
science?

https://miro.medium.com/max/1552/1*Nv2NNALuokZEcV6hYEHdGA.png
Challenge

How to make
this work in
the real
world?
Machine Learning’s Surprises
A Checklist for Developers
when Building ML Systems
Hi, I’m Jade Abbott

@alienelf

masakhane.io
Hi, I’m Jade Abbott
Surprises while...

Trying to deploy Trying to improve


the model the model

After deployment
of model
Some context
❖ I won’t be talking about training machine
learning models
❖ I won’t be talking about which models to chose
❖ I work primarily in deep learning & NLP
❖ I am a one person ML team working in a startup
context
❖ I work in a normal world where data is scarce
and we need to collect more
The Problem

Yes, they
I want to meet... should
meet

No they
I can provide...
shouldn’t

Embedding + LSTM + Downstream NN


The Problem

Yes, they
I want to meet... should
meet

someone to look after my cat

No they
I can provide...
shouldn’t

pet sitting
cat breeding
software
development
chef lessons Language Model + Downstream Task

The Problem

Yes, they
I want to meet... should
meet

someone to look after my cat

No they
I can provide...
shouldn’t

pet sitting
cat breeding
software
development
chef lessons
The Problem

Yes, they
I want to meet... should
meet

someone to look after my cat The Model

No they
I can provide...
shouldn’t

pet sitting
cat breeding
software
development
chef lessons
Surprises

Surprises trying to
deploy the model
Expectations

train & evaluate model

CI/CD
model API
Unit
Tests user testing
Surprise #1

Is the model good


enough?
75% Accuracy
Performance Metrics
❖ Business needs to understand it
❖ Active discussion about pros & cons
❖ Get sign off
❖ Threshold selection strategy
Surprise #2

Can we trust it?


Skin Cancer Detection Husky/Dog Classifier

1. https://visualsonline.cancer.gov/details.cfm?imageid=9288
2. https://arxiv.org/pdf/1602.04938.pdf
Skin Cancer Detection Husky/Dog Classifier

1. https://visualsonline.cancer.gov/details.cfm?imageid=9288
2. https://arxiv.org/pdf/1602.04938.pdf
Explanations
https://github.com/marcotcr/lime https://pair-code.github.io/what-if-tool/
Surprise #3

Will this model harm


users?
“Racial bias in a medical
algorithm favors white
patients over sicker black
patients”
Washington Post
“Racist robots, as I invoke them here,
represent a much broader process:
social bias embedded in technical
artifacts, the allure of objectivity
without public accountability”

~ Ruha Benjamin @ruha9


“What are the unintended
consequences of designing
systems at scale on the basis of
existing patterns of society?”

~ M.C. Eilish & Danah Boyd,


Don’t Believe Every AI You See
@m_c_elish @zephoria
❖ Word2Vec has known gender and race biases
❖ It’s in English
❖ Is it robust to spelling errors?
❖ How does it perform with malicious data?
❖ Word2Vec has known gender and race biases

Make it measurable!
❖ It’s in English
❖ Is it robust to spelling errors?
❖ How does it perform with malicious data?
https://pair-code.github.io http://aif360.mybluemix.net https://github.com/fairlearn/fairlearn

https://github.com/jphall663/awesome-machine-learning-interpretability
Expectations

train & evaluate model

CI/CD
model API
Unit
Tests user testing
Reality

choose
a useful metric

Evaluate model

model
Choose threshold

API
Explain predictions
Fairness Framework
Unit
Tests user
testing

Surprises

Surprises after deploying


the model
Expectations user drop off

agile cycle
Bug Triage

bug tracking
tool

reproduce, debug, fix, release user testing

logs a bug or submits a complaint


Surprise #5
I want to meet a doctor

I can provide marijuana and other


drugs which improves health
Surprise #5

The model has some “bugs”


Surprise #5 continued...

❖ What is a model “bug”


❖ How to fix the bug?
❖ When is the “bug” fixed?
❖ How do I ensure test regression?
❖ “Bug” priority?
Surprise #5
I want to meet a doctor

I can provide marijuana and other


drugs which improves health
Ad
dt
oy
ou
Describing the “bugs” r tes
ts
et
Prediction Target

I can provide marijuana and other drugs I want to meet a doctor YES NO False Positive
which improves health

I can provide marijuana I want to meet a doctor NO NO True Negative


I can provide drugs for cancer patients I want to meet a doctor YES NO

I can provide general practitioner I want to meet a doctor NO YES


services
False Negative

I can provide medicine I want to meet a drug addiction sponsor YES YES

I can provide medicine I want to meet a pharmacist YES YES True Positives
I can provide illegal drugs I want to meet a drug dealer YES NO
Is my “bug” fixed?
Classification Error

politicians-false-neg

designers-too-general
drugs-doctors-false-pos

tech-too-general

Candidate Model Over Time


How do we triage these “bugs”?
How do we triage these “bugs”?

% Users Affected
x
Normalized Error
x
Harm
How do we triage these “bugs”?

Problem Impact Error


the-arts-too-general 2.931529
health-more-specific 1.53985
brand-marketing-social-media 1.285735
developer 1.054248
1-services 0.960129
Surprise #6

Is this new model


better than my old
model?
Alice replied, rather
shyly, “I—I hardly
know, sir, just at
present—at least I know
who I was when I got up
this morning, but I think
I must have changed
several times since
then.”
Why is model
comparison hard?
Living Test Set

0.8 0.75
Re-evaluate ALL models

0.72 0.75
Surprise #7

I demoed the model yesterday and it


went off-script!
What changed?
Surprise #7

Why is the model


doing something
differently today?
What changed?

❖ My data?
❖ My model?
❖ My preprocessing?
How to figure out what changed? Experiment

Model Repository Results Metadata Store


Repository experiment: 3
model-3 data: ea2541df
code: da1341bb
desc: “Added feature to
training pipeline”
CI/CD run_on: 10-10-2019
completed_on:11-10-2019
model: model-3
results: 3

ea2541df da1341bb

Data Repository Code repository


Expectations user drop off

agile cycle
Prioritization

bug tracking
tool

reproduce, debug, fix user testing

logs a bug or submits a complaint


Actual
Describe Add to
user reports Identify problem model bug Calculate Triage
bug problem with test tracking Priority
patterns tool

“Agile Sprint” Pick Problem


- Evaluate model against
other models
Retrain - Gather More
- Evaluate individual
Data for Problem
problems - Change Model
- Select model - Create
Features
Surprises

Surprises maintaining
and improving the
model over time
Expectation

Generate/select Get them Add to


Pick an issue unlabelled data set Retrain
labelled
patterns
Surprise #8

User behaviour
drifts
Now what?

● Regularly sample
data from
production for
training
● Regularly refresh
your test set
Surprise #9

Data labellers
are rarely
experts
Surprise #10

The model is not


robust
Surprise #10

The model knows


when it’s uncertain
Techniques for detecting robustness & uncertainty

❖ Softmax predictions that


are uncertain
❖ Dropout at Inference
❖ Add noise to data and
see how much output
changes
Surprise #11

Changing and
updating the data so
often gets messy
Needed to check the following

● Data Leakage
● Duplicates
● Distributions
Expectation

Generate/select Get them Add to


Pick an issue data set Retrain
unlabelled labelled
patterns
Reject
Actual

Review
Pick Get data labelled
Generate/select sample from
Problem on crowdsourced
unlabelled data each data
platform
labeller
Approve
Model tells you
which patterns Escalate
it’s uncertain conflicting
about data labels

Data Version Data Version Expert data label


Control CI/CD Control platform

Runs tests on Add to branch of


Merge into dataset data dataset New data!
The Checklist
First Release

Careful metric selection

Threshold selection strategy

Explain Predictions

Fairness Framework
After First Release
The Checklist
ML Problem Tracker

Problem Triage Strategy

Reproducible Training

Comparable Results

Result Management

Be able to answer why


The Checklist
Long term improvements & maintenance

Data refresh strategy

Data Version Control

CI/CD or Metrics for Data

Data Labeller Platform + Strategy

Robustness & Uncertainty


Things I didn’t cover
Pipelines & Orchestration Kubeflow, MLFlow

End-to-end Products TFX, Sage Maker, Azure ML

Unit Testing ML systems “Testing your ML pipelines” by


Kristina Georgieva

Debugging ML models A field guide to fixing your


neural network model by Josh
Tobin.

Privacy Google’s Federated Learning

Hyper parameter optimization So many!


The End
@alienelf
[email protected]
https://retrorabbit.co
https://kalido.me
https://masakhane.io

You might also like