Overview Stanford PDF
Overview Stanford PDF
Overview Stanford PDF
(Russian)
1966: ALPAC report cut off government funding for MT, first AI winter
Contributions:
• Lisp, garbage collection, time-sharing (John McCarthy)
• Key paradigm: separate modeling and inference
Problems:
• Knowledge is not deterministic rules, need to model uncertainty
• Requires considerable manual effort to create rules, hard to main-
tain
A brief history
Two views
Course overview
Course logistics
Optimization
Predicting poverty
[Sharif+ 2016]
A brief history
Two views
Course overview
Course logistics
Optimization
Modeling
Inference Learning
Real world
Modeling
6 7
4
5
5 5 3 1
8 6 3
Model 8
0
8 1 1
7 2
7 2 3 6
4
8
6
6 7
4
5
5 5 3 1
8 6 3
Model 8
0
8 1 1
7 2
7 2 3 6
4
8
6
Inference
6 7
4
5
5 5 3 1
8 3
6
Predictions 8
0
8 1 1
7 2
7 2 3 6
4
8
6
? ?
?
?
? ? ? ?
? ? ?
Model without parameters ?
?
? ? ?
? ?
? ? ? ?
?
?
?
+data
Learning
6 7
4
5
5 5 3 1
8 6 3
Model with parameters 8
0
8 1 1
7 2
7 2 3 6
4
8
6
Machine learning
Data Model
Reflex
”Low-level intelligence” ”High-level intelligence”
Machine learning
Search problems
Markov decision processes
Adversarial games
Reflex States
”Low-level intelligence” ”High-level intelligence”
Machine learning
White to move
Applications:
• Games: Chess, Go, Pac-Man, Starcraft, etc.
• Robotics: motion planning
• Natural language generation: machine translation, image caption-
ing
[demo]
Search problems
Markov decision processes Constraint satisfaction problems
Adversarial games Bayesian networks
Machine learning
Goal: put digits in blank squares so each row, column, and 3x3 sub-block
has digits 1–9
X1 X2
X3 X4
H1 H2 H3 H4 H5
E1 E2 E3 E4 E5
Search problems
Markov decision processes Constraint satisfaction problems
Adversarial games Bayesian networks
Machine learning
Search problems
Markov decision processes Constraint satisfaction problems
Adversarial games Bayesian networks
Machine learning
A brief history
Two views
Course overview
Course logistics
Optimization
Late days: 7 total late days, max two per assignment (not for final project
report, poster)
Piazza: ask questions on Piazza, do not email us directly except for OAE
letters
A brief history
Two views
Course overview
Course logistics
Optimization
min Cost(p)
p∈Paths
min TrainingError(w)
w∈Rd
Examples:
”cat”, ”cat” ⇒ 0
”cat”, ”dog” ⇒ 3
”cat”, ”at” ⇒ 1
”cat”, ”cats” ⇒ 1
”a cat!”, ”the cats!” ⇒ 4
[semi-live solution]
• Once you have the recurrence, you can code it up. The straightforward implementation will take exponential
time, but you can memoize the results to make it quadratic time (in this case, O(nm)). The end result
is the dynamic programming solution: recurrence + memoization.
Problem: finding the least squares line
Examples:
{(2, 4)} ⇒ 2
{(2, 4), (4, 2)} ⇒ ?
[semi-live solution]
CS221 / Spring 2020 / Finn & Anari [linear regression,gradient descent] 110
• The formal task is this: given a set of n two-dimensional points (xi , yi ) which defines F (w), compute the
w that minimizes F (w).
• Linear regression is an important problem in machine learning, which we will come to later. Here’s a
motivation for the problem: suppose you’re trying to understand how your exam score (y) depends on the
number of hours you study (x). Let’s posit a linear relationship y = wx (not exactly true in practice, but
maybe good enough). Now we get a set of training examples, each of which is a (xi , yi ) pair. The goal is
to find the slope w that best fits the data.
• Back to algorithms for this formal task. We would like an algorithm for optimizing general types of F (w).
So let’s abstract away from the details. Start at a guess of w (say w = 0), and then iteratively update
w based on the derivative (gradient if w is a vector) of F (w). The algorithm we will use is called gradient
descent.
• If the derivative F 0 (w) < 0, then increase w; if F 0 (w) > 0, decrease w; otherwise, keep w still. This
motivates the following update rule, which we perform over and over again: w ← w − ηF 0 (w), where
η > 0 is a step size that controls how aggressively we change w.
• If η is too big, then w might bounce around and not converge. If η is too small, then w might not move
very far to the optimum. Choosing the right value of η can be rather tricky. Theory can give rough
guidance, but this is outside the scope of this class. Empirically, we will just try a few values and see which
one works best. This will help us develop some intuition in the process.
• Now to specialize to our function, we just need to compute the derivative, which is an elementary calculus
n
exercise: F 0 (w) = i=1 2(xi w − yi )xi .
P
Summary
• History: roots from logic, neuroscience, statistics—melting pot!