Aws Scholarship
Aws Scholarship
Aws Scholarship
Terminology
Nearly all tasks solved with machine learning involve three primary
components:
Important
Example 1
Imagine you own a snow cone cart, and you have some data about the
average number of snow cones sold per day based on the high temperature.
You want to better understand this relationship to make sure you have
enough inventory on hand for those high sales days.
In the graph above, you can see one example of a model, a linear regression
model (indicated by the solid line). You can see that, based on the data
provided, the model predicts that as the high temperature for the day
increases so do the average number of snow cones sold. Sweet!
Example 2
Let's look at a different example that uses the same linear regression model,
but with different data and a completely different question to answer.
Imagine that you work in higher education and you want to better
understand the relationship between the cost of enrollment and the
number of students attending college. In this example, our model predicts
that as the cost of tuition increases, the number of people attending college
is likely to decrease.
Using the same linear regression model (indicated by the solid line), you can
see that the number of people attending college does go down as the cost
increases.
Both examples showcase that a model is a generic program made specific by
the data used to train it.
Let's revisit our clay teapot analogy. We've gotten our piece of clay, and now
we want to make a teapot. Let's look at the algorithm for molding clay and
how it resembles a machine learning algorithm:
Think about the changes that need to be made. The first thing you
would do is inspect the raw clay and think about what changes can be
made to make it look more like a teapot. Similarly, a model training
algorithm uses the model to process data and then compares the
results against some end goal, such as our clay teapot.
Make those changes. Now, you mold the clay to make it look more
like a teapot. Similarly, a model training algorithm gently nudges
specific parts of the model in a direction that brings the model closer
to achieving the goal.
Repeat. By iterating these steps over and over, you get closer and
closer to what you want, until you determine that you’re close enough
and then you can stop.
After finishing this lesson, you will have been introduced to the key terms
and the most common techniques used to solve problems in machine
learning.
In the preceding diagram, you can see an outline of the major steps of the
machine learning process. Regardless of the specific model or training
algorithm used, machine learning practitioners practice a common workflow
to accomplish machine learning tasks.
These steps are iterative. In practice, that means that at each step along the
process, you review how the process is going. Are things operating as you
expected? If not, go back and revisit your current step or previous steps to
try and identify the breakdown.
The rest of the course is designed around these very important steps.
Think back to the snow cone sales example. Now imagine that you own a
frozen treats store and you sell snow cones along with many other products.
You wonder, "How do I increase sales?" It's a valid question, but it's
the opposite of a very specific task. The following examples demonstrate
how a machine learning practitioner might attempt to answer that question.
Step 2: Identify the machine learning task we might use to solve this
problem
This helps you better understand the data you need for a project.
All model training algorithms, and the models themselves, take data as their
input. Their outputs can be very different and are classified into a few
different groups, based on the task they are designed to solve.
Often, we use the kind of data required to train a model as part of defining a
machine learning task.
In this lesson, we will focus on two common machine learning tasks:
Supervised learning
Unsupervised learning
A task is supervised if you are using labeled data. We use the term labeled to
refer to data that already contains the solutions, called labels.
For example, predicting the number of snow cones sold based on the
average temperature outside is an example of supervised learning.
In the preceding graph, the data contains both a temperature and the
number of snow cones sold. Both components are used to generate the
linear regression shown on the graph. Our goal was to predict the number of
snow cones sold, and we feed that value into the model. We are providing
the model with labeled data and therefore, we are performing a supervised
machine learning task.
Unsupervised tasks
Take a look at the preceding picture. Did you notice the tree in the
picture? What you just did, when you noticed the object in the picture
and identified it as a tree, is called labeling the picture. Unlike you, a
computer just sees that image as a matrix of pixels of varying intensity.
Since this image does not have the labeling in its original data, it is
considered unlabeled.
Unsupervised learning involves using data that doesn't have a label. One
common task is called clustering. Clustering helps to determine if there are
any naturally occurring groupings in the data.
Imagine that you work for a company that recommends books to readers.
The assumption is that you are fairly confident that micro-genres exist, and
that there is one called Teen Vampire Romance. However, you don’t know
which micro-genres exist specifically, so you can't use supervised
learning techniques.
In supervised learning, there are two main identifiers that you will see in
machine learning:
You can take an entire class just on working with, understanding, and
processing data for machine learning applications. Good, high-quality data is
essential for any kind of machine learning project. Let's explore some of the
common aspects of working with data.
Data collection
Data inspection
The quality of your data will ultimately be the largest factor that affects how
well you can expect your model to perform. As you inspect your data, look
for:
Outliers
Missing or incomplete values
Data that needs to be transformed or preprocessed so it's in the
correct format to be used by your model
Summary statistics
Data visualization
You can use data visualization to see outliers and trends in your data and to
help stakeholders understand your data.
Look at the following two graphs. In the first graph, some data seems to
have clustered into different groups. In the graph immediately preceding it,
some data points might be outliers.
Nice work completing this chapter! There is a lot information in this section.
Let’s review a couple of key parts:
One
You learned that having good data is key to being able to successfully
answer the problem you have defined in your machine learning problem.
Two
To build a good dataset, there are four key aspects to be considered when
working with your data. First, you need to collect the data. Second, you
should inspect your data to check for outliers, missing or incomplete values,
and to see if any kind of data reformatting is required. Third, you should use
summary statistics to understand the scope, scale, and shape of the dataset.
Finally, you should use data visualizations to check for outliers, and to see
trends in your data.
Additional reading
In machine learning, you use several statistical-based tools to better
understand your data. The sklearn library has many examples and
tutorials, such as this example that demonstrates outlier detection on
a real dataset
Wrap-up
Nice work completing this chapter! There is a lot information in this section.
Let’s review a couple of key parts:
One
You learned that having good data is key to being able to successfully
answer the problem you have defined in your machine learning problem.
Two
To build a good dataset, there are four key aspects to be considered when
working with your data. First, you need to collect the data. Second, you
should inspect your data to check for outliers, missing or incomplete values,
and to see if any kind of data reformatting is required. Third, you should use
summary statistics to understand the scope, scale, and shape of the dataset.
Finally, you should use data visualizations to check for outliers, and to see
trends in your data.
You continue to cycle through these steps until you reach a predefined stop
condition. This might be based on training time, the number of training
cycles, or an even more intelligent or application-aware mechanism.
Remember the following advice when training your model.
Linear models
Tree-based models
Tree-based models are probably the second most common model type
covered in introductory coursework. They learn to categorize or regress by
building an extremely large structure of nested if/else blocks, splitting the
world into different regions at each if/else block. Training determines exactly
where these splits happen and what value is assigned at each leaf region.
For example, if you’re trying to determine if a light sensor is in sunlight or
shadow, you might train tree of depth 1 with the final learned configuration
being something like if (sensor_value > 0.698), then return 1; else return 0;.
The tree-based model XGBoost is commonly used as an off-the-shelf
implementation for this kind of model and includes enhancements beyond
what is discussed here. Try tree-based models to quickly get a baseline
before moving on to more complex models.
Wrap-up
Nice working completing this chapter! Let’s review a couple of key parts
from this lesson:
One
The model training algorithm iteratively updates a model's parameters to
minimize some loss function.
Two
During model training, the training data is fed into the model, and then the
loss function is computed based on the results. The model parameters are
then updated in a direction that reduces loss. You will continue to cycle
through these steps until your reach a predefined stop condition.
After you have collected your data and trained a model, you can start to
evaluate how well your model is performing. The metrics used for
evaluation are likely to be very specific to the problem you defined. As you
grow in your understanding of machine learning, you will be able to explore
a wide variety of metrics that can enable you to evaluate effectively.
Model accuracy is a fairly common evaluation metric. Accuracy is the
fraction of predictions a model gets right.
Here's an example:
Imagine that you build a model to identify a flower as one of two common
species based on measurable details like petal length. You want to know
how often your model predicts the correct species. This would require you
to look at your model's accuracy.
Every step we have gone through is highly iterative and can be changed or
rescoped during the course of a project. At each step, you might find that
you need to go back and reevaluate some assumptions you had in previous
steps. Don't worry! This ambiguity is normal.
This information hasn't been covered in the above video but is provided for the
advanced reader.
Model inference
Congratulations! You're ready to deploy your model. Once you have trained
your model, have evaluated its effectiveness, and are satisfied with the
results, you're ready to generate predictions on real-world problems using
unseen data in the field. In machine learning, this process is often
called inference.
Even after you deploy your model, you're always monitoring to make sure
your model is producing the kinds of results that you expect. There may be
times where you reinvestigate the data, modify some of the parameters in
your model training algorithm, or even change the model type used for
training.
Great job getting through all the steps in the machine learning process. Let’s
review some key takeaways from these lessons.
One
Solving problems using machine learning is an evolving and iterative process.
Two
To solve a problem successfully in machine learning finding high quality data
is essential.
Three
To evaluate models, you often use statistical metrics. The metrics you choose
are tailored to a specific use case.
Summary of examples
Through the remainder of the lesson, we will walk through three different case studies. In each
example, you will see how machine learning can be used to solve real-world problems.
Supervised learning
Using machine learning to predict housing prices in a neighborhood,
based on lot size and the number of bedrooms.
Unsupervised learning
While this type of task is beyond the scope of this lesson, we wanted
to show you the power and versatility of modern machine learning.
You will see how it can be used to analyze raw images from lab video
footage from security cameras, trying to detect chemical spills.
You detect this relationship and believe that you could use machine
learning to predict home prices.
In the next sections, you will go through the 5 major steps in machine
learning in the context of this example.
Problem
Can we estimate the price of a house based on lot size or the number of
bedrooms?
You access the sale prices for recently sold homes or have them appraised.
Since you have this data, this is a supervised learning task. You want to
predict a continuous numeric value, so this task is also a regression task.
You can plot home values against each of your input variables to look for
trends in your data. In the following chart, you see that when lot size
increases, house value increases.
Prior to actually training your model, you need to split your data. The
standard practice is to put 80% of your dataset into a training dataset and
20% into a test dataset.
The Python scikit-learn library has tools that can handle the implementation
of the model training algorithm for you.
In the following chart, you can see where the data points are in relation to
the blue line. You want the data points to be as close to the "average" line
as possible, which would mean less net error.
You compute the root mean square between your model’s prediction for a
data point in your test dataset and the true value from your data. This actual
calculation is beyond the scope of this lesson, but it's good to understand
the process at a high level.
Interpreting Results
In general, as your model improves, you see a better RMS result. You may
still not be confident about whether the specific value you’ve computed is
good or bad.
Model inference
Now you are ready to put your model into action. As you can see in the
following image, this means seeing how well it predicts with new data not
seen during model training.
In this example, you saw how you can use machine learning to help predict
home prices.
One
Solving problems using machine learning is an evolving and iterative
process.
Two
To solve a problem successfully in machine learning, finding high quality
data is essential.
Three
To evaluate models, you often use statistical metrics. The metrics you
choose are tailored to a specific use case.
Terminology
In the lesson about building your dataset, you learned about how
sometimes it is necessary to change the format of the data that you want to
use. In this case study, we need use a process called vectorization.
Vectorization is a process whereby words are converted into numbers.
For this project, you believe capitalization and verb tense will not matter,
and therefore you remove capitals and convert all verbs to the same tense
using a Python library built for processing human language. You also remove
punctuation and words you don’t think have useful meaning,
like 'a' and 'the'. The machine learning community refers to these words
as stop words.
Data preprocessing
Before you can train the model, you need to do a type of data preprocessing
called data vectorization, which is used to convert text into numbers.
As shown in the following image, you transform this book description text
into what is called a bag of words representation, so that it is
understandable by machine learning models.
How the bag of words representation works is beyond the scope of this
lesson. If you are interested in learning more, see the What's next section at
the end of this chapter.
You pick a common cluster-finding model called k-means. In this model, you
can change a model parameter, k, to be equal to how many clusters the
model will try to find in your dataset.
Your data is unlabeled and you don't how many micro-genres might exist.
So, you train your model multiple times using different values for k each
time.
What does this even mean? In the following graphs, you can see examples
of when k=2 and when k=3.
During the model evaluation phase, you plan on using a metric to find which
value for k is the most appropriate.
Model evaluation
In machine learning, numerous statistical metrics or methods are available
to evaluate a model. In this use case, the silhouette coefficient is a good
choice. This metric describes how well your data was clustered by the
model. To find the optimal number of clusters, you plot the silhouette
coefficient as shown in the following image below. You find the optimal
value is when k=19.
Often, machine learning practitioners do a manual evaluation of the model's
findings.
You find one cluster that contains a large collection of books that you can
categorize as "paranormal teen romance." This trend is known in your
industry, and therefore you feel somewhat confident in your machine
learning approach. You don’t know if every cluster is going to be as cohesive
as this, but you decide to use this model to see if you can find anything
interesting about which to write an article.
As you inspect the different clusters found when k=19, you find a
surprisingly large cluster of books. Here's an example from fictionalized
cluster #7.
As you inspect the preceding table, you can see that most of these text
snippets indicate that the characters are in some kind of long-distance
relationship. You see a few other self-consistent clusters and feel you now
have enough useful data to begin writing an article on unexpected modern
romance micro-genres.
In this example, you saw how you can use machine learning to help find
micro-genres in books by using the text found on the back of the book. Here
is summary of key moments from the lesson you just finished.
One
For some applications of machine learning, you need to not only clean and
preprocess the data but also convert the data into a format that is machine
readable. In this example, the words were converted into numbers through
a process called data vectorization.
Two
Solving problems in machine learning requires iteration. In this example you
saw how it was necessary to train the model multiples times for different
values of k. After training your model over multiple iterations you saw how
the silhouette coefficient could be use to determine the optimal value for k.
Three
During model inference you continued to inspect the clusters for accuracy to
ensure that your model was generative useful predictions.
Terminology
As shown in the image above, your goal will be to predict if each image
belongs to one of the following classes:
Contains spill
Does not contain spill
Building a dataset
Collecting
Split your image data into a training dataset and a test dataset.
Model Training
Traditionally, solving this problem would require hand-engineering features
on top of the underlying pixels (for example, locations of prominent edges
and corners in the image), and then training a model on these features.
Today, deep neural networks are the most common tool used for solving
this kind of problem. Many deep neural network models are structured to
learn the features on top of the underlying pixels so you don’t have to learn
them. You’ll have a chance to take a deeper look at this in the next lesson,
so we’ll keep things high-level for now.
Model evaluation
As you saw in the last example, there are many different statistical metrics
that you can use to evaluate your model. As you gain more experience in
machine learning, you will learn how to research which metrics can help you
evaluate your model most effectively. Here's a list of common metrics:
Accuracy
Confusion matrix
F1 score
False positive rate
False negative rate
Log loss
Negative predictive value
Precession
Recall
ROC Curve
Specificity
In cases such as this, accuracy might not be the best evaluation mechanism.
Why not? The model will see the does not contain spill' class almost all the
time, so any model that just predicts no spill most of the time will seem
pretty accurate.
What you really care about is an evaluation tool that rarely misses a real
spill.
After doing some internet sleuthing, you realize this is a common problem
and that precision and recall will be effective. Think of precision as
answering the question, "Of all predictions of a spill, how many were right?"
and recall as answering the question, "Of all actual spills, how many did we
detect?"
Manual evaluation plays an important role. If you are unsure if your staged
spills are sufficiently realistic compared to actual spills, you can get a better
sense how well your model performs with actual spills by finding additional
examples from historical records. This allows you to confirm that your model
is performing satisfactorily.
Model inference
The model can be deployed on a system that enables you to run machine
learning workloads such as AWS Panorama.
Thankfully, most of the time, the results will be from the class does not
contain spill.
But, when the class contains spill' is detected, a simple paging system could
alert the team to respond.
In this example, you saw how you can use machine learning to help detect
spills in a work environment. This example also used a modern machine
learning technique called a convolutional neural network (CNN).
Here is summary of key moments from the lesson that you just finished.
One
For some applications of machine learning, you need to use more
complicated techniques to solve the problem. While modern neural
networks are a powerful tool, don’t forget their cost in terms of being easily
explained.
Two
High quality data once again was very important to the success of this
application, to the point where even staging some fake data was required.
Once again, the process of data vectorization was required so it was
important to convert the images into numbers so that they could be used by
the neural network.
Three
During model inference you continued to inspect the predictions for
accuracy. It is especially important in this case because you created some
fake data to use when training your model.
Terminology