Machine learning algorithms learn from data to solve problems that are too complex to solve with conventional programming
Machine learning defined
Machine learning is a branch of artificial intelligence that includes methods, or algorithms, for automatically creating models from data. Unlike a system that performs a task by following explicit rules, a machine learning system learns from experience. Whereas a rule-based system will perform a task the same way every time (for better or worse), the performance of a machine learning system can be improved through training, by exposing the algorithm to more data.
Machine learning algorithms are often divided into supervised (the training data are tagged with the answers) and unsupervised (any labels that may exist are not shown to the training algorithm). Supervised machine learning problems are further divided into classification (predicting non-numeric answers, such as the probability of a missed mortgage payment) and regression (predicting numeric answers, such as the number of widgets that will sell next month in your Manhattan store).
Unsupervised learning is further divided into clustering (finding groups of similar objects, such as running shoes, walking shoes, and dress shoes), association (finding common sequences of objects, such as coffee and cream), and dimensionality reduction (projection, feature selection, and feature extraction).
Applications of machine learning
We hear about applications of machine learning on a daily basis, although not all of them are unalloyed successes. Self-driving cars are a good example, where tasks range from simple and successful (parking assist and highway lane following) to complex and iffy (full vehicle control in urban settings, which has led to several deaths).
Game-playing machine learning is strongly successful for checkers, chess, shogi, and Go, having beaten human world champions. Automatic language translation has been largely successful, although some language pairs work better than others, and many automatic translations can still be improved by human translators.
Automatic speech to text works fairly well for people with mainstream accents, but not so well for people with some strong regional or national accents; performance depends on the training sets used by the vendors. Automatic sentiment analysis of social media has a reasonably good success rate, probably because the training sets (e.g. Amazon product ratings, which couple a comment with a numerical score) are large and easy to access.
Automatic screening of résumés is a controversial area. Amazon had to withdraw its internal system because of training sample biases that caused it to downgrade all job applications from women.
Other résumé screening systems currently in use may have training biases that cause them to upgrade candidates who are “like” current employees in ways that legally aren’t supposed to matter (e.g. young, white, male candidates from upscale English-speaking neighborhoods who played team sports are more likely to pass the screening). Research efforts by Microsoft and others focus on eliminating implicit biases in machine learning.
Automatic classification of pathology and radiology images has advanced to the point where it can assist (but not replace) pathologists and radiologists for the detection of certain kinds of abnormalities. Meanwhile, facial identification systems are both controversial when they work well (because of privacy considerations) and tend not to be as accurate for women and people of color as they are for white males (because of biases in the training population).
Machine learning algorithms
Machine learning depends on a number of algorithms for turning a data set into a model. Which algorithm works best depends on the kind of problem you’re solving, the computing resources available, and the nature of the data. No matter what algorithm or algorithms you use, you’ll first need to clean and condition the data.
Let’s discuss the most common algorithms for each kind of problem.
Classification algorithms
A classification problem is a supervised learning problem that asks for a choice between two or more classes, usually providing probabilities for each class. Leaving out neural networks and deep learning, which require a much higher level of computing resources, the most common algorithms are Naive Bayes, Decision Tree, Logistic Regression, K-Nearest Neighbors, and Support Vector Machine (SVM). You can also use ensemble methods (combinations of models), such as Random Forest, other Bagging methods, and boosting methods such as AdaBoost and XGBoost.
Regression algorithms
A regression problem is a supervised learning problem that asks the model to predict a number. The simplest and fastest algorithm is linear (least squares) regression, but you shouldn’t stop there, because it often gives you a mediocre result. Other common machine learning regression algorithms (short of neural networks) include Naive Bayes, Decision Tree, K-Nearest Neighbors, LVQ (Learning Vector Quantization), LARS Lasso, Elastic Net, Random Forest, AdaBoost, and XGBoost. You’ll notice that there is some overlap between machine learning algorithms for regression and classification.
Clustering algorithms
A clustering problem is an unsupervised learning problem that asks the model to find groups of similar data points. The most popular algorithm is K-Means Clustering; others include Mean-Shift Clustering, DBSCAN (Density-Based Spatial Clustering of Applications with Noise), GMM (Gaussian Mixture Models), and HAC (Hierarchical Agglomerative Clustering).
Dimensionality reduction algorithms
Dimensionality reduction is an unsupervised learning problem that asks the model to drop or combine variables that have little or no effect on the result. This is often used in combination with classification or regression. Dimensionality reduction algorithms include removing variables with many missing values, removing variables with low variance, Decision Tree, Random Forest, removing or combining variables with high correlation, Backward Feature Elimination, Forward Feature Selection, Factor Analysis, and PCA (Principal Component Analysis).
Optimization methods
Training and evaluation turn supervised learning algorithms into models by optimizing their parameter weights to find the set of values that best matches the ground truth of your data. The algorithms often rely on variants of steepest descent for their optimizers, for example stochastic gradient descent (SGD), which is essentially steepest descent performed multiple times from randomized starting points.
Common refinements on SGD add factors that correct the direction of the gradient based on momentum, or adjust the learning rate based on progress from one pass through the data (called an epoch or a batch) to the next.
Neural networks and deep learning
Neural networks were inspired by the architecture of the biological visual cortex. Deep learning is a set of techniques for learning in neural networks that involves a large number of “hidden” layers to identify features. Hidden layers come between the input and output layers. Each layer is made up of artificial neurons, often with sigmoid or ReLU (Rectified Linear Unit) activation functions.
In a feed-forward network, the neurons are organized into distinct layers: one input layer, any number of hidden processing layers, and one output layer, and the outputs from each layer go only to the next layer.
In a feed-forward network with shortcut connections, some connections can jump over one or more intermediate layers. In recurrent neural networks, neurons can influence themselves, either directly, or indirectly through the next layer.
Supervised learning of a neural network is done just like any other machine learning: You present the network with groups of training data, compare the network output with the desired output, generate an error vector, and apply corrections to the network based on the error vector, usually using a backpropagation algorithm. Batches of training data that are run together before applying corrections are called epochs.
As with all machine learning, you need to check the predictions of the neural network against a separate test data set. Without doing that you risk creating neural networks that only memorize their inputs instead of learning to be generalized predictors.
The breakthrough in the neural network field for vision was Yann LeCun’s 1998 LeNet-5, a seven-level convolutional neural network (CNN) for recognition of handwritten digits digitized in 32×32 pixel images. To analyze higher-resolution images, the network would need more neurons and more layers.
Convolutional neural networks typically use convolutional, pooling, ReLU, fully connected, and loss layers to simulate a visual cortex. The convolutional layer basically takes the integrals of many small overlapping regions. The pooling layer performs a form of non-linear down-sampling. ReLU layers, which I mentioned earlier, apply the non-saturating activation function f(x) = max(0,x)
.
In a fully connected layer, the neurons have full connections to all activations in the previous layer. A loss layer computes how the network training penalizes the deviation between the predicted and true labels, using a Softmax or cross-entropy loss for classification or a Euclidean loss for regression.
Natural language processing (NLP) is another major application area for deep learning. In addition to the machine translation problem addressed by Google Translate, major NLP tasks include automatic summarization, co-reference resolution, discourse analysis, morphological segmentation, named entity recognition, natural language generation, natural language understanding, part-of-speech tagging, sentiment analysis, and speech recognition.
In addition to CNNs, NLP tasks are often addressed with recurrent neural networks (RNNs), which include the Long-Short Term Memory (LSTM) model.
The more layers there are in a deep neural network, the more computation it takes to train the model on a CPU. Hardware accelerators for neural networks include GPUs, TPUs, and FPGAs.
Reinforcement learning
Reinforcement learning trains an actor or agent to respond to an environment in a way that maximizes some value, usually by trial and error. That’s different from supervised and unsupervised learning, but is often combined with them.
For example, DeepMind’s AlphaGo, in order to learn to play (the action) the game of Go (the environment), first learned to mimic human Go players from a large data set of historical games (apprentice learning). It then improved its play by trial and error (reinforcement learning), by playing large numbers of Go games against independent instances of itself.
Robotic control is another problem that has been attacked with deep reinforcement learning methods, meaning reinforcement learning plus deep neural networks, the deep neural networks often being CNNs trained to extract features from video frames.
How to use machine learning
How does one go about creating a machine learning model? You start by cleaning and conditioning the data, continue with feature engineering, and then try every machine-learning algorithm that makes sense. For certain classes of problem, such as vision and natural language processing, the algorithms that are likely to work involve deep learning.
Data cleaning for machine learning
There is no such thing as clean data in the wild. To be useful for machine learning, data must be aggressively filtered. For example, you’ll want to:
- Look at the data and exclude any columns that have a lot of missing data.
- Look at the data again and pick the columns you want to use (feature selection) for your prediction. This is something you may want to vary when you iterate.
- Exclude any rows that still have missing data in the remaining columns.
- Correct obvious typos and merge equivalent answers. For example, U.S., US, USA, and America should be merged into a single category.
- Exclude rows that have data that is out of range. For example, if you’re analyzing taxi trips within New York City, you’ll want to filter out rows with pickup or drop-off latitudes and longitudes that are outside the bounding box of the metropolitan area.
There is a lot more you can do, but it will depend on the data collected. This can be tedious, but if you set up a data-cleaning step in your machine learning pipeline you can modify and repeat it at will.
Data encoding and normalization for machine learning
To use categorical data for machine classification, you need to encode the text labels into another form. There are two common encodings.
One is label encoding, which means that each text label value is replaced with a number. The other is one-hot encoding, which means that each text label value is turned into a column with a binary value (1 or 0). Most machine learning frameworks have functions that do the conversion for you. In general, one-hot encoding is preferred, as label encoding can sometimes confuse the machine learning algorithm into thinking that the encoded column is ordered.
To use numeric data for machine regression, you usually need to normalize the data. Otherwise, the numbers with larger ranges might tend to dominate the Euclidian distance between feature vectors, their effects could be magnified at the expense of the other fields, and the steepest descent optimization might have difficulty converging. There are a number of ways to normalize and standardize data for machine learning, including min-max normalization, mean normalization, standardization, and scaling to unit length. This process is often called feature scaling.
Feature engineering for machine learning
A feature is an individual measurable property or characteristic of a phenomenon being observed. The concept of a “feature” is related to that of an explanatory variable, which is used in statistical techniques such as linear regression. Feature vectors combine all the features for a single row into a numerical vector.
Part of the art of choosing features is to pick a minimum set of independent variables that explain the problem. If two variables are highly correlated, either they need to be combined into a single feature, or one should be dropped. Sometimes people perform principal component analysis to convert correlated variables into a set of linearly uncorrelated variables.
Some of the transformations that people use to construct new features or reduce the dimensionality of feature vectors are simple. For example, subtract Year of Birth
from Year of Death
and you construct Age at Death
, which is a prime independent variable for lifetime and mortality analysis. In other cases, feature construction may not be so obvious.
Splitting data for machine learning
The usual practice for supervised machine learning is to split the data set into subsets for training, validation, and test. One way of working is to assign 80% of the data to the training data set, and 10% each to the validation and test data sets. (The exact split is a matter of preference.) The bulk of the training is done against the training data set, and prediction is done against the validation data set at the end of every epoch.
The errors in the validation data set can be used to identify stopping criteria, or to drive hyperparameter tuning. Most importantly, the errors in the validation data set can help you find out whether the model has overfit the training data.
Prediction against the test data set is typically done on the final model. If the test data set was never used for training, it is sometimes called the holdout data set.
There are several other schemes for splitting the data. One common technique, cross-validation, involves repeatedly splitting the full data set into a training data set and a validation data set. At the end of each epoch, the data is shuffled and split again.
AutoML and hyperparameter optimization
AutoML and hyperparameter optimization are ways of getting the computer to try many models and identify the best one. With AutoML (as usually defined) the computer tries all of the appropriate machine learning models, and may also try all of the appropriate feature engineering and feature scaling techniques. With hyperparameter optimization, you typically define which hyperparameters you would like to sweep for a specific model—such as the number of hidden layers, the learning rate, and the dropout rate—and the range you would like to sweep for each.
Google has a different definition for Google Cloud AutoML. Instead of trying every appropriate machine learning model, it attempts to customize a relevant deep learning model (vision, translation, or natural language) using deep transfer learning. Azure Machine Learning Service offers similar transfer learning services by different names: custom vision, customizable speech and translation, and custom search.
Machine learning in the cloud
You can run machine learning and deep learning on your own machines or in the cloud. AWS, Azure, and Google Cloud all offer machine learning services that you can use on demand, and they offer hardware accelerators on demand as well.
While there are free tiers on all three services, you may eventually run up monthly bills, especially if you use large instances with GPUs, TPUs, or FPGAs. You need to balance this operating cost against the capital cost of buying your own workstation-class computers and GPUs. If you need to train a lot of models on a consistent basis, then buying at least one GPU for your own use makes sense.
The big advantage of using the cloud for machine learning and deep learning is that you can spin up significant resources in a matter of minutes, run your training quickly, and then release the cloud resources. Also, all three major clouds offer machine learning and deep learning services that don’t require a Ph.D. in data science to run. You have the options to use their pre-trained models, customize their models for your own use, or create your own models with any of the major machine learning and deep learning frameworks, such as Scikit-learn, PyTorch, and TensorFlow.
There are also free options for running machine learning and deep learning Jupyter notebooks: Google Colab and Kaggle (recently acquired by Google). Colab offers a choice of CPU, GPU, and TPU instances. Kaggle offers CPU and GPU instances, along with competitions, data sets, and shared kernels.
Machine learning in more depth
You can learn a lot about machine learning and deep learning simply by installing one of the deep learning packages, trying out its samples, and reading its tutorials. For more depth, consider one or more of the following resources.
- Neural Networks and Deep Learning by Michael Nielsen A Brief Introduction to Neural Networks by David Kriesel
- Deep Learning by Yoshua Bengio, Ian Goodfellow, and Aaron Courville
- A Course in Machine Learning by Hal Daumé III
- TensorFlow Playground by Daniel Smilkov and Shan Carter
- Stanford Computer Science CS231n: Convolutional Neural Networks for Visual Recognition