Data Science Techniques Classification Regression and Clustering
Data Science Techniques Classification Regression and Clustering
Data Science Techniques Classification Regression and Clustering
Now, let's look closer at the various data science techniques and methods that are
available to perform the analysis.
Classification techniques
The primary question data scientists are looking to answer in classification
problems is, "What category does this data belong to?" There are many reasons for
classifying data into categories. Perhaps the data is an image of handwriting and
you want to know what letter or number the image represents. Or perhaps the data
represents loan applications and you want to know if it should be in the "approved"
or "declined" category. Other classifications could be focused on determining
patient treatments or whether an email message is spam.
The algorithms and methods that data scientists use to filter data into categories
include the following, among others:
Decision trees. These are a branching logic structure that uses machine-
generated trees of parameters and values to classify data into defined
categories.
Some classification techniques, such as decision trees, SVMs and neural networks,
can also be used to do regressions. In addition, the regression techniques available
to data scientists include the following:
Linear regression. One of the most widely used data science methods,
this approach tries to find the line that best fits the data being analyzed
based on the correlation between two variables.
Association analysis is a related, but separate, technique. The main idea behind it is
to find association rules that describe the commonality between different data
points. Similar to clustering, we're looking to find groups that data belongs to.
However, in this case, we're trying to determine when data points will occur
together, rather than just identify clusters of them. In clustering, the goal is to
segregate a large data set into identifiable groups, whereas with association
analysis, we're measuring the degree of association between data points.
Using these techniques, data scientists can tackle a wide range of applications,
many of which are commonly seen across different types of industries and
organizations. Here are a few examples.
Anomaly detection. If you can find the pattern for expected or "normal" data, then
you can also find those data points that don't fit the pattern. Companies in
industries as diverse as financial services, healthcare, retail and manufacturing
regularly employ a variety of data science methods to identify anomalies in their
data for uses such as fraud detection, customer analytics, cybersecurity and IT
systems monitoring. Anomaly detection can also be used to eliminate outlier values
from data sets for better analytics accuracy.