Data Science Techniques Classification Regression and Clustering

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

lOMoARcPSD|29761241

Data Science Techniques -Classification, Regression and


Clustering
Data Science Honors course (Savitribai Phule Pune University)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Nirnay Patil ([email protected])
lOMoARcPSD|29761241

Now, let's look closer at the various data science techniques and methods that are
available to perform the analysis.

Classification techniques
The primary question data scientists are looking to answer in classification
problems is, "What category does this data belong to?" There are many reasons for
classifying data into categories. Perhaps the data is an image of handwriting and
you want to know what letter or number the image represents. Or perhaps the data
represents loan applications and you want to know if it should be in the "approved"
or "declined" category. Other classifications could be focused on determining
patient treatments or whether an email message is spam.

The algorithms and methods that data scientists use to filter data into categories
include the following, among others:

 Decision trees. These are a branching logic structure that uses machine-
generated trees of parameters and values to classify data into defined
categories.

 Naïve Bayes classifiers. Using the power of probability, Bayes


classifiers can help put data into simple categories.

 Support vector machines. SVMs aim to draw a line or plane with a


wide margin to separate data into different categories.

 K-nearest neighbor. This technique uses a simple "lazy decision"


method to identify what category a data point should belong to based on
the categories of its nearest neighbors in a data set.

 Logistic regression. A classification technique despite its name, it uses


the idea of fitting data to a line to distinguish between different
categories on each side. The line is shaped such that data is shifted to one
category or another rather than allowing more fluid correlations.

 Neural networks. This approach uses trained artificial neural networks,


especially deep learning ones with multiple hidden layers. Neural nets

Downloaded by Nirnay Patil ([email protected])


lOMoARcPSD|29761241

have shown profound capabilities for classification with extremely large


sets of training data.
Regression techniques
What if instead of trying to find out which category the data falls into, you'd like to
know the relationship between different data points? The main idea of regression is
to answer the question, "What is the predicted value for this data?" A simple
concept that comes from the statistical idea of "regression to the mean," it can
either be a straightforward regression between one independent and one dependent
variable or a multidimensional one that tries to find the relationship between
multiple variables.

Some classification techniques, such as decision trees, SVMs and neural networks,
can also be used to do regressions. In addition, the regression techniques available
to data scientists include the following:

 Linear regression. One of the most widely used data science methods,
this approach tries to find the line that best fits the data being analyzed
based on the correlation between two variables.

 Lasso regression. Lasso, short for "least absolute shrinkage and


selection operator," is a technique that improves upon the prediction
accuracy of linear regression models by using a subset of data in a final
model.

 Multivariate regression. This involves different ways to find lines or


planes that fit multiple dimensions of data potentially containing many
variables.
Clustering and association analysis techniques
Another set of data science techniques focuses on answering the question, "How
does this data form into groups, and which groups do different data points belong
to?" Data scientists can discover clusters of related data points that share various
characteristics in common, which can yield useful information in analytics
applications.

The methods available for clustering uses include the following:

Downloaded by Nirnay Patil ([email protected])


lOMoARcPSD|29761241

 K-means clustering. A k-means algorithm determines a certain number


of clusters in a data set and finds the "centroids" that identify where
different clusters are located, with data points assigned to the closest one.

 Mean-shift clustering. Another centroid-based clustering technique, it


can be used separately or to improve on k-means clustering by shifting
the designated centroids.

 DBSCAN. Short for "Density-Based Spatial Clustering of Applications


with Noise," DBSCAN is another technique for discovering clusters that
uses a more advanced method of identifying cluster densities.

 Gaussian mixture models. GMMs help find clusters by using a


Gaussian distribution to group data together rather than treating the data
as singular points.

 Hierarchical clustering. Similar to a decision tree, this technique uses a


hierarchical, branching approach to find clusters.

Association analysis is a related, but separate, technique. The main idea behind it is
to find association rules that describe the commonality between different data
points. Similar to clustering, we're looking to find groups that data belongs to.
However, in this case, we're trying to determine when data points will occur
together, rather than just identify clusters of them. In clustering, the goal is to
segregate a large data set into identifiable groups, whereas with association
analysis, we're measuring the degree of association between data points.

Data science application examples


The above methods and techniques in the data science tool belt need to be applied
appropriately to specific analytics problems or questions and the data that's
available to address them. Good data scientists must be able to understand the
nature of the problem at hand -- is it clustering, classification or regression? -- and
the best algorithmic approach that can yield the desired answers given the
characteristics of the data. This is why data science is, in fact, a scientific process,
rather than one that has hard and fast rules and allows you to just program your
way to a solution.

Downloaded by Nirnay Patil ([email protected])


lOMoARcPSD|29761241

Using these techniques, data scientists can tackle a wide range of applications,
many of which are commonly seen across different types of industries and
organizations. Here are a few examples.

Anomaly detection. If you can find the pattern for expected or "normal" data, then
you can also find those data points that don't fit the pattern. Companies in
industries as diverse as financial services, healthcare, retail and manufacturing
regularly employ a variety of data science methods to identify anomalies in their
data for uses such as fraud detection, customer analytics, cybersecurity and IT
systems monitoring. Anomaly detection can also be used to eliminate outlier values
from data sets for better analytics accuracy.

Binary and multiclass classification. One primary application of classification


techniques is to determine if something is or is not in a particular category. This is
known as binary classification, because we could ask something like, "Is there a cat
in the picture, or not?" A practical business application is to identify contracts or
invoices among piles of documents using image recognition. In multiclass
classification, we have many different categories in a data set and we're trying to
find the best fit for data points. For example, the U.S. Bureau of Labor Statistics
does automated classification of workplace injuries.

Personalization. Organizations looking to personalize interactions with people or


recommend products and services to customers first need to group them into data
buckets with shared characteristics. Effective data science work enables websites,
marketing offers and more to be tailored to the specific needs and preferences of
individuals, using technologies such as recommendation engines and hyper-
personalization systems that are driven by matching the data in detailed profiles of
people.

That's just a sample of useful data science applications. By understanding the


various techniques, methods, tools and analytical approaches, data scientists can
help the organizations that employ them achieve the strategic and competitive
benefits that many business rivals are already enjoying.

Downloaded by Nirnay Patil ([email protected])

You might also like