Data Science Techniques Classification Regression and Clustering

lOMoARcPSD|29761241
Data Science Techniques -Classification, Regression and

Clustering
Data Science Honors course (Savitribai Phule Pune University)
Studocu is not sponsored or endorsed by any college or university

Downloaded by Nirnay Patil ([email protected])
lOMoARcPSD|29761241
Now, let's look closer at the various data science techniques and methods that are
available to perform the analysis.
Classification techniques
The primary question data scientists are looking to answer in classification
problems is, "What category does this data belong to?" There are many reasons for
classifying data into categories. Perhaps the data is an image of handwriting and
you want to know what letter or number the image represents. Or perhaps the data
represents loan applications and you want to know if it should be in the "approved"
or "declined" category. Other classifications could be focused on determining
patient treatments or whether an email message is spam.
The algorithms and methods that data scientists use to filter data into categories
include the following, among others:
 Decision trees. These are a branching logic structure that uses machine-
generated trees of parameters and values to classify data into defined
categories.
 Naïve Bayes classifiers. Using the power of probability, Bayes

classifiers can help put data into simple categories.
 Support vector machines. SVMs aim to draw a line or plane with a

wide margin to separate data into different categories.
 K-nearest neighbor. This technique uses a simple "lazy decision"

method to identify what category a data point should belong to based on
the categories of its nearest neighbors in a data set.
 Logistic regression. A classification technique despite its name, it uses

the idea of fitting data to a line to distinguish between different
categories on each side. The line is shaped such that data is shifted to one
category or another rather than allowing more fluid correlations.
 Neural networks. This approach uses trained artificial neural networks,

especially deep learning ones with multiple hidden layers. Neural nets

lOMoARcPSD|29761241
have shown profound capabilities for classification with extremely large

sets of training data.
Regression techniques
What if instead of trying to find out which category the data falls into, you'd like to
know the relationship between different data points? The main idea of regression is
to answer the question, "What is the predicted value for this data?" A simple
concept that comes from the statistical idea of "regression to the mean," it can
either be a straightforward regression between one independent and one dependent
variable or a multidimensional one that tries to find the relationship between
multiple variables.
Some classification techniques, such as decision trees, SVMs and neural networks,
can also be used to do regressions. In addition, the regression techniques available
to data scientists include the following:
 Linear regression. One of the most widely used data science methods,
this approach tries to find the line that best fits the data being analyzed
based on the correlation between two variables.
 Lasso regression. Lasso, short for "least absolute shrinkage and

selection operator," is a technique that improves upon the prediction
accuracy of linear regression models by using a subset of data in a final
model.
 Multivariate regression. This involves different ways to find lines or

planes that fit multiple dimensions of data potentially containing many
variables.
Clustering and association analysis techniques
Another set of data science techniques focuses on answering the question, "How
does this data form into groups, and which groups do different data points belong
to?" Data scientists can discover clusters of related data points that share various
characteristics in common, which can yield useful information in analytics
applications.
The methods available for clustering uses include the following:

lOMoARcPSD|29761241
 K-means clustering. A k-means algorithm determines a certain number

of clusters in a data set and finds the "centroids" that identify where
different clusters are located, with data points assigned to the closest one.
 Mean-shift clustering. Another centroid-based clustering technique, it

can be used separately or to improve on k-means clustering by shifting
the designated centroids.
 DBSCAN. Short for "Density-Based Spatial Clustering of Applications

with Noise," DBSCAN is another technique for discovering clusters that
uses a more advanced method of identifying cluster densities.
 Gaussian mixture models. GMMs help find clusters by using a

Gaussian distribution to group data together rather than treating the data
as singular points.
 Hierarchical clustering. Similar to a decision tree, this technique uses a

hierarchical, branching approach to find clusters.
Association analysis is a related, but separate, technique. The main idea behind it is
to find association rules that describe the commonality between different data
points. Similar to clustering, we're looking to find groups that data belongs to.
However, in this case, we're trying to determine when data points will occur
together, rather than just identify clusters of them. In clustering, the goal is to
segregate a large data set into identifiable groups, whereas with association
analysis, we're measuring the degree of association between data points.
Data science application examples

The above methods and techniques in the data science tool belt need to be applied
appropriately to specific analytics problems or questions and the data that's
available to address them. Good data scientists must be able to understand the
nature of the problem at hand -- is it clustering, classification or regression? -- and
the best algorithmic approach that can yield the desired answers given the
characteristics of the data. This is why data science is, in fact, a scientific process,
rather than one that has hard and fast rules and allows you to just program your
way to a solution.

lOMoARcPSD|29761241
Using these techniques, data scientists can tackle a wide range of applications,
many of which are commonly seen across different types of industries and
organizations. Here are a few examples.
Anomaly detection. If you can find the pattern for expected or "normal" data, then
you can also find those data points that don't fit the pattern. Companies in
industries as diverse as financial services, healthcare, retail and manufacturing
regularly employ a variety of data science methods to identify anomalies in their
data for uses such as fraud detection, customer analytics, cybersecurity and IT
systems monitoring. Anomaly detection can also be used to eliminate outlier values
from data sets for better analytics accuracy.
Binary and multiclass classification. One primary application of classification

techniques is to determine if something is or is not in a particular category. This is
known as binary classification, because we could ask something like, "Is there a cat
in the picture, or not?" A practical business application is to identify contracts or
invoices among piles of documents using image recognition. In multiclass
classification, we have many different categories in a data set and we're trying to
find the best fit for data points. For example, the U.S. Bureau of Labor Statistics
does automated classification of workplace injuries.
Personalization. Organizations looking to personalize interactions with people or

recommend products and services to customers first need to group them into data
buckets with shared characteristics. Effective data science work enables websites,
marketing offers and more to be tailored to the specific needs and preferences of
individuals, using technologies such as recommendation engines and hyper-
personalization systems that are driven by matching the data in detailed profiles of
people.
That's just a sample of useful data science applications. By understanding the

various techniques, methods, tools and analytical approaches, data scientists can
help the organizations that employ them achieve the strategic and competitive
benefits that many business rivals are already enjoying.

Data Science Techniques Classification Regression and Clustering

Uploaded by

Copyright:

Available Formats

Data Science Techniques Classification Regression and Clustering

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Science Techniques Classification Regression and Clustering

Uploaded by

Copyright:

Available Formats

lOMoARcPSD|29761241

Data Science Techniques -Classification, Regression and

Studocu is not sponsored or endorsed by any college or university

 Naïve Bayes classifiers. Using the power of probability, Bayes

 Support vector machines. SVMs aim to draw a line or plane with a

 K-nearest neighbor. This technique uses a simple "lazy decision"

 Logistic regression. A classification technique despite its name, it uses

 Neural networks. This approach uses trained artificial neural networks,

Downloaded by Nirnay Patil ([email protected])

have shown profound capabilities for classification with extremely large

 Lasso regression. Lasso, short for "least absolute shrinkage and

 Multivariate regression. This involves different ways to find lines or

The methods available for clustering uses include the following:

Downloaded by Nirnay Patil ([email protected])

 K-means clustering. A k-means algorithm determines a certain number

 Mean-shift clustering. Another centroid-based clustering technique, it

 DBSCAN. Short for "Density-Based Spatial Clustering of Applications

 Gaussian mixture models. GMMs help find clusters by using a

 Hierarchical clustering. Similar to a decision tree, this technique uses a

Data science application examples

Downloaded by Nirnay Patil ([email protected])

Binary and multiclass classification. One primary application of classification

Personalization. Organizations looking to personalize interactions with people or

That's just a sample of useful data science applications. By understanding the

Downloaded by Nirnay Patil ([email protected])

You might also like