PTDLKT
PTDLKT
PTDLKT
Data Analytics :
The process of evaluating data with the purpose of drawing conclusions to
address business questions.
-Provides a way to search through large structured and unstructured data to
discover unknown patterns or relationships.
-Gives organizations the information they need to make sound and timely
business decisions.
-Aims to transform raw data into knowledge to create value.
Big Data
+ refers to datasets which are too large and complex to be
analyzed traditionally.
+ large and complex for businesses' existing systems to handle utilizing their
traditional capabilities to capture, store, manage, and analyze these dataset.
+ Characterized by the 3 Vs: volume (The sheer size of the dataset) , velocity
(The speed of data processing) , and variety (The number of types of data)
IMPACT cycle
Identify the questions.
are SMART: specific, measurable, achievable, relevant, and timely.
Specific: needs to be direct and focused to produce a meaningful answer.
Measurable: must be amenable to data analysis and thus the inputs to
answering the question must be measurable with data.
Achievable: should be able to be answered and the answer should cause a
decision maker to
take an action.
Relevant: should relate to the objectives of the organization or the situation
under consideration.
Timely: must have a defined time horizon for answering.
Master the data.
• Know what data is available and how they relate to the problem.
• Internal ERP systems.
• External networks and data warehouses.
• Data dictionaries.
• Extraction, transformation, and loading.
• Data validation and completeness.
• Data normalization.
• Data preparation and scrubbing
Track outcomes.
• Follow up on the results of the analysis.
• How frequently should the analysis be performed?
• Have the analytics changed?
• What are the trends?
Chap 2
Flat File single table, maintaining all of the data you need in one place (excel,
csv)
The ETL process begins with identifying which data you need and is
complete when the clean data are loaded in the appropriate format
into the tool to be used for analysis. The Requesting data is an iterative
practice involving 5 steps
File Types
Proprietary: (ex: Excel)
- Pro: easily read by excel
- Con: not easily read by other programs
Delimited: (ex: csv)
- Pro: easily read by many programs, no size limits
- Con: delimiters can cause problems
Chap 3
4 main categories of data analytics :
Descriptive analytics are procedures that summarize existing data to determine
what has happened in the past. ex : summary statistics (count, min, max,
average, median), distributions, and proportion.
Diagnostic analytics are procedures that explore the current data to determine
why something has happened the way it has, typically comparing the data to a
benchmark. ex : drill down in the data and see how it compares to a budget, a
competitor, or trend
Predictive analytics are procedures used to generate a model that can be used
to determine what is likely to happen in the future. ex : regression analysis,
forecasting, classification, and other predictive modeling
Prescriptive analytics are procedures that model data to enable
recommendations for what should be done in the future.
Supervised approach : when you are trying to predict a future outcome based
on historical data
- Summary statistics
- Co-occurrence grouping
- Similarity matching
- regression
- classification
- link prediction
Descriptive analytics
Approaches : summary statistics, data reduction or filtering
Diagnostic analytics
Approaches : profiling, clustering, similarity matching, co-occurrence grouping
Profiling
identify typical behavior in the data to identify outliers
- compares an individual to the population
- done primarily using structured data
- diagnostic analytic
- unsupervised
- used on structured data
Clustering
helps identify groups (or clusters) of individuals (such as customers) that share
common underlying characteristics—in other words, identifying groups of
similar data elements and the underlying drivers of those groups.
- find undiscovered natural groupings in the data
- good for anomaly testing, indicate risk or fraud
t-test : Compares mean values of a continuous variable between 2
categories/groups.
null hypothesis : the hypothesis that there is no significant difference between
specified populations, any observed difference being due to sampling or
experimental error.
Predictive analytics
Approaches : regression, classification, link prediction
Classification
predicts a class or category for a new observation based on the manual
identification of classes from previous observations
6 steps of classification
1. Identify the classes you wish to predict
2. Manually classify an existing set of records
3. Select a set of classification models
4. Divide your data into training and testing sets
5. Generate your model
6. Interpret the results and select the "best" model
pruning
- removes branches from a decision tree to avoid overfitting the model
-reduces the number of times we split the groups of data into smaller
support vector machine
- discriminating classifier
- defined by separating hyperplane that works first to find widest margin (or
biggest pipe) and then works to find the middle line
overfitting
- classifiers that are "too" accurate
-you want a good amount of accuracy without being too perfect
linear classifiers
- useful for ranking items rather than simply predicting class probability
- used for determining really important values
Prescriptive analytics
Approaches : decision support systems , machine learning and AI
Chap 4
Statistics involves the collection, analysis, interpretation, presentation, and
organization of data to help us understand patterns, trends, and relationships
within the data set.
Data Visualization is the graphical representation of data or information in a
way that is easy to understand and interpret.
Using the right type of chart will help you in many ways:
• Charts make your data easy to understand
• Charts help people quickly see the key trends and the orders of magnitude.
• Charts are easier to remember than large list of numbers
• Charts allow you to highlight the key insights your audience
should focus on to extract actionable information
Qualitative data ( dữ liệu định tính ) are categorical data (for example:
count, group, rank) .
Categorical data : Data that is divided into categories or groups
A. Nominal data ( dữ liệu định danh) is simple. (for example: hair color)
B. Ordinal data can be ranked. (for example: gold, silver, bronze)
C. Proportion shows the makeup of each category.
(for example: 55% cats, 45% dogs)
Qualitative Data Charts include : bar, pie, stacked bar, tree maps, heat maps,
symbol maps, and word clouds.
Comparison: Bar charts, Pie charts
Stacked bar chart: Tree map, Heat map
Geographic data: Symbol maps
Text data: Word clouds
Quantitative data ( dữ liệu định lượng) are numerical (for example: age,
height, dollar amount).
Data that has a meaningful difference between data points
A. Ratio data defines 0 as “absence of” something. (for example: cash)
B. Interval data where 0 is just another number. (for example: temperature)
C. Discrete data show only whole numbers. (for example: points in a basketball
game)
D. Continuous data show numbers with decimals. (for example: height)
E. Distributions describe the mean, median, and standard deviation of the data.
Quantitative Data Charts include : line charts, box and whisker plots, scatter
plots, and filled geographic maps.
Trend Over time: Line charts
Outlier detection: Box and whisker plots
Relationship between two variables: Scatter plots
Geographic data: Filled map
Exploratory visualizations are used to gain insights while you are interacting
with data. (for example: identifying good customers) . Tableau is more useful
than Excel if your data analysis project is more
Heat map is most useful for showing the relative size of a value by using a
color scale.
Line Chart
Line chart is useful for showing the change in stock price over time.
Identifying trends over time is best visualized in a Line chart
when
You have a continuous dataset that changes over time.
Your dataset is too big for a bar chart.
You want to display multiple series for the same timeline.
You want to visualize trends instead of exact values.
when not ?
Line charts work better with bigger datasets, so, if you
have a small one, use a bar chart instead
Bar Chart
when
Comparing parts of a bigger set of data, highlighting different categories, or
showing change over time.
You have long categories label — it offers more space.
If you want to illustrate both positive and ne gative values in the dataset.
when not
You have many categories, avoid overloading your graph.
Combined Chart
when
You want to compare values with different measurements..
The values are different in range.
when not
You want to display more than 2~3 types of graphs. It’s better to have separate
graphs to make it easier to read and understand.
Pie Chart
when
You show relative proportions and percentages of a whole dataset.
• Best used with small datasets.
• Comparing the effect of ONE factor on different categories.
• You have up to 6 categories.
• Your data is the nominal and not ordinal.
when not
• You have a big dataset.
• You want to make a precise or absolute
comparison between values.