PTDLKT

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

Chap 1

Data Analytics :
The process of evaluating data with the purpose of drawing conclusions to
address business questions.
-Provides a way to search through large structured and unstructured data to
discover unknown patterns or relationships.
-Gives organizations the information they need to make sound and timely
business decisions.
-Aims to transform raw data into knowledge to create value.

Big Data
+ refers to datasets which are too large and complex to be
analyzed traditionally.
+ large and complex for businesses' existing systems to handle utilizing their
traditional capabilities to capture, store, manage, and analyze these dataset.
+ Characterized by the 3 Vs: volume (The sheer size of the dataset) , velocity
(The speed of data processing) , and variety (The number of types of data)

Data volume refers to the amount of data created and stored by an


organization.
Data velocity refers to the pace at which data is created and stored.
Data variety refers to the different forms data can take.
Data veracity refers to the quality or trustw orthiness of data

IMPACT cycle
Identify the questions.
are SMART: specific, measurable, achievable, relevant, and timely.
Specific: needs to be direct and focused to produce a meaningful answer.
Measurable: must be amenable to data analysis and thus the inputs to
answering the question must be measurable with data.
Achievable: should be able to be answered and the answer should cause a
decision maker to
take an action.
Relevant: should relate to the objectives of the organization or the situation
under consideration.
Timely: must have a defined time horizon for answering.
Master the data.
• Know what data is available and how they relate to the problem.
• Internal ERP systems.
• External networks and data warehouses.
• Data dictionaries.
• Extraction, transformation, and loading.
• Data validation and completeness.
• Data normalization.
• Data preparation and scrubbing

• Perform the test plan.


Classification approach to Data Analytics attempts to assign each unit in a
population into a small set of classes where the unit belongs
Similarity matching approach to Data Analytics attempts to identify similar
individuals based on data known about them
Link prediction approach to Data Analytics attempts to predict relationship
between two data items
data reduction A data approach that attempts to reduce the amount of
information that needs to be considered to focus on the most critical items (i.e.,
highest cost, highest risk, largest impact, etc.).
co-occurrence grouping A data approach that attempts to discover associations
between individuals based on transactions involving them.
clustering A data approach that attempts to divide individuals (like customers)
into groups (or clusters) in a useful or meaningful way.
Regression data approach might be used to assess the appropriate level of the
allowance for doubtful accounts A data approach that attempts to estimate or
predict, for each unit, the numerical value of some variable using some type of
statistical model
profiling A data approach that attempts to characterize the "typical" behavior of
an individual, group, or population by generating summary statistics about the
data (including mean, standard deviations, etc.)

Address and refine results.


• Identify issues with the analyses, possible issues, and refine the model
• Ask further questions.
• Explore the data.
• Rerun analysis
Communicate insights.
• Communicate effectively using clear language and visualizations:
▪ Dashboards.
▪ Static reports.
▪ Summaries

Track outcomes.
• Follow up on the results of the analysis.
• How frequently should the analysis be performed?
• Have the analytics changed?
• What are the trends?

Chap 2

Enterprise Resource Planning (ERP) is a category of business management


software that integrates applications from throughout the business (such as
manufacturing, accounting, finance, human resources, etc.) into one system.

Flat File single table, maintaining all of the data you need in one place (excel,
csv)

Relational Database collection of data that is organized so that it can be easily


accessed, managed, and updated with minimum data redundancy (Access,
SQLite)

Relationship databases ensure that data:


- are complete
- aren't redundant
- follow business rules and IC
- aid communication

our Types of Attributes


- Primary Keys: unique identifiers
- Foreign Keys: point to primary key in another table
- Composite Keys: combination of two foreign keys used for line items
- Descriptive Attributes: everything else

Importance of Primary Keys


- uniquely identify a row in a table
- ensure no duplicates will be added
- speed up queries, searches, and sort requests
- have no embedded spaces, special characters, or differential capitalization
- cannot be null

Importance of Foreign Keys


- links tables
- enforce referential integrity (excel links worksheets)

Data dictionary is defined as being a central repository of descriptions for all


of the data attributes of the dataset . The metadata that describes each attribute
in a database.

The ETL process begins with identifying which data you need and is
complete when the clean data are loaded in the appropriate format
into the tool to be used for analysis. The Requesting data is an iterative
practice involving 5 steps

ETL include Extract - Transform - Load


Exact
Step 1: Determine the purpose and scope of the data request.
Step 2: Obtain the data.
Transform
Step 3: Validate the data for completeness and integrity.
Step 4: Clean the data.
Load
Step 5: Load the data for data analysis.

File Types
Proprietary: (ex: Excel)
- Pro: easily read by excel
- Con: not easily read by other programs
Delimited: (ex: csv)
- Pro: easily read by many programs, no size limits
- Con: delimiters can cause problems

How to Clean the Data


- remove headings and subtotals
- clean leading zeroes and non-printable characters
- format negative numbers
- correct inconsistencies (fill missing values, international differences, remove
duplicate, consistent date format)

Chap 3
4 main categories of data analytics :
Descriptive analytics are procedures that summarize existing data to determine
what has happened in the past. ex : summary statistics (count, min, max,
average, median), distributions, and proportion.
Diagnostic analytics are procedures that explore the current data to determine
why something has happened the way it has, typically comparing the data to a
benchmark. ex : drill down in the data and see how it compares to a budget, a
competitor, or trend
Predictive analytics are procedures used to generate a model that can be used
to determine what is likely to happen in the future. ex : regression analysis,
forecasting, classification, and other predictive modeling
Prescriptive analytics are procedures that model data to enable
recommendations for what should be done in the future.

Unsupervised approach : when you don't have a specific question


- clustering
- profiling
- co-occurrence grouping
- data reduction

Supervised approach : when you are trying to predict a future outcome based
on historical data
- Summary statistics
- Co-occurrence grouping
- Similarity matching
- regression
- classification
- link prediction

Descriptive analytics
Approaches : summary statistics, data reduction or filtering

Summary statistics : describe a set of a data in terms of their location, range,


shape, and size (count, max, min, average, median) and dependence of a set of
observations

Data reduction or filtering


filter or group the data to simplify the analysis or to eliminate groups from
further analysis - unsupervised
Fuzzy matching locates approximate matches - useful for identifying
relationships in imperfect data
Data reduction steps :
1. Identify the attribute you would like to reduce or focus on
2. Filter the results
3. Interpret the results
4. Follow up on results

Diagnostic analytics
Approaches : profiling, clustering, similarity matching, co-occurrence grouping

Profiling
identify typical behavior in the data to identify outliers
- compares an individual to the population
- done primarily using structured data
- diagnostic analytic
- unsupervised
- used on structured data

Clustering
helps identify groups (or clusters) of individuals (such as customers) that share
common underlying characteristics—in other words, identifying groups of
similar data elements and the underlying drivers of those groups.
- find undiscovered natural groupings in the data
- good for anomaly testing, indicate risk or fraud
t-test : Compares mean values of a continuous variable between 2
categories/groups.
null hypothesis : the hypothesis that there is no significant difference between
specified populations, any observed difference being due to sampling or
experimental error.

Similarity matching is a grouping technique used to identify similar


individuals based on data known about them.

Co-occurrence grouping discovers associations between individuals based on


common events, such as transactions they are involved in.

Predictive analytics
Approaches : regression, classification, link prediction

Regression estimates or predicts the numerical value of a dependent variable


based on the slope and intercept of a line and the value of an independent
variable.
Examples of regression : predict employee turnover (managerial accounting)
appropriateness of allowance accounts
Three steps of a regression
1. Identify the variable (that might predict an outcome)
2. Determine the functional form (of the relationship)
3. Identify the parameters (of the model

Classification
predicts a class or category for a new observation based on the manual
identification of classes from previous observations

6 steps of classification
1. Identify the classes you wish to predict
2. Manually classify an existing set of records
3. Select a set of classification models
4. Divide your data into training and testing sets
5. Generate your model
6. Interpret the results and select the "best" model
pruning
- removes branches from a decision tree to avoid overfitting the model
-reduces the number of times we split the groups of data into smaller
support vector machine
- discriminating classifier
- defined by separating hyperplane that works first to find widest margin (or
biggest pipe) and then works to find the middle line
overfitting
- classifiers that are "too" accurate
-you want a good amount of accuracy without being too perfect
linear classifiers
- useful for ranking items rather than simply predicting class probability
- used for determining really important values

Link prediction predicts a relationship between two data items, such as


members of a social media platform.
ex. friends you might know on FB

Prescriptive analytics
Approaches : decision support systems , machine learning and AI

Decision support systems are rule-based systems that gather


data and recommend actions based on the input.

Machine learning and artificial intelligence are learning


models or intelligent agents that adapt to new external data to
recommend a course of action.

Chap 4
Statistics involves the collection, analysis, interpretation, presentation, and
organization of data to help us understand patterns, trends, and relationships
within the data set.
Data Visualization is the graphical representation of data or information in a
way that is easy to understand and interpret.

Visualizations are preferred over text.


• People prefer visuals.
• The brain can process visuals faster.
• Visuals can summarize complex information

Using the right type of chart will help you in many ways:
• Charts make your data easy to understand
• Charts help people quickly see the key trends and the orders of magnitude.
• Charts are easier to remember than large list of numbers
• Charts allow you to highlight the key insights your audience
should focus on to extract actionable information

Qualitative data ( dữ liệu định tính ) are categorical data (for example:
count, group, rank) .
Categorical data : Data that is divided into categories or groups
A. Nominal data ( dữ liệu định danh) is simple. (for example: hair color)
B. Ordinal data can be ranked. (for example: gold, silver, bronze)
C. Proportion shows the makeup of each category.
(for example: 55% cats, 45% dogs)
Qualitative Data Charts include : bar, pie, stacked bar, tree maps, heat maps,
symbol maps, and word clouds.
Comparison: Bar charts, Pie charts
Stacked bar chart: Tree map, Heat map
Geographic data: Symbol maps
Text data: Word clouds

Quantitative data ( dữ liệu định lượng) are numerical (for example: age,
height, dollar amount).
Data that has a meaningful difference between data points
A. Ratio data defines 0 as “absence of” something. (for example: cash)
B. Interval data where 0 is just another number. (for example: temperature)
C. Discrete data show only whole numbers. (for example: points in a basketball
game)
D. Continuous data show numbers with decimals. (for example: height)
E. Distributions describe the mean, median, and standard deviation of the data.

Quantitative Data Charts include : line charts, box and whisker plots, scatter
plots, and filled geographic maps.
Trend Over time: Line charts
Outlier detection: Box and whisker plots
Relationship between two variables: Scatter plots
Geographic data: Filled map

Declarative visualizations are used to present findings. (for example: financial


results) . Excel is more useful than Tableau if your data analysis project is more.

Exploratory visualizations are used to gain insights while you are interacting
with data. (for example: identifying good customers) . Tableau is more useful
than Excel if your data analysis project is more

Heat map is most useful for showing the relative size of a value by using a
color scale.

Line Chart
Line chart is useful for showing the change in stock price over time.
Identifying trends over time is best visualized in a Line chart

when
You have a continuous dataset that changes over time.
Your dataset is too big for a bar chart.
You want to display multiple series for the same timeline.
You want to visualize trends instead of exact values.
when not ?
Line charts work better with bigger datasets, so, if you
have a small one, use a bar chart instead
Bar Chart
when
Comparing parts of a bigger set of data, highlighting different categories, or
showing change over time.
You have long categories label — it offers more space.
If you want to illustrate both positive and ne gative values in the dataset.
when not
You have many categories, avoid overloading your graph.

Combined Chart
when
You want to compare values with different measurements..
The values are different in range.
when not
You want to display more than 2~3 types of graphs. It’s better to have separate
graphs to make it easier to read and understand.

Pie Chart
when
You show relative proportions and percentages of a whole dataset.
• Best used with small datasets.
• Comparing the effect of ONE factor on different categories.
• You have up to 6 categories.
• Your data is the nominal and not ordinal.
when not
• You have a big dataset.
• You want to make a precise or absolute
comparison between values.

You might also like