Da Unit-1
Da Unit-1
Da Unit-1
Big data is a field that treats ways to analyze, systematically extract information from, or
otherwise deal with data sets that are too large or complex to be dealt with by traditional
data-processing application software.
Volume
Variety
Velocity
Veracity
The volume of data refers to the size of the data sets that need to be analyzed and processed,
which are now frequently larger than terabytes and petabytes. The sheer volume of the data
requires distinct and different processing technologies than traditional storage and processing
capabilities. In other words, this means that the data sets in Big Data are too large to process
with a regular laptop or desktop processor. An example of a high-volume data set would be
all credit card transactions on a day within Europe.
Velocity refers to the speed with which data is generated. High velocity data is generated
with such a pace that it requires distinct (distributed) processing techniques. An example of a
data that is generated with high velocity would be Twitter messages or Facebook posts.
Variety makes Big Data really big. Big Data comes from a great variety of sources and
generally is one out of three types: structured, semi structured and unstructured data. The
variety in data types frequently requires distinct processing capabilities and specialist
algorithms. An example of high variety data sets would be the CCTV audio and video files
that are generated at various locations in a city.
Veracity refers to the quality of the data that is being analyzed. High veracity data has many
records that are valuable to analyze and that contribute in a meaningful way to the overall
results. Low veracity data, on the other hand, contains a high percentage of meaningless data.
The non-valuable in these data sets is referred to as noise. An example of a high veracity data
set would be data from a medical experiment or trial.
Data that is high volume, high velocity and high variety must be processed with advanced
tools (analytics and algorithms) to reveal meaningful information. Because of these
characteristics of the data, the knowledge domain that deals with the storage, processing, and
analysis of these data sets has been labeled Big Data.
FORMS OF DATA
– STRUCTURED FORM
– UNSTRUCTURED FORM.
• Any form of data that does not have predefined structure is represented as
unstructured form of data. Eg: video, images, comments, posts, few
websites such as blogs and wikipedia
SOURCES OF DATA
DATA ANALYSIS
Data analysis is a process of inspecting, cleansing, transforming and modeling data with
the goal of discovering useful information, informing conclusions and supporting decision-
making.
DATA ANALYTICS
• Data analytics is the science of analyzing raw data in order to make conclusions about
that information......This information can then be used to optimize processes to increase
the overall efficiency of a business or system.
Types:
In descriptive statistics the result is always going lead with probability among ‘n’
number of options where each option has an equal chance of probability.
– Predictive analytics Eg: healthcare, sports, weather, insurance, social media analysis.
This type of analytics deals with predicting past data to make decisions based on
certain algorithms. In case of a doctor the doctor questions the patient about the
past to correct his illness through already existing procedures.
Prescriptive analytics works with predictive analytics, which uses data to determine
near-term outcomes. Prescriptive analytics makes use of machine learning to help
businesses decide a course of action based on a computer program's predictions.
Fig 0.1Relation between Social Media, Data Analysis and Big Data
Social media data are used in number of domains such as health and political trending
and forecasting, hobbies, ebusiness,cyber-crime, counter terrorism, time-evolving opinion
mining, social net-work analysis, and human machineinteractions.
Finally, summarizing all the above concepts processing for social media data can be
categorized into 3 parts as shown infigure 0.1. The first part consists of social media
websites, the second part consists of data analysis part and the thirdpart consists of big data
management layer and schedules the jobs across the cluster.
MACHINE LEARNING
Analytics
In general data is passed to a machine learning tool to perform descriptive data analytics
through set of algorithms built in it. Here both data analytics and data analysis is done by the
tool automatically. Hence we can say that Data analysis is a sub component of data analytics.
And data analytics is a sub component of machine learning tool. All these are described in
figure 0.2. The output of this machine learning tool generates a model. And from this model
predictive analytics and prescriptive analytics can be performed because the model gives
output as data to machine learning tool. This cycle continues till we get an efficient output.
UNIT - I
1.1 DESIGN DATA ARCHITECTURE AND MANAGE THE DATA FOR ANALYSIS
Data architecture is composed of models, policies, rules or standards that govern which
data is collected, and how it is stored, arranged, integrated, and put to use in data systems
and in organizations. Data is usually one of several architecture domains that form the
pillars of an enterprise architecture or solution architecture.
Various constraints and influences will have an effect on data architecture design. These
include enterprise requirements, technology drivers, economics, business policies and data
processing needs.
• Enterpriserequirements
These will generally include such elements as economical and effective system
expansion, acceptable performance levels (especially system access speed), transaction
reliability, and transparent data management. In addition, the conversion of raw data such as
transaction records and image files into more useful information forms through such
features as data warehouses is also a common organizational requirement, since this enables
managerial decision making and other organizational processes. One of the architecture
techniques is the split between managing transaction data and (master) reference data.
Another one is splitting data capture systems from data retrieval systems (as done in a
datawarehouse).
• Technologydrivers
These are usually suggested by the completed data architecture and database
architecture designs. In addition, some technology drivers will derive from existing
organizational integration frameworks and standards, organizational economics, and
existing site resources (e.g. previously purchased software licensing).
• Economics
These are also important factors that must be considered during the data architecture phase.
It is possible that some solutions, while optimal in principle, may not be potential
candidates due to their cost. External factors such as the business cycle, interest rates,
market conditions, and legal considerations could all have an effect on decisions relevant to
data architecture.
Businesspolicies
Business policies that also drive data architecture design include internal organizational
policies, rules of regulatory bodies, professional standards, and applicable governmental
laws that can vary by applicable agency. These policies and rules will help describe the
manner in which enterprise wishes to process their data.
• Data processingneeds
These include accurate and reproducible transactions performed in high volumes, data
warehousing for the support of management information systems (and potential data
mining), repetitive periodic reporting, ad hoc reporting, and support of various
organizational initiatives as required (i.e. annual budgets, new productdevelopment).
The logical view/user's view, of a data analytics represents data in a format that is
meaningful to a user and to the programs that process those data. That is, the logical
view tells the user, in user terms, what is in the database. Logical level consists of data
requirements and process models which are processed using any data modelling techniques to
result in logical data model.
Physical level is created when we translate the top level design in physical tables in
the database. This model is created by the database architect, software architects, software
developers or database administrator. The input to this level from logical level and various
data modeling techniques are used here with input from software developers or database
administrator. These data modelling techniques are various formats of representation of data
such as relational data model, network model, hierarchical model, object oriented model,
Entity relationship model.
Implementation level contains details about modification and presentation of data through
the use of various data mining tools such as (R-studio, WEKA, Orange etc). Here each tool
has a specific feature how it works and different representation of viewing the same data.
These tools are very helpful to the user since it is user friendly and it does not require much
programming knowledge from the user.
Observation Method:
we need to clearly differentiate our own observations from the observations provided to us by
other people. The range of data storage genre found in Archives and Collections, is suitable
for documenting observations e.g. audio, visual, textual and digital including sub-genres
of note taking, audio recording and video recording.
There exist various observation practices, and our role as an observer may vary
according to the research approach. We make observations from either the outsider or insider
point of view in relation to the researched phenomenon and the observation technique can be
structured or unstructured. The degree of the outsider or insider points of view can be seen as
a movable point in a continuum between the extremes of outsider and insider. If you decide
to take the insider point of view, you will be a participant observer in situ and actively
participate in the observed situation or community. The activity of a Participant observer in
situ is called field work. This observation technique has traditionally belonged to the data
collection methods of ethnology and anthropology. If you decide to take the outsider point of
view, you try to try to distance yourself from your own cultural ties and observe the
researched community as an outsider observer. These details are seen in figure 1.2.
Experimental Designs
There are number of experimental designs that are used in carrying out and
experiment. However, Market researchers have used 4 experimental designs most frequently.
These are –
A completely randomized design (CRD) is one where the treatments are assigned
completely at random so that each experimental unit has the same chance of receiving any
one treatment. For the CRD, any difference among experimental units receiving the same
treatment is considered as experimental error. Hence, CRD is appropriate only for
experiments with homogeneous experimental units, such as laboratory experiments, where
environmental effects are relatively easy to control. For field experiments, where there is
generally large variation among experimental plots in such environmental factors as soil, the
CRD is rarely used. CRD is mainly used in agricultural field.
Step 1. Determine the total number of experimental plots (n) as the product of the number of
treatments (t) and the number of replications (r); that is, n = rt. For our example, n = 5 x 4 =
20. Here, one pot with a single plant in it may be called a plot. In case the number of
replications is not the same for all the treatments, the total number of experimental pots is to
be obtained as the sum of the replications for each treatment. i.e.,
n= i
Step 2. Assign a plot number to each experimental plot in any convenient manner; for
example, consecutively from 1 to n.
Step 3. Assign the treatments to the experimental plots randomly using a table of random
numbers.
Example 1: Assume that a farmer wishes to perform the experiment to determine which of
his 3 fertilizers to use on 2800 tress. Assuming that farmer has a farm divided in to 3 terraces,
where those 2800 trees can be divided in the below format
Solution
Scenario 1
First we divide the 2800 trees in to random assignment of almost 3 equal parts
Random Assignment1: 933 trees
Random Assignment2: 933 trees
Random Assignment3: 934 trees
So for example random assignment1 we can assign fertilizer1, random assignment2 we can
assign fertilizer2, random assignment3 we can assign fertilizer3.
Scenario 2
Thus the farmer will be able analyze and compare various fertilizer performance on different
terrace.
Example 2:
A company wishes to test 4 different types of tyre. The tyres lifetime as determined
from their threads are given. Where each tyre has been tried on 6 similar automobiles
assigned at random to their tyres. Determine whether there is a significant difference between
tyres at .05 level.
Solution:
Null Hypothesis: There is no difference between the tyres in their life time.
We choose a random value closest to the average of all values in the table and subtract that
for each tyre in the automobile, for example by choosing 35
Now by using ANOVA (one way classification) Table, We calculate the F- Ratio.
F-Ratio:
The F ratio is the ratio of two mean square values. If the null hypothesis is true, you
expect F to have a value close to 1.0 most of the time. A large F ratio means that the variation
among group mean is more than you'd expect to see by chance
If the value of F-Ratio is closer to 1 it means that null hypothesis is true. If F-ratio is
greater than then we assume that the null hypothesis is false.
In this scenario the value of F-ratio is greater than 1. This indicates there will be variation
between samples. So assumed null hypothesis will be false
Level of significance = 0.05 (given in question)
Degrees of Freedom = (3, 20)
Critical value = 3.10 (calculated from 5 percentage table)
F-Ratio >critical value (i.e) 2.376> 3.10
Hence assumed null hypothesis is false. This indicates there is life time difference between
tyres.
A randomized block design, the experimenter divides subjects into subgroups called
blocks, such that the variability within blocks is less than the variability between blocks.
Then, subjects within each block are randomly assigned to treatment conditions. Compared to
a completely randomized design, this design reduces variability within treatment conditions
and potential confounding, producing a better estimate of treatment effects.
The table below shows a randomized block design for a hypothetical medical experiment.
Gender Treatment
Placebo Vaccine
Male 250 250
Female 250 250
Subjects are assigned to blocks, based on gender. Then, within each block, subjects are
randomly assigned to treatments (either a placebo or a cold vaccine). For this design, 250
men get the placebo, 250 men get the vaccine, 250 women get the placebo, and 250 women
get the vaccine.
It is known that men and women are physiologically different and react differently to
medication. This design ensures that each treatment condition has an equal proportion of men
and women. As a result, differences between treatment conditions cannot be attributed to
gender. This randomized block design removes gender as a potential source of variability and
as a potential confounding variable.
LSD - Latin Square Design - A Latin square is one of the experimental designs which has a
balanced two-way classification scheme say for example - 4 X 4 arrangement. In this scheme
each letter from A to D occurs only once in each row and also only once in each column. The
balance arrangement, it may be noted that, will not get disturbed if any row gets changed with
the other.
A B C D
B C D A
C D A B
D A B C
The balance arrangement achieved in a Latin Square is its main strength. In this design, the
comparisons among treatments, will be free from both differences between rows and
columns. Thus the magnitude of error will be smaller than any other design.
FD - Factorial Designs - This design allows the experimenter to test two or more variables
simultaneously. It also measures interaction effects of the variables and analyzes the impacts
of each of the variables.
In a true experiment, randomization is essential so that the experimenter can infer cause and
effect without any bias.
Internal sources
If available, internal secondary data may be obtained with less time, effort and money
than the external secondary data. In addition, they may also be more pertinent to the situation
at hand since they are from within the organization. The internal sources include
Accounting resources- This gives so much information which can be used by the marketing
researcher. They give information about internal factors.
Sales Force Report- It gives information about the sale of a product. The information
provided is of outside theorganization.
Internal Experts- These are people who are heading the various departments. They can give
an idea of how a particular thing isworking
Miscellaneous Reports- These are what information you are getting from operational
reports.If the data available within the organization are unsuitable or inadequate, the marketer
should extend the search to external secondary data sources.
Government Publications- Government sources provide an extremely rich pool of data for
the researchers. In addition, many of these data are available free of cost on internet websites.
There are number of government agencies generating data. These are:
Director General of Commercial Intelligence- This office operates from Kolkata. It gives
information about foreign trade i.e. import and export. These figures are provided region-
wise and country-wise.
Ministry of Commerce and Industries- This ministry through the office of economic
advisor provides information on wholesale price index. These indices may be related to a
number of sectors like food, fuel, power, food grains etc. It also generates All India
Consumer Price Index numbers for industrial workers, urban, non-manual employees and
cultural labourers.
Reserve Bank of India- This provides information on Banking Savings and investment. RBI
also prepares currency and finance reports.
Labour Bureau- It provides information on skilled, unskilled, white collared jobs etc.
National Sample Survey- This is done by the Ministry of Planning and it provides social,
economic, demographic, industrial and agricultural statistics.
State Statistical Abstract- This gives information on various types of activities related to the
state like - commercial activities, education, occupation etc.
The Bombay Stock Exchange (it publishes a directory containing financial accounts, key
profitability and other relevant matter)
Various Associations of Press Media. Export Promotion Council.
Syndicate Services- These services are provided by certain organizations which collect and
tabulate the marketing information on a regular basis for a number of clients who are the
subscribers to these services. So the services are designed in such a way that the information
suits the subscriber. These services are useful in television viewing, movement of consumer
goods etc. These syndicate services provide information data from both household as well as
institution.
In collecting data from household they use three approaches Survey- They conduct surveys
regarding - lifestyle, sociographic, general topics. Mail Diary Panel- It may be related to 2
fields - Purchase and Media.
Various syndicate services are Operations Research Group (ORG) and The Indian
Marketing Research Bureau (IMRB).
Importance of Syndicate Services
Syndicate services are becoming popular since the constraints of decision making are
changing and we need more of specific decision-making in the light of changing
environment. Also Syndicate services are able to provide information to the industries at a
low unit cost.
Disadvantages of Syndicate Services
The information provided is not exclusive. A number of research agencies provide
customized services which suits the requirement of each individual organization.
International Organization- These includes
The International Labour Organization (ILO)- It publishes data on the total and active
population, employment, unemployment, wages and consumer prices
The Organization for Economic Co-operation and development (OECD) - It publishes data
on foreign trade, industry, food, transport, and science andtechnology.
Based on various features (cost, data, process, source time etc.) various sources of
data can be compared as per table 1.
Sensor data is the output of a device that detects and responds to some type of input
from the physical environment. The output may be used to provide information or input to
another system or to guide a process. Examples are as follows
A photosensor detects the presence of visible light, infrared transmission (IR) and/or
ultraviolet (UV) energy.
Lidar, a laser-based method of detection, range finding and mapping, typically uses a
low-power, eye-safe pulsing laser working in conjunction with a camera.
A charge-coupled device (CCD) stores and displays the data for an image in such a way
that each pixel is converted into an electrical charge, the intensity of which is related to a
color in the color spectrum.
Smart grid sensors can provide real-time data about grid conditions, detecting outages,
faults and load and triggering alarms.
Wireless sensor networks combine specialized transducers with a communications
infrastructure for monitoring and recording conditions at diverse locations. Commonly
monitored parameters include temperature, humidity, pressure, wind direction and speed,
illumination intensity, vibration intensity, sound intensity, powerline voltage, chemical
concentrations, pollutant levels and vital body functions.
The simplest form of signal is a direct current (DC) that is switched on and off; this is
the principle by which the early telegraph worked. More complex signals consist of an
alternating-current (AC) or electromagnetic carrier that contains one or more data streams.
Data must be transformed into electromagnetic signals prior to transmission across a
network. Data and signals can be either analog or digital. A signal is periodic if it consists
of a continuously repeating pattern.
The Global Positioning System (GPS) is a space based navigation system that
provides location and time information in all weather conditions, anywhere on or near the
Earth where there is an unobstructed line of sight to four or more GPS satellites. The system
provides critical capabilities to military, civil, and commercial users around the world. The
United States government created the system, maintains it, and makes it freely accessible to
anyone with a GPS receiver.
Accuracy and Precision: This characteristic refers to the exactness of the data. It cannot
have any erroneous elements and must convey the correct message without being misleading.
This accuracy and precision have a component that relates to its intended use. Without
understanding how the data will be consumed, ensuring accuracy and precision could be off-
target or more costly than necessary. For example, accuracy in healthcare might be more
important than in another industry (which is to say, inaccurate data in healthcare could have
more serious consequences) and, therefore, justifiably worth higher levels of investment.
Legitimacy and Validity: Requirements governing data set the boundaries of this
characteristic. For example, on surveys, items such as gender, ethnicity, and nationality
are typically limited to a set of options and open answers are not permitted. Any answers
other than these would not be considered valid or legitimate based on the survey’s
requirement. This is the case for most data and must be carefully considered when
determining its quality. The people in each department in an organization understand what
data is valid or not to them, so the requirements must be leveraged when evaluating data
quality.
Reliability and Consistency: Many systems in today’s environments use and/or collect the
same source data. Regardless of what source collected the data or where it resides, it cannot
contradict a value residing in a different source or collected by a different system. There must
be a stable and steady mechanism that collects and stores the data without contradiction or
unwarranted variance.
Timeliness and Relevance: There must be a valid reason to collect the data to justify the
effort required, which also means it has to be collected at the right moment in time. Data
collected too soon or too late could misrepresent a situation and drive inaccurate
decisions.
Availability and Accessibility: This characteristic can be tricky at times due to legal and
regulatory constraints. Regardless of the challenge, though, individuals need the right level of
access to the data in order to perform their jobs. This presumes that the data exists and is
available for access to be granted.
Granularity and Uniqueness: The level of detail at which data is collected is important,
because confusion and inaccurate decisions can otherwise occur. Aggregated, summarized
and manipulated collections of data could offer a different meaning than the data
implied at a lower level. An appropriate level of granularity must be defined to provide
sufficient uniqueness and distinctive properties to become visible. This is a requirement for
operations to function effectively.
Noisy data is meaningless data. The term has often been used as a synonym for
corrupt data. However, its meaning has expanded to include any data that cannot be
understood and interpreted correctly by machines, such as unstructured text.
Noisy data
Examples: distortion of a person’s voice when talking on a poor phone and “snow” on
television screen
We can talk about signal to noise ratio.
Left image of 2 sine waves has low or zero SNR; the right image are the two waves
combined with noise and has high SNR
Origins of noise
BUT...Missing (null) values may have significance in themselves (e.g. missing test in a
medical examination, deathdate missing means still alive!)
Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of one another
Data Cleaning: Data is cleansed through processes such as filling in missing values,
smoothing the noisy data, or resolving the inconsistencies in the data.
Data Integration: Data with different representations are put together and conflicts
within the data are resolved.
Data Transformation: Data is normalized, aggregated and generalized.
Data Reduction: This step aims to present a reduced representation of the data in a
data warehouse.
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is
done. It involves handling of missing data, noisy data etc.
1. Binning Method:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all
data in a segment by its mean or boundary values can be used to complete the
task.
2. Regression:
Here data can be made smooth by fitting it to a regression function.The
regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be
undetected or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
1. Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0 to
1.0)
2. Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to help
the mining process.
3. Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
4. Concept Hierarchy Generation:
Here attributes are converted from level to higher level in hierarchy. For Example-
The attribute “city” can be converted to “country”.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working
with huge volume of data, analysis became harder in such cases. In order to get rid of this, we
uses data reduction technique. It aims to increase the storage efficiency and reduce data
storage and analysis costs.