Data Analysis: Tools and Methods: January 2011

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228969012

Data analysis: tools and methods

Article · January 2011

CITATIONS READS

4 19,112

3 authors:

Zdenka Prokopová Petr Silhavy


Tomas Bata University in Zlín Tomas Bata University in Zlín
77 PUBLICATIONS   295 CITATIONS    45 PUBLICATIONS   190 CITATIONS   

SEE PROFILE SEE PROFILE

Radek Silhavy
Tomas Bata University in Zlín
59 PUBLICATIONS   212 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Use Case Points Dataset (Software Development Effort Estimation, Software Size) View project

All content following this page was uploaded by Radek Silhavy on 05 June 2014.

The user has requested enhancement of the downloaded file.


Recent Researches in Automatic Control

Data analysis: tools and methods


PROKOPOVA ZDENKA, SILHAVY PETR, SILHAVY RADEK
Department of Computer and Communication Systems
Faculty of Applied Informatics
Tomas Bata University in Zlin
nám. T. G. Masaryka 5555, 760 01 Zlín
CZECH REPUBLIC
[email protected], [email protected], [email protected] http://web.fai.utb.cz

Abstract: - The paper outlines an overview about contemporary state of art and trends in the field of data
analysis. Collecting, storing, merging and sorting enormous amounts of data have been a major challenge for
software and hardware facilities. Increasing number of companies and institutions has solved and developed
tools for saving and storing tables, documents or multimedia data. Database structures are a major instrument in
prevailing applications. These structures have everyday thousands or millions entries. The objectives of
analytical tools is obtaining necessary and useful information from collected data and consequently utilizing
them for active control and decision making. The main aim of this contribution is to present some possibilities
and tools of data analysis with regards to availability of final users.

Key-Words: - Data Analysis, Business Intelligence, Data Mining, OLAP, Datawarehouse

1 Introduction discipline from seventies. These types of


The origin of Business Intelligence principles is applications are known as Decision Support
connected with the name Hans Peter Luhn – Systems. Their basic imposition was providing of
experimental staff member of IBM. He published in information and tools for the modeling and
the IBM Journal article entitled “A Business evaluation of various business alternatives and
Intelligence System” in 1958 where main principles strategies. The development of decision-making
and cogitations were formulated [1]. The main systems was supported also with expansion in the
philosophy idea consists in the principle that hardware and software area. Two points can be seen
commercial aims of the company should have been as a key factor in the development. The first one is
defined on the bases of present facts evaluation. the data access speed changes. The second one is
These presumptions led in various software program revolutionary proposal of relational data model
implementations intended to administration of introduced by E. F. Codd. This model is based on
manager information. In 1989, the term Business mathematical set theory. With entrance of graphic-
Intelligence defined by analyst Howard J. Dresner oriented user interface, the third wave of tools for
was introduced to the wider public awareness. He helping in control processes appeared. There are so
described them as a set of concepts and methods called Executive Information Systems (or Executive
intended for improving the quality of analytical and Support Systems) which offer on-line access to
decision-making processes in organizations. He actual information about state of controlled
focused on importance of data analysis, reporting organizations for top managers. First applications of
and query tools, which offer to user amount of data this type worked right on the purchased data.
and help him with synthesis of valuable and useful However, it was a big primary system workload and
information [2], [3]. therefore came to separation of service data and data
Early information systems in large companies for analyses [4], [5].
and banks were operated since 60th years of last The depth data analysis of Business information
century. In spite of the title Management systems and their subsequent utilization at company
Information Systems there were only common control can be labeled by the common mark –
routine agenda specialized to accounting data Business Intelligence. Analytical and planning
processing. Special systems intended not only for characters of Business Intelligence applications
everyday operational control but especially for differ from the ordinary operating systems in user’s
strategy management began to create a new look on data. While operating systems work with
detailed information then analytic exercises work

ISBN: 978-1-61804-004-6 201


Recent Researches in Automatic Control

with aggregate data. So that the analytical look into ETL works in batch mode next regular updating
data imposed the necessity of changing the data brings only such amount of data which
access technology. Operating systems work with corresponding with used time period (day, week,
transaction entity-relational databases analytical month, year).
systems work with data warehouses and
multidimensional databases. 2.1.2 EAI – Enterprise Application Integration
EAI tools are exploited in source system layer.
Their aim is integration of primary business systems
2 Tools of analytical systems and reduction of a number of their reciprocal
interface. These tools work on two levels:
2.1 Data transformations – data pumps at the level of data integration where there are
Data acceptable for next analysis have to be used for integration and data distribution
extracted from operational systems and put into data at the level of application integration where
store. After that we can perform analyses by the there are used for sharing of selected functions
help of OLAP technology, Data Mining technology of information systems.
or by the help of reporting services to create reports.
This action is at the creation of data stores most 2.2 Database components – data warehouse
important as well as more exacting. It is necessary The philosophy of data warehouse (stores) has
to ensure analysis of contain and technologically published for the first time by Bill Inmon in the
heterogeneous data sources and then choose relevant book Building the Data Warehouse in the year 1991.
data and centralize, integrate and aggregate them Genuine reason of data warehouse occurrence had
each other. Data pumps serves to collection and connection especially with massive setting of server
transmission of data from source systems to data business systems and their conception of separate
stores and dumping ground. They include: and independent application at the end of eightieth
ETL systems for extraction, transformation and years of last century. Data warehouses ware
transmission of data established as independent information systems set
EAI systems for application integration (work above business data. While data warehouses are
in contrast to ETL tools in real-time). subject-oriented (data are separated according to
types) data markets are problem-oriented. For the
2.1.1 ETL – Extract, Transform and Load purpose of data storage served new
Data store filling (ETL process) starts by data multidimensional database model which enabled
extraction from primary sources (Extraction). easily and quickly create various views on data by
During this phase there are seek out and remove the help of special cuts of data cube. This
various data inconsistency. Before their technology is the bases of today analytical tools of
transformation into the data schema extracted data Business Intelligence. By connection of BI with
can be loaded in temporary dumping ground. tools of business planning was created a new type of
Temporary dumping ground data component (Data application called Corporate Performance
Staging Area – DSA) used to be most frequently a Management (CPM).
part of those solutions of data stores which has a Data warehouses are special types of business
source in heavy transaction systems. By using of databases which contain consolidated data from all
DSA will reduce requirement of transaction systems accessible service systems. There are not optimized
utilization in the ETL process and they can be used for quick transaction processing but quick
at business processes service. DSA is possible to use administration of analytical information obtained
also in the case when is necessary to transfer data from big amount of data. Data warehouses ensuring
from for example text file into the required database processes of storing, actualization and
format. After the extraction follows data administration of data. There are exists two basic
transformation (Transformation) which will convert types of data stores and two types of auxiliary stores
data obtained from single data sources into unified
data model. This model makes it possible to create 2.2.1 Basic data stores
aggregations and clustering. Data Warehouse (DWH)
The final phase of ETL is data transmission from Data warehouse is wide (extensive) central
source data memories or temporary dumping ground business database in which are saved
to database tables of the data store. At the primary transformed data coming from various service
filling it can be a gigantic quantity of data. Because systems and external databases. Mentioned data
are intended to following analyses.

ISBN: 978-1-61804-004-6 202


Recent Researches in Automatic Control

Data Marts (DMA) 2.3 Analytical components


The principle of data marts is similar as the 2.3.1 Analysis of multidimensional data - OLAP
principle of data warehouses. Difference is only Data in data warehouse are cleaned out and
in one point of view - data marts are integrated but often very voluminous. There are use
decentralized and thematic oriented. Provided special data structures and technology for their
analytical information are aimed to specific analysis known as OLAP (On-line Analytical
user group (marketing, selling etc.). Processing). OLAP tools are simple, readily
available and very popular susceptible to create
2.2.2 Auxiliary data stores multidimensional analysis. There are for example
Operational Data Store (ODS) pivot tables from MS Excel. There were defined 12
Data Staging Areas (DSA) rules for OLAP by Dr. Codd in 1993:
Multidimensional conceptual view - the system
2.2.3 Schemes for data stores should offer multidimensional model
Data models of working systems used to be very corresponding to business individual needs and
coplicated because they contain a lot of tables and enable intuitive manipulation and analysis of
relations. It was appeared an effort to simplify ERD gained data.
diagrams and their conformation to data stock Transparency - the system should be connected
requirements. There were created two types of to front-end systems.
dimensional models for data type structure. We can Availability - the system should offer only data
distinguish them according to connection between needed to analysis. Users are not interested in
tables of dimension and table (tables) of facts: the way how the system approaches to
Star schema – in this schema are data insert in heterogeneous sources.
one table so called “non-normal”. Hierarchies Consistent effort - the system effort mustn’t
of dimensions are created only by levels whose depend on the number of system dimensions.
items are in one table. It causes complicated Client-server architecture - OLAP system has
ETL process but on the contrary offers high to be client-server type.
query performance. Generic dimensionality - each dimension of
data has to be equivalent in structure and
operational abilities.
Dynamic treatment of sparse matrices - the
system should by able to adapt its physical
scheme to analytical model optimizing
treatment of sparse matrices.
Multi user support - the system should by
Fig. 1. Star schema support team work of users and parallel data
processing.
Snowflake schema – in this schema are data Unlimited crosswise dimensional operation -
widespread in several related tables with the system has to distinguish dimensional
cardinality 1:N. Obviously are tables in third hierarchy and automatically execute associated
normal form. It causes restriction of redundant calculations.
data but by reason of more connections Intuitive manipulation with data - user interface
between tables is decreasing the query should be intuitive.
performance. Flexible declaration - the system should be
allows changes in rows and columns disposals
(according the analysis needs).
Unlimited dimension number and aggregate
levels - OLAP system shouldn’t implement any
artificial restriction of dimensions or
aggregation levels.

2.3.2 Description of the OLAP technology


The OLAP technology works with so called
multidimensional data. In contrast to two
Fig. 2. Snowflake schema dimensional data storage in relation databases

ISBN: 978-1-61804-004-6 203


Recent Researches in Automatic Control

(columns and rows) here is using n-dimensional aggregations are stored in multidimensional
Data Cube. The Data Cube can be considered as an structure (in data warehouse).
n-dimensional hypercube known from analytic DOLAP – Dynamical OLAP
geometry. This is special type of OLAP when the
Multidimensional database is not normalized. It is multidimensional Data Cube is constructed virtually
formed from tables of dimensions and facts in RAM memory. Basic advantage of this solution is
organized into schema. Every dimension represents unlimited flexibility and disadvantage id significant
other visual angle on data. Data could be organized demands on RAM memory.
not only logically but also hierarchically. Numerical
data came from process are in table of facts. 2.3.4 Knowledge mining from data
Data mining is process of looking for information
and hidden or unknown relations in big mass of
data. Development of this analytical method has
connection with enormous data rising in companies
databases. There are increasing not only data but
also the number of errors (bugs) in data. Data
mining work on the intuitive principle when on the
basis of real data are created possible hypotheses.
These hypotheses need to be verified and according
solutions adopt or reject.
Data mining arose by connection of database and
statistical discipline. It utilizes various complicated
algorithm whereby it is possible to predicate
development or segment (or cluster) related data.
From mathematical and statistical theory point of
view there is based on correlations searching and
hypotheses testing.
For the data mining is very important quality of
input (incoming) data. If data do not contain some
important statement the analysis solution couldn’t
be correct. For this reason it is very important
preparation of data intended for analysis. Usually
there is created one table from data warehouse
which contains preprocessed and cleaned data.
Fig. 3. Structures of relational and multidimensional Objective setting
databases Ordinarily, there is the same real problem which is
the impulse to start the data mining process. At the
2.3.3 Physical realization of multidimensional end of this process should be amount of information
data model suitable for solving the defined problem. Perhaps
MOLAP – Multidimensional OLAP marketing is area of largest use of the Data Mining.
It needs for its work special multidimensional Data selection
database which is periodically actualized by data In this phase it is necessary to choose data for the
from data warehouse. MOLAP is useful for small Data mining not only according alignment point of
and middle sized data quantity. view (demographical, behavioral, psychological
ROLAP – Relational OLAP etc.) but source databases too. Data are usually
It works above data warehouse or dada mart extracted from source systems to special server.
relational database. Multidimensional queries Data preprocessing
automatically translate to corresponding SQL Data preparation is most exacting and most critical
queries (SELECT). ROLAP is useful for extensive phase of the process. It is necessary to choose
data quantity. corresponding information from voluminous
HOLAP – Hybrid OLAP databases and save it to simple table. Data
It is specific combination of both approaches. Data preprocessing consist of next steps:
analysis works with relational databases but Data clearing – solving of missing or
inconsistent data problem,

ISBN: 978-1-61804-004-6 204


Recent Researches in Automatic Control

Data integration – various sources cause OLAP module for multidimensional data
problems with data redundancy, nomenclature, analysis enabling loading, questioning and
Data transformation – data have to be administration of data cubes created by
transformed to suitable format for data mining, Business Intelligence Development Studio
Data reduction – erasing of unneeded data and (BIDS)
attributes, data compression etc. Data Mining module which extended
Data mining models possibilities of business analyses.
Previously prepared data can be processed by
special algorithm to obtain mathematical models. 2.4.2 Data analysis user tools - MS Excel
Data exploration analysis – independent data The simplest and most obtainable analysis
searching without previous knowledge. proceeding of business data offers MS Excel.
Description – describe full data set. There are Certainly it is too the cheapest way because there is
created groups according behavior demonstration. no manager or chief executive without this program
installed on their notebooks or PC. That why there is
Prediction – it is trying to predict unknown
not necessary to by license for specialized software.
value according to knowledge of the others.
Users could create analytical reports and graphs
Retrieval according to template – the analyst aim is
immediately. Data analyses created by MS Excel are
to find data corresponding to templates.
very dynamic and effective. They enable a lot of
Data mining methods different views and graphical representations. Data
Regression methods – linear regression analysis, into MS Excel we can obtain by several ways. Most
nonlinear regression analysis, neural networks, common is the manual table filling form business
Classification – logistic regression analysis, reports. The second way is easier and it is data
decision trees, import from business information system. The third
Segmentation (clustering) – clustering analysis, way represents direct connection to database of
genetic algorithms, neural clustering, business information system. This way is most
Time series prediction – Box-Jenkins method, operative.
neural networks, Data analysis by pivot tables and graphs
Deviation detection Pivot tables are one of the most powerful tools of
MS Excel. Enable data summarization, filtration and
2.4 Tools for end - users ordering. There is possible to create a lot of different
2.4.1 Analytical tools of MS SQL server 2008 views, reports and graphs from one data source.
From the beginning of OLAP Microsoft made effort Created pivot table is easily variable - we can add or
to create the model of self-service analytical tools. delete data, columns, rows or change summaries
In the version MS SQL Server 2005 were joined all without influences of data source. Pivot tables are
analytical levels into Unified Dimension Model. In very often use as a user tool for work with data cube
the version MS SQL Server 2008 is the focal point used by MS SQL Server.
in Analysis Services which are containing OLAP,
Data Mining, Reporting Services and Integration
Services. 3 Example
Integration Services From the manufacturing processes point of view it is
SQL Server Integration Services (SSIS) works as a interesting utilization of data mining or OLAP at
data pump ETL. It allows creating applications for analysis of technological process stage, prediction
data administration, manipulation with files in and diagnostic of abnormal stages and looking for
directories, data import and data export. technological connections in historical data rising as
Reporting Services a secondary product of monitoring.
SQL Server Reporting Services (SSRS) provides As an example is mentioned utilization of SQL
flexible platform for reports creation and distribution. It Server Analysis Services as a key component for
cooperates with client tool MS SQL Server Report data analysis. For multidimensional data analysis
Builder which is complexly free for end-users. enabling loading, questioning and administration of
Analysis Services data cubes we used OLAP module created by
SQL Server Analysis Services (SSAS) is a key Business Intelligence Development Studio (BIDS).
component of data analysis. It consists of two If we want to create a new project we must choose
components: Analyses Service project.

ISBN: 978-1-61804-004-6 205


Recent Researches in Automatic Control

Completed Data Cube we can see in browser


environment or we can draw it for better
understanding as a three dimensional cube.

Fig. 4. Project creation in BIDS

Second step is creation of Data Source connection.

Fig. 8. Data analysis by the help of data cube

Fig. 5. Definition of Data Source connection 4 Conclusion


High - quality data analysis and level of gained
Third step is Data Source View setting. information stands on background of all correct
manager decisions. Good managers are able to use it
for improvement of efficiency and company
competitive advantage by prediction of trend and
future development tendencies.
Acknowledgments
This work was supported by the Ministry of
Education, Youth and Sports of the Czech Republic
in the range of the project No. MSM 7088352102.

References:
[1] H. P. Luhn, A Business Intelligence Systems.
IBM Journal of Research and Development,
Fig. 6. Setting of Data Source View 1958, pp. 314-319.
[2] M. Berthold, D. Hand, Intelligent Data
The last step is Data cube composition. Analysis. Springer, Berlin, 2009.
[3] P. Tan, M. Steinbach, V. Kumar, Introduction to
Data Mining. 2005. ISBN 0-321-32136-7
[4] G. Shmueli, N. R. Patel, P. C. Bruce, Data
Mining for Business Intelligence. 2006. ISBN 0-
470-08485-5
[5] D. Pokorná, Business Data Analyses
Possibilities. Diploma thesis. Faculty of
Applied Informatics, Tomas Bata University in
Zlín. 2010.
[6] D. Power, Dssresources.com [online]. 2007 [cit.
2010-06-07]. A Brief History of Decision
Fig. 7. Data Cube definition Support Systems. From WWW: <http://
dssresources.com/history/dsshistory.html>.

ISBN: 978-1-61804-004-6 206

View publication stats

You might also like