Data Analysis: Tools and Methods: January 2011
Data Analysis: Tools and Methods: January 2011
Data Analysis: Tools and Methods: January 2011
net/publication/228969012
CITATIONS READS
4 19,112
3 authors:
Radek Silhavy
Tomas Bata University in Zlín
59 PUBLICATIONS 212 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Use Case Points Dataset (Software Development Effort Estimation, Software Size) View project
All content following this page was uploaded by Radek Silhavy on 05 June 2014.
Abstract: - The paper outlines an overview about contemporary state of art and trends in the field of data
analysis. Collecting, storing, merging and sorting enormous amounts of data have been a major challenge for
software and hardware facilities. Increasing number of companies and institutions has solved and developed
tools for saving and storing tables, documents or multimedia data. Database structures are a major instrument in
prevailing applications. These structures have everyday thousands or millions entries. The objectives of
analytical tools is obtaining necessary and useful information from collected data and consequently utilizing
them for active control and decision making. The main aim of this contribution is to present some possibilities
and tools of data analysis with regards to availability of final users.
with aggregate data. So that the analytical look into ETL works in batch mode next regular updating
data imposed the necessity of changing the data brings only such amount of data which
access technology. Operating systems work with corresponding with used time period (day, week,
transaction entity-relational databases analytical month, year).
systems work with data warehouses and
multidimensional databases. 2.1.2 EAI – Enterprise Application Integration
EAI tools are exploited in source system layer.
Their aim is integration of primary business systems
2 Tools of analytical systems and reduction of a number of their reciprocal
interface. These tools work on two levels:
2.1 Data transformations – data pumps at the level of data integration where there are
Data acceptable for next analysis have to be used for integration and data distribution
extracted from operational systems and put into data at the level of application integration where
store. After that we can perform analyses by the there are used for sharing of selected functions
help of OLAP technology, Data Mining technology of information systems.
or by the help of reporting services to create reports.
This action is at the creation of data stores most 2.2 Database components – data warehouse
important as well as more exacting. It is necessary The philosophy of data warehouse (stores) has
to ensure analysis of contain and technologically published for the first time by Bill Inmon in the
heterogeneous data sources and then choose relevant book Building the Data Warehouse in the year 1991.
data and centralize, integrate and aggregate them Genuine reason of data warehouse occurrence had
each other. Data pumps serves to collection and connection especially with massive setting of server
transmission of data from source systems to data business systems and their conception of separate
stores and dumping ground. They include: and independent application at the end of eightieth
ETL systems for extraction, transformation and years of last century. Data warehouses ware
transmission of data established as independent information systems set
EAI systems for application integration (work above business data. While data warehouses are
in contrast to ETL tools in real-time). subject-oriented (data are separated according to
types) data markets are problem-oriented. For the
2.1.1 ETL – Extract, Transform and Load purpose of data storage served new
Data store filling (ETL process) starts by data multidimensional database model which enabled
extraction from primary sources (Extraction). easily and quickly create various views on data by
During this phase there are seek out and remove the help of special cuts of data cube. This
various data inconsistency. Before their technology is the bases of today analytical tools of
transformation into the data schema extracted data Business Intelligence. By connection of BI with
can be loaded in temporary dumping ground. tools of business planning was created a new type of
Temporary dumping ground data component (Data application called Corporate Performance
Staging Area – DSA) used to be most frequently a Management (CPM).
part of those solutions of data stores which has a Data warehouses are special types of business
source in heavy transaction systems. By using of databases which contain consolidated data from all
DSA will reduce requirement of transaction systems accessible service systems. There are not optimized
utilization in the ETL process and they can be used for quick transaction processing but quick
at business processes service. DSA is possible to use administration of analytical information obtained
also in the case when is necessary to transfer data from big amount of data. Data warehouses ensuring
from for example text file into the required database processes of storing, actualization and
format. After the extraction follows data administration of data. There are exists two basic
transformation (Transformation) which will convert types of data stores and two types of auxiliary stores
data obtained from single data sources into unified
data model. This model makes it possible to create 2.2.1 Basic data stores
aggregations and clustering. Data Warehouse (DWH)
The final phase of ETL is data transmission from Data warehouse is wide (extensive) central
source data memories or temporary dumping ground business database in which are saved
to database tables of the data store. At the primary transformed data coming from various service
filling it can be a gigantic quantity of data. Because systems and external databases. Mentioned data
are intended to following analyses.
(columns and rows) here is using n-dimensional aggregations are stored in multidimensional
Data Cube. The Data Cube can be considered as an structure (in data warehouse).
n-dimensional hypercube known from analytic DOLAP – Dynamical OLAP
geometry. This is special type of OLAP when the
Multidimensional database is not normalized. It is multidimensional Data Cube is constructed virtually
formed from tables of dimensions and facts in RAM memory. Basic advantage of this solution is
organized into schema. Every dimension represents unlimited flexibility and disadvantage id significant
other visual angle on data. Data could be organized demands on RAM memory.
not only logically but also hierarchically. Numerical
data came from process are in table of facts. 2.3.4 Knowledge mining from data
Data mining is process of looking for information
and hidden or unknown relations in big mass of
data. Development of this analytical method has
connection with enormous data rising in companies
databases. There are increasing not only data but
also the number of errors (bugs) in data. Data
mining work on the intuitive principle when on the
basis of real data are created possible hypotheses.
These hypotheses need to be verified and according
solutions adopt or reject.
Data mining arose by connection of database and
statistical discipline. It utilizes various complicated
algorithm whereby it is possible to predicate
development or segment (or cluster) related data.
From mathematical and statistical theory point of
view there is based on correlations searching and
hypotheses testing.
For the data mining is very important quality of
input (incoming) data. If data do not contain some
important statement the analysis solution couldn’t
be correct. For this reason it is very important
preparation of data intended for analysis. Usually
there is created one table from data warehouse
which contains preprocessed and cleaned data.
Fig. 3. Structures of relational and multidimensional Objective setting
databases Ordinarily, there is the same real problem which is
the impulse to start the data mining process. At the
2.3.3 Physical realization of multidimensional end of this process should be amount of information
data model suitable for solving the defined problem. Perhaps
MOLAP – Multidimensional OLAP marketing is area of largest use of the Data Mining.
It needs for its work special multidimensional Data selection
database which is periodically actualized by data In this phase it is necessary to choose data for the
from data warehouse. MOLAP is useful for small Data mining not only according alignment point of
and middle sized data quantity. view (demographical, behavioral, psychological
ROLAP – Relational OLAP etc.) but source databases too. Data are usually
It works above data warehouse or dada mart extracted from source systems to special server.
relational database. Multidimensional queries Data preprocessing
automatically translate to corresponding SQL Data preparation is most exacting and most critical
queries (SELECT). ROLAP is useful for extensive phase of the process. It is necessary to choose
data quantity. corresponding information from voluminous
HOLAP – Hybrid OLAP databases and save it to simple table. Data
It is specific combination of both approaches. Data preprocessing consist of next steps:
analysis works with relational databases but Data clearing – solving of missing or
inconsistent data problem,
Data integration – various sources cause OLAP module for multidimensional data
problems with data redundancy, nomenclature, analysis enabling loading, questioning and
Data transformation – data have to be administration of data cubes created by
transformed to suitable format for data mining, Business Intelligence Development Studio
Data reduction – erasing of unneeded data and (BIDS)
attributes, data compression etc. Data Mining module which extended
Data mining models possibilities of business analyses.
Previously prepared data can be processed by
special algorithm to obtain mathematical models. 2.4.2 Data analysis user tools - MS Excel
Data exploration analysis – independent data The simplest and most obtainable analysis
searching without previous knowledge. proceeding of business data offers MS Excel.
Description – describe full data set. There are Certainly it is too the cheapest way because there is
created groups according behavior demonstration. no manager or chief executive without this program
installed on their notebooks or PC. That why there is
Prediction – it is trying to predict unknown
not necessary to by license for specialized software.
value according to knowledge of the others.
Users could create analytical reports and graphs
Retrieval according to template – the analyst aim is
immediately. Data analyses created by MS Excel are
to find data corresponding to templates.
very dynamic and effective. They enable a lot of
Data mining methods different views and graphical representations. Data
Regression methods – linear regression analysis, into MS Excel we can obtain by several ways. Most
nonlinear regression analysis, neural networks, common is the manual table filling form business
Classification – logistic regression analysis, reports. The second way is easier and it is data
decision trees, import from business information system. The third
Segmentation (clustering) – clustering analysis, way represents direct connection to database of
genetic algorithms, neural clustering, business information system. This way is most
Time series prediction – Box-Jenkins method, operative.
neural networks, Data analysis by pivot tables and graphs
Deviation detection Pivot tables are one of the most powerful tools of
MS Excel. Enable data summarization, filtration and
2.4 Tools for end - users ordering. There is possible to create a lot of different
2.4.1 Analytical tools of MS SQL server 2008 views, reports and graphs from one data source.
From the beginning of OLAP Microsoft made effort Created pivot table is easily variable - we can add or
to create the model of self-service analytical tools. delete data, columns, rows or change summaries
In the version MS SQL Server 2005 were joined all without influences of data source. Pivot tables are
analytical levels into Unified Dimension Model. In very often use as a user tool for work with data cube
the version MS SQL Server 2008 is the focal point used by MS SQL Server.
in Analysis Services which are containing OLAP,
Data Mining, Reporting Services and Integration
Services. 3 Example
Integration Services From the manufacturing processes point of view it is
SQL Server Integration Services (SSIS) works as a interesting utilization of data mining or OLAP at
data pump ETL. It allows creating applications for analysis of technological process stage, prediction
data administration, manipulation with files in and diagnostic of abnormal stages and looking for
directories, data import and data export. technological connections in historical data rising as
Reporting Services a secondary product of monitoring.
SQL Server Reporting Services (SSRS) provides As an example is mentioned utilization of SQL
flexible platform for reports creation and distribution. It Server Analysis Services as a key component for
cooperates with client tool MS SQL Server Report data analysis. For multidimensional data analysis
Builder which is complexly free for end-users. enabling loading, questioning and administration of
Analysis Services data cubes we used OLAP module created by
SQL Server Analysis Services (SSAS) is a key Business Intelligence Development Studio (BIDS).
component of data analysis. It consists of two If we want to create a new project we must choose
components: Analyses Service project.
References:
[1] H. P. Luhn, A Business Intelligence Systems.
IBM Journal of Research and Development,
Fig. 6. Setting of Data Source View 1958, pp. 314-319.
[2] M. Berthold, D. Hand, Intelligent Data
The last step is Data cube composition. Analysis. Springer, Berlin, 2009.
[3] P. Tan, M. Steinbach, V. Kumar, Introduction to
Data Mining. 2005. ISBN 0-321-32136-7
[4] G. Shmueli, N. R. Patel, P. C. Bruce, Data
Mining for Business Intelligence. 2006. ISBN 0-
470-08485-5
[5] D. Pokorná, Business Data Analyses
Possibilities. Diploma thesis. Faculty of
Applied Informatics, Tomas Bata University in
Zlín. 2010.
[6] D. Power, Dssresources.com [online]. 2007 [cit.
2010-06-07]. A Brief History of Decision
Fig. 7. Data Cube definition Support Systems. From WWW: <http://
dssresources.com/history/dsshistory.html>.