DB2
DB2
DB2
Front cover
Whei-Jen Chen
Scott Andrus
Bhuvana Balaji
Enzo Cialini
Michael Kwok
Roman B. Melnyk
Jessica Rockwood
ibm.com/redbooks
SG24-8157-00
Note: Before using this information and the product it supports, read the information in
Notices on page ix.
Contents
Notices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Authors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
Now you can become a published author, too! . . . . . . . . . . . . . . . . . . . . . . . . xv
Comments welcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Stay connected to IBM Redbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Part 1. Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 1. Gaining business insight with IBM DB2 . . . . . . . . . . . . . . . . . . . 3
1.1 Current business challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Big data and the data warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Data warehouse infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 High performance warehouse with DB2 for Linux, UNIX, and Windows . . . 7
Chapter 2. Technical overview of IBM DB2 Warehouse. . . . . . . . . . . . . . . . 9
2.1 DB2 Warehouse solutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 The component groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 Editions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Expert systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 PureData for operational analytics . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 DB2 Warehouse components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 Database management system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 DB2 Warehouse topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Chapter 3. Warehouse development lifecycle . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Defining business requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Building the data model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.1 Defining the physical data model . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Creating the physical data model . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.2.3 Working with diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.4 Using the Diagram Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.5 Optimal attributes of the physical data model . . . . . . . . . . . . . . . . . . 39
3.2.6 Model analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.7 Deploying the data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.8 Maintaining model accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
iii
iv
6.1.2 Ingest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.1.3 Continuous data ingest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.2 DB2 Warehouse SQL Warehousing Tool . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.2 Development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.3 Moving from development to production . . . . . . . . . . . . . . . . . . . . . 101
6.2.4 Runtime environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Chapter 7. Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.1 Understanding monitor elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.1.1 Monitor element collection levels . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7.1.2 New monitoring elements for column-organized tables . . . . . . . . . 110
7.2 Using DB2 table functions to monitor your database in real time . . . . . . 111
7.2.1 Monitoring requests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2.2 Monitoring activities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.3 Monitoring data objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.2.4 Monitoring locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.5 Monitoring system memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.2.6 Monitoring routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3 Using event monitors to capture information about database events . . . 114
7.4 Monitoring your DB2 system performance with IBM InfoSphere Optim
Performance Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.4.1 Information dashboards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.4.2 OPM support for DB2 10.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Chapter 8. High availability. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.1 IBM PureData System for Operational Analytics. . . . . . . . . . . . . . . . . . . 120
8.2 Core warehouse availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8.2.1 Roving HA group configuration for the administration hosts . . . . . . 124
8.2.2 Roving HA group configuration for the data hosts . . . . . . . . . . . . . 125
8.2.3 Core warehouse HA events monitored by the system console. . . . 127
8.3 Management host availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.3.1 Management host failover events that are monitored by the system
console . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8.4 High availability management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.4.1 High availability toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
8.4.2 Starting and stopping resources with the HA toolkit . . . . . . . . . . . . 133
8.4.3 Monitoring the status of the core warehouse HA configuration. . . . 134
8.4.4 Monitoring the status of the management host HA configuration . . 137
8.4.5 Moving database partition resources to the standby node as a planned
failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
8.4.6 Moving resources to the standby management host as a planned
failover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Contents
vi
Contents
vii
viii
Notices
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other countries. Consult your
local IBM representative for information on the products and services currently available in your area. Any
reference to an IBM product, program, or service is not intended to state or imply that only that IBM product,
program, or service may be used. Any functionally equivalent product, program, or service that does not infringe
any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and
verify the operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter described in this document. The
furnishing of this document does not grant you any license to these patents. You can send license inquiries, in
writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other country where such
provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION
PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR
IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of
express or implied warranties in certain transactions, therefore, this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes are periodically made to the
information herein; these changes will be incorporated in new editions of the publication. IBM may make
improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time
without notice.
Any references in this information to non-IBM websites are provided for convenience only and do not in any
manner serve as an endorsement of those websites. The materials at those websites are not part of the materials
for this IBM product and use of those websites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any
obligation to you.
Any performance data contained herein was determined in a controlled environment. Therefore, the results
obtained in other operating environments may vary significantly. Some measurements may have been made on
development-level systems and there is no guarantee that these measurements will be the same on generally
available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual
results may vary. Users of this document should verify the applicable data for their specific environment.
Information concerning non-IBM products was obtained from the suppliers of those products, their published
announcements or other publicly available sources. IBM has not tested those products and cannot confirm the
accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the
capabilities of non-IBM products should be addressed to the suppliers of those products.
This information contains examples of data and reports used in daily business operations. To illustrate them as
completely as possible, the examples include the names of individuals, companies, brands, and products. All of
these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is
entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which illustrate programming
techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any
form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs
conforming to the application programming interface for the operating platform for which the sample programs are
written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or
imply reliability, serviceability, or function of these programs. You may copy, modify, and distribute these sample
programs in any form without payment to IBM for the purposes of developing, using, marketing, or distributing
application programs conforming to IBM's application programming interfaces.
ix
Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business
Machines Corporation in the United States, other countries, or both. These and other IBM trademarked
terms are marked on their first occurrence in this information with the appropriate symbol ( or ),
indicating US registered or common law trademarks owned by IBM at the time this information was
published. Such trademarks may also be registered or common law trademarks in other countries. A current
list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml
The following terms are trademarks of the International Business Machines Corporation in the United States,
other countries, or both:
AIX
Cognos
DataStage
DB2
DB2 Connect
DB2 Extenders
GPFS
IBM
IBM PureData
Informix
InfoSphere
Intelligent Miner
Optim
POWER
PureData
pureQuery
pureScale
PureSystems
pureXML
Rational
Redbooks
Redbooks (logo)
solidDB
SPSS
System z
Tivoli
WebSphere
z/OS
Preface
Building on the business intelligence (BI) framework and capabilities that are
outlined in InfoSphere Warehouse: A Robust Infrastructure for Business
Intelligence, SG24-7813, this IBM Redbooks publication focuses on the new
business insight challenges that have arisen in the last few years and the new
technologies in IBM DB2 10 for Linux, UNIX, and Windows that provide
powerful analytic capabilities to meet those challenges.
This book is organized in to two parts. The first part provides an overview of data
warehouse infrastructure and DB2 Warehouse, and outlines the planning and
design process for building your data warehouse. The second part covers the
major technologies that are available in DB2 10 for Linux, UNIX, and Windows.
We focus on functions that help you get the most value and performance from
your data warehouse. These technologies include database partitioning,
intrapartition parallelism, compression, multidimensional clustering, range (table)
partitioning, data movement utilities, database monitoring interfaces,
infrastructures for high availability, DB2 workload management, data mining, and
relational OLAP capabilities. A chapter on BLU Acceleration gives you all of the
details about this exciting DB2 10.5 innovation that simplifies and speeds up
reporting and analytics. Easy to set up and self-optimizing, BLU Acceleration
eliminates the need for indexes, aggregates, or time-consuming database tuning
to achieve top performance and storage efficiency. No SQL or schema changes
are required to take advantage of this breakthrough technology.
This book is primarily intended for use by IBM employees, IBM clients, and IBM
Business Partners.
Authors
This book was produced by a team of specialists from around the world working
at the International Technical Support Organization, San Jose Center.
Whei-Jen Chen is a Project Leader at the International Technical Support
Organization, San Jose Center. She has extensive experience in application
development, database design and modeling, and IBM DB2 system
administration. Whei-Jen is an IBM Certified Solutions Expert in Database
Administration and Application Development, and an IBM Certified IT Specialist.
xi
xii
Preface
xiii
Acknowledgements
Thanks to the following people for their contributions to this project:
Chris Eaton, Sam Lightstone, Berni Schiefer, Jason Shayer, Les King
IBM Toronto Laboratory, Canada
Thanks to the authors of the previous editions of this book.
Authors of InfoSphere Warehouse: A Robust Infrastructure for Business
Intelligence, SG24-7813, published by June 2010, were:
Chuck Ballard
Nicole Harris
Andrew Lawrence
Meridee Lowry
Andy Perkins
Sundari Voruganti
xiv
Comments welcome
Your comments are important to us!
We want our books to be as helpful as possible. Send us your comments about
this book or other IBM Redbooks publications in one of the following ways:
Use the online Contact us review Redbooks form found at:
ibm.com/redbooks
Send your comments in an email to:
[email protected]
Mail your comments to:
IBM Corporation, International Technical Support Organization
Dept. HYTD Mail Station P099
2455 South Road
Poughkeepsie, NY 12601-5400
Preface
xv
xvi
Part 1
Part
Overview
The increasing focus on speed of thought analytics as a critical success factor
for organizations underscores the urgent need for a cost-effective business
intelligence framework that is easy to implement, ease to manage, and provides
outstanding performance. The IBM DB2 10 for Linux, UNIX, and Windows data
warehouse offerings provide such a framework with all of these characteristics
and more.
The first phase of developing a data warehouse solution is to understand your
requirements and how they can be addressed by the warehouse offering. The
chapters in this section provide an overview of data warehouse infrastructure,
DB2 Warehouse, and the planning and design process.
Chapter 1, Gaining business insight with IBM DB2 on page 3 covers the
warehouse capabilities that you want when you build a business intelligence (BI)
architectural framework to meet the business challenges of today, including the
great opportunities around big data.
Chapter 2, Technical overview of IBM DB2 Warehouse on page 9 provides an
overview of the IBM DB2 for Linux, UNIX, and Windows data warehouse
offerings, including solution components, architecture, and product licensing.
Chapter 1.
From http://www.ibm.com/software/data/bigdata
A big data platform must support both traditional analytics (on structure data from
traditional sources) and a new style of exploratory analytics on unstructured
data. Data warehouses provide support for traditional analytics, running deep
analytic queries on huge volumes of structured data with massive parallel
processing.
The chapters that follow focus on two primary areas: embedded analytics and
performance optimization. In embedded analytics, this past year brought the
addition of IBM Cognos Business Intelligence Dynamic Cubes to the
architecture. Chapter 11, Providing the analytics on page 177 focuses on how
to take advantage of the in-memory acceleration that is provided by Dynamic
Cubes to provide speed of thought analytics over terabytes of enterprise data.
In performance optimization, there are two main areas of discussion. The first is
the introduction of BLU Acceleration to DB2. This new technology, introduced in
DB2 10.5 for Linux, UNIX, and Windows, is in-memory optimized columnar
processing that optimizes memory, processor, and I/O for analytic queries. The
second area of focus is the performance advancements for both scale up and
scale out options for standard row-based tables.
Data movement and transformation, modeling and design, and administration
and control are reviewed, but primarily from a focus of what is new with DB2 10.
For a complete examination of these components, see InfoSphere Warehouse: A
Robust Infrastructure for Business Intelligence, SG24-7813.
Chapter 2.
Note: All the functions, features, and operators for IBM DB2 Advance
Enterprise Server Edition on the Linux, UNIX, and Windows platforms are not
the same as on the IBM System z platform. We try to point out the
differences where appropriate in each chapter.
10
Figure 2-1 Logical component groups in IBM DB2 Advance Enterprise Server Edition
The highly scalable and robust DB2 database servers are the foundation of DB2
Warehouse. All the design and development is performed by using Design
Studio, which is an Eclipse-based development environment. The IBM
WebSphere Application Server is used as the runtime environment for
deployment and management of all DB2 Warehouse based applications.
2.1.2 Editions
IBM DB2 for Linux, UNIX, and Windows 10.5 offers various editions and features
that provide customers with choices based on business need. For data
warehouse usage, you can choose from Advance Enterprise Server Edition,
Advance Workgroup Edition, and Developer Edition.
11
12
Change Data Capture (CDC) replication between a DB2 for Linux, UNIX, and
Windows source and up to two DB2 for Linux, UNIX, and Windows 10.5
targets for HADR
IBM solidDB and solidDB Universal Cache
Cognos Business Intelligence
Warehouse model packs
A rich set of tools:
The six tools that are listed above can be used with any supported version of
DB2 Advanced Enterprise Server Edition. Similarly, any of these tools that were
included with prior versions of DB2 Advanced Enterprise Server Edition can be
used with DB2 Advanced Enterprise Server Edition 10.5, if supported.
DB2 Advanced Enterprise Server Edition can be deployed on Linux, UNIX, or
Windows servers of any size, from one processor to hundreds of processors, and
on both physical and virtual servers.
DB2 10.5 Advanced Enterprise Server Edition is available on a Processor Value
Unit (PVU), per Authorized User Single Install (AUSI), or per Terabyte charge
metric. Under the AUSI charge metric, you must acquire a minimum of 25 AUSI
licenses per 100 PVUs. Under the per Terabyte charge metric, you must meet
the following requirements:
Use the database for a data warehouse and run a data warehouse workload.
A data warehouse is a subject-oriented, integrated, and time-variant
collection of data that integrates data from multiple data sources for historical
ad hoc data reporting and querying. A data warehouse workload is a
workload that typically scans thousands or millions of rows in a single query
to support the analysis of data in a data warehouse.
Have user data from a single database spread across two or more active data
partitions or at least 75% of the user data in BLU Acceleration
column-organized tables, with most of the workload accessing the BLU
Acceleration column-organized tables.
Not use the DB2 for Linux, UNIX, and Windows built-in high availability
disaster recovery (HADR) or pureScale capabilities.
13
You can upgrade from DB2 Enterprise Server Edition to DB2 Advanced
Enterprise Server Edition by using a Processor Value Unit (PVU) based upgrade
part number.
14
15
DB2 10.5 Workgroup Server Edition is available on a per Limited Use Socket,
PVU, or Authorized User Single Install (AUSI) charge metric. Under the AUSI
metric, you must acquire a minimum of five AUSI licenses per installation. Under
all metrics, you are restricted to 16 processor cores and 128 GB of instance
memory and 15 TB of user data per database. In addition, under the Socket
metric, you are also restricted to no more than four processor sockets. These
restrictions are per physical or, where partitioned, virtual server.
IBM DB2 Express Server Edition for Linux, UNIX, and Windows
10.5
DB2 Express Server Edition is a full-function transactional data server, which
provides attractive entry-level pricing for the small and medium business (SMB)
market. It comes with simplified packaging and is easy to transparently install
within an application. With DB2 Express Server Edition, it is easy to then
upgrade to the other editions of DB2 10.5 because DB2 Express Server Edition
includes most of the same features, including security and HADR, as the more
scalable editions. DB2 Express Server Edition can be deployed in x64 server
environments and is restricted to eight processor cores and 64 GB of memory
per physical or, where partitioned, virtual server. If using multiple virtual servers
on a physical server, there is no limit on the cores or memory that are available to
the physical server if the processor and memory restrictions are observed by the
virtual servers running DB2. This makes DB2 Express Server Edition ideal for
consolidating multiple workloads onto a large physical server running DB2
Express Server Edition in multiple virtual servers.
DB2 10.5 Express Server Edition is available on a per Authorized User Single
Install (AUSI), PVU, Limited Use Virtual Server, or 12-month Fix Term licensing
model. If licensed under the AUSI metric, you must acquire a minimum of five
AUSI licenses per installation. DB2 Express Server Edition can also be licensed
on a yearly Fixed Term License pricing model. Under all charge metrics, you are
restricted to 15 TB of user data per database.
16
DB2 10.5 Express-C can be used for development and deployment at no charge.
It can be installed on x64-based physical or virtual systems and may use up to a
maximum of two processor cores and 16 GB of memory with no more than 15 TB
of user data per database. DB2 Express-C comes with online community-based
assistance. Users requiring more formal support, access to fix packs, or more
capabilities such as high availability, Homogeneous Federation, and replication,
can purchase an optional yearly subscription for DB2 Express Server Edition
(Fixed Term License) or upgrade to other DB2 editions. In addition, IBM Data
Studio is also available to facilitate solutions deployment and management.
IBM DB2 Developer Edition for Linux, UNIX, and Windows 10.5
IBM DB2 Developer Edition offers a package for a single application developer to
design, build, test, and prototype applications for deployment on any of the IBM
DB2 client or server platforms. This comprehensive developer offering includes
DB2 Workgroup Server Edition, DB2 Advanced Workgroup Server Edition, DB2
Enterprise Server Edition, DB2 Advanced Enterprise Server Edition, IBM DB2
Connect Enterprise Edition, and all of the built-in DB2 10.5 capabilities,
allowing you to build solutions that use the latest data server technologies.
The software in this package cannot be used for production systems. You must
acquire a separate Authorized User license for each unique person who is given
access to the program.
17
18
19
The categories and components of the architecture are shown in Figure 2-3, and
described in the remainder of this section.
Figure 2-3 IBM DB2 Advance Enterprise Server Edition functional component
architecture
Common configuration
SQL warehousing
Cubing Services
Mining
Common configuration
This component allows you to create and manage database and system
resources, including driver definitions, log files, and notifications.
20
SQL warehousing
This component allows you to run and monitor data warehousing applications,
view deployment histories, and run statistics.
Cubing Services
Cubing Services works with business intelligence tools to provide OLAP access
to data directly from DB2 Warehouse. Cubing Services includes tools for
multidimensional modeling to design OLAP metadata (cubes), an optimization
advisor for recommending materialized query tables (MQTs) in DB2, and a cube
server for providing multidimensional access to the data. Each of these Cubing
Services components is integrated with the DB2 Warehouse user interfaces for
design (Design Studio) and administration and maintenance (Administration
Console).
The Cubing Services cube server processes multidimensional queries that are
expressed in the MDX query language and produces multidimensional results.
The cube servers fetch data from DB2 through SQL queries as needed to
respond to the MDX queries. The MQTs that are recommended by the
optimization advisor are used by the DB2 optimizer, which rewrites incoming
SQL queries and routes eligible queries to the appropriate MQT for faster query
performance. In addition to these performance enhancements, the cube server
includes multiple caching layers for further optimizing the performance of MDX
queries.
DB2 Warehouse provides a Cubing Services Client ODBO Provider to allow
access to cube data in Microsoft Excel.
Mining
The mining component allows you to run and monitor data mining applications.
DB2 Warehouse includes the following mining features:
21
companies that want to consolidate data marts, information silos, and Business
Analytics to deliver a single version of the truth to all users. It allows easy
creation of reports and quick analysis of data from the data warehouse.
Text Analysis
With DB2 Warehouse, you can create business insight from unstructured
information. You can extract information from text columns in your data
warehouse and then use the extracted information in reports, multidimensional
analysis, or as input for data mining.
22
Performance optimization
The DB2 Warehouse has several features that can be used to enhance the
performance of analytical applications by using data from the data warehouse.
You can use the db2batch Benchmark Tool in DB2 for Linux, UNIX, and
Windows to compare your query performance results after you create the
recommended summary tables against the benchmarked query performance.
In a data warehouse environment, a huge amount of data exists for creating a
cube only or materialized query tables (MQTs), and multidimensional clusters
(MDCs) can increase query performance.
For more information about these features, see the following website:
For DB2 for Linux, UNIX, and Windows:
http://www-01.ibm.com/software/data/db2/linux-unix-windows/
For System z:
http://www-01.ibm.com/software/data/db2/zos/family/
23
24
Table Relationships
Database Schema
25
Database servers
This category includes the DB2 database server with DPF support plus the
database functions to support Intelligent Miner, Cubing Services, and the
workload manager. The database server is supported on IBM AIX, various
Linux platforms, Windows Server 2003, and System z.
Application servers
The application server category includes the WebSphere Application Server,
Cognos Business Intelligence server components, and InfoSphere
Warehouse Administration Console server components.
Documentation
This category includes the PDF and online versions of the manuals and can
be installed with any of the other categories.
For more information about DB2 Warehouse Installation, see the DB2
information center that is found at the following website:
http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.qb
.server.doc/doc/r0025127.html
The DB2 Warehouse components can be installed on multiple machines in a
number of topologies. Three common topologies are shown in Figure 2-5.
26
One tier
In this topology, the InfoSphere Warehouse Client, InfoSphere Warehouse
Database Server, and the InfoSphere Warehouse Application Server are all
on one system. This topology is used only for development and testing
purposes and only on a Windows platform.
Two tier
In this topology, the InfoSphere Warehouse Database Server and the
InfoSphere Warehouse Application Server are on one system with
InfoSphere Warehouse Clients on separate systems. This topology can
suffice for a test system or for smaller installations. The database and
applications can be any of the supported Windows, Linux, or AIX platforms.
Three tier
In any large installation, the InfoSphere Warehouse Database Server, the
InfoSphere Warehouse Application Server, and the InfoSphere Warehouse
Clients should all be installed on separate servers. A DB2 client, at a
minimum, is required to connect to database servers. It is a preferred practice
that a DB2 server be installed for local access to the runtime metadata
databases. The application server is supported on AIX, Linux, and Windows
Server 2003.
27
28
Chapter 3.
Warehouse development
lifecycle
This chapter describes the planning and design process of building the data
warehouse.
The InfoSphere Warehouse Design Studio (Design Studio) provides a platform
and a set of integrated tools for developing your data warehouse. You can use
these tools to build, populate, and maintain tables and other structures for data
mining and Online Analytical Processing (OLAP) analysis.
Design Studio includes the following tools and features:
Integrated physical data modeling, which is based on InfoSphere Data
Architect
SQL Warehousing Tool (SQW) for data flow and control flow design
Data mining, exploration, and visualization tools
Tools for designing OLAP metadata, MQTs, and cube models
Integration points with IBM InfoSphere DataStage ETL systems
By integrating these tools, Design Studio offers a fast time-to-value and
managed cost for warehouse-based analytics. For an introduction to Design
Studio, see Chapter 3, InfoSphere Warehouse Design Studio, in InfoSphere
Warehouse: A Robust Infrastructure for Business Intelligence, SG24-7813.
29
30
31
Using Design Studio, in your physical data models, you can define the following
data elements for the target database:
The physical data models that you create and update in Design Studio are
implemented as entity relationship diagrams. The models are visually
represented by using either Information Engineering (IE) notation or Unified
Modeling Language (UML) notation. Before you begin your modeling work,
configure Design Studio to use the notation that you prefer. Configure Design
Studio by clicking Data Diagram, which takes you to the configuration window
where you can select the wanted notation.
32
33
34
After you select the schemas that you want to use in the model, click Next to
be presented with a list of the database elements that you want to include.
7. Select all the database object types of interest and click Next.
The final window of the wizard provides options to create an overview
diagram and to infer implicit relationships. If you do not select these options,
the components may be added to your model later.
35
Figure 3-4 Dragging from the Database Explorer to create a physical data model
36
Creating a diagram
Design Studio provides several ways to add entity relationship diagrams to
projects.
If you choose to reverse engineer the data model, you can have the wizard
create an overview diagram for you. The wizard provides prompts that allow you
to specify what elements you want to include in the diagram.
You can still use diagrams in the data projects, even if you do not create the data
models through the reverse engineering approach. New diagrams might be
created from any Diagrams folder in the Project Explorer. Select the Diagrams
folder, right-click, and select New Overview Diagram. You are prompted to
select which elements from the current schema you want to include in the
diagram.
You can also create a blank diagram, rather than including existing schema
elements. To create a blank diagram, right-click the Diagrams folder in the
Project Explorer and select New Blank Diagram.
37
The shapes representing tables and elements can be moved to make them more
readable by highlighting either a rectangle or a line and dragging it to a new
position.
Items from the palette might be placed on the drawing area by clicking an
element to select it and moving the mouse to the drawing area and clicking.
Elements that are added to the diagram in this manner are provided with a
default name, but you can change the name to something more meaningful, as
shown in Figure 3-6 on page 39. When elements are added to the canvas, as in
Figure 3-6 on page 39, they are also added to the Data Project Explorer.
38
When the diagram contains tables, then columns might be added from the visual
diagram. When you select a model element, action bars appear, providing
context-aware dialog boxes. The options that are available through the action bar
include the ability to add a key, column, index, or trigger. The palette can be used
to establish identifying and non-identifying relationships between the various
tables that make up each diagram. As you are using this approach, you can also
use the Properties view to further define the objects that you are creating.
39
40
3. Select the set of rules that are of interest and click Finish to start the analysis.
41
The results of the Model Analysis utility are shown in the Problems view. An
example of the output is shown in Figure 3-8. Double-click an item in the
Problems view to navigate to the relevant model objects in the Project Explorer.
Take corrective action and rerun the model analysis to verify that all problems
were corrected.
Correct any issues that the model analysis identifies until all errors are cleared.
After the model is validated, you can deploy it.
42
43
After you select the objects that you want to deploy, you are presented with a
summary of the DDL script, with options to save the DDL file or run the DDL
script on the server. These options are shown in Figure 3-10. If you choose to run
the DDL on the server, you are prompted to select a database connection or
create a database connection. The DDL script is saved in the SQL Scripts logical
folder in your data design project.
The DDL scripts that are in the SQL Scripts folder might be saved for later
execution, and they might also be modified before being run. To edit the DDL,
select the file in the SQL Scripts folder, right-click, and select Open With SQL
Editor. This causes the file to be opened within a text editor. When you are
ready to run the DDL, select the file, right-click, and select Run SQL. You can
review the results in the Data Output view, which includes status information,
messages, parameters, and results.
44
45
As you work down the tree in the Structural Compare, differences are highlighted
in the lower view, which is called the Property Compare. This shows the
Properties view. An example of the output from the Compare Editor is shown in
Figure 3-11.
Synchronization of differences
As you view the differences between the objects you have compared, you can
use the Copy From buttons, which are on the right side of the toolbar that
separates the Structural Compare and Property Compare views. These buttons
allow you to implement changes from one model or object to another. You can
copy changes from left to right, or from right to left. As you use the Copy From
buttons, the information that displayed in the Structural Compare is updated to
reflect the change.
Another way to implement changes to bring your models in synchronization with
each other is to edit the values that are displayed in the Property Compare view.
Any changes that are made might be undone or redone by clicking Edit Undo
and Edit Redo.
After you reviewed all of the differences and decided what must be done to bring
your models in sync with each other, you are ready to generate delta DDL. This
DDL can be saved for later execution or might be run directly from Design Studio.
46
Impact analysis
Design Studio also provides a mechanism for performing impact analysis. It is
beneficial to understand the implications of changes to models before they are
implemented. The impact analysis utility shows all of the dependencies for the
selected object. The results are visually displayed, with a dependency diagram
and a Model Report view added to the Output pane of Design Studio.
The impact analysis discovery can be run selectively. To start the utility, highlight
an object, right-click, and select Analyze Impact. Select the appropriate options,
as shown in Figure 3-12.
47
48
It is a best of both worlds architecture that can handle both embedded analytics
of structured data but also generate and leverage knowledge from unstructured
information. It is the business requirements that drive the decision to scale up
within a single system (using either BLU Acceleration or standard intrapartition
parallelism) versus scale out with a shared-nothing architecture across multiple
logical or physical nodes.
49
50
Part 2
Part
Technologies
The next phase of developing the data warehouse is choosing the optimal
technology for the business requirements and data model. The chapters in this
section describe the strengths of the major technologies that are available in DB2
10 for Linux, UNIX, and Windows that you can leverage to optimize the value and
performance of your data warehouse.
Chapter 4, Column-organized data store with BLU Acceleration on page 53
shows how this remarkable innovation provides large order of magnitude
improvements in analytic workload performance, significant storage savings, and
reduced time to value.
Traditional row-based DB2 warehouse solutions are the subject of Chapter 5,
Row-based data store on page 71, which covers database partitioning,
intrapartition parallelism, and other technologies that you can use to build a
high-performance data warehouse environment, including compression,
multidimensional clustering, and range (table) partitioning.
Chapter 6, Data movement and transformation on page 91 gives you an
overview of some of the important DB2 data movement utilities and tools, and
describes the data movement and transformation capabilities in the DB2
Warehouse SQL Warehousing Tool (SQW).
51
Database monitoring, which includes all of the processes and tools that you use
to track the operational status of your database and the overall health of your
database management system, is the subject of Chapter 7, Monitoring on
page 105, which provides an overview of the DB2 monitoring interfaces and IBM
InfoSphere Optim Performance Manager Version 5.3.
Chapter 8, High availability on page 119 describes the high availability
characteristics of IBM PureData System for Operational Analytics, which
provides redundant hardware components and availability automation to
minimize or eliminate unplanned downtime.
DB2 workload management is the focus of Chapter 9, Workload management
on page 145, which gives you an overview of this important technology whose
purpose is to ensure that limited system resources are prioritized according to
the needs of your business. In a data warehouse environment, business
priorities and workload characteristics are continually changing, and resource
management must keep pace.
The data mining process and unstructured text analytics in the DB2 Warehouse
are covered in Chapter 10, Mining and unstructured text analytics on page 163,
which describes embedded data mining for the deployment of mining-based
business intelligence (BI) solutions.
Chapter 11, Providing the analytics on page 177 focuses on the relational
OLAP capabilities of the Dynamic Cubes feature of IBM Cognos Business
Intelligence V10.2, which enables high-speed query performance, interactive
analysis, and reporting over terabytes of data for many users.
52
Chapter 4.
Column-organized data
store with BLU Acceleration
This chapter introduces BLU Acceleration, which provides significant order of
magnitude benefits for analytic workload performance, storage savings, and time
to value.
BLU Acceleration is a new technology in DB2 for analytic queries. This set of
technologies encompasses CPU-, memory-, and I/O-optimization with unique
runtime handling and memory management and unique encoding for speed and
compression.
53
Figure 4-1 The relationship between TSN (logical row representation) and data pages in
BLU Acceleration
54
55
56
2. Determine the percentage of data that is accessed frequently (that is, by more
than 90% of the queries). For example, suppose that you have a table of
sales data that contains data that is collected 2000 - 2013, but only the most
recent three years are accessed by most queries. In this case, only data 2010
- 2013 is considered active, and all other data in that table is considered
inactive and be excluded from sizing calculations.
3. Determine the percentage of columns that are frequently accessed. With BLU
Acceleration, only the columns that are accessed in the query are considered
active data.
57
After you determine an estimate of the raw active data size, you can then
consider the system resources that are required regarding concurrency and
complexity of the workload, the amount of active data to be kept in memory in
buffer pools, and more. The minimum system requirements for BLU Acceleration
are eight cores and 64 GB of memory. As a preferred practice, maintain a
minimum ratio of 8 GB of memory per core.
For more information about sizing guidelines, see Best Practices: Optimizing
analytic workloads using DB2 10.5 with BLU Acceleration, found at:
https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/w
iki/0fc2f498-7b3e-4285-8881-2b6c0490ceb9/page/ecbdd0a5-58d2-4166-8cd5-5
f186c82c222/attachment/e66e86bf-4012-4554-9639-e3d406ac1ec9/media/DB2BP
_BLU_Acceleration_0913.pdf
58
db2set DB2_WORKLOAD=ANALYTICS
db2start
db2 create db mydb autoconfigure using mem_percent 100 apply db and dbm
Note: Do not specify CREATE DATABASE AUTOCONFIGURE NONE, which disables
the ability of the registry variable to influence the automatic configuration of
the database for analytic workloads.
The default setting for the AUTOCONFIGURE command are USING MEM_PERCENT
25. Specify the USING MEM_PERCENT option to allocate more memory to the
database and instance.
When DB2_WORKLOAD=ANALYTICS, the auto-configuration of the database sets the
default table organization method (dft_table_org) to COLUMN (that is, BLU
Acceleration) and enables intra-query parallelism. Database memory
configuration parameters that cannot use self-tuning memory manager and need
an explicit value (that is, sort-related parameters) are set to a value higher than
default and optimized for the hardware you are running on. Automatic space
reclamation is enabled. Finally, workload management is enabled by default to
ensure maximum efficiency and usage of the server. For more information, see
the Column-organized tables topic in the DB2 10.5 information center at the
following website:
http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/index.jsp
Note: The registry variable DB2_WORKLOAD is set at the instance level. It
influences all databases in that instance. If the instance is going to have
multiple database, not all of which are BLU Accelerated, then set the registry
variable before creating the databases you want to optimize for analytic
workloads and then unset the registry variable before creating the other
databases.
59
db2set DB2_WORKLOAD=ANALYTICS
db2start
db2 connect to mydb
db2 autoconfigure apply db only
60
Table 4-1 provides an example of memory allocation for low and high
concurrency workloads where the WLM configuration allows 20 concurrent
high-cost queries.
Table 4-1 Example of memory distribution for a BLU-accelerated database based on
workload concurrency
Low concurrency (< 20 concurrent
workloads)
40% SHEAPTHRES_SHR
50% SHEAPTHRES_SHR
SORTHEAP = SHEAPTHRES_SHR/5
SORTHEAP = SHEAPTHRES_SHR/20
61
TABLEORG COMPRESSION
-------- ----------C
R
N
2 record(s) selected.
Note: The COMPRESSION column for column-organized tables is blank
because the data in such tables is always compressed.
Constraints
Primary key, informational foreign key, and any informational check constraints
should be defined after your initial data is loaded.
DB2 10.5 introduces an option for specifying not enforced (informational) primary
key constraints or unique constraints. The NOT ENFORCED clause does not ensure
uniqueness, so this option should be used only when you are certain that the
data is unique. Informational primary key constraints or unique constraints
require less storage and less time to create because no internal B-tree structure
is involved. However, information primary key, foreign key, and check constraints
can be beneficial to the query optimizer. If your data is cleansed as part of a
rigorous ETL process, then consider the use of informational constraints.
Note: An enforced primary key constraint uses extra space because a unique
index is required to enforce the constraint. In addition, like any index, this
unique index increases the cost of insert, update, or delete operations. An
unenforced primary key constraint does not have these additional
considerations.
62
63
Figure 4-2 Sample of a synopsis table showing the relationship to its user table
A synopsis table is approximately 0.1% of the size of its user table with one row
in the synopsis table for every 1024 rows in the user table. To determine the
name of the synopsis table for a particular user table, query the
SYSCAT.TABLES catalog view, as shown in Example 4-7 on page 65.
64
Example 4-7 Querying the system catalog to identify the synopsis table for a user table
TABNAME
TABLEORG
------------------------------ -------SALES_COL
C
SYN130330165216275152_SALES_COL
C
2 record(s) selected.
65
To calculate the total size of a table, use the SQL statement that is shown in
Example 4-8.
Example 4-8 SQL statement to calculate table size
SELECT
SUBSTR(TABNAME,1,20) AS tabname,
COL_OBJECT_P_SIZE,
DATA_OBJECT_P_SIZE,
INDEX_OBJECT_P_SIZE,
(COL_OBJECT_P_SIZE + DATA_OBJECT_P_SIZE +
INDEX_OBJECT_P_SIZE) AS total_size
FROM SYSIBMADM.ADMINTABINFO
WHERE tabname LIKE 'mytable%'
WITH UR
The COL_OBJECT_P_SIZE element represents the user table size. The
DATA_OBJECT_P_SIZE element represents the size of the metadata, including
column compression dictionaries and synopsis tables. Both of these values must
be combined with the value of the INDEX_OBJECT_P_SIZE element to determine
the total size of the table.
To determine how much a table is compressed, look at the PCTPAGESSAVED value
in SYSCAT.TABLES. For a column-organized table, the PCTPAGESSAVED value is
based on an estimate of the number of data pages that are needed to store the
table in decompressed row organization, so the PCTPAGESSAVED value can be
used to compare compression ratios between row-organized and
column-organized tables.
The unique aspect to understanding the compression factor for
column-organized tables is the degree to which a particular columns values were
encoded as part of approximate Huffman encoding. The new PCTENCODED
column in the SYSCAT.COLUMNS catalog view represents the percentage of
values that are encoded as a result of compression for a column in a
column-organized table.
If the overall compression ratio for your column-organized table is too low, check
this statistic to see whether values in specific columns were left decompressed.
Note: If you see many columns with a very low value (or even 0) for
PCTENCODED, the utility heap might have been too small when the column
compression dictionaries were created.
66
You might also see very low values for columns that were incrementally loaded
with data that was outside of the scope of the column compression dictionaries.
Those dictionaries are created during the initial load operation. Additional
page-level dictionaries might be created to take advantage of local data
clustering at the page level and to further compress the data.
67
Example 4-9 illustrates the steps to reduce the load time for a large table. The full
set of data is in all_data.csv, when a representative sample of data is in
subset_data.csv. The first command runs the ANALYZE phase only and
generates the column compression dictionary. The second command loads the
data and skips the ANALYZE phase because the dictionary is created.
Example 4-9 LOAD commands to generate a dictionary on a subset of data to reduce
load time
LOAD FROM subset_data.csv OF DEL REPLACE RESETDICTIONARYONLY INTO mytable;
LOAD FROM all_data.csv OF DEL INSERT INTO mytable;
The subset of data can be created by either copying a subset of the full file, or by
generating a sample of the original source to get a subset. To illustrate a
sampling method, Example 4-10 shows the commands to load a
column-organized table, building the column compression dictionary by sampling
an existing table.
Example 4-10 Loading a column-organized table with a column-compression dictionary
built from a sampling of cursor data
DECLARE c1 CURSOR FOR SELECT * from my_row_table TABLESAMPLE BERNOULLI (10);
DECLARE c2 CURSOR FOR SELECT * from my_row_table;
LOAD FROM c1 OF CURSOR REPLACE RESETDICTIONARYONLY INTO my_col_table;
LOAD FROM c2 OF CURSOR INSERT INTO my_col_table;
68
A new CTQ operator for query execution plans indicates the transition between
column-organized data processing (BLU Acceleration) and row-organized
processing. All operators that appear below the CTQ operator in an execution
plan are optimized for column-organized tables.
A good execution plan for column-organized tables has the majority of operators
below the CTQ operator and only a few rows flowing through the CTQ operator.
With this type of plan, as shown in Figure 4-3, DB2 is working directly on
encoded data and optimizing with BLU Acceleration.
Figure 4-3 An example of a good query execution plan for a column-organized table
Access methods and join enumeration are simplified with BLU Acceleration, but
join order is still dictated by cost, so cardinality estimation is still important.
69
70
Chapter 5.
71
5.1.1 Partition
A database partition is part of a database that consists of its own data, indexes,
configuration files, and transaction logs. A database partition is sometimes called
a node or a database node. It can be either logical or physical. Logical partitions
are on the same physical server and can take advantage of the symmetric
multiprocessor (SMP) architecture. These partitions use common memory,
processors, disk controllers, and disks. Physical partitions consist of two or more
physical servers, and the database is partitioned across these servers. Each
partition has its own memory, processors, disk controllers, and disks.
72
User interaction occurs through one database partition, which is known as the
coordinator node for that user. The coordinator runs on database partition to
which the application is connected. Any database partition can be used as a
coordinator node.
73
create table mytable (col1 int, col2 int, col3 int, col4 char(10))
in mytbls1
distributed by hash (col1, col2, col3)
Figure 5-1 on page 75 shows another example of a DPF environment with four
partitions (0, 1, 2, and 3). Two tables, customer and store_sales, are created.
Both have a partitioning key named cust_id. The value in cust_id is hashed to
generate a partition number, and the corresponding row is stored in the relevant
partition.
74
75
76
77
Partitioning keys should not include columns that are updated frequently.
Whenever a partitioning key value is updated, DB2 might need to drop the
row and reinsert it into a different partition, as determined by hashing the new
partitioning key value.
Unless a table is not critical or you have no idea what a good partitioning key
choice is, you should not let the partitioning key be chosen by default. For the
record, the default partitioning key is the first column of the primary key, and if
there is none, it is the first column that has an eligible data type.
Ensure that you understand collocation and the different join types. For more
information, see the DB2 Partitioning and Clustering Guide, SC27-2453.
Collocated tables must fulfill the following requirements:
Be in the same database partition group (one that is not being
redistributed). (During redistribution, tables in the database partition group
might be using different partitioning maps; they are not collocated.)
Have partitioning keys with the same number of columns.
Have the corresponding columns of the partitioning key be partition
compatible.
Be in a single partition database partition group, if not in the same
database partition group, that is defined on the same partition.
78
79
80
5.3.1 Compression
The DB2 Storage Optimization Feature enables you to transparently compress
data on disk to decrease disk space and storage infrastructure requirements.
Because disk storage systems are often the most expensive components of a
database solution, even a small reduction in the storage subsystem can result in
substantial cost savings for the entire database solution. This is especially
important for a data warehouse solution, which typically has a huge volume of
data.
81
Row compression
DB2 uses a variant of the Lempel-Ziv algorithm to apply compression to each
row of a table. Log records are also compressed. Savings are extended to
backup disk space, racks, cables, floor space, and other disk subsystem
peripherals.
Because compressed rows are smaller, not only do you need fewer disks, but
your overall system performance might be improved. By storing compressed
data on disk, fewer I/O operations need to be performed to retrieve or store the
same amount of data. Therefore, for disk I/O-bound workloads, the query
processing time can be noticeably improved.
DB2 stores the compressed data on both disk and memory, reducing the amount
of memory that is consumed and freeing it up for other database or system
operations.
To enable row compression in DB2, when tables are created, you can specify the
COMPRESS YES STATIC option. It can also be enabled for an existing table by using
the ALTER TABLE command. If you alter the existing table to enable compression,
a REORG must be run to build the dictionary and compress the existing data in the
table. For more details, see the DB2 10.5 information center at the following
website:
http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw.we
lcome.doc/doc/welcome.html
82
XML compression
The verbose nature of XML implies that XML fragments and documents typically
use much disk space. DB2 stores XML data in a parsed hierarchical format,
replacing tag names (for example, employee) with integer shorthand. Repeated
occurrences of the same tags are assigned the same shorthand. Storing text-rich
tags using integer shorthand reduces space consumption and assists with higher
performance when querying data. Moreover, the XML tag parsing, such as data
row compression, is transparent to users and applications.
Index compression
Indexes, including indexes on declared or created temporary tables, can be
compressed in DB2 to reduce storage costs. This is especially useful for large
data warehouse environments.
By default, index compression is enabled for compressed tables, and disabled
for decompressed tables. You can override this default behavior by using the
COMPRESS YES option of the CREATE INDEX statement. When working with existing
indexes, use the ALTER INDEX statement to enable or disable index compression;
you must then perform an index reorganization to rebuild the index.
Processor usage might increase slightly as a result of the processing that is
required for index compression or decompression. If this is not acceptable, you
can disable index compression for new or existing indexes.
83
Each block contains only rows that have the same unique combination of
dimension values. The set of blocks that have the same unique combination of
dimension values is called a cell. A cell might consist of one or more blocks in
the MDC table. As shown in Figure 5-5, a cell can be described as a combination
of unique values of year, nation, and color, such as 1997, Canada, and Blue. All
records that have 1997 as year, Canada as nation, and blue as the color are
stored in the same extents of that table.
With MDC tables, clustering is ensured. If an existing block satisfies the unique
combination of dimension values, the row is inserted into that block, assuming
there is sufficient space. If there is insufficient space in the existing blocks, or if
no block exists with the unique combination of dimension values, a new block is
created.
Example 5-4 shows the DDL for creating an MDC table. The ORGANIZE keyword is
used to define an MDC table.
Example 5-4 MDC example
84
Starting with DB2 10.5, you must specify ORGANIZE BY ROW USING DIMENSION
keywords.
MDC introduced a new type of index that is called a block index. When you
create an MDC table, the following two kinds of block indexes are created
automatically:
Dimension block index
A dimension block index per dimension contains pointers to each occupied
block for that dimension.
Composite block index
A composite block index contains all columns that are involved in all
dimensions that are specified for the table, as shown in Figure 5-6. The
composite block index is used to maintain clustering during insert and update
activities. It can also be used for query processing.
With MDC, data is organized on the disk based on dimensions. Queries can skip
parts of the table space that the optimizer has determined do not apply. When
data is inserted, it is automatically put in the correct place so that you no longer
need to reorganize the data. In addition, because one index entry represents the
entire data page (versus having one index entry per row with traditional indexes),
MDCs reduce the overall size of the space that is required for the indexes. This
reduces disk requirements and produces faster queries because of the reduced
amount of I/O needed for a query. MDCs also improve delete performance
because DB2 now has to drop only a few data pages. Inserts are also faster
because DB2 rarely has to update an index page (only the data page).
Performance consideration
The performance of an MDC table depends upon the correct choice of
dimensions and the block (extent) size of the table space for the data and
application workload. A poor choice of dimensions and extent size can result in
low disk storage use and poor query access and load utility performance.
85
In choosing dimensions, you must identify the queries in the existing or planned
workloads that can benefit from multidimensional clustering. For existing
applications, you can use DB2 Design Advisor to analyze the workload and
recommended dimensions for the MDC tables. For a new table or database, you
need a good understanding of the expected workload. The following dimension
candidates are typical:
Extent size is related to the concept of cell density, which is the percentage of
space that is occupied by rows in a cell. Because an extent contains rows only
with the same unique combination of dimension values, significant disk space
can be wasted if dimension cardinalities are high (for example, when there is a
dimension with unique values that would result in an extent per row).
Defining small extent sizes can increase cell density, but increasing the number
of extents per cell can result in more I/O operations, and potentially poorer
performance when retrieving rows from this cell. However, unless the number of
extents per cell is excessive, performance should be acceptable. If every cell
occupies more than one extent, it can be considered excessive.
At times, because of data skew, cells occupy many extents when others occupy
a small percentage of the extent. Such cases signal a need for a better choice of
dimension keys. Currently, the only way to determine the number of extents per
cell requires the DBA to run appropriate SQL queries or run db2dart.
Performance might be improved if the number of blocks can be reduced by
consolidation. Unless the number of extents per cell is excessive, this situation is
not considered an issue.
Multidimensional clustering is a unique capability that is targeted for large
database environments, providing an elegant method for flexible, continuous,
and automatic clustering of data along multiple dimensions. The result is
improvement in the performance of queries, and a reduction in the impact of data
maintenance operations (such as reorganization) and index maintenance
operations during insert, update, and delete operations.
86
87
Fast roll-in/roll-out
DB2 allows data partitions to be easily added or removed from the table
without having to take the database offline. This ability can be useful in a data
warehouse environment where there is the need to load or delete data to run
decision-support queries. For example, a typical insurance data warehouse
might have three years of claims history. As each month is loaded and rolled
in to the data warehouse, the oldest month can be archived and removed
(rolled out) from the active table. This method of rolling out data partitions is
also more efficient, as it does not need to log delete operations, which is the
case when deleting specific data ranges.
88
89
90
Chapter 6.
91
6.1 Introduction
Moving data across different databases is a common task in IT departments.
Moving data mainly refers to the task of copying data from a source system to a
target system, not necessarily removing it at the source location. You can then
process the data (for example, export it to a file) either remotely, which is a valid
solution for infrequent processing or reasonable amounts of data. You can also
move the data to the most suitable place for the overall processing (for example,
move data from one system to another). For example, it is common to copy
production data (or at least a subset of it) to a test system.
Typical DB2 data movement tasks involve three steps:
1. Exporting the data from the source database into a temporary data exchange
file in binary or text format
2. Moving the generated file between systems
3. Importing or loading the data from the file into the target database.
This section describes some of the most common methods and how they are
used in moving data into a table and database.
6.1.1 Load
Using the load utility to insert data is the most common practice in data
warehouse environments. The load utility is used to insert the data into a table of
a database after extracting that data from the same or a different database by
using other utilities, such as export. The export and import utilities use SQL to
extract and add data. The load utility is often faster than the import utility because
it bypasses the DB2 SQL engine and builds physical database pages and puts
them directly into the table space. These utilities can be used to move data
between databases on any operating systems and versions. You can also use
them to add data in files to a database.
Although the LOAD command is often much faster than the import utility and can
load much data to a table quickly, there are some aspects of the LOAD command
that you must consider.
If a table that you load has referential constraints to other tables, then you must
run the SET INTEGRITY command after loading your tables. This command
verifies that the referential and all other constraints are valid. If there are
constraints and you do not use the SET INTEGRITY command, then the loaded
table and its child processes might not be accessible.
92
You should also understand the COPY YES/NO parameters of the LOAD command.
Depending on which you choose, the table space that contains the table that you
load can be placed into backup pending status, and no access to any of the
tables in that table space is allowed until a backup is done. This utility allows you
to copy data between different versions of DB2 and databases on different
operating systems.
93
When you specify the FROM CURSOR option, the load utility directly references the
result set of a SQL query as the source of a data load operation, thus bypassing
the need to produce a temporary data exchange file. This way, LOAD FROM CURSOR
is a fast and easy way to move data between different table spaces or different
databases. LOAD FROM CURSOR operations can be run on the command line and
from within an application or a stored procedure by using the ADMIN_CMD stored
procedure.
For examples of LOAD FROM CURSOR, see Fast and easy data movement using
DB2's LOAD FROM CURSOR feature, found at:
http://www.ibm.com/developerworks/data/library/techarticle/dm-0901fechn
er/
6.1.2 Ingest
The ingest utility, which was introduced in DB2 10.1, incorporates the best
features of both the load and import utilities and adds data import features. The
ingest utility meets the demanding requirements of the extract, load, and
transform (ELT) operations of data warehouse maintenance. It supports the need
to have data that is current and also available 24x7 to make mission-critical
decisions in the business environment of today.
Like the load utility, the ingest utility is fast, although not as fast as the load utility.
Like the import utility, the ingest utility checks constraints and fires triggers so
there is no need to use the SET INTEGRITY facility after running ingest utility. In
addition, the table being loaded is available during loading. The ingest utility also
can do continuous loading of data, so if you have data arriving constantly, you
can set up the utility to continuously add data to your table, as described in 6.1.3,
Continuous data ingest on page 96.
Furthermore, you can use ingest to update rows, merge data, and delete rows
that match records in the input file.
With its rich set of functions, the ingest utility can be a key component of your
ETL system:
In near real-time data warehouses
In databases, specifically data marts and data warehouses, where data must
be updated in a near real-time basis, the users cannot wait for the next load
window to get fresh new data. Additionally, they cannot afford to have offline
LOAD jobs running during business hours.
94
95
The INGEST command can load data continuously when it is called from a
scheduled script. Because the Ingest utility issues SQL statements just like
any other DB2 application, changes are logged and its statements can be
recovered if necessary.
Furthermore, the ingest utility commits data frequently, and the commit
interval can be configured by time or number of rows. Rejected rows can be
discarded or placed into a file or table.
96
Ingest architecture
Internally, the ingest architecture is made of a multi-threaded process. There are
three types of threads (transporters, formatters, and flushers) that are
responsible for each phase of the ingest operation, as shown in Figure 6-1.
97
98
6.2.1 Architecture
SQW architecture separates the tools that are needed for the development
environment and the runtime environment, as shown in Figure 6-2.
Development
Production
FlatFiles
Or
JDBC
FTP
OS Script
Data Flow
Data Flow
Verify
Administration
Console
DataStage
Server
SQW
DB2
Extract
SQL
Join
SQL
Lookup
Metadata
Update
Table
Design Studio
Admin Console
DB2
DataStage
Server
(Web Browser)
The development tool is the DB2 Warehouse Design Studio, where the data
movement and transformation routines are developed and tested.
After testing in the development environment against development databases,
the routines are deployed to the runtime environment, where they can be
scheduled for execution in the production databases. The web-based
administration console is used to manage the runtime environment, including
deployment, scheduling, and execution, and monitoring functions.
99
100
101
102
Support was added for the Liberty profile for WebSphere Application Server
Network Deployment component in InfoSphere Warehouse Version 10.1 Fix
Pack 2 for the runtime environment (Administration Console). The Liberty profile
is embedded in InfoSphere Warehouse. Users can choose to either use a
stand-alone instance of WebSphere Application Server or the embedded version
in the InfoSphere Warehouse at the time of installation.
As of SQW for DB2 Warehouse 10.5, SQW has support for BLU Acceleration
(load, import, and ingest into column organized tables) and DB2 Oracle
Compatible mode.
103
104
Chapter 7.
Monitoring
As described in Chapter 9, Workload management on page 145, DB2 workload
management has four clearly defined stages, the last of which is monitoring to
ensure that your data server is being used efficiently. Database monitoring
includes all of the processes and tools that you use to examine the operational
status of your database, which is a key activity to ensure the continued health
and excellent performance of your database management system. The DB2
monitoring infrastructure collects information from the database manager, active
databases, and connected applications to help you analyze the performance of
specific applications, SQL queries, and indexes. You can then use that data to
troubleshoot poor system performance or to gauge the efficacy of your tuning
efforts.
105
table functions. These table functions return data from monitor elements that
report on most database operations at a specific point. The monitoring table
functions use a high-speed monitoring infrastructure that was introduced in
DB2 for Linux, UNIX, and Windows 9.7. Before Version 9.7, the DB2
monitoring infrastructure included snapshot monitoring routines. Although
these routines are still available, this technology is no longer being enhanced
in the DB2 product, and you are encouraged to use the monitoring table
functions wherever possible.
Monitoring with event monitors:
You can configure event monitors to capture information about specific
database events (such as deadlocks) over time. Event monitors generate
output in different formats, but all of them can write event data to regular
tables.
This chapter provides an overview of the DB2 monitoring interfaces. It also
introduces you to IBM InfoSphere Optim Performance Manager Version 5.3,
which provides a convenient web interface to help you identify and analyze
database performance problems.
106
Text elements track text values. For example, the stmt_text monitor element
contains the text of an SQL statement.
Timestamp elements track the time at which an event occurred. For example,
the conn_time monitor element tracks the time at which a connection was
made to the database.
Monitor elements can be further classified by the type of work that they monitor:
Request monitor elements measure the amount of work that is needed to
process different types of application requests. A request is a directive to a
database agent to perform some work that consumes database resources.
For example, an application request is a directive that is issued directly by an
external application.
Activity monitor elements are a subset of request monitor elements. Activity
monitor elements measure the amount of work that is needed to run SQL
statement sections, including locking, sorting, and the processing of
row-organized or column-organized data.
Data object monitor elements return information about operations that are
performed on specific data objects, including buffer pools, containers,
indexes, tables, and table spaces.
Time-spent monitor elements track how time is spent in the system.
Wait-time monitor elements track the amount of time during which the
database manager waits before it continues processing. For example, the
database manager might spend time waiting for locks on objects to be
released; this time is tracked by the lock_wait_time monitor element.
Component processing time monitor elements track the amount of time
that is spent processing data within a specific logical component of the
database. For example, the total_commit_proc_time monitor element
tracks the amount of time that is spent committing transactions.
Component elapsed time monitor elements track the total amount of
elapsed time that is spent within a specific logical component of the
database. This time includes both processing time and wait time. For
example, the total_commit_time monitor element tracks the total amount
of time that is spent performing commit processing on the database
server.
For more information about the DB2 monitor elements, see the DB2 for Linux,
UNIX, and Windows 10.5 information center at the following web page:
http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/index.jsp
Chapter 7. Monitoring
107
108
In this example, request metrics are collected for all agents that run in the
SYSDEFAULTUSERCLASS, but not for agents that run outside of the
SYSDEFAULTUSERCLASS.
Now suppose that you specify a collection level of EXTENDED for activity
monitor elements at the database level, but you do not want to collect activity
metrics for the default user workload. You can achieve this by running the
following command and SQL statement:
UPDATE DB CFG FOR SAMPLE USING MON_ACT_METRICS EXTENDED;
ALTER WORKLOAD SYSDEFAULTUSERWORKLOAD COLLECT ACTIVITY METRICS NONE;
In this example, activity metrics are collected for all agents that run in the
database, including those that run as part of the
SYSDEFAULTUSERWORKLOAD. The effective collection level is determined
by the broader collection level (EXTENDED) that was specified at the
database level by setting the mon_act_metrics configuration parameter.
Chapter 7. Monitoring
109
110
Application requests
Activities
Operations on data objects
Locks
System memory use
Routines
For more information about the DB2 table functions that return data from monitor
elements, see the DB2 10.5 information center at the following web page:
http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/index.jsp
Chapter 7. Monitoring
111
112
MON_GET_TABLE
MON_GET_TABLESPACE
For a database that is created in DB2 10.5, these table functions collect data
object monitoring information by default.
Chapter 7. Monitoring
113
114
Tip: After you run the application for which you want to collect data, it is a
good idea to deactivate the event monitor to avoid the collection of unneeded
data.
If the data is written to a relational table, you can use SQL to access the data, but
if the data is written to an unformatted event table, you must run the db2evmonfmt
command or call the EVMON_FORMAT_UE_TO_TABLES procedure before you can
examine the event data.
Monitor elements that complement one another are grouped in useful sets called
logical data groups. For example, the event_activity logical data group includes
monitor elements like appl_id (application ID) and uow_id (unit of work ID), which
are often returned together, as, for example, in the following query:
SELECT VARCHAR(A.APPL_NAME, 15) AS APPL_NAME,
VARCHAR(A.TPMON_CLIENT_APP, 20) AS CLIENT_APP_NAME,
VARCHAR(A.APPL_ID, 30) AS APPL_ID,
A.ACTIVITY_ID,
A.UOW_ID,
VARCHAR(S.STMT_TEXT, 300) AS STMT_TEXT
FROM ACTIVITY_DB2ACTIVITIES AS A,
ACTIVITYSTMT_DB2ACTIVITIES AS S
WHERE A.APPL_ID = S.APPL_ID AND
A.ACTIVITY_ID = S.ACTIVITY_ID AND
A.UOW_ID = S.UOW_ID;
Important: DB2ACTIVITIES is the name of the event monitor that you created
earlier in this section. The standard table reference
ACTIVITY_DB2ACTIVITIES or ACTIVITYSTMT_DB2ACTIVITIES consists of
the name of the logical data group concatenated with the underscore (_)
character concatenated with the name of the event monitor.
The DB2 data server associates a default set of logical data groups with each
event monitor, and the monitor therefore collects a useful set of elements for you
automatically.
For a list of the logical data groups and the monitor elements that they can return
during event monitoring, see the DB2 10.5 information center at the following
web page:
http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/index.jsp?topic=%2Fcom.i
bm.db2.luw.admin.mon.doc%2Fdoc%2Fr0007595.html
Chapter 7. Monitoring
115
Tip: Prune data that you no longer need from event monitor tables for
frequently used event monitors. If you must prune event monitor output
regularly, consider using an unformatted event table to record event monitor
output because unformatted event tables can be pruned automatically after
data is transferred to relational tables.
116
Chapter 7. Monitoring
117
Overview
Buffer Pool and I/O
Connection
SQL Statements
Extended Insight
You can use these performance metrics, which include table and column access
statistics, I/O statistics, and extended insight statistics, to verify that your
workloads with column-organized tables behave as expected.
IBM InfoSphere Optim Query Workload Tuner (OQWT) Version 4.1 can help you
tune individual SQL statements or a complete SQL workload. OQWT includes
the Workload Table Organization Advisor, which examines the tables that are
referenced by the statements in a query workload. The advisor makes
recommendations about which tables are good candidates for conversion from
row to column organization. The advisor also estimates the performance gain
that can be realized if the recommended tables are converted from row to
column organization. For more information about the Workload Table
Organization Advisor, see the following web page:
http://pic.dhe.ibm.com/infocenter/dstudio/v4r1/index.jsp?topic=%2Fcom.i
bm.datatools.qrytune.workloadtunedb2luw.doc%2Ftopics%2Fgenrecswtoa.html
For a summary of the new features and enhancements in Version 5.x of IBM
InfoSphere Optim Performance Manager for DB2 for Linux, UNIX, and Windows
and its modifications and fix packs, see the following web page:
http://www-01.ibm.com/support/docview.wss?uid=swg27023197
For more information about IBM InfoSphere Optim Performance Manager, see
the Version 5.3 information center at the following web page:
http://pic.dhe.ibm.com/infocenter/perfmgmt/v5r2/index.jsp
118
Chapter 8.
High availability
IBM PureData System for Operational Analytics provides redundant hardware
components and availability automation to minimize and eliminate impacts from
hardware and software failures.
This chapter describes the high availability characteristics present in the IBM
PureData System for Operational Analytics.
119
In addition to the numerous redundancies that are part of its design, IBM
PureData System for Operational Analytics also uses IBM Tivoli System
Automation for Multiplatforms (SA MP) cluster management software to provide
high availability capabilities at the software and hardware level. Tivoli SA MP
integration provides the capability for automated recovery to take specific actions
when a detectable resource failure occurs. This action can be as simple as
restarting a failed component, for example, software, in-place, or automatically
failover components from the active host to the standby host. A resource failure
can include:
Network failure
Host failure
DB2 instance or partition failure
The IBM PureData System for Operational Analytics HA design protects the
Management Host and the core warehouse, which consists of the administration
host on the foundation node, and data hosts on the data nodes. The HA
configuration consists of two peer domains: one for the management host and
one for the core warehouse. This configuration allows the HA resources on the
management host to be managed independently of the HA resources for the core
warehouse.
120
121
Furthermore, if a subsequent active node fails, its processing is failed over to the
standby node in the HA group, as shown in Figure 8-3 on page 123.
122
123
124
The HA resources that support the five database partitions run on the active
administration host. If a detectable failure occurs, the resources fail over to the
standby administration host and the standby administration host continues to
process the workload. When the standby administration host takes over the
workload processing, it assumes the role of the active administration host. When
the failed administration host is brought back online, it assumes the role of the
standby administration host.
Figure 8-5 shows the administration hosts after the resources are failed over and
the failed administration host is brought back online as the standby
administration host.
Figure 8-5 Roving HA group for the administration hosts after a failover
125
Figure 8-6 shows the initial roving HA group configuration that contains three
active data host and one standby data host.
If a detectable failure occurs on any one active data host, the resources of the
failed data host fail over to the standby data host, and the standby data host
assumes the role of the active data host in the roving HA group and starts
processing new workloads. When the failed data host is brought back online, it
integrates and assumes the role of the standby data host for the HA group. The
resources do not need to be manually failed back to the original data host.
Figure 8-7 on page 127 shows the roving HA group after the resources on a data
host have failed over and the failed data host assumes the role of the standby
data host.
126
Figure 8-7 Roving HA group for the data hosts after a failover
127
128
The resource model for the management host includes separate resource
groups for the database performance monitor and the warehouse tools. This
separation of resource groups allows for independent availability management if
there is a failure, such that only the affected resources are failed over as
required.
If a detectable failure occurs on the management host and affects both the
database performance monitor and the warehouse tools, the resources fail over
to the standby management host, which allows the database performance
monitor to continue to monitor the performance of the core warehouse database
and allows the warehouse tools to process new jobs and requests. However, the
system console event monitoring capability is not available on the standby
management host.
Figure 8-9 shows the HA configuration of management host and the standby
management host after a failure of the management host.
129
When the failed management host is repaired and brought back online, the
resources must be manually moved back (that is, fall back) from the standby
management host to the management host at the next available maintenance
window or when a short planned service outage is acceptable. Until the
resources are failed back, the system console event monitoring capability is not
available and the management host is running unprotected against another
failure event.
If a failure of a resource that affects only the warehouse tools occurs, the HA
resources for the warehouse tools fail over to the standby management host.
The HA resources for the performance monitor continue to run, unaffected, on
the management host, as shown in Figure 8-10.
130
After either such event, manually fall back the resources from the standby
management host to the management host at the next scheduled maintenance
window or when a short planned service outage is acceptable.
131
132
133
134
The sample output that is provided in this section is based on a core warehouse
with the following two roving high availability (HA) groups:
The first roving HA group includes the administration host (bcu001) and the
standby administration host (bcu002).
The second roving HA group includes two data hosts (bcu003 and bcu004)
and one standby data host (bcu005).
To monitor the status of the core warehouse HA configuration, complete the
following steps:
1. Log in to the system console, and determine whether there are any alert
events with the type Resource Failover and a severity of Critical. Events
with these values indicate that a failover of the core warehouse resources
was attempted and did not complete. You must identify the cause of the
failover and correct the problem immediately to restore the operation of the
core warehouse. If there are critical resource failover events, complete the
following steps:
a. As the root user or core warehouse instance owner, run the lssam
command on any core warehouse host and identify the resources with
Failed Offline states.
b. Determine the cause of the Failed Offline state for each resource and
correct the problem.
c. To clear the Failed Offline states, run the hareset command as the root
user on any core warehouse host.
2. If the system console displays events with the type Resource Failover and a
severity of Informational, this indicates that a failover occurred in the core
warehouse. The core warehouse remains operational after the failover, but to
restore the HA configuration to a good state, you must address any Failed
Offline resource states on the failed host so that it can be reintegrated into
the system as a standby host. While the failed host remains offline, if there is
a subsequent failure in the same roving HA group, there is an outage of the
core warehouse. Complete the following steps:
a. As the root user or core warehouse instance owner, run the lssam
command on any core warehouse host and identify the resources with
Failed Offline states.
b. Determine the cause of the Failed Offline state for each resource and
correct the problem.
c. To clear the Failed Offline states, run the hareset command as the root
user on any core warehouse host.
135
3. If there are no Resource Failover events, you can use the hals command to
determine the status of the HA configuration for the core warehouse. On any
core warehouse host, run the hals -core command.
If all resources display a Normal HA status and an Online operational
state (OPSTATE) similar to the sample output that is shown in Figure 8-11, it
indicates that your core warehouse HA configuration is running in a
good state.
136
137
138
139
To move the database partition resources to the standby host for maintenance,
complete the following steps:
1. Verify that there are no DB2 jobs running.
2. As the core warehouse instance owner (bcuaix), connect to the core
warehouse database and terminate all connections to the database by
running the following command:
db2 force applications all
140
3. To verify that there are no resources on the target host that are in a Failed or
Stuck state, Run the lssam command on any host in the core warehouse peer
domain.
If a resource on the target host is in one of these states, the failover
procedure fails. To reset the Failed or Stuck resources, run the hareset
command on any host.
4. Fail the active host over to the standby host by running the hafailover
command.
For example, to fail over the resources from the bcu004 host to the standby
host in its HA group, run the following command:
hafailover bcu004
Figure 8-17 shows the hals output that shows that the resources for database
partitions 5 - 12 that were previously running on the bcu004 host are now
running on the bcu003 host.
Note: When you complete the maintenance on the node and reintegrate it
into the system, you do not have to fail back the database partition
resources. The node is reintegrated as the standby node.
141
online, and the database performance monitor and the warehouse tools continue
to run.
To move resources to the standby management host in a planned fashion,
complete the following steps:
1. Determine the host name of the host where the database performance
monitor (DPM) is running. Run the following command:
hals
Identify the row in the output for the DPM component. The host name where
the database performance monitor is running appears in the CURRENT
column. In the sample hals output that is shown in Figure 8-18, the database
performance monitor is running on the bcu01 host.
2. Fail over the resources for the database performance monitor by running the
following command:
hafailover management_host DPM
management_host represents the host name of the host where the database
performance monitor is running.
3. Determine the host name of the host where the warehouse tools are running.
Run the following command:
hals
Identify the row in the output for the WASAPP component. The host name
where the warehouse tools is running appears in the CURRENT column. In
the sample hals output that is shown in Figure 8-19 on page 143, the
warehouse tools are running on the bcu01 host.
142
4. Fail over the resources for the warehouse tools by running the following
commands.
a. If there are cube servers that are not managed as highly available
resources, stop the cube servers.
b. Fail over the resources for the warehouse tools by running the following
command:
hafailover management_host APP
management_host represents the host name of the host where the
warehouse tools are running.
c. If there are cube servers that are not managed as highly available
resources, start the cube servers.
143
144
Chapter 9.
Workload management
The purpose of workload management is to ensure that limited system resources
are prioritized according to the needs of the business. Work that is designated as
being of greatest importance to the business is given the highest priority access
to system resources. In a data warehouse environment, this means that users
running online queries can expect predictable and acceptable response times at
the possible expense of long running batch jobs.
It is possible to monitor and manage the work at every stage of processing, from
the application, the network, the database management system (DBMS), the
operating system, and the storage subsystem. This chapter provides an
overview of the workload management capabilities of the database.
Workload management is a continuous process of design, deployment,
monitoring, and refinement. In a data warehouse environment, business
priorities and the volume and mix of work are constantly changing. These
changes must also be reflected in the management of resources.
DB2 workload management (WLM) is the primary tool that is available for
managing an IBM InfoSphere Warehouse database workload. WLM was
introduced with DB2 for Linux, UNIX, and Windows 9.5 and enhanced in every
subsequent release of DB2 for Linux, UNIX, and Windows.
145
A workload
A service class
A threshold
A work action set
DB2 workload management has four clearly defined stages, which are outlined in
the following sections.
146
147
Table functions also provide information about the work that is running on the
system. This information is available at the following levels:
148
SUPERCLASS_NAME
SUBCLASS_NAME
MEMB COORDMEMB APPHNDL WORKLOAD_NAME
WLO_ID
------------------- ------------------ ---- --------- ------- ---------------------- -----SYSDEFAULTUSERCLASS SYSDEFAULTSUBCLASS 0
0
383
SYSDEFAULTUSERWORKLOAD 1
1 record(s) selected.
Event monitors capture detailed activity information (per workload, work class, or
service class, or when a threshold is exceeded) and aggregate activity statistics
for historical analysis.
DB2 WLM uses the following event monitors:
Activity event monitors capture information about individual activities in a
workload, work class, or service class that violated a threshold.
Threshold violations event monitors capture information when a threshold is
exceeded. They identify the threshold, the activity that was the source of the
exception, and what action was taken in response to the violation.
Statistics event monitors serve as a low-impact alternative to capturing
detailed activity information by collecting aggregate data (for example, the
number of activities that are completed, or the average execution time).
Note: There is no impact for unused WLM event monitors. Create event
monitors so that individual workloads, service classes, and work actions can
be altered to capture events when needed.
To store the events in DB2 tables, see the sample DDL in
~/sqllib/misc/wlmevmon.ddl.
You can find examples of Perl scripts for the analysis of workload
management in the following files:
~/sqllib/samples/perl/wlmhist.pl
wlmhistrep.pl
149
Histogram templates
Service classes
Thresholds
Work action sets
Work class sets
Workloads
9.3.1 Workloads
A workload is an object that is used to identify incoming work based on its source
so that the work can be managed correctly. The workload attributes are assigned
when the workload establishes a database connection. Examples of workload
identifiers include the application name, system authorization ID, client user ID,
and connection address.
150
9.3.4 Thresholds
Thresholds help you maintain stability in the system. You create threshold
objects to trap work that behaves abnormally. Abnormal behavior can be
identified predictively (before the work begins running, based on the projected
impact) or reactively (as the work runs and uses resources). An example of work
that can be controlled with thresholds is a query that consumes large amounts of
processor time at the expense of all other work running on the system. Such a
query can be controlled either before it begins running, based on estimated cost,
or after it has begun running and while it is using more than the permitted amount
of resources.
Collect data.
Stop execution.
Continue execution.
Queue activities.
151
152
You can optionally define a histogram template (by using the CREATE HISTOGRAM
TEMPLATE statement) and specify a high bin value. All other bin values are
automatically defined as exponentially increasing values that approach the high
bin value. A measurement unit, which depends on the context in which the
histogram template is used, is assigned to the histogram when a service
subclass, workload, or work action is created or altered. The new histogram
template overrides the default histogram template with a new high bin value.
For a complete description of these stages, download a copy of this paper from
the following web page:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/
Wc9a068d7f6a6_4434_aece_0d297ea80ab1/page/Implementing%20DB2%20workload
%20management%20in%20a%20data%20warehouse
If your system includes a data warehouse, implementing a Stage 2 (tuned)
workload management configuration at a minimum is recommended. Consider
evolving to a Stage 3 implementation when specific performance or strategic
requirements must be addressed.
153
154
155
156
By default, the dispatcher can manage CPU resources only by way of CPU limit
settings. To enable the dispatcher to manage CPU resources by using both CPU
shares and CPU limits, set the wlm_disp_cpu_shares database manager
configuration parameter to YES. You can set and adjust CPU shares and CPU
limits by using the CREATE SERVICE CLASS and ALTER SERVICE CLASS statements.
A service class with hard CPU shares that are assigned cannot exceed its CPU
resource entitlement to consume any unused CPU resources that become
available on the host or logical partition (LPAR) if work is still running in
competing service superclasses or running in competing service subclasses
within the same service superclass. If competing workloads are not present,
service classes with hard CPU shares are able to claim unused CPU resources.
The hard CPU shares setting is most effective when you want to prevent work
running in a service class from interrupting more important work running on the
host or LPAR. Assign hard CPU shares to service classes running complex or
intensive queries that might otherwise degrade the performance of higher priority
work because of contention for limited resources.
A service class with soft CPU shares that are assigned can exceed its CPU
resource entitlement to consume any unused CPU resources that become
available on the host or LPAR. If two or more service classes have soft shares,
unused CPU resources become available, and there is enough CPU resource
demand from each service class to consume the spare capacity, allocation of the
CPU resources is done proportionally according to the relative share of each
active service class.
The soft CPU shares setting is most effective for high-priority work that should be
able to temporarily claim any spare CPU resource that becomes available, or for
workloads that are expected to consume little resource beyond their immediate
CPU requirements.
Configure a CPU limit to enforce a fixed maximum CPU resource entitlement for
work in a service class. If a CPU limit is set for all service classes, you can
reserve a portion of the CPU resource to perform work regardless of any other
work running on the instance. The CPU resource allocation of any service class
is computed from the shares of that service class relative to the shares of all
other service classes within the instance.
Although CPU limits can be configured at either the service superclass level or
the subclass level, by applying CPU limits to your superclasses and CPU shares
to your subclasses, you can use the CPU limits to control the absolute CPU
resource entitlement of each superclass, and the CPU shares to control the
relative CPU resource entitlements of service subclasses running within those
superclasses.
157
The workload management dispatcher always uses the most restrictive CPU
limits or CPU shares assignments when allocating CPU resources to service
classes. For example, if a service class reaches its CPU limit before it fully uses
its shares-based CPU resource entitlement, the dispatcher uses the CPU limit.
Before enabling the workload management dispatcher for the first time, monitor
your workloads to determine the relative CPU resources that they consume. This
information can help you to make decisions about service class creation, CPU
shares assignment, and whether to use CPU limits.
Table functions and monitor elements are provided to help you monitor the
performance of the workload management dispatcher. After analyzing the
collected data, you can adjust the dispatcher concurrency level or redistribute
CPU entitlements by adjusting service class CPU shares and CPU limits to tune
the dispatcher performance.
For a complete description of the DB2 workload management dispatcher, see
the DB2 for Linux, UNIX, and Windows 10.1 information center at the following
web page:
http://pic.dhe.ibm.com/infocenter/db2luw/v10r1/index.jsp
158
159
160
For recommendations about tuning the default concurrency threshold value and
the default work class timeron range when the system appears to be
underutilized or overutilized, see the DB2 for Linux, UNIX, and Windows 10.1
information center. For a comprehensive set of recommendations that apply to
monitoring both system utilization and workload characteristics, download a copy
of Implementing DB2 Workload Management in a Data Warehouse from the
following web page:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/
Wc9a068d7f6a6_4434_aece_0d297ea80ab1/page/Implementing%20DB2%20workload
%20management%20in%20a%20data%20warehouse
161
162
10
Chapter 10.
163
10.1 Overview
Essentially, data mining discovers patterns and relationships that are hidden in
your data. It is part of a larger process that is called knowledge discovery,
specifically, the step in which advanced statistical analysis and modeling
techniques are applied to the data to discover useful patterns and relationships.
The knowledge discovery process as a whole is essential for successful data
mining because it describes the steps you must take to ensure meaningful
results.
A pattern that is interesting (according to a user-imposed interest measure) and
certain enough (again according to the users criteria) is called knowledge. The
output of a program that monitors the set of facts in a database and produces
patterns in this sense is discovered knowledge.
164
Thus, the goal of data mining and analytics is to solve business problems by
using the technology (mining algorithms) to support problem solving. Use the
mining algorithms to detect and model potentially valuable but hidden patterns in
your data. These techniques characterize relationships that are not expressed
explicitly.
165
166
You can try to solve this problem using Text Analytics by taking the following
actions:
Find the top issues in (unstructured) in the call center log and perform causal
analysis by relating these issues to structured data that is associated with the
log entry.
Identify and address the top types of problems that are encountered by the
most profitable customers to reduce loyal customer churn.
Analyze customer ratings and customer surveys about your and your
competitors products
Analyze call center notes to find interest in other (cross-sell) or new products
(growth).
Detect early warning signals in problem reports to avoid costly recalls and
lawsuits.
Analyze claims data and patient records to identify insurance fraud (health
insurance).
So, you might end up taking some of the following actions:
Create simple reports: To identify the top 10 customer satisfaction problems
that are recorded in a text field of a customer survey.
Create multidimensional reports: In retail, you might want to derive the new
Return reasons OLAP dimension from text analysis and combine it with
existing dimensions, such as time, geography, or product.
Generate additional input fields for data mining to improve the predictive
power of data mining models: You might want to use symptoms that are
included in patient records to improve the predictive power of a data mining
model. That model can predict the patients who need treatment based only
on available structured data, such as patient age or blood pressure and
related conditions.
167
Discovery methods are data mining techniques that find patterns that exist in the
data, but without any prior knowledge of what those patterns might be. The
objective is to discover the relationships that are inherent in the data.
There are three discovery methods:
Clustering involves grouping data records into segments by how similar they
are based on the attributes of interest. Clustering can be used, for example,
to find distinct profiles of clients with similar behavioral and demographic
attributes to create a client segmentation model.
Associations are a type of link analysis that finds links or associations among
the data records of single transactions. A common usage of the Associations
method is for market basket analysis, which finds what items tend to be
purchased together in a single market basket, such as chips and soda.
Sequences are a type of link analysis, sequential patterns, that find
associations among data records across sequential transactions. A store can
use sequential patterns to analyze purchases over time and, at checkout
time, use that model to print customized discount coupons for the customers
to use on their next visit.
Predictive methods are data mining techniques that can help predict categorical
or numeric values.
There are three predictive methods:
Classification is used to predict values that fall into predefined buckets or
categories. For example, they can predict whether a particular treatment
cures, harms, or has no effect on a particular patient.
Regression is used to predict a numerical value on a continuous scale, for
example, predicting how much each customer spends in a year. If the range
of values is 0 - 1, this becomes a probability of an event happening, such as
the likelihood of a customer leaving.
Time series forecasting predicts the values of a numerical data field for a
future period. You can use a time series model to predict future events based
on known past events. For example, forecasting can be used to create stock
level forecasts to reduce warehouse costs.
168
For more information about the discovery and predictive mining methods that are
available in DB2 Warehouse, see Dynamic Warehousing: Data Mining Made
Easy, SG24-7418.
169
3. Perform the data mining process: Build models to run the wanted technique
with the appropriate parameters. Modeling is an interactive and iterative
process as the initial results are reviewed, model parameters are adjusted to
produce a better model, and any additional data preparation is performed.
4. Interpret and evaluate results: Visualize the model to help interpret the result,
and assess the model quality and determine whether the model fulfills its
business purpose. Improvements to the input data, model parameters, and
modeling technique can be made to obtain a model that meets the objective.
5. Deploy the solution: The final and most important step in the data mining
process because how and where the results are deployed are crucial to
realizing the maximum value from the data mining. This step leans heavily on
the usage of scoring techniques and scoring results (apply a data mining
model to generate a prediction for each record, depending on the type of
model).
170
Figure 10-2 shows the shared and integrated environment that greatly enhances
ease-of-use and accessibility to data mining by many users with various needs
and skill sets, not just a small group of experts, which eliminates the need to
move data to a separate analytic environment.
Figure 10-2 Data mining and text analytics in a DB2 Warehouse environment
171
Modeling
The modeling component is a DB2 SQL application programming interface (API)
that is implemented as a DB2 extender. Modeling is accessed graphically
through the Design Studio to build data mining models from information in DB2
databases.
All six discovery and predictive methods of data mining describe in 10.4.1, Data
mining techniques on page 168 have multiple algorithms to provide more
flexibility in model development.
172
These IBM DB2 Extenders provide a set of SQL stored procedures and
user-defined functions to build a model and store it in a DB2 table. These
procedures and functions are collectively referred to as easy mining procedures.
As the model is set up through graphical wizards in Design Studio, the easy
mining procedures use the wizard inputs to automatically create mining tasks
that specify the type of model to be built, the parameter settings, the data
location, and data settings (for example, which columns to use in the model), and
to call the appropriate mining kernel in DB2.
A DB2 table containing columns representing the behaviors and other attributes
of the records (such as clients, stores, accounts, and machines), including the
response or outcome, if any, is used as the data source for building (training) the
model and, for predictive mining, validating (testing) the model.
The new model is stored in Predictive Model Markup Language (PMML) format
in a DB2 table where it is accessible for deployment.
Visualization
The visualization component in DB2 Warehouse is a Java application that uses
SQL to call and graphically display PMML models, enabling the analyst to assess
a models quality, decide how to improve the model by adjusting model content or
parameters, and interpret the final model results for business value.
Visualization has visualizers for all six mining methods in the modeling. The
visualizers are tailored to each mining method and provide various graphical and
tabular information for model quality assessment and interpretation in light of the
business problem.
Visualization can also display PMML models that are generated by other tools if
the models contain appropriate visualization extensions, such as quality
information or distribution statistics that are produced by modeling. These model
types do not contain much extended information and do not present well in
visualization.
Scoring
Like modeling, the scoring component is implemented as a DB2 extender. It
enables application programs by using the SQL API to apply PMML models to
large databases, subsets of databases, or single records. Because the focus of
the PMML standard is interoperability for scoring, scoring supports all the model
types that are created by modeling, and selected model types that are generated
by other applications (for example, SAS and IBM SPSS) that support PMML
models.
173
Scoring also supports the radial basis function (RBF) prediction technique, which
is not yet part of PMML. RBF models can be expressed in XML format, enabling
them to be used with scoring.
Scoring includes scoring JavaBeans, which enables the scoring of a single data
record in a Java application. This capability can be used to integrate scoring into
client-facing or e-business applications, for example.
A PMML model is either created and automatically stored by modeling or created
by another application and imported into DB2 by using the scoring components
SQL import function. Accessed through Design Studio, scoring applies the
PMML model to new data. The score that is assigned to each record depends on
the type of mining model. For example, with a clustering model, the score is the
best-fit cluster for a given record. The results are written to a new DB2 table or
view, where they are available to other applications for reporting or further
analysis, such as OLAP.
174
The middle part shows the Text Analytics transformation run time, which is
embedded in the database server and runs SQL-based transformations in the
DB2 database.
Before you can run the transformation flows, you must design them. The design
time components on the left part of Figure 10-4 consist of two parts.
A workbench to configure the text analysis engines (or in UIMA terminology,
annotators). For example, if you have a rule-based annotator, you must
specify the rules, depending on your business problem and text corpus. If you
have a list- or dictionary-based annotator, you must be able to specify the list
of words to be used. UIMA Annotators or Analysis Engines are UIMA-based
components that are used to extract entities such as names, sentiments, or
relationships.
The second part, after configuring the analysis engine, is used to define the
transformation flows themselves. Specify the input table to be analyzed and
the configured analysis engine to be used, and map the analysis results to
columns and tables in the database.
After you convert the text into structure, you can use existing and well-known
reporting and analysis tools from IBM (for example, Cognos).
175
UIMA text analytics can be considered an ELT for text and extract structured
information from the text. The extracted information is stored in a relational
database (DB2 Data Warehouse) and can be used to describe OLAP metadata
(Cubing Services). Reporting or OLAP tools, such as Cognos, can use that
metadata to create reports on the combination of the pre-existing structured data
and the structured data that is obtained from the text analysis.
Combining the results of text analysis on unstructured textual information with
structured data in data mining allows you to combine the analysis of structured
and unstructured information to create new insight. Data mining algorithms can
also, by using the text analysis results, improve the predictive power of the data
mining models.
Step-by-step details to create data mining and text analysis flows using Design
Studio in DB2 Warehouse are described in InfoSphere Warehouse: A Robust
Infrastructure for Business Intelligence, SG24-7813.
176
11
Chapter 11.
177
In addition to the relationship between fact and dimensions, there are data
relationships that are present in a dimension. Attributes in a dimension represent
how data is summarized or aggregated. These are organized in a hierarchy.
178
Figure 11-2 shows a typical time hierarchy. Sales Amount in a specific store, in a
specific region, can be aggregated at Year, Month, and Day levels.
179
Figure 11-3 shows how Cognos Dynamic Cubes is tightly integrated in to the
Cognos BI stack, and its data can be displayed through any of the Cognos
interfaces. This method allows existing customers to integrate this technology
into their application environment without affecting existing users, who are
already familiar with interfaces such as Report Studio, Business Workspace, and
Business Workspace Advanced.
Cognos Dynamic Cubes uses the database and data cache for scalability, and
also uses a combination of caching, optimized aggregates (in-memory and
in-database), and optimized SQL to achieve performance. The Cognos Dynamic
Cubes solution uses multi-pass SQL that is optimized for the relational database,
minimizing the movement of data between the relational database and the
Cognos Dynamic Cubes engine. It is aggregate-aware, and able to identify and
use both in-memory and in-database aggregates to achieve optimal
performance. It optimizes aggregates (in-memory and in-database) by using
workload-specific analysis.
180
This solution can achieve low latency over large data volumes, such as billions of
rows or more of fact data and millions of members in a dimension.
181
182
There are two modeling tools that are available in IBM Cognos BI V10.2: Framework Manager is a
metadata modeling tool that is used for designing dimensionally modeled relational (DMR)
metadata. Dynamic Cubes can be modeled by using the Cube Designer only.
183
You can import cube metadata from an IBM InfoSphere Warehouse Cubing
Services model. IBM Cognos Cube Designer creates a project with a separate
dynamic cube for each cube that is contained in the imported model.
184
IBM Cognos Cube Designer provides dynamic cube design and modeling
capability. The Administration Console is used to deploy and manage the cube
data. The IBM Cognos Dynamic Query Mode (DQM) server maintains the cube
data. Studio applications use the data in reporting environments. In addition,
various tools, such as Dynamic Query Analyzer, are used to analyze and
optimize the data as necessary, such as the Aggregate Advisor.
Importing relational metadata to use as the basis for dynamic cube design
Designing dynamic, aggregate, and virtual cubes
Setting cube-level security for hierarchies and measures
Publishing the dynamic cube
185
11.3.4 Optimization
If the performance of the reports does not meet your expectations, optimization is
performed to achieve the wanted results. By using Cognos Administration, you
can adjust various performance parameters.
186
Aggregate Advisor
Aggregate Advisor is a tool that is available with IBM Cognos Dynamic Query
Analyzer that can analyze the underlying model in a dynamic cube data source
and recommend which aggregates to create. These aggregates can be created
both in-database and in-memory.
Aggregate Advisor can also reference a workload log file that can help Aggregate
Advisor suggest aggregate tables (in-database or in-memory) that correspond
directly to the reports that are contained in the log file. Before running the
Aggregate Advisor, the expectation is that the dynamic cube is published in the
Content Store, can be started successfully, and that reports and analysis run and
return correct results.
The usage of aggregates can improve the performance of queries by providing
data that is aggregated at levels higher than the grain of the fact.
Database aggregates
Database aggregates are tables of pre-computed data that can improve query
performance by reducing the number of database rows that are processed for a
query.
Cognos Dynamic Cubes brings up the aggregate routing logic into its query
engine, where the multidimensional OLAP context is preserved and Cognos
Dynamic Cubes can better determine whether to route to aggregates.
Cognos Dynamic Cubes can also use aggregate tables that the database
optimizer does not know about (for example, MQTs that are not enabled for
queries, or regular tables that have aggregated data) or the database optimizer
might not route to because of complex OLAP-style SQL. After a database
aggregate table is modeled as an aggregate cube, Cognos Dynamic Cubes can
select data directly from that database aggregate table.
187
In-memory aggregates
In addition to ensured routing to aggregates in the database, a major
performance feature of Cognos Dynamic Cubes is its support of aggregates that
are in memory. In-memory aggregates provide precomputed data and can
improve query performance because the aggregated data is stored in memory in
the aggregate cache of the dynamic cube. This situation avoids the impact of
transferring data from the data warehouse to the BI server.
In-memory aggregates are aggregate tables that can be created in memory by
the IBM Cognos Business Intelligence server every time the cube is started or
the data cache is refreshed. The definition of these aggregates is stored in the
content store.
In-database aggregates are built during a predefined ETL processing window.
In-memory aggregates are built during cube-start. The Aggregate Advisor is
used to recommend additional in-database aggregate tables and a set of
in-memory aggregates.
188
Cognos Dynamic Cubes uses five caches, each with a separate purpose. When
users run the system and run queries, individual data items are loaded in
memory. Because security is applied on top of the dynamic cube, the reuse of
data is maximized, allowing as many users and reports as possible to benefit
from previous queries.
Data cache
The data cache contains the result of queries that are posed by the MDX engine
to a dynamic cube for data. When users run the system and run queries,
individual data items are loaded in memory. Because security is applied on top of
the dynamic cube, the reuse of data is maximized, allowing as many users and
reports as possible to benefit from previous queries.
Member cache
Members of each hierarchy are retrieved by running a SQL statement to retrieve
all of the attributes for all levels within a hierarchy, including multilingual property
values, and are stored in memory. The parent-child and sibling relationships that
are inherited in the data are used to construct the hierarchical structure in
memory. When you start a cube, all members are loaded in to memory. This
approach ensures a fast experience as users explore the metadata tree. This
approach also helps the server maximize its ability to run the most efficient
queries possible.
189
Expression cache
To accelerate query planning, the engine saves expressions that are generated
when queries are run. This way, query planning can be accelerated for future
queries when expressions can be reused.
As with the result set cache, the intermediate results that are stored in the
expression cache are security-aware and are flushed when the data cache is
refreshed.
Because the expression cache is stored in memory and is so closely associated
with the data cache, the expression cache is stored within the space that is
allotted to the data cache.
Aggregate cache
In-memory aggregates contain measure values that are aggregated by the
members at the level of one or more hierarchies within the cube. These values
can be used to provide values at the same level of aggregation.
The aggregates that are recommended by the Aggregate Advisor and its size are
estimated at the time of the advisor run. The value of this estimate grows as the
size of the data warehouse grows at a scale relative to the size of the member
cache. The amount of memory that is allocated to this cache is determined by the
Maximum amount of memory to use for aggregate cache cube property (set from
IBM Cognos Administration). This property determines the maximum size that is
allocated to this cache. An aggregate that cannot fit in to the cache is discarded.
190
Any action to refresh member cache, data cache, or cube start /restart initiates
the loading of the aggregate cache. Cube metrics that are available in Cognos
Administration can be used to monitor and indicate when aggregates complete
loading and to monitor the aggregate cache hit rate (along with the hit rates of
the result set cache, data cache, and database aggregate tables).
A request for data from the MDX engine can be satisfied by data that exists in the
data or aggregate cache. However, if data is not present in either cache to satisfy
the request for data, or if only part of it can be retrieved from the cache, dynamic
cubes obtain the data either by aggregating data in the aggregate cache or by
retrieving data from the underlying relational database. In either case, the data
obtained is stored in the data cache as a cubelet, which is a multidimensional
container of data. Thus, the data cache is a collection of such cubelets that are
continuously ordered in such a manner as to reduce the time that is required to
search for data in the cache.
All the caches are built at the time of cube start and are flushed out when the
cube stops/restarts.
191
192
Thus, DB2 with BLU Acceleration with IBM Cognos Business Intelligence work
together to deliver blazing fast Business Analytics. You can analyze key facts
and freely explore information from multiple angles and perspectives to make
more informed decisions about enterprise volume levels of data at breakthrough
speeds.
11.4 Resources
For more information about many of the concepts that are described in this
chapter, see the following resources:
IBM Cognos dynamics FAQ:
http://www-01.ibm.com/support/docview.wss?uid=swg27036155
IBM Cognos Dynamic Cubes Version 10.2.1.1 User Guide:
http://public.dhe.ibm.com/software/data/cognos/documentation/docs/en
/10.2.1/ug_cog_rlp.pdf
IBM Cognos Business Intelligence 10.2.0 information center:
http://pic.dhe.ibm.com/infocenter/cbi/v10r2m0/index.jsp
IBM Cognos Dynamic Cubes, SG24-8064:
http://www.redbooks.ibm.com/abstracts/sg248064.html
193
194
Related publications
The publications that are listed in this section are considered suitable for a more
detailed discussion of the topics that are covered in this book.
IBM Redbooks
The following IBM Redbooks publications provide additional information about
the topic in this document. Some publications referenced in this list might be
available in softcopy only.
IBM Cognos Dynamic Cubes, SG24-8064
InfoSphere Warehouse: A Robust Infrastructure for Business Intelligence,
SG24-7813
You can search for, view, download, or order these documents and other
Redbooks, Redpapers, Web Docs, draft and additional materials, at the following
website:
ibm.com/redbooks
Other publications
These publications are also relevant as further information sources:
Analytics: The new path to value, found at:
http://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=PM&subtyp
e=XB&appname=GBSE_GB_TI_USEN&htmlfid=GBE03371USEN&attachment=GBE0337
1USEN.PDF
Best Practices: Optimizing analytic workloads using DB2 10.5 with BLU
Acceleration, found at:
https://www.ibm.com/developerworks/community/wikis/form/anonymous/ap
i/wiki/0fc2f498-7b3e-4285-8881-2b6c0490ceb9/page/ecbdd0a5-58d2-41668cd5-5f186c82c222/attachment/e66e86bf-4012-4554-9639-e3d406ac1ec9/me
dia/DB2BP_BLU_Acceleration_0913.pdf
DB2 Partitioning and Clustering Guide, SC27-2453
195
Eaton and Cialini, High Availability Guide for DB2, IBM Press, 2004, ISBN
0131448307
IBM Cognos Dynamic Cubes Version 10.2.1.1 User Guide, found at:
http://public.dhe.ibm.com/software/data/cognos/documentation/docs/en
/10.2.1/ug_cog_rlp.pdf
Implementing DB2 Workload Management in a Data Warehouse, found at:
https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wi
ki/Wc9a068d7f6a6_4434_aece_0d297ea80ab1/page/Implementing%20DB2%20wo
rkload%20management%20in%20a%20data%20warehouse
Physical database design for data warehouse environments, found at:
https://ibm.biz/Bdx2nr
Online resources
These websites are also relevant as further information sources:
IBM Cognos Business Intelligence 10.2.0 information center:
http://pic.dhe.ibm.com/infocenter/cbi/v10r2m0/index.jsp
IBM Cognos dynamics FAQ:
http://www-01.ibm.com/support/docview.wss?uid=swg27036155
IBM DB2 Version 10.5 information center:
http://pic.dhe.ibm.com/infocenter/db2luw/v10r5/topic/com.ibm.db2.luw
.welcome.doc/doc/welcome.html
IBM Big Data:
http://www-01.ibm.com/software/data/bigdata
196
(0.2spine)
0.17<->0.473
90<->249 pages
Back cover
SG24-8157-00
ISBN 0738438979
INTERNATIONAL
TECHNICAL
SUPPORT
ORGANIZATION
BUILDING TECHNICAL
INFORMATION BASED ON
PRACTICAL EXPERIENCE
IBM Redbooks are developed by
the IBM International Technical
Support Organization. Experts
from IBM, Customers and
Partners from around the world
create timely technical
information based on realistic
scenarios. Specific
recommendations are provided
to help you implement IT
solutions more effectively in
your environment.