Overview - Teradata For Oracle Developers
Overview - Teradata For Oracle Developers
Overview - Teradata For Oracle Developers
Overview
Table of Contents
Table of Figures ..................................................................................................................................................................................... 5 Overview .................................................................................................................................................................................................. 6 Assumptions ........................................................................................................................................................................................... 6 Architectures .......................................................................................................................................................................................... 6 Section introduction ....................................................................................................................................................................... 6 EDW ....................................................................................................................................................................................................... 6 Data mart architectures ............................................................................................................................................................ 8 Oracle overview ........................................................................................................................................................................ 10 Teradata overview .................................................................................................................................................................... 11 Teradata v. Oracle ............................................................................................................................................................................. 13 Oracle-to-Teradata data type conversion ........................................................................................................................... 13 SQL functions ................................................................................................................................................................................. 18 Oracle-to-Teradata function conversion ........................................................................................................................ 19 Data Consistency........................................................................................................................................................................... 22 Levels of isolation ..................................................................................................................................................................... 22 Locking ......................................................................................................................................................................................... 23 Teradata utilities ................................................................................................................................................................................ 33 Born to be parallel............................................................................................................................................................................. 33 Introduction .................................................................................................................................................................................... 33 Data placement to support parallel processing ............................................................................................................... 34 Indexes ......................................................................................................................................................................................... 35 Strengths and weakness of various index types .......................................................................................................... 37 Data placement ......................................................................................................................................................................... 39 Intelligent internodal communication .................................................................................................................................. 40 Parallel-aware query optimization ......................................................................................................................................... 41 Request parallelism ...................................................................................................................................................................... 42 Query (request) parallelism .................................................................................................................................................. 42 Within-step parallelism .......................................................................................................................................................... 42 Multistep parallelism .............................................................................................................................................................. 43 Multistatement request parallelism .................................................................................................................................. 44 Synchronization of parallel operations ................................................................................................................................ 44 Synchronized table scans ...................................................................................................................................................... 44 Spool file reuse.......................................................................................................................................................................... 45 Synchronized BYNET operations ........................................................................................................................................ 45 For additional details ................................................................................................................................................................... 46 Teradata data load & unload........................................................................................................................................................ 46 BTEQ ................................................................................................................................................................................................... 47 BTEQ example............................................................................................................................................................................ 47 FastExport ........................................................................................................................................................................................ 49 FastExport example ................................................................................................................................................................. 50 FastLoad ........................................................................................................................................................................................... 50 FastLoad example .................................................................................................................................................................... 52 MultiLoad ......................................................................................................................................................................................... 53 MultiLoad example .................................................................................................................................................................. 55 TPump ............................................................................................................................................................................................... 55 TPump example ........................................................................................................................................................................ 56
TPT ...................................................................................................................................................................................................... 59 Major TPT features................................................................................................................................................................... 60 TPT example ............................................................................................................................................................................... 60 ETL design ............................................................................................................................................................................................ 64 ETL APIs............................................................................................................................................................................................. 64 Overview of the Teradata PT API ............................................................................................................................................ 65 Informaticas PowerExchange for Teradata Parallel Transporter ............................................................................... 66 Sequences ............................................................................................................................................................................................ 67 Oracle sequences .......................................................................................................................................................................... 67 Teradata identity columns ......................................................................................................................................................... 67 Process for generating identity column numbers ....................................................................................................... 69 Oracle sequence v. Teradata identity column ................................................................................................................... 70 Data compression ............................................................................................................................................................................. 70 Teradata compression................................................................................................................................................................. 71 Column compression .............................................................................................................................................................. 71 How Teradata database compresses data ..................................................................................................................... 72 Default settings for NULL and COMPRESS .................................................................................................................... 73 Meaning of the COMPRESS attribute in table DDL .................................................................................................... 73 Interesting SQL differences ........................................................................................................................................................... 74 Appendices .......................................................................................................................................................................................... 74 Teradata utilities ............................................................................................................................................................................ 75
Table of Figures
Figure 1 EDW architecture ....................................................................................................................................7 Figure 2 The Oracle instance .............................................................................................................................. 11 Figure 3 The Teradata building block ................................................................................................................ 12 Figure 4 Oracle-to-Teradata data type conversion.......................................................................................... 14 Figure 5 Oracle-to-Teradata function conversion............................................................................................ 19 Figure 6 ASNI SQL isolation phenomena .......................................................................................................... 23 Figure 7 Teradata locking levels ......................................................................................................................... 27 Figure 8 Teradata locking severities .................................................................................................................. 28 Figure 9 Teradata locking severity compatabilities.......................................................................................... 30 Figure 10 Teradata lock concurrency ................................................................................................................. 31 Figure 11 Teradata lock isolation levels ............................................................................................................ 32 Figure 12 Strengths and weakness of various index types ............................................................................. 37 Figure 13 Row assignment to AMPs .................................................................................................................. 39 Figure 14 Ensuring AMP-local joins ................................................................................................................... 40 Figure 15 AMP data ownership & control ........................................................................................................ 40 Figure 16 BYNET parallelism ............................................................................................................................... 41 Figure 17 Potential JOIN complexities ................................................................................................................ 41 Figure 18 Pipeline parallelism .............................................................................................................................. 43 Figure 19 Multistep parallelism ........................................................................................................................... 44 Figure 20 Synchronized table scans .................................................................................................................... 45 Figure 21 Dynamic BYNET groups ...................................................................................................................... 46 Figure 22 BTEQ example report ......................................................................................................................... 49 Figure 23 FastExport example ............................................................................................................................ 50 Figure 24 FastLoad example ............................................................................................................................... 52 Figure 25 MultiLoad phases ................................................................................................................................ 53 Figure 26 MultiLoad example ............................................................................................................................. 55 Figure 27 TPump example .................................................................................................................................. 57 Figure 28 TPT example ........................................................................................................................................ 61 rd Figure 29 Legacy utility integration architecture for 3 party applications.................................................. 65 rd Figure 30 API integration architecture for 3 party applications................................................................... 66 Figure 31 Generating identity using ALWAYS & BY DEFAULT ....................................................................... 68 Figure 32 Generating identity using ALWAYS & BY DEFAULT (cont) ............................................................ 69 Figure 33 Generating identity column numbers .............................................................................................. 69 Figure 34 Table DDL COMPRESS Attribute ....................................................................................................... 73 Figure 35 Teradata Release 12.0 documentation ............................................................................................. 75
Overview
This paper is an attempt to consolidate in a single document the many topics with which a new Teradata project person must quickly become familiar. Note: This paper provides a high-level overview of the material being presented. Detailed information may be obtained in the corresponding paper, Teradata for Oracle Developers Research Paper.
Assumptions
The primary assumption is that a reader will be familiar with Oracle but will have had little if any experience with Teradata. As such, the following topics will be addressed: Review of the importance of architectures Review and comparison of Oracle and Teradata architectures Review and comparison of Oracle and Teradata data types Brief discussion of Oracle and Teradata SQL differences Introduction to Teradata data manipulation utilities Using ETL tools with Teradata
Architectures
Section introduction
An information system is the organized collection, processing, transmission, and dissemination of 1 information in accordance with defined procedures, whether automated or manual. Architectures are the blueprints describing the environments in which information systems are built. Attempting to create an enterprise data warehouse without laying a proper architectural foundation is just as foolish as trying to build a modern skyscraper without first investing the time to produce a thorough set of blueprints. Both cases result in potential loss of large amounts of monies and extreme disappointment if not disaster. It is thus worthwhile to briefly discuss architectural implications and related data mart strategies. This section discusses the following topics: Overview of a typical EDW architecture Industry state-of-the-art reference architectures Data mart architectural alternatives
EDW
EDW initiatives are designed to create analytical and operational decision support capabilities while building an infrastructure for future growth. In addition to targeted technological efficiencies, integrated information strategies enable companies to more effectively achieve and align to the business strategy. An EDW is a true information system. As such, it must be supported by a true architecture. A typical EDW conceptual architecture is shown in Figure 1.
1
www.gao.gov/policy/itguide/glossary.htm
Of course an alternative EDW architectural approach is available: a reference architecture. A reference architecture simply defines an ideal target architecture. An organization may or may not actually intend to implement it per se. Reference architectures are frequently used as guidelines to provide direction and overall guidance to help ensure that key architectural components are not omitted. The actual architecture that is implemented, however, is unique to the firms requirements and constructed to meet its needs. Conceptually and functionally, though, it will map to the reference architecture. There are three main industry state-of-the art reference architecture types: Topical architectures A Topical Architecture is dedicated to explaining the environment supporting a very specific topic or highly vertical application. One example is a Meta Data Topical Architecture. End-to-end architectures An End-to-End Architecture is an overview of all major components in an architecture. Data provisioning frameworks A Data Provisioning Framework is a hybrid Target and End-to-End Architecture. Its purpose is to describe an architecture that provides for near real-time data provisioning that incorporates
Historical persistence; Near real-time access to consolidated transactional data; Mechanisms to publish information; BI enablement; Federated access to non-persistent data; Intelligent query services. Data mart architectures This section discussed various approaches and recommendations for implementing data marts (DM).
2
Independent data marts Independent data marts are built directly from operational systems rather than from a common data repository such as a data warehouse (DW). They are usually built independently by separate teams rather than in a coordinated fashion. Thus the tool sets are often different, business rules vary and frequently contradict each other costs escalate, and the overall lack of coordination results in a reporting environment.
Independent data marts have issues! Five main problems exist with independent data marts: 1. 2. 3. 4. 5. Redundant data Costs escalate storing the same thing many times; Redundant processing As the number of DMs increase, so does the stress on the operational systems; Difficulty in scaling This is directly related to problem #2; Lack of integration No single version of the truth; Difficulty to integrate reporting
The industry has proven that independent data marts are rarely beneficial in todays enterprise-centric IT environments.
Dependent data marts A dependent data mart is one whose source is a DW. Many reasons have proven to make this approach far 3 more viable than that of the independent DM. Several of them are
More standardized and accurate data One of the criteria for DW data is that it be integrated, consolidated, and harmonized. This means the same data looks alike, has the same business rules applied, is quality certified, and so forth. When this data is then used to populate data marts, it will, by definition, retain these characteristics. Long-term cost effectiveness via easier maintenance Maintenance takes place in a single location and is propagated to the appropriate DMs. Each DM is not responsible for applying possible complex update scenarios or tracing mountains of data errors. A true enterprise-wide view of the firms data and organizational performance Integration with other DM data is no longer an issue.
2 3
Federated refers to being in multiple, possibly heterogeneous (diverse or dissimilar), locations. http://www.bettermanagement.com/library/library.aspx?l=347
The industry has shown that the dependent data mart approach is the approach to use unless a very compelling reason proves otherwise.
Virtual data marts A Virtual Data Mart (VDM), a type of dependent data mart, is a set of views defined on the enterprise warehouse that appear to the user as a separate, self-contained database organized for one specific 4 purpose.
When sufficient performance can be obtained with a virtual data mart, this may be advantageous: It eliminates the need for an additional copy of the data (no additional disk storage). It eliminates the need for daily extracts and downloads (no additional network traffic). It eliminates the need to acquire and operate a separate system. It provides the opportunity to share load the processing across multiple virtual data marts. It is much easier to handle many types of changes to the virtual data mart, as long as the enterprise warehouse is able to deliver the necessary performance and scalability. As an example, providing access to an additional table is simply a matter of defining a new view no changes to the extract process, download process or network workload are necessary. It reduces backup and recovery tasks. It offers a efficient window into world of enterprise data. When a new item is needed, it already exists and can simply be added to the view making up the VDM. Virtual data marts can, though, have a downside: Performance can become a problem unless a high-end database platform (both hardware and software) is used. The number of concurrent users must be controlled. All network traffic becomes centralized, causing potential bottlenecks traffic concerns. Complex views can be difficult or impossible to represent in SQL. Complex data models must be translated into SQL sometimes an impossible task.
So how do we determine which data mart approach to use? Ignore independent data marts. They have been proven to cause far too many problems. Develop a strategy that determines the requirements of the interested parties. Implement only those dependent data marts that are absolutely required (usually mandated by performance considerations or specialized analytical needs). Follow the rigor of a virtual data mart strategy. (See Teradata for Oracle Developers Research Paper for a discussion of such a strategy.)
http://www.wintercorp.com/rwintercolumns/pitter_patter.html
Oracle overview
Oracle uses what is commonly referred to as a shared memory architecture. An Oracle database system consists of a database and an Oracle instance: A database consists of a set of disk files that store user data and meta data. Meta data, or "data about the data," consists of structural, configuration, and control information about the database. An Oracle instance (also known as a database instance) contains the set of Oracle Database background processes that operate on the stored data and the shared allocated memory that those processes use to do their work. [emphasis mine] Oracle uses shared memory for several purposes, including caching of data and indexes as well as storing shared program code. This shared memory is broken into various pieces, or memory structures. The basic memory structures associated with Oracle are the System Global Area (SGA) and the Program 6 Global Area (PGA). Figure 2 illustrates an Oracle instance.
5 6
Oracle Database 2 Day DBA 11g Release 1 (11.1), Part Number B28301-03 http://oracle.basisconsultant.com/oracle_architecture.htm
10
Teradata overview Teradata uses a shared-nothing (SN) architecture. SN architectures are distributed computing architectures in which each node is independent and self-sufficient, and there is no single point of contention across 7 the system. This allows for a great degree of processing parallelism, system scalability, reliability, availability, serviceability, usability, and installability. Figure 3 illustrates the basic building block of the Teradata architecture.
http://en.wikipedia.org/wiki/Shared_nothing_architecture
11
The next sections briefly describe the primary components of the Teradata architecture.
Primary components 8 The hardware that supports Teradata Database software is based on Symmetric Multi-Processing (SMP) technology. The hardware can be combined with a communications network that connects the SMP systems to form Massively Parallel Processing (MPP) systems.
Processor Node
A processor node is a hardware assembly containing several, tightly coupled, central processing units (CPUs) in an SMP configuration. An SMP node is connected to one or more disk arrays. An MPP configuration is a configuration of two or more loosely coupled SMP nodes. The processor node is the platform upon which the database software operates.
Bynet
A BYNET is the hardware interprocessor network to link nodes on an MPP system. (Single-node SMP systems use a software-configured virtual BYNET driver to implement BYNET services.) The BYNET implements broadcast, multicast, or point-to-point communication, along with merge functions, between processors. A multinode system has at least two BYNETs.
Disk Arrays
Teradata Database employs Redundant Array of Independent Disks (RAID) storage technology to provide data protection at the disk level.
8
12
Virtual Processors
The versatility of Teradata Database is based on virtual processors (vprocs) that eliminate dependency on specialized physical processors. Vprocs are a set of software processes that run on a node under Teradata Parallel Database Extensions (PDE) within the multitasking environment of the operating system. There are two types of vprocs: Parsing Engine P E The PE is the vproc that communicates with the client system on one side and with the AMPs (via the BYNET) on the other side. Each PE executes the database software that manages sessions, decomposes SQL statements into steps, possibly in parallel, and returns the answer rows to the requesting client. Access Module Processor A MP The AMP is the heart of Teradata Database. The AMP is a vproc that controls the management of the Teradata Database and the disk subsystem, with each AMP being assigned to a vdisk. Each AMP manages a portion of the physical disk space, storing its portion of each database table within that space. AMPs are grouped into logical clusters to enhance the fault-tolerant capabilities of the Teradata Database.
What is virtualization? Virtualization is a broad term that refers to the abstraction of computer resources. Virtualization hides the physical characteristics of computing resources from their users, be they applications, or end users. This includes making a single physical resource (such as a server, an operating system, an application, or storage device) appear to function as multiple virtual resources; it can also include making multiple physical resources (such as storage devices or servers) appear as a single virtual resource.
Teradata v. Oracle
Oracle-to-Teradata data type conversion
This section describes the Teradata data types that correspond to those used by Oracle. For a detailed discussion of the data types for both databases please see Teradata for Oracle Developers Research Paper.
13
Oracle BFILE
Comment No equivalent The Oracle data type requires 9 bytes, including the length byte. For Teradata, eight bytes are used to hold a floating point value. No length byte is required. The Oracle data type requires 5 bytes, including the length byte. For Teradata, eight bytes are used to hold a floating point value. No length byte is required. The Oracle maximum size is (4 gigabytes - 1) * (database block size). (<=128 terabytes) The Teradata maximum number of bytes is 2097088000, which is the default if n is not specified.
BINARY_DOUBLE
BINARY_FLOAT
BLOB
BLOB[n]
CHAR[(n)] VARCHAR(n) The Oracle maximum size is (4 gigabytes - 1) * (database block size). (<=128 terabytes)
CLOB
CLOB The Teradata maximum number of bytes is 2097088000, which is the default if n is not specified. Oracle: Valid date range from January 1, 4712 BC, to December 31, 9999 AD. The size is fixed at 7 bytes. This data type contains the datetime fields YEAR, MONTH, DAY, HOUR, MINUTE, and SECOND. It does not have fractional seconds or a time zone. Teradata: TIMESTAMP is treated as a record composed of six fieldsYEAR, MONTH, DAY, HOUR, MINUTE, and SECONDdefined appropriately
DATE
TIMESTAMP
14
Oracle
Teradata
Comment for the Gregorian calendar. Although the record is composed of numeric fields, it is not treated internally as a numeric value. The length of the internal stored form is 10 bytes. It does have fractional seconds. For Oracle, the precision p can range from 1 to 126 binary digits. A FLOAT value requires from 1 to 22 bytes. Represent values in sign/magnitude form ranging from 2.226 x 10-308 to 1.797 x 10+308. Eight bytes are used to hold a floating point value. Note: Precision may be lost when converting from Oracle to Teradata. Oracle: day_precision accepted values are 0 to 9. The default is 2. fractional_seconds_precision accepted values are 0 to 9. The default is 6. Teradata: n spans 1 to 4. The default is 2. m spans zero to six. The default is six. Oracle: year_precision accepted values are 0 to 9. The default is 2.
FLOAT [(p)]
Teradata: n spans 1 to 4. The default is 2. For Oracle, character data of variable length up to 2 gigabytes (2147483647). The Teradata maximum number of bytes is 2097088000 Note: Data may be lost when converting from Oracle to Teradata.
LONG
CLOB
LONG RAW
BLOB
15
Oracle
Teradata
Comment length up to 2 gigabytes (2147483647). The Teradata maximum number of bytes is 2097088000 Note: Data may be lost when converting from Oracle to Teradata. Oracle: Maximum size is determined by the national character set definition, with an upper limit of 2000 bytes. Teradata: The maximum number of characters or bytes allotted to the column defined with this server character set as follows:
NCHAR[(size)]
CHAR[(size)]
IF the THEN the CHAR server type isdefined character in terms set is of LATIN characters UNICODE GRAPHIC KANJI1 bytes JANKISJIS Oracle maximum size is (4 gigabytes - 1) * (database block size). The Teradata maximum number of bytes is 2097088000, which is the default if n is not specified. If UNICODE is specified, the maximum number of bytes is 1048544000. Note: Data may be lost when converting from Oracle to Teradata. In Oracle, the precision p can range from 1 to 38. The scale s can range from -84 to 127. For Teradata, this represents a
NCLOB
CLOB [UNICODE]
NUMBER [ (p [, s]) ]
16
Oracle
Teradata
Comment decimal number of n digits, with m of those n digits to the right of the decimal point. 1<=n<=38 with m<=n. Special Cases: If p<=3, use BYTEINT If p=4 or 5, use SMALLINT If 6<=p<=10, use INT[EGER] If p>=11, use NUMERIC / DECIMAL Oracle: The maximum size is determined by the national character set definition, with an upper limit of 4000 bytes. Teradata: The maximum number of characters or bytes allotted to the column defined with this server character set as follows: IF the server THEN the VARCHAR character set type is is defined in terms of LATIN characters UNICODE GRAPHIC KANJI1 bytes JANKISJIS VARBYTE is recommended if size>=6. Not required in a properly modeled Teradata data base. Oracle: Accepted values of fractional_seconds_precision are 0 to 9. The default is 6. Teradata: Values for fractional_seconds_precision range from zero through six inclusive. The default precision is six. Specifications regarding LOCAL TIME ZONE are handled using other techniques. (See Teradata SQL Reference: Fundamentals, Teradata SQL
NVARCHAR2(size)
VARCHAR(size)
RAW(size) ROWID
17
Oracle
Teradata
Comment Reference: Functions and Operators, and Teradata SQL/Data Dictionary Quick Reference.) Note: Fractional seconds precision may be lost when converting from Oracle to Teradata. Oracle: Accepted values of fractional_seconds_precision are 0 to 9. The default is 6. Teradata: Values for fractional_seconds_precision range from zero through six inclusive. The default precision is six. Note: Fractional seconds precision may be lost when converting from Oracle to Teradata. Oracle: Accepted values of fractional_seconds_precision are 0 to 9. The default is 6. Teradata: Values for fractional_seconds_precision range from zero through six inclusive. The default precision is six. Note: Fractional seconds precision may be lost when converting from Oracle to Teradata. Not required in a properly modeled Teradata data base.
TIMESTAMP [(fractional_seconds_precision)]
TIMESTAMP [(fractional_seconds_precision)]
n/a VARCHAR(size)
SQL functions
This section describes the Teradata SQL functions that correspond to those used by Oracle. For a detailed discussion of the data types for both databases please see Teradata for Oracle Developers Research Paper.
18
Note: Only those functions that are considered within the scope of this document are included. Oracle-to-Teradata function conversion Note 1: The Oracle functions which are or may be used as analytic functions are marked with an *. Note 2: Teradata has developed some User-Defined Functions (UDFs) that simulate Oracle functions and operators that are not explicitly a part of Teradata SQL. As of 26 July 2009, though well documented and tested, these are considered unsupported field-developed UDFs. However, they are used by many Teradata customers and should be considered (after thorough testing, of course) for inclusion in a production EDW environment. The UDFs are identified via the syntax <UDFname>.
Oracle ABS ACOS ASIN ATAN ATAN2 AVG * BITAND CEIL COLLECT CORR * COS COSH COUNT * COVAR_POP * COVAR_SAMP * CUME_DIST * DENSE_RANK * EXP
Teradata ABS
Comment
ACOS ASIN
ATAN ATAN2 AVG <ceil> CORR COS COSH COUNT COVAR_POP COVAR_SAMP RANK EXP The following requests are equivalent: SELECT TOP 1 * FROM sales ORDER BY county; TOP n operator QUALIFY clause with RANK or ROW_NUMBER SELECT * FROM sales QUALIFY ROW_NUMBER() OVER (ORDER BY COUNTY) <= 1; Refer to the Teradata SQL Reference: Data Manipulation Statements Revision 12.0 manual for usage restrictions. Note: Always use Top n where possible.
No Teradata equivalent Teradata UDF Teradata does not support nested tables.
FIRST *
19
Oracle FIRST_VALUE * FLOOR GROUP_ID GROUPING GROUPING_ID LAG * LAST * LAST_VALUE * LEAD * LN LOG MAX * MEDIAN MIN * MOD NANVL
Comment
Teradata UDF
Self-join LN LOG MAX MIN MOD The Teradata MOD is an arithmetic operator, not a function. So, the Oracle function MOD(n2,n1) is represented in Teradata by the Teradata clause n2 MOD n1. No Teradata equivalent The following example divides into 4 buckets the values in the salary column of the employees table: Oracle SELECT last_name, salary, NTILE(4) OVER (ORDER BY salary DESC) AS quartile FROM employees Teradata SELECT last_name, salary, (RANK() OVER (ORDER BY salary DESC) -1) * 4 / COUNT(*) OVER () AS quartile FROM employee
NTILE *
RANK
PERCENTILE_CONT * PERCENTILE_DISC * PERCENT_RANK * POWER RANK * RATIO_TO_REPORT * REGR_SLOPE * REGR_INTERCEPT * REGR_COUNT * REGR_R2 * REGR_AVGX *
PERCENT_RANK ** RANK REGR_SLOPE * REGR_INTERCEPT * REGR_COUNT * REGR_R2 * REGR_AVGX * The exponentiate operator. So, POWER(x,y) would be x**y.
20
Comment
The Teradata MOD is an arithmetic operator, not a function. So, the Oracle function REMAINDER(n2,n1) is represented in Teradata by the Teradata clause n2 MOD n1. Teradata UDF. Generic rounding in Teradata is controlled by built-in rules plus the RoundHalfwayMagUp flag in DBSControl. See XX for a detailed discussion. Teradata UDF
ROUND
<round>
ROW_NUMBER * SIGN SIN SINH SQRT STATS_BINOMIAL_TEST STATS_CROSSTAB STATS_F_TEST STATS_KS_TEST STATS_MODE STATS_MW_TEST STATS_ONE_WAY_ANOVA STATS_T_TEST_INDEP STATS_T_TEST_INDEPU STATS_T_TEST_ONE STATS_T_TEST_PAIRED STATS_WSR_TEST STDDEV * STDDEV_POP * STDDEV_SAMP * SUM * TAN TANH TRUNC VAR_POP * VAR_SAMP * VARIENCE * WIDTH_BUCKET
STDDEV_POP STDDEV_POP * STDDEV_SAMP * SUM * TAN TANH <trunc> VAR_POP VAR_SAMP VAR_SAMP
Teradata UDF
VARIENCE is the same as VAR_SAMP except that given an input set of one element, VAR_SAMP returns null and VARIENCE returns 0.
21
Data Consistency9
In a single-user database, the user can modify data in the database without concern for other users modifying the same data at the same time. However, in a multiuser database, the statements within multiple simultaneous transactions can update the same data. Transactions executing at the same time need to produce meaningful and consistent results. Therefore, control of data concurrency and data consistency is vital in a multiuser database. Data concurrency means that many users can access data at the same time. Data consistency means that each user sees a consistent view of the data, including visible changes made by the user's own transactions and transactions of other users.
Transaction serializability To describe consistent transaction behavior when transactions run at the same time, database researchers have defined a transaction isolation model called serializability. The serializable mode of transaction behavior tries to ensure that transactions run in such a way that they appear to be executed one at a time, or serially, rather than concurrently. Additionally, concurrent database accesses by transactions are controlled in a manner such that any arbitrary serial execution of those transactions preserves the integrity of the database.
While the degree of isolation provided by serializability between transactions is generally desirable, running many applications in this mode can seriously compromise application throughput. Complete isolation of concurrently running transactions could mean that one transaction cannot perform an insert into a table being queried by another transaction. In short, real-world considerations usually require a compromise between perfect transaction isolation and performance. Levels of isolation The ANSI SQL-2003 standard defines isolation level as follows: The isolation level of an SQL-transaction defines the degree to which the operations on SQL-data or schemas in that SQL-transaction are affected by the effects of and can affect operations on SQL-data or schemas in concurrent SQL-transactions. Note that isolation level is a concept related to concurrently running transactions and how well their updates are protected from one another as a system processes their respective transactions. Serializability defines transaction isolation. A transaction is either isolated from other concurrently running transactions or it is not. If you can achieve greater concurrency at the expense of imperfect isolation by using a lower isolation level, while at the same time being certain that you can avoid concurrency errors, then there is no reason not to run under that isolation level. The result is that you use CPU resources more effectively, while still guaranteeing serializable execution for the specific workload implemented in these transactions. The ANSI SQL standard formalizes what it refers to as four isolation levels for transactions. To be precise, this section of the standard defines isolation (called serializable) and three weaker, non-serializable isolation levels that permit certain prohibited operation sequences to occur.
Teradata SQL Reference: Statement and Transaction Processing Release 12.0 ; Oracle Database Concepts 11g Release 1 (11.1)
22
The standard collectively refers to these prohibited operation sequences as phenomena. Note that the ANSI isolation levels are defined in terms of these phenomena, not in terms of locking, even though all commercial RDBMSs implement transaction isolation using locks. The three preventable phenomena are Dirty reads A transaction reads data that has been written by another transaction that has not been committed yet. N o n r e p e a t a b l e f u z z y r e a d s A transaction rereads data it has previously read and finds that another committed transaction has modified or deleted the data. P h a n t o mr e a d s o r p h a n t o ms A transaction re-runs a query returning a set of rows that satisfies a search condition and finds that another committed transaction has inserted additional rows that satisfy the condition. Figure 6 is taken from the ANSI SQL standard with slight modification. It specifies the phenomena that are or are not possible for each isolation level:
Note 1: The Oracle Database offers the read committed and serializable isolation levels, as well as a readonly mode that is not part of the ASNI SQL standard. Read committed is the default. Note 2: The Teradata Database does not support the isolation levels READ COMMITTED and REPEATABLE READ. READ UNCOMMITTED is implemented using an ACCESS level lock. Sometimes you might be willing to give up a level of transaction isolation insurance in return for better performance. While this makes no sense for operations that write data, it can sometimes make sense to permit dirty read operations, particularly if you are only interested in gaining a general impression of some aspect of the data rather than obtaining consistent, reliable, repeatable results. Locking In general, multiuser databases use some form of data locking to solve the problems associated with data concurrency, consistency, and integrity. (In other words, to enforce full transaction isolation as described in the previous section.) Locks are mechanisms that prevent destructive interaction between transactions accessing the same resourceeither user objects such as tables and rows or system objects not visible to users, such as data dictionary rows or Oracle shared data structures in memory. The following sections discuss the differences between the locking philosophies used by Oracle and Teradata.
23
What in the world is a pseudo-table ? A pseudo-table can be thought of as an alias for the physical table it represents. The purpose of pseudo-tables is to provide a mechanism for queueing table locks in order to avoid the global deadlocking that can otherwise occur in response to a full-table scan request in a parallel system. So, when you make an all-AMP request for a READ, WRITE, or EXCLUSIVE lock, the system automatically imposes pseudo-table locking.
Oracle locks Oracle Database automatically provides read consistency to a query so that all the data that the query sees comes from a single point in time (statement-level read consistency). Oracle Database can also provide read consistency to all of the queries in a transaction (transaction-level read consistency).
Oracle Database uses the information maintained in its rollback segments to provide these consistent views. The rollback segments contain the old values of data that have been changed by uncommitted or recently committed transactions. Please see Teradata for Oracle Developers Research Paper for a detail discussion of Oracle locking.
Teradata Locks
Lock manager Overview The Teradata Database Lock Manager imposes concurrency control by locking the database object being accessed by each transaction and releasing those locks when the transaction either commits or rolls back its work. This control ensures that the data remains consistent for all users. Note that with the exception of pseudo-table locks, locks in the Teradata Database are not managed globally, but by each AMP individually. (While the Lock Manager imposes locks automatically, a user can upgrade locks explicitly by using the SQL LOCKING modifier.) Locking considerations The Teradata Database always makes an effort to lock database objects at the least restrictive level and severity possible to ensure database integrity while at the same time maintaining maximum 10 concurrency. When determining whether to grant a lock, the Lock Manager takes into consideration both the requested locking severity and the object to be locked. For example, a READ lock requested at the table level cannot be granted if a WRITE or EXCLUSIVE lock already exists on any of the following database objects: The database that owns the table The table itself Any rows in the table A WRITE lock requested at the row hash level cannot be granted if a READ, WRITE, or EXCLUSIVE lock already exists on any of the following database objects:
10 This is not strictly true for host utility (HUT) locks, where the same locking levels and severities are always used for a given Archive/Recovery utility command. That does not mean the locking levels and severities for HUT locks are not optimal, but rather that the optimum locks for those operations are predetermined.
24
The owner database for the table The parent table for the row The row hash itself In each case, the request is queued until the conflicting lock is released. It is possible to exhaust Lock Manager resources. Any transaction that requests a lock when Lock Manager resources are exhausted aborts. In such cases, row hash locking for DDL statements can be disabled. You can review all the active locks and determine which other user locks are blocking your transactions if you have the performance monitor MONITOR SESSION privilege and an application that uses the Performance Monitor/Application Programming Interface. Releasing locks The Lock Manager releases all locks held by a transaction under the following conditions: An implicit transaction commits (Teradata session mode). All ANSI mode transactions are explicit. A two-phase commit (2PC) transaction commits (Teradata session mode). 2PC is not valid in ANSI mode. An explicit transaction commits by issuing its outermost END TRANSACTION statement (Teradata session mode). Explicit transactions are not valid in ANSI mode. A transaction issues a COMMIT statement (ANSI session mode). The COMMIT statement is not valid in Teradata mode. A transaction issues a ROLLBACK or ABORT statement (all session modes) Unlike the case for database transaction locks, you must release HUT locks explicitly. This is done either by specifying the RELEASE LOCK option as part of a given Archive/Recovery-related utility command or by issuing a separate RELEASE LOCK command in your job script that is placed sequentially to execute following the completion of the command that set those HUT locks. Unreleased HUT locks persist after job completion, user logoff, and even after system restarts, so you must be careful to ensure that any HUT locks set by an Archive/Recovery-related job are explicitly released after that job completes.
ANSI v. Teradata Session Modes Teradata Database was created prior to the ANSI SQL standard(s) being created. Thus the original Teradata SQL syntax and transaction processing philosophies were different from those eventually adopted by the ANSI committee. In subsequent releases, Teradata Database has added more and more compliance with ANSI-2003. However, to maintain compatibility (and since the ANSI standard has some very silly and often confusing syntax), two modes of operation are available: Teradata session mode and ANSI session mode. Please see the Teradata for Oracle Developers Research Paper for details.
25
Locking and transaction processing Overview A lock placed as part of a transaction is held during processing of the transaction and continues to be held until one of the following events occurs: The transaction completes. The transaction aborts and has completed its rollback. The system restarts and aborted transactions have completed rollback. During system restart, only update transactions that were in progress at the time of the crash are aborted and rolled back. WRITE and EXCLUSIVE locks remain in place for those transactions until they are rolled back. Implicit transactions In Teradata session mode, because an implicit (system-generated) transaction is taken as a single request, the Optimizer can determine what kinds of locks are required by the entire transaction at the time the request is parsed. Before processing begins, the Optimizer can arrange any table locks in an ordered fashion to minimize deadlocks. For a single statement transaction, the Optimizer specifies a lock on a row hash only while the step that accesses those rows is executing. Explicit transactions When several requests are submitted as an explicit (user-generated) transaction, the requests are processed one at a time. This means that the Optimizer has no way of determining what locks will be needed by the transaction as a whole. Because of this, locks are placed as each request is received , and all locks are held until one of the following events occurs: Completion of either of these events, depending on the mode, but regardless of when the user receives 12 the data (the spool file might exist beyond the end of the transaction). Outermost END TRANSACTION statement in Teradata session mode. COMMIT statement in ANSI mode. The transaction aborts. The system restarts. Locking levels and severities 13 Teradata Database locks have two orthogonal dimensions: level and severity.
11
11 The exception to this is the WRITE locks placed by a SELECT AND CONSUME operation on a queue table. The system grants WRITE locks on a queue table only when one or more rows exist in the table. The system does not grant locks at the time it receives the SELECT AND CONSUME request. 12 A spool file is simply a temporary work file usually required during the execution of an SQL statement. 13 There is a small, but unimportant, correlation between locking levels and locking severity. For example, it makes no sense to apply a row hash-level lock with EXCLUSIVE severity because the row hash level is atomic and so cannot be shared.
26
The level of a lock refers to its scope or granularity: the type and, by inference, the size of the object 14 locked. For example, a database lock is a higher, less finely grained level lock than a row hash lock, which operates at a lower level and finer granularity. The selection of lock granularity is always a tradeoff between the conflicting demands of concurrency and overhead. For example, concurrency increases as the choice of locking level becomes increasingly granular. Exerting a row-hash level lock permits more users to access a given table than exerting a tablelevel lock on the same table. This is why the system provides multiple levels of locking granularity. The severity of a lock refers to its degree of restrictiveness or exclusivity, such as WRITE lock being more restrictive than an ACCESS lock, or an EXCLUSIVE lock being more restrictive than a READ lock. Locking levels The hierarchy of locking levels for a database management system is a function of the available granularities of locking, with database-level locks having the lowest (coarsest) granularity and row hashlevel locks having the highest (finest) granularity. Depending on the request being processed, the system places a certain default lock level on the object of the request as shown in Figure 7.
This lock level Most Restrictive Database Table View Row Hash
Locks all rows of all tables in the database and their associated secondary index subtables. all rows in the base table and in any secondary index and fallback subtables associated with it. all underlying tables accessed by the view. the primary copy of rows sharing the same row hash value. A row hash lock permits other users to access other data in the table and is the least restrictive type of lock. A row hash lock applies to a set of rows that shares the same hash code. It does not necessarily, nor even generally, lock only one row. A row hash lock is applied whenever a table is accessed using a unique primary index (UPI). For an update that uses a unique secondary index (USI), the appropriate row of the secondary index subtable is also locked. It is not necessary to lock the fallback copy of the row, nor any associated row, of a nonunique secondary index (NUSI) subtable.
Least Restrictive
14 Note that the Teradata Database locks rows at the level of row hash, not the individual row, which means that a row hash-level lock typically locks more than one row.
27
The locking level determines whether other users can access the target object. Locking severities and locking levels combine to exert various locking granularities. The less granular the combination, the greater the impact on concurrency and system performance, and the greater the delay in processing time. Locking severities Five types of locks are used as shown in Figure 8.
Description Placed only on a database or table when the object is undergoing structural changes (for example, a column is being created or dropped) or when a database is being restored, copied, rolled back, rolled forward, or built by an Archive/Recovery utility command, in which case a HUT EXCLUSIVE lock is placed on the resource. An EXCLUSIVE lock restricts access to the object by any other user. You can also place this lock explicitly using the LOCKING modifier Placed in response to an INSERT, UPDATE, or DELETE request or when a set of database objects is being recovered or restored by an Archive/Recovery utility command. A WRITE lock restricts access by other users (except for applications that are not concerned with data consistency and choose to override the automatically applied WRITE lock by specifying a less restrictive ACCESS lock). You can also place this lock explicitly using the LOCKING modifier Placed in response to a SELECT request. A READ lock restricts access by users who require EXCLUSIVE or WRITE locks. You can also place this lock explicitly using the LOCKING modifier.
Archive requests can also place HUT READ and HUT GROUP READ locks on database resources. You must release any HUT READ or HUT GROUP READ locks you set either by submitting the RELEASE LOCK Archive utility command after the command that set those locks completes, or by specifying the RELEASE LOCK option in the command text itself. The CHECKSUM, ACCESS, and HUT ACCESS locking severities are all at the same level in the restrictiveness hierarchy.
28
Description Placed in response to a user-specified LOCKING FOR CHECKSUM modifier when using cursors in embedded SQL. CHECKSUM locking is identical to ACCESS locking except that it adds checksums to the rows of a spool file to allow a test of whether a row in the cursor has been modified by another user or session at the time an update is being made through the cursor. Placed in response to a user-defined LOCKING FOR ACCESS modifier or by setting the session default isolation level to READ UNCOMMITTED using the SET SESSION CHARACTERISTICS AS TRANSACTION ISOLATION LEVEL statement. Permits a user to have READ access to an object that might already be locked for READ or WRITE. An ACCESS lock does not restrict access by another user except when an EXCLUSIVE lock is required; therefore it is sometimes referred to as a dirty READ lock. A user requesting an ACCESS lock disregards all data consistency issues. Because ACCESS and WRITE locks are compatible, the data might be undergoing updates while the user who requested the access is reading it. Therefore, any query that requests an ACCESS lock might return incorrect or inconsistent results. Note that the GROUP READ lock set by the Archive utility also places a HUT ACCESS lock internally on its subject table for the duration of the GROUP READ lock.
Least Restrictive
The Archive/Recovery (ARC) utility uses HUT Exclusive locks when restoring, copying, rolling back, rolling forward, or building a database. Unlike with database locks, you must explicitly release HUT locks with the RELEASE LOCK option in the ARC utility command or by issuing a separate RELEASE LOCK in a script that sets HUT locks. Client utilities (BTEQ, FastExport, FastLoad, MultiLoad, Teradata Parallel Data Pump, and Teradata Parallel Transporter) use standard database locks. When using the RcvManager utility to cancel a transaction rollback, use the LOCKING FOR READ OVERRIDE modifier for the LOCKING request if you want to inspect the rows. Note: In some cases, you can override locks by using locking modifiers. However, these modifiers should not be used for OLTP because OLTP uses table level locks that incur significant overhead. Compatibility among locking severities No locking severity is compatible with all other locking severities. There are two basic types of locking severity for any relational database management system: Read and Write. The transaction literature refers to these two basic types as Access locking and Exclusive locking, though the terms are not defined in the same way as they are used in the Teradata Database. Read locks always conflict with Write locks, while Write locks always conflict with other Write locks. The Teradata Lock Manager controls three types of Read lock:
29
ACCESS CHECKSUM READ The Teradata Lock Manager also controls two types of Write lock: WRITE EXCLUSIVE Various commands of the Archive/Recovery utility use three types of Read lock: HUT ACCESS HUT READ HUT GROUP READ Various commands of the Archive/Recovery utility use only one type of Write lock: HUT EXCLUSIVE Figure 9 summarizes the compatibilities and incompatibilities among the various locking severities used by the Teradata Lock Manager. Note that the system enforces the identical compatibilities, where relevant, for HUT locks.
Write
Is compatible with this locking severity Access HUT Access Checksum Read HUT Read HUT Group Read Write Access HUT Access Checksum Read HUT Read HUT Group Read Access HUT Access Checksum
But is not compatible with this locking severity Exclusive HUT Exclusive
Read HUT Read HUT Group Read Write Exclusive HUT Exclusive
30
But is not compatible with this locking severity Access HUT Access Checksum Read HUT Read HUT Group Read Write Exclusive HUT Exclusive
Lock concurrency Figure 10 shows whether a new lock request is granted or must wait behind other locks that are either in queue or already in effect. For example, a new Read request must wait until the Write lock ahead of it is released before the new Read request can go into effect. Note: The system enforces the identical compatibilities with other locks for HUT ACCESS, HUT READ, HUT GROUP READ, and HUT EXCLUSIVE locks as it does for the comparably named database locks.
Level of Lock Held None Access HUT Access Checksum Lock Granted Lock Granted Lock Granted Lock Granted
Access HUT Access Checksum Read HUT Read 15 HUT Group Read Write Exclusive HUT Exclusive
Read HUT Read HUT Group Read Lock Granted Lock Granted
Write
A queued request is in an I/O wait state and is said to be blocked. An operation can wait for a requested lock indefinitely unless you specify the NOWAIT option. If you specify a LOCKING FOR NOWAIT request modifier, the transaction aborts if it is blocked instead of queuing. The type and level of locks are automatically chosen by the system based on the type of SQL statement you issue. You can, however, upgrade a lock or downgrade a read lock to an access lock as necessary. Override locks by specifying the LOCKING request modifier in your SQL statements.
15
HUT GROUP READ locks are implemented internally as RANGE READ locks. This lock place a HUT ACCESS lock on the subject table for the duration of the Read operation and a series of rolling HUT READ locks on the component row sets of the table as rows are read in sequence from a set of data blocks ranging between system-determined starting and ending rowhash values.
31
If dirty reads are acceptable, use the locking severity of ACCESS at the locking level ROW hash as the best method to avoid deadlocks and minimize overhead. A WRITE or EXCLUSIVE lock on a database, table, or view restricts all requests or transactions except the one holding the lock from accessing data within the domain of that object. Because a lock on an entire database can restrict access to a large quantity of data, the Lock Manager ensures that default database locks are applied at the lowest possible level and severity required to secure the integrity of the database while simultaneously maximizing concurrency. Table-level WRITE locks on dictionary tables prevent contending tasks from accessing the dictionary, so the Lock Manager attempts to lock dictionary tables at the row hash level whenever possible. Isolation levels Isolation levels, shown in Figure 11, determine how a program or SQL statement acquires locks. To control the level for an individual request, use the LOCKING statement modifier. To control the isolation level for a session, use the TRANSACTION ISOLATION LEVEL statement.
If you need Fast response time over data accuracy. Response time is quick because queries can retrieve data without taking locks. Use READ UNCOMMITTED if you require read-only operations and it is not important that the data has not yet been committed. ("Read-only" operations do not include any SELECT statements specified within DELETE, INSERT, or UPDATE). READ UNCOMMITTED is especially useful for: Looking up data that is generally static in nature, for example, codes or references in a look-up table. Gathering information for statistical processing on a large amount of data when you only want to obtain averages or estimates for a general impression. Note: Using READ UNCOMITTED, the queries of the transaction will be reading data equivalent to ACCESS which may result in dirty reads. Dirty reads mean the data might be inaccurate if the data was in the process of being updated by some other transaction. Updates, inserts, and deletes are not affected because write operations obtain exclusive locks which are not released until the end of the transaction.
32
If you need Accurate and committed data. SERIALIZABLE isolation provides a stable view of the database for SELECT transactions. For transactions containing UPDATE, INSERT, and DELETE queries, the SERIALIZABLE option causes the system to execute all transactions as though they were run one after another, even if they are run concurrently. Therefore, when processing locks, the system must wait for transactions to commit and may take longer to return results.
Teradata utilities
Please see the Appendix Teradata utilities for the tools and utilities documentation available as of Teradata Database Release 12.0. Due to their importance and fundamental differences from Oracle, the data load and unload utilities will be discussed in the section Teradata data load & unload. It is left as an exercise for the reader to investigate other Teradata topics of interest vis--vis Oracle.
Born to be parallel
Introduction
The original charter for development of the Teradata system included the following goals: A large capacity, parallel processing database machine with thousands of MIPS capable of storing terabytes of total user data and billions of rows in a single table. Fault tolerance, with no single point of failure, to ensure data integrity. Redundant network connectivity to ensure system throughput. Manageable, scalable growth. A fully relational database management system using a standard, non-proprietary access language. Faster response times than any competing relational database management systems. A centralized shared information architecture in which a single version of the truth is presented to 16 users. Because the Teradata architecture was designed from the outset to support a relational database management system, its component subsystems were all designed to optimally support the established norms for a relational database management system, including full normalization of the database schema. Significant for the handling of normalized data was the incorporation of parallel processing into the system. For example, because it was designed to perform parallel processing from the outset, the Teradata architecture does not suffer from the allocation of shared resources that other system that have been adapted for parallelism experience. This is because the system is designed to maximize throughput while multiple dimensions of parallel processing are available for each individual system user.
16
There is a difference between a single version (or source) of the truth and a single view of the truth. It is quite possible, and often very necessary, to have multiple views of the truth, but these multiple views should all be based on a single version of the truth if they are to be relied upon for decision making.
33
Repeating for emphasis, unlike Oracle, the Teradata architecture is parallel from the ground up and has always been so. Its file system, message subsystem, lock manager, and query optimizer all fit together snugly, all working in parallel. Among the fundamental aspects of Teradata parallelism are the following: Data placement to support parallel processing Intelligent internodal communication Parallel-aware query optimization Request parallelism Synchronization of parallel operations
34
Row Hash In order to find which AMP to which a row is assigned, the primary index value of the row is transformed by a mathematical function called a hash function to produce an abstract value the row hash that is not related to the original data value in an obvious way. Row hashes are assigned to hash buckets, which are memory-resident routing structures that correspond in a 1:1 manner to the relationship a particular hash code or range of hash codes has with an AMP location. And, in case youre wondering, Teradata handles hash collisions (different primary indexes having the same row hash value) quite nicely.
Indexes An index is a physical mechanism used to store and access the rows of a table. Indexes on tables in a relational database function much like indexes in books they speed up information retrieval. In general, Teradata Database uses indexes to Distribute data rows. Locate data rows. Improve performance. Indexed access is usually more efficient than searching all rows of a table. Ensure uniqueness of the index values. Only one row of a table can have a particular value in the column or columns defined as a unique index. Teradata Database supports the following types of indexes: Primary Secondary Join Hash Special indexes for referential integrity
Primary indexes
Teradata Database requires one Primary Index (PI) for each table in the database, except for some data dictionary tables and global temporary tables. If unique, the PI is a column, or columns, that has no duplicate values. If nonunique, the PI is a column, or columns, that may have nonunique, or duplicate, values. Unique Primary Indexes (UPIs) guarantee uniform distribution of table rows. Nonunique Primary Indexes (NUPIs) can cause skewed data. While not a guarantor of uniform row distribution, the degree of uniqueness of the index will determine the degree of uniformity of the distribution. Because all rows with the same PI value end up on the same AMP, columns with a small number of distinct values that are repeated frequently do not make good PI candidates. The most efficient access method to data in a table is through the PI.
35
Both Unique Primary Indexes (UPIs) and Nonunique Primary Indexes (NUPIs) can be partitioned, though a non-partitioned PI is the default Teradata Database PI. Teradata Database supports a Multilevel Partitioned Primary Index (MLPPI) wherever a Partitioned Primary Index (PPI) is supported.
Secondary indexes
Secondary Indexes (SIs) allow access to information in a table by alternate, less frequently used paths and improve performance by avoiding full table scans. Although SIs add to table overhead, in terms of disk space and maintenance, you can drop and recreate SIs as needed. SIs: Do not affect the distribution of rows across AMPs. Can be unique or nonunique. Are used by the Optimizer when the indexes can improve query performance. The system builds subtables for all SIs. The subtable contains the index rows that associate the SI value with one or more rows in the base table.
Join indexes
A Join Index (JI) is an indexing structure containing columns from one or more base tables. Some queries can be satisfied by examining only the JI when all referenced columns are stored in the index. Such queries are said to be covered by the JI. Other queries may use the JI to qualify a few rows, then refer to the base tables to obtain requested columns that are not stored in the JI. Such queries are said to be partially-covered by the index. Because Teradata Database supports multitable, partially-covering JIs, all types of JIs, except the aggregate JI, can be joined to their base tables to retrieve columns that are referenced by a query but are not stored in the JI. Aggregate JIs can be defined for commonly-used aggregation queries. Much like SIs, JIs impose additional processing on insert and delete operations and update operations which change the value of columns stored in the JI. The performance trade-off considerations are similar to those for SIs. There are three types of JIs: Single table Multiple table Aggregate When you enter a query, the Optimizer determines whether accessing JI gives the correct answer and is more efficient than accessing the base tables. This sparse JI would be selected by the Optimizer only for queries that restricted themselves to data from the year 2007.
Hash indexes
The hash index provides an index structure that can be hash-distributed to AMPs in various ways. The index has characteristics similar to a single-table JI with a row identifier that provides transparent access to the base table. A hash index may be simpler to create than a corresponding JI. The hash index has been designed to improve query performance in a manner similar to a single-table JI. In particular, you can specify a hash index to:
36
Cover columns in a query so that the base table does not need to be accessed. Serve as an alternate access method to the base table in a join or retrieval operation. Strengths and weakness of various index types Teradata Database does not require or even allow users to explicitly dictate how indexes should be used for a particular query. The Optimizer costs all of the reasonable alternatives and selects the least expensive. The object of any query plan is to return accurate results as quickly as possible. Therefore, the Optimizer uses an index or indexes only if the index speeds up query processing. In some cases, the Optimizer processes the query without using any index. Selection of indexes: Can have a direct impact on overall Teradata Database performance. Is not always a straightforward process. Is based partly on usage expectations. The following table assumes execution of a simple SELECT statement and explains the strengths and weaknesses of some of the various indexing methods.
Has the following strengths is the most efficient access method when the SQL statement contains the PI value involves one AMP and one row requires no spool file (for a simple SELECT) can obtain the most granular locks provides efficient access when the SQL statement contains the PI value involves one AMP can obtain granular locks may not require a spool file as long as the number of rows returned is small provides efficient access when the SQL statement contains the USI values, and you do not specify PI values involves two AMPs and one row requires no spool file (for a simple SELECT)
And the following possible drawbacks none, in the context of a SELECT statement specifying a PI value. However, a poorly chosen PI can cause poor overall performance in a large workload.
may slow down INSERTs for a SET table with no USIs. may decrease the efficiency of SELECTs containing the PI value when some values are repeated in many rows.
37
Has the following strengths provides efficient access when the number of rows per value in the table is relatively small involves all AMPS and probably multiple rows provides access using information that may be more readily available than a UPI value, such as employee last name, compared to an employee number may require a spool file accesses each row only once provides access using any arbitrary set of column conditions can eliminate the need to perform certain joins and aggregates repetitively may be able to satisfy a query without referencing the base tables can have a different PI from that of the base table can replace an NUSI or a USI
And the following possible drawbacks requires additional overhead for INSERTs, UPDATEs, MERGEs, and DELETEs. will not be used by the Optimizer if the number of data blocks accessed is a significant percentage of the data blocks in the table because the Optimizer will determine that a full table scan is cheaper.
can isolate frequently used columns (or their aggregates for JIs only) from those that are seldom used can reduce number of physical I/Os when only commonly used columns are referenced can have a different PI from that of the base table
examines every row. usually requires a spool file possibly as large as the base table. requires additional overhead for INSERTs, UPDATEs, MERGEs, and DELETEs for any of the base tables that contribute to the multitable JI. usually is not suitable for data in tables subjected to a large number of daily INSERTs, UPDATEs, MERGEs, and DELETEs. imposes some restrictions on operations performed on the base table. requires additional overhead for INSERTs, UPDATEs, MERGEs, and DELETEs. imposes some restrictions on operations performed on the base table.
38
Has the following strengths can be stored in less space than an ordinary JI reduces the additional overhead associated with INSERTs, UPDATEs, MERGEs, and DELETEs to the base table when compared with an ordinary JI can exclude common values that occur in many rows to help ensure that the Optimizer chooses to use the JI to access less common values
And the following possible drawbacks requires additional overhead for INSERTs, UPDATEs, MERGEs, and DELETEs to the base table. imposes some restrictions on operations performed on the base table.
Data placement Teradata rows are hashed across the AMPs of a system using the row hash value of their primary index as the hash key. The system also uses the row hash of its primary index to retrieve a row. Figure 13 shows how a row is hashed and assigned to an AMP.
By carefully choosing the primary index for each table, you can ensure that rows that are frequently joined hash to the same AMP, eliminating the need to redistribute the rows across the BYNET in order to join them. Figure 14 shows how you can set up rows from commonly joined tables on the same AMP to ensure that a join operation avoids being redistributed across the BYNET.
39
Because of its share-nothing architecture, each AMP in a Teradata system exclusively controls its own virtual disk space. As a result, each row is owned by exactly one AMP. That AMP is the only one in the system that can create, read, update, or lock its data. The AMP-local control of logging and locking not only enhances system parallelism, but also reduces BYNET traffic significantly. Figure 15 shows how local autonomy provides each AMP (AMP 92 in this particular example) with total accountability for its own data.
40
All aspects of the Teradata architecture were designed with parallelism in mind, and the hash-based row distribution and AMP-local processing work hand-in-hand with the BYNET to optimize parallel processing. By co-locating prospective joined rows on the same AMP and then processing tasks involving them directly on and by the AMP, BYNET traffic is minimized. Figure 16 diagrams the BYNET for a 64 node Teradata system configuration. Each system node has multiple BYNET paths to every other node in the system.
Thus, Teradata optimizes BYNET communication by implementing a data communications protocol specifically designed for parallel query processing rather than a general-purpose packet switching protocol like TCP/IP.
41
An optimizer that estimates operational costs based on a single-threaded model will make inappropriate decisions for a parallel processing system. Using the previous table as a guide, you can easily see that any such inappropriate decisions are magnified exponentially as the number of tables to be joined increases. Among the demographics and statistics available for the Optimizer to use are information about the number of AMPs per node in the system, table cardinalities, and how many of those rows are likely to be touched for a particular operation. From this, the Optimizer knows the number of rows each AMP needs to manage, which it can use to manage file system I/Os most effectively. The Optimizer determines an AMP-to-CPU ratio that permits it to compare the number of AMPs per node as a function of available CPU power to build the most efficient query plan possible. For example, operations that are more CPU-intensive, like a product join, can be avoided and less CPU-intensive operations substituted for them when possible. This means that the Optimizer will select a product join less frequently in a configuration that has less powerful or fewer CPUs than for a system that has a larger AMP-to-CPU ratio or more powerful CPUs. Another example of the parallel awareness of the Teradata Database query optimizer is deciding whether to redistribute rows or not for a join operation. The Optimizer selects row redistribution less often in a configuration with many AMPs than it does for a system with few AMPs because row redistribution is an 17 activity that impinges heavily on BYNET traffic as well as being a CPU-intensive operation.
Request parallelism
Request parallelism is multidimensional in Teradata Database. Among the more important of these dimensions are the following: Query parallelism Within-step parallelism Multistep parallelism Multistatement request parallelism Query (request) parallelism Teradata Database request parallelism is enabled by the hash distribution of rows across all AMPs in the system. All relational operations perform in parallel, simultaneously, and unconditionally across the AMPs, and each operation on an AMP is performed independently of the data on other AMPs in the system. Within-step parallelism When the Optimizer generates a query plan, it parcels out the components of a query into a number of suboperations referred to as concrete steps, which are then dispatched to the appropriate AMPs for execution. An individual step can perform a substantial quantity of work. Within an individual step, multiple relational operations are pipelined for parallel processing. For example, while a table scan is occurring, rows that have already been selected can be pipelined into an ongoing join process. The Optimizer chooses the mix of relational operations within a step very carefully to avoid the possibility of stalls within the pipeline. Figure 18 illustrates how the operations performed by a single step can be pipelined within an AMP:
17
See Teradata SQL Reference: Statement and Transaction Processing for additional details on how Teradata performs query optimization.
42
Multistep parallelism Teradata Database is the only commercially available relational database management system to 18 implement multistep parallelism. Whenever possible, Teradata Database invokes one or more processes for each step in a request to perform an operation. Multiple steps for the same request execute simultaneously as long as they are not dependent of the results of previous steps. You can recognize multistep parallelism in EXPLAIN reports by the dot notation used for the parallel steps, as seen in Figure 19:
18
Multistep parallelism is the simultaneous performance of multiple query steps across all units of parallelism in the system.
43
This diagram shows the steps the Optimizer might generate for a 4-AMP system configuration. Note the multistep parallelism with steps 1 and 2, where steps 1.1 and 1.2 run in parallel with one another as do steps 2.1 and 2.2, respectively. Multistatement request parallelism Multistatement requests are a Teradata SQL extension that bundle any number of distinct SQL statements together in such a way that the Optimizer treats them as a single unit. Teradata Database always attempts to perform the SQL statements in a multistatement request in parallel. An example of multistatement parallelism is common subexpression elimination, an operation in which the Optimizer collapses any subexpressions that are common to the SQL statements in the request and performs the extracted operation one time only. For example, suppose you bundle six SELECT statements together in one multistatement request, and each contained a common subquery. That common subquery is executed only once and the result substituted back into the respective individual select operations. Even though the individual SQL statements within a multistatement request are performed in an interdependent, overlapping set of operations, each returns its own, distinct answer set.
44
This elegant solution is possible due to Relational theory stating that the order in which rows are retrieved from a table must be considered irrelevant. In other words, unlike Oracle, the SQL developer must not assume a priori knowledge of the sequence in which rows will be retrieved. That is why the ORDER BY clause exists! Spool file reuse The Teradata Database always reuses the intermediate answer sets referred to as spool files within a request if they are needed at a later point to process the same query. Two common examples of spool file reuse are table self-joins and correlated subqueries. Synchronized BYNET operations Several Teradata Database features act to minimize message flow within the database software. Primary among them are dynamic BYNET groups and global semaphores. A dynamic BYNET group is an ad hoc grouping of the AMPs working on a single query step. Any number of dynamic BYNET groups can coexist simultaneously. Dynamic BYNET groups are established when the first step of an optimized query is about to be dispatched to the AMPs. Before the first step is dispatched, a message is sent across the BYNET to only those AMPs that will be involved in processing the step. As a result, a dynamic BYNET group might be composed of a single AMP, a group of AMPs, or all AMPs in the system. When an AMP acknowledges reception of the message, it is enrolled in the dynamic BYNET group, sparing the database software from having to initiate a separate communication. Success and failure global semaphores are associated with dynamic BYNET groups. These objects exist in the BYNET software and signal the completion (or failure) of a step by the first and last AMPs in the dynamic BYNET group. Each success semaphore maintains a count of the number of AMPs in the group still processing the step. As each AMP reports its success, the semaphore count is reduced by 1, so when the last AMP in the group reports, the value of the semaphore is 0. Each AMP reads the value of the success semaphore when it finishes processing the step, and when its value reaches 0, that AMP knows it is the last in the group to complete the task. At that point, the AMP sends a message back to the
45
Dispatcher to notify it that all AMPs in the dynamic BYNET group have finished processing the step, so it can then send the next step to the group. This is the only message sent to the Dispatcher for each step irrespective of the number of AMPs in the dynamic BYNET group. This is illustrated by Figure 21.
Failure semaphores are the mechanism the Teradata Database uses to support global abort rocessing. A failure semaphore signals to the other AMPs within a dynamic BYNET group that one of its members has experienced a serious error condition so they can abort the step and free system resources immediately for other uses. With success and failure semaphores, no external coordinator is required. Global synchronization is built into the Teradata architecture, and because there is no single point of control, performance can be scaled up to easily handle an increasing volume of system users.
19
For this discussion, the Archive/Recovery utility is not considered a true data movement tool.
46
Teradata FastLoad Teradata MultiLoad Teradata Parallel Data Pump (TPump) Teradata Parallel Transporter (TPT)
BTEQ
BTEQ (pronounced bee-teek) is an abbreviation of Basic Teradata Query. It is a general-purpose, 20 command-based program that allows users, usually on a workstation , to communicate with one or more Teradata Database systems, and optionally to format reports for both print and screen output. Using BTEQ you can submit SQL queries to the Teradata Database. BTEQ formats the results and returns them to the screen, a file, or to a designated printer. BTEQ commands and operating system (IBM VM , IBM MVS , DOS, UNIX , Linux) commands can also be entered. A BTEQ session provides a quick and easy way to access a Teradata Database. In a BTEQ session, you can do the following: Enter Teradata SQL statements to view, add, modify, and delete data. Enter BTEQ commands. Enter operating system commands. Create and use Teradata stored procedures. BTEQ commands perform data control functions; they do not act directly on data. BTEQ commands perform four types of activities: File Control Specifies input and output formats and identifies information sources and destinations. Format Control Controls the format of screen and printer output. Sequence Control Controls the sequence in which other BTEQ commands and Teradata SQL statements are executed within scripts and macros. Session Control Begins and ends BTEQ sessions and controls session characteristics. BTEQ example Use the following BTEQ input stream to produce a report similar to that shown in Figure 22: .LOGON userid,password; DATABASE PERSONNEL; .SET FORMAT ON .SET WIDTH 80 .SET HEADING 'Total Salaries by Location, Department' .SET FOOTING '&DATE &TIME||Confidential' .SET SUPPRESS ON 1,2 SELECT Loc (TITLE 'Location') , Department.DeptNo (TITLE 'Dept.//No.') , Name (TITLE 'Employee//Name') , JobTitle (TITLE 'Position') , Salary , YrsExp (TITLE 'Years//Experience') FROM Department
20
You install and run BTEQ on the client portion of either a channel-attached or a network-attached system.
47
, Employee WHERE Loc IN ('NYC', 'ATL') AND Salary > 15000 AND Department.DeptNo=Employee.DeptNo ORDER BY Loc , Department.DeptNo , Name WITH SUM(Salary) (TITLE 'Total for Department &2') , SUM(YrsExp) (TITLE ' ', FORMAT'zz9') BY Loc, Department.DeptNo WITH SUM(Salary) (TITLE 'Total for Location &1') , SUM(YrsExp) (TITLE ' ', FORMAT 'zz9') BY Loc WITH SUM(Salary) (TITLE 'GRAND TOTAL') , SUM(YrsExp) (TITLE ' ', FORMAT'zz9') ; .LOGOFF .EXIT where: LOGON logs you onto a BTEQ session. SET FORMAT activates BTEQ format commands. SET WIDTH centers the report on 80 characters (narrow printer paper). SET HEADING centers the specified heading on the report page. SET FOOTING specifies the footing on one line in three parts. The first part contains the current date, the second contains the current time, and the third is the word Confidential. SET SUPPRESS suppresses repeating values in column 1 (entitled Location) and column 2 (entitled Dept. No.). LOGOFF terminates the Teradata Database session without exiting BTEQ. The BTEQ substitution feature (&) inserts the column 2 value in the first subtotal line, and the column 1 value in the second subtotal line.
48
FastExport
FastExport is a command-driven utility that uses multiple sessions to quickly transfer large amounts of data from tables and views of the Teradata Database to a client-based application. You can export data from any table or view to which you have the SELECT access privilege. The destination for the exported data can be: A file on your channel-attached or network-attached client system An Output Modification (OUTMOD) routine you write to select, validate, and preprocess the exported data
49
FastExport processes a series of FastExport commands and Teradata SQL statements. These direct FastExport to: Log you on to the Teradata Database for a specified number of sessions, using your username, password, and tdpid/acctid information Retrieve the specified data from the Teradata Database, in accordance with your format and selection specifications Export the data to the specified file or OUTMOD routine on your client system Log you off the Teradata Database The FastExport commands provide the session control and data handling specifications for the data transfer operations. The Teradata SQL statements perform the actual data export functions on the Teradata Database tables and views. FastExport example Figure 23 shows a FastExport job script example that executes a single SELECT statement and returns the results to a data set on the client system.
FastLoad
Teradata FastLoad is a command-driven utility you can use to quickly load large amounts of data in an empty table on a Teradata Database. You can load data from
50
Disk or tape files on a channel-attached client system Input files on a network-attached workstation Special input module (INMOD) routines you write to select, validate, and preprocess input data Any other device providing properly formatted source data Teradata FastLoad uses multiple sessions to load data. However, it loads data into only one table on a Teradata Database per job. If you want to load data into more than one table in the Teradata Database, you must submit multiple Teradata FastLoad jobsone for each table. When you invoke Teradata FastLoad, the utility executes the Teradata FastLoad commands and Teradata SQL statements in your Teradata FastLoad job script. These direct Teradata FastLoad to Log you on to the Teradata Database for a specified number of sessions, using your username, password, and tdpid/acctid information. Load the input data into the Teradata FastLoad table on the Teradata Database. Log you off from the Teradata Database. If the load operation was successful, return the following information about the Teradata FastLoad operation and then terminate: Total number of records read, skipped, and sent to the Teradata Database Number of errors posted to the Teradata FastLoad error tables Number of inserts applied Number of duplicate rows
51
FastLoad example
52
MultiLoad
The MultiLoad utility is an efficient way to deal with batch maintenance of large databases. MultiLoad is a command-driven utility for fast, high-volume maintenance on multiple tables and views of a Teradata Database. A single MultiLoad job performs a number of different import and delete tasks on database tables and views: Each MultiLoad import task can do multiple data insert, update, and delete functions on up to five different tables or views. Each MultiLoad delete task can remove large numbers of rows from a single table. Use MultiLoad to import data from Disk or tape files (using a custom Access Module) on a channel-attached client system Input files on a network-attached workstation Special input module (INMOD) programs you write to select, validate, and preprocess input data Access modules Any device providing properly formatted source data The table or view in the database receiving the data can be any existing table or view for which you have access privileges for the maintenance tasks you want to do. When you invoke MultiLoad, the utility executes the MultiLoad commands and Teradata SQL statements in your MultiLoad job script. Figure 25 describes the phases of MultiLoad operations.
Phase Preliminary
DML transaction
MultiLoad Operation Parses and validates all of the MultiLoad commands and Teradata SQL statements in your MultiLoad job Establishes sessions and process control with the Teradata Database Submits special Teradata SQL requests to the Teradata Database Creates and protects temporary work tables and error tables in the Teradata Database Submits the DML statements specifying the insert, update, and delete tasks to the Teradata Database
53
Phase Acquisition
MultiLoad Operation Imports data from the specified input data source Evaluates each record according to specified application conditions Loads the selected records into the worktables in the Teradata Database (There is no acquisition phase activity for a MultiLoad delete task.) Acquires locks on the specified target tables and views in the Teradata Database For an import task, inserts the data from the temporary work tables into the target tables or views in the Teradata Database For a delete task, deletes the specified rows from the target table in the Teradata Database Updates the error tables associated with each MultiLoad task Forces an automatic restart/rebuild if an AMP went offline and came back online during the application phase Releases all locks on the target tables and views Drops the temporary work tables and all empty error tables from the Teradata Database Reports the transaction statistics associated with the import and delete tasks
Application
Clean-up
54
MultiLoad example
TPump
The Teradata Parallel Data Pump (TPump) is a data loading utility that helps you maintain (update, delete, insert, and atomic upsert) the data in your Teradata Database. TPump allows you to achieve near-real time data in your data warehouse.
55
TPump uses standard Teradata SQL to achieve moderate to high data loading rates to the Teradata Database. Multiple sessions and multistatement requests are typically used to increase throughput. TPump provides an alternative to Teradata MultiLoad for the low volume batch maintenance of large databases under control of a Teradata system. Instead of updating Teradata Databases overnight, or in batches throughout the day, TPump updates information in real time, acquiring data from the client system with low processor utilization. It does this through a continuous feed of data into the data warehouse, rather than through traditional batch updates. Continuous updates result in more accurate, timely data. Unlike most load utilities, TPump uses row hash locks rather than table level locks. This allows you to run queries while TPump is running. This also means that TPump can be stopped instantaneously. TPump provides a dynamic throttling feature that enables it to run all out during batch windows, but within limits when it may impact other business uses of the Teradata Database. Operators can specify the number of statements run per minute, or may alter throttling minute-by-minute, if necessary. TPumps main attributes are the following: Simple, hassle-free setup does not require staging of data, intermediary files, or special hardware. Efficient, time-saving operation jobs can continue running in spite of database restarts, dirty data, and network slowdowns. Jobs restart without intervention. Flexible data management accepts an infinite variety of data forms from an infinite number of data sources, including direct feeds from other databases. TPump is also able to transform that data on the fly before sending it to Teradata. SQL statements and conditional logic are usable within the utilities, making it unnecessary to write wrapper jobs around the utilities. TPump example The following example shows what a simple TPump script and its corresponding output might look like. The lines that begin with 4-digit numbers (for example, 0001) are scripts, the rest are output.
56
57
58
TPT21
The Teradata Parallel Transporter (Teradata PT, or TPT ) is a comprehensive, multi-function product for handling all facets of loading, retrieving, processing, and transferring data within the Teradata Database and between external databases and the Teradata Database. Teradata PT is packaged as an objectoriented client application suite that provides parallel extract and load capabilities that can be extended with third-party products or customizations. Teradata PT combines a parallel execution structure, processspecific operators, an application programming interface (API), a graphical user interface (GUI), and a log viewing service that work together to execute multiple instances of data extraction, transformations, and loading functions in a scalable, high-speed, parallel processing environment: Teradata PT combines and expands on the functionality of the traditional Teradata extract/load utilities (FastLoad, MultiLoad, FastExport, and TPump, also known as standalone utilities) into a single product through the use of a single scripting language. Jobs are run using operators, which are discrete object-oriented modules that perform specific extract and load processes. Teradata PT can be invoked with scripts or with the Teradata PT API, which allows thirdparty applications to directly execute Teradata PT operators.
21 Teradata Parallel Transporter User Guide, Release 12.0 and Teradata Parallel Transporter Reference, Release 12.0 22 Teradata PT and TPT will be used interchangeably in this document.
22
59
A GUI-based Teradata PT Wizard is available to generate simple scripts. Major TPT features Following are the key features of Teradata PT: SQL-like Job Script Language Unlike the traditional Teradata standalone utilities that each use their own script language, Teradata PT uses a single script language to specify export, transform, and load (ETL) operations. This language is a combination of SQL and a syntactically similar proprietary language, sometimes referred to as Teradata SQL. Multi-Step Loading A single script can contain multiple job steps, each performing a separate load or unload function. This ability dramatically increases the potential for creating complex jobs with a single script. Teradata PT can simultaneously load data from multiple and dissimilar sources in a single job, and execute multiple instances of most operators. It can export, transform, and load data to multiple targets in a single job. It can perform inline filtering and transformations. Increased Throughput In addition to allowing multi-session capabilities of the Teradata standalone utilities, Teradata PT permits multiple instances of operators to access multiple sources and multiple targets in a single job. Teradata PT also automatically distributes input and output data into data streams that can be shared with multiple operator instances. The result is increased throughput capacity and performance. Checkpoints and Restarts I na d d i t i o nt oma n u a l o r c l i e n t r e s t a r t s , Teradata PT can automatically resume jobs from the last checkpoint if a job fails. Direct API The Teradata PT API allows developers to create a direct program-to-program calling structure that interfaces with the load and unload protocols of the Teradata standalone utilities. Using the C or C++ languages with the API, developers can create third-party tools that can load and unload Teradata tables without scripts. Reduce File Storage For intermediary steps that require temporary storage, Teradata PT stores data in buffers, called data streams, eliminating the need to write temporary data to flat files. This capability permits large amounts of data to be transferred from sources to targets without file size limits imposed by system resources or operating system. Teradata PT Wizard For help in creating, managing, and running simple Teradata PT scripts, use the Teradata PT Wizard. The Wizard steps through the process of specifying source data, destination data, and operators. Generated scripts from the Wizard can be copied into other scripts. Scripts in the Wizard can be immediately run or saved for later use. Reusability Operators are reusable components that can be combined in many ways to address a wide variety of extract, transform, and load (ETL) operations. For example, producer operators and consumer operators can work together as long as the output and input schema of inter-connected operators match. TPT example Figure 28 shows a Load and DataConnector operators example.
60
61
62
63
ETL design
ETL APIs
The major database vendors, such as IBM, Oracle, Teradata, provide a formal methodology for 3 party developers to create interfaces between applications and the database. Creating a highly structured environment for building such tools eases the development task: An Open standard reduces research and development costs to integrate with the database A robust API facilitates easier access to key Database functions
rd
64
Better control of the runtime environment simplifies the overall management processes
Approaches to providing useful APIs vary widely from vendor to vendor. However, in order to discuss the concept, the remainder of this section will discuss the highly mature Teradata Parallel Transporter Application Programming Interface (Teradata PT API). This short introduction will then be followed by reviewing Informaticas PowerExchange for Teradata Parallel Transporter.
In an API environment, though, the 3 party application and vendor-specific utilities can be integrated into a single application and run as one job (see Figure 30 ).
rd
65
User Guide: Informatica PowerExchange for Teradata Parallel Transporter (version 8.6.1). It is also assumed that the read has knowledge and experience with Informatica fundamentals of operation.
66
Please refer to Teradata for Oracle Developers Research Paper for a more detailed discussion of Informatica PowerExchange for Teradata Parallel Transporter.
Sequences
When designing a database, simple, unique sequence numbers are frequently required. Such an object starts at one and is incremented by some interval each time it is used. The most common use is to provide a unique identifier for some database object such as a table rows primary key. Both Oracle database and Teradata database provide built-in facilities to generate sequences. In most cases, this eliminates the requirement to produce sequences programmatically. However, basic philosophical differences produce vastly different implementations
Oracle sequences24
Oracle Sequences are database objects from which multiple users can generate unique integers. The sequence generator generates sequential numbers. The sequence generator provides a sequential series of numbers. It is especially useful in multiuser environments for generating unique sequential numbers without the overhead of disk I/O or transaction locking. For example, assume two users are simultaneously inserting new employee rows into the employees table. By using a sequence to generate unique employee numbers for the employee_id column, neither user has to wait for the other to enter the next available employee number. The sequence automatically generates the correct values for each user. Therefore, the sequence generator reduces serialization where the statements of two transactions must generate sequential numbers at the same time. By avoiding the serialization that results when multiple users wait for each other to generate and use a sequence number, the sequence generator improves transaction throughput, and a user's wait is considerably shorter. Please refer to Teradata for Oracle Developers Research Paper for a more detailed discussion of Oracle sequences.
24 25
Oracle Database Administrators Guide 11g Release 1 (11.1); Oracle Database Concepts 11g Release 1 (11.1) Teradata SQL Reference: Data Definition Statements Release 12.0
67
IDENTITY is an optional attribute of the Teradata SQL CREATE TABLE statement and causes the system to generate a number for every row inserted into the data table on which it is defined. The identity column does not have to be the first column in the table or be defined as an index. To make the identity column numbers unique, you must use the GENERATED ALWAYS and NO CYCLE options of the CREATE TABLE. You can then avoid load preprocessing by eliminating duplicates when loading tables and achieve uniqueness of PK values without incurring the overhead of a UNIQUE constraint. When you use an identity column as the PI, you can accomplish the following: Achieve row uniqueness without the need for a composite index. Ensure row uniqueness during a merge of several tables that each has a UPI. The identity column PI is the UPI for the final result table. When you create an identity column on a table, the system performs one of the following: Single INSERT statements made through multiple concurrent sessions (BTEQ IMPORTs into the same table) Multiple INSERT statements made through multiple concurrent sessions (TPump inserts) INSERT-SELECT statements Atomic UPSERTs (with a non-PI identity column) Single-row MERGE-INTO. The GENERATED AS IDENTITY keywords introduce a clause that specifies the following: o o The column is an identity column. The system generates values for the column when a new row is inserted into the table except in certain cases described later.
THEN the system always generates a unique value for the column when a new row is 26 inserted into the table and NO CYCLE is specific 27 generates a unique value for the column when a new row is inserted into the table only if the INSERT statement does not specify a value for the column
Note, however, that if you load the same row twice into an identity column table, it is not rejected as a duplicate because it is made unique as soon as an identity column value is generated for it. This means that some preprocessing must still be performed on rows to be loaded into identity column tables if real world uniqueness is a concern. 27 The generated value is guaranteed to be unique within the set of generated values only if you specify the NO CYCLE option.
26
68
IF you are using the identity column for this purpose to ensure a UPI, USI, PK, or some other row uniqueness property to load data into or unload data from a table to copy rows from one table to another table to fill in gaps in the sequence to reuse numbers that once belonged to nowdeleted rows
Process for generating identity column numbers The system allocates identity column numbers differently depending on whether an operation is 1) a single row or USING clause-based insert or 2) an INSERT... SELECT insert.
For this type of insert operation 28 Single row USING clause INSERT SELECT
Teradata Database uses a batch numbering scheme to generate identity column numbers. When the initial batch of rows for a bulk insert arrives at a PE or AMP, the system reserves a range of numbers before it begins to process the rows. Each PE or AMP retrieves the next available value for the identity column from dictionary table DBC.IdCol and immediately increments it by the value of the setting for the IdCol Batch Size flag in the DBSControl record. Once a range of numbers has been reserved, the system stores the first number in the range in a vproclocal identity column cache. Different tasks doing concurrent inserts on the same identity column table allot a number for each row being inserted and increment it in the cache. When the last reserved number has been issued, the PE or AMP reserves another range of numbers and updates the entry for this identity column in DBC.IdCol. This process explains the following apparent numbering anomalies: Because the Teradata architecture is highly parallel, generated identity column numbers do not necessarily reflect the chronological order in which rows are inserted.
28
69
Sequential numbering gaps can occur. Because the cached range of reserved numbers is not preserved across system restarts, exact increments cannot be guaranteed. For example, the identity column values for 1,000 rows inserted into a table with an INCREMENT BY value of 1 might not be numbered from 1 to 1,000 if a system restart occurs before the identity column number pool is exhausted. Please see Teradata for Oracle Developers Research Paper for more details on Teradata identity columns.
Data compression
Is data compression really necessary? Given the on-going reduction in storage and processors costs, the answer is, Not nearly as much as it used to be. Today, the primary motivations to implement data compression are usually Requirements to store extremely large amounts of data The desire to minimize the amount of data archive media Use data encryption techniques frequently related to compression algorithms
Oracle database supports multiple types of compression: Table The Oracle Database table compression feature compresses data in heap-organized tables by eliminating duplicate values in a database block. Key Key compression lets you compress portions of the primary key column values in an index or index-organized table, which reduces the storage overhead of repeated values. Please see the Teradata for Oracle Developers Research Paper for a detailed discussion of Oracle compression.
29
See Teradata SQL Reference: Stored Procedures and Embedded SQL Release 12.0 for stored procedure syntax; see Teradata SQL Reference: Data Manipulation Statements Release 12.0 for details on executing stored procedures via the SQL CALL statement. 30 See Teradata SQL Reference: Data Definition Statements Release 12.0 for details on macros.
70
Teradata compression31
The term compression is used to mean two entirely different things for the Teradata system. Both forms are lossless, meaning that the original data can be reconstructed exactly from their compressed forms: When describing compression of hash and join indexes, compression refers to a logical row compression in which multiple sets of nonrepeating column values are appended to a single set of repeating column values. This allows the system to store the repeating value set only once, while any nonrepeating column values are stored as logical segmental extensions of the base repeating set. When describing compression of column values, compression refers to the storage of those values one time only in the table header, not in the row itself, and pointing to them by means of an array of presence bits in the row header. This section deals with the second definition of compression in the preceding list. Column compression Note that with this kind of data compression, there is essentially no uncompression necessary to access compressed data values. The system does not need to uncompress blocks or other large chunks of data to be able to access a single row or value. This removes the significant CPU cost tradeoff common to most compression implementations. Because the compression used by Teradata Database is value-based rather than block- or run-length based or any of the other common compression and because the list of compressed data values is memory-resident any time the table is being accessed, the system can access compressed values in two ways, depending on the presence bits for the column: Using a pointer reference to the value in the current row Using a pointer reference to the value in the value list There is a small cost for compressing a value when a row is stored for the first time, and a one-time cost to convert an existing uncompressed column to a compressed column. But for queries, even those made against small tables, compression is a net win as long as the chosen compression reduces the size of the table. With respect to compressed spool files, if a column is copied to spool with no expressions applied against it, then the system copies just the compressed bits into the spool file, saving on both CPU and I/O costs. Once in spool, compression works exactly as it does in a base table. There is a compress value list in the table header of the spool which is memory-resident while the system is operating on the spool. The column attributes COMPRESS and NULL are useful for minimizing table storage space. You can use these attributes to selectively compress as many as 255 distinct, frequently repeated, column values, to compress all nulls in a column, or to compress both. Compression has two principal applications: Reducing storage costs Enhancing system performance
31
Teradata Database Design Release 12.0; Teradata SQL Data Types and Literals Release 12.0
71
Compression reduces storage costs by storing more logical data per unit of physical capacity. Optimal application of compression produces smaller rows, which in turn results in more rows stored per data block and fewer data blocks. Similarly, compression enhances system performance because there is less physical data to retrieve per row for queries. Additionally, because compressed data remains compressed while in memory, the PDE 32 file segment (FSG) cache can hold more logical rows, thus reducing disk I/O. Compression is transparent to applications, ETL utilities, ad hoc queries, and views. Experience with real world customer production databases with very large tables indicates that compression produces performance benefits for a table even when more than 100 of its columns have been compressed. There does not appear to be a downside to properly employing this technique. Column compression provides the following capacity and performance benefits. Enhanced storage capacity Improved response time for table scans Reduced disk I/O traffic Moderate to little CPU savings Compressed values are stored in the table header. Each table has only one 64 or 128 Kbyte table header. Depending on table specifications for factors such as journaling, fallback, logging, disk I/O integrity, indexes, and other things, you might find that you must limit the number of values you want to compress. How Teradata database compresses data Teradata Database uses a non-adaptive, lossless, corruption-resistant compression algorithm called the Dictionary Index method to compress values independently on a column-by-column basis. Lossless means that although the data is compacted, there is no loss of information as there is, for example, with many audio and video compression. Unlike many common compression algorithms, the Teradata Database does not replace values to be compressed with encoded representations of those values. Instead, it stores one copy of a compressed value per column in the table header and nothing at all in any row that contains that value. The mechanism of resolving which compressed values belong to which rows is based on using presence bits that index particular values stored in a well known place in the table header. The granularity of Teradata Database compression is the individual field of a row, which is the finest level possible. Field compression optimizes concurrency as well as offering superior performance for query and update processing when compared to row-level or block-level compression schemes. Row- and blocklevel compression methods require additional system resources to uncompress the row or block whenever it might contribute to a result set, whether it eventually does contribute to the result or not. Furthermore, field compression allows compression to be optimized for the data type of each column. When you designate a set of column values for compression, the system adds space to Field 5 of the 33 appropriate table header to store the compressed value set. Because Teradata Database compression is completely internal to the system, it is transparent to ETL operations, queries using base table access, queries using view access, and all application software.
32 33
The FSG cache buffers disk I/O. FSG stands for File SeGment. This does not apply to null compression, which is handled at the level of the presence bits array in the row header.
72
Default settings for NULL and COMPRESS Unless you explicitly specify different options, the following attributes are defined by default for any column in a Teradata Database table. Nulls are allowed and are compressed. Compression is not in effect. Meaning of the COMPRESS attribute in table DDL The effect of a COMPRESS specification depends on the format of the definition.
THEN all nulls for the column are compressed to zero space Each occurrence of a specified constant is compressed to zero space. All nulls for the column are compressed to zero space.
The following CREATE TABLE fragment specifies that all occurrences of the job titles "cashier," "manager," and "programmer" for the jobtitle column as well as all nulls are to be compressed to zero space. This definition saves 30 bytes for each row whenever an employee has any of the following job titles: null cashier manager programmer CREATE TABLE employee ( employee_number INTEGER ... jobtitle ... ); Please see the Teradata for Oracle Developers Research Paper for additional information regarding Teradata compression. CHARACTER(30) COMPRESS (cashier, manager, programmer)
73
34
Teradata doesn't have DUAL. Instead you can just write "SELECT current_date" and leave it at that. You can also write "SELECT table.column" too, rather than "SELECT column FROM table". You can even do "SELECT table1.column from table2" which is equivalent to "select table1.column from table1, table2" in Oracle. A Teradata database is more like a schema in Oracle. Rather than do an ALTER SESSION SET CURRENT_SCHEMA=..., you do a SET DATABASE ... There's no ROWNUM in Teradata. Instead you do a "SELECT TOP 5 * FROM table". But if you do "SELECT TOP 5 * FROM table ORDER BY column" it does the ORDER BY first, then returns the first five rows. Like pretty much every-one but Oracle, Teradata differentiates between a zero-length string and a null value.
Teradata also has a QUALIFY clause. You know how HAVING is like a WHERE for a GROUP BY ? Well
QUALIFY is like a WHERE for an analytic function, so you can do something like SELECT id, start_date, end_date FROM table QUALIFY rank() OVER (PARTITION BY id ORDER BY start_date desc) = 1 rather than making the whole thing an inline view. When using Teradata mode (not ANSI mode), comparisons of character strings ignore case. In other words, ABC = abc is TRUE. Case-specificity must be specifically enabled. (See Teradata for Oracle Developers Research Paper for a detailed casespecificity discussion.)
Appendices
Please see the Appendices in Teradata for Oracle Developers Research Paper for details regarding Teradata reserved words and keywords Teradata EXPLAIN command Teradata maxima Teradata system-derived and system-generated column data types Naming conventions
34
http://igor-db.blogspot.com/2007/10/oracle-coder-in-teradata-world.html
74
Teradata utilities
Figure 35 Teradata Release 12.0 documentation
Description This book lists and explains the messages produced by the Teradata Database on MP-RAS systems. It also contains the PDE and Gateway messages for Microsoft Windows/Linux systems. Because of their DBC/1012 heritage, some messages still mention the DBC or DBC/1012, and also use old terminology such as IFP and COP, which is now the PE This book does not include messages without numbers. This book provides a summary description of the command syntax for the following Teradata Database client software utilities: Archive/Recovery (Teradata ARC) Basic Teradata Query (BTEQ) Teradata FastExport Teradata FastLoad Teradata MultiLoad Teradata Director Program (TDP) Teradata Parallel Data Pump (TPump) Teradata Parallel Transporter (TPT)
This book provides information about ODBC Driver for Teradata. The ODBC Driver for Teradata enables UNIX , Linux and Microsoft Windows operating systems to communicate with the Teradata Database across local area networks (LAN) using the open standards ODBC interface. With the ODBC Driver for Teradata, connect Microsoft Windows, Linux, and UNIX applications to the Teradata Database.
75
Description This book provides information about OLE DB Provider for Teradata. The OLE DB Provider for Teradata allows programmers to design application programs to interface with the Teradata Database. This book provides information about Teradata Driver for the JDBC Interface. The Teradata Driver for the JDBC Interface provides access to the Teradata Database using Java applications.
This manual describes the access module component of the Teradata Tools and Utilities infrastructure that links client utilities to external data source/destination storage devices. Explained are the standard software functions and protocols for providing a block-level I/O interface between the external devices and the client utility Data Connector Application Program Interface (API). This reference details how to use the access modules that link the Teradata Tools and Utilities to external data sources and destination storage devices. For information about other third-party access modules, refer to the third-party vendor documentation. This book provides information about Teradata Call-Level Interface Version 2 for Channel-Attached Systems (CLIv2). CLIv2 is a library of routines that enable an application program to access data on a Teradata Database. An overview of the product and its components is presented and a description of its operational functions and features and how they work is provided. This book provides information about Teradata Call-Level Interface Version 2 for Network-Attached Systems. Teradata Call-Level Interface Version 2 for NetworkAttached Systems (CLIv2) is a library of routines that enable an application program to access data on a Teradata Database.
76
Interactive Teradata Query User Guide Teradata Preprocessor2 for Embedded SQL Programmer Guide
Description This book provides information about CICS Interface for application programs using IBMs Customer Information Control System/Virtual Storage (CICS/VS). This book provides information about Information Management System (IMS) interface to the Teradata Database. This book provides information about Interactive Teradata Query Facility (ITEQ). This book provides information about using the ITEQ command language. This book provides information about Interactive Teradata Query Facility (ITEQ). This book provides information about Teradata Preprocessor2 for Embedded SQL. PP2 is used to incorporate Structured Query Language (SQL) statements into application programs that access data in a Teradata Database. This book provides information about Teradata Director Program (TDP). TDP is the Teradata software that resides on a mainframe client and provides channel communication between applications and a Teradata Database.
77
Description
This book provides information on the Transparency Series/Application Program Interface (TS/API) product. It includes an overview of the product and its components, and it describes the operational functions and features of the product. The TS/API application program provides access to relational databases stored on the Teradata RDBMS via a selected set of Independent Software Vendors (ISV) products designed to retrieve data stored in DB2 or SQL/DS databases. TS/API intercepts database requests from applications and passes them to the Teradata RDBMS instead of to DB2 or SQL/DS. Data and error information are returned to the application in the same format used by DB2 and SQL/DS.
This book provides information about installing Teradata Tools and Utilities software on an IBM Virtual Machine (VM). This book provides information about installing Teradata Tools and Utilities software for z/OS on an IBM compatible mainframe. This book provides information about installing the Teradata Tools and Utilities Release 12.00.00 software on a computer that runs on a Windows operating system. This book provides information about installing Teradata Tools and Utilities version 12.00.00 products on a client system that runs on NCR UNIX SVR4 MP-RAS, IBM AIX, HP-UX, Sun Solaris or Linux operating systems.
78
Manual Storage Management Tools Teradata Access Module for Tivoli Installation and User Guide
Description This book provides information about Teradata Access Module for Tivoli. The book also provides information about installing Teradata Access Module for Tivoli on Windows 2000 and 2003 operating systems. The Teradata Access Module for Tivoli is supported by ARCMAIN as an interface for backing up and restoring objects in a Teradata Database. The Teradata Access Module manages the input/output (I/O) interfaces between the Teradata ARCMAIN client utility and IBMs Tivoli Storage Manager (TSM). This book provides information about Teradata Archive/Recovery Utility (Teradata ARC). Teradata ARC writes and reads sequential files on a Teradata client system to archive, restore, recover, and copy Teradata Database table data. Through its associated script language, it also provides an interface between Teradatas Open Teradata Backup (OTB) solutions and the Teradata Database.
This book provides information about Teradata Administrator. Teradata Administrator provides an easy-to-use Windows-based graphical interface to the Teradata Database Data Dictionary for performing a multitude of database administration tasks on the Teradata Database. This book provides information about Teradata Dynamic Workload Manager (Teradata DWM). Teradata DWM enables the database administrator (DBA) to manage queries submitted to the Teradata Database. This book provides information on Teradata Index Wizard. Teradata Index Wizard allows a Teradata Database administrator to create or identify a workload, perform index analysis for a workload, and verify and apply index recommendations to increase efficiency and maximize system performance.
79
Description This book provides information about installing Teradata Manager. This book provides pre-installation, installation, configuration, and troubleshooting information for Teradata Manager. This book provides information about Teradata Manager. As the command center for the Teradata Database, Teradata Manager supplies an extensive suite of indispensable DBA tools for managing your Teradata Database. Teradata Manager collects, analyzes, and displays database performance and utilization information in either report or graphic format, displaying it all on a Windows PC. The client-server feature in Teradata Manager replicates performance data on the server for access by any number of clients. This book provides information about Teradata Query Director (Teradata QD). Teradata Tools and Utilities are a group of products designed to work with Teradata Database. This book explains how to use Teradata QD to increase database availability and efficiency by routing sessions between two or more Teradata Databases. This book provides information about Teradata Query Scheduler (Teradata QS). Teradata QS enables the database administrator to manage workloads submitted to the Teradata Database. This book will help the database administrator (DBA) learn, manage, and skillfully utilize the Teradata QS features. This book provides information about the Teradata Query Scheduler (Teradata QS) client. This user guide describes the Teradata QS client components and features. Using Teradata QS, you can submit scheduled SQL requests to a Teradata QS server and also view information about your scheduled jobs.
80
Description This book provides information about Teradata SQL Assistant for Microsoft Windows. Teradata SQL Assistant is a Windows-based information discovery tool designed to retrieve, manipulate, and store data from ODBC-compliant database servers. This book provides information about Teradata SQL Assistant/Web Edition. Teradata SQL Assistant/Web Edition is a Webbased query tool that enables you to compose a query, submit it to the Teradata Database and view the results from a Web browser. Then you can save the data on your PC for analysis. This book provides information about Teradata Statistics Wizard. Teradata Statistics Wizard is a graphical tool that can improve the performance of queries and, as a result, the entire Teradata Database. It reduces the time to collect data and eliminates the need for constant customizing. This book provides information about Teradata System Emulation Tool (SET). Teradata SET imitates (emulates) the Optimizergenerated data from a target system, generates query plans, and then imports that information to a test system where queries can be run without impacting the production system. This book provides information on Teradata Visual Explain. Teradata Visual Explain adds another dimension to the EXPLAIN modifier by visually depicting the execution plan of complex SQL statements in a simplified manner. It presents a graphical view of the statement broken down into discrete steps that show the flow of data during execution.
81
Description This book provides information about Teradata Workload Analyzer. Teradata Workload Analyzer (Teradata WA) is a tool that analyzes and generates candidate workloads from a Windows PC. Teradata WA provides three major areas of guidance: Recommending workload group definitions. The DBA guides these based on business knowledge and Analyzer-assisted based on existing workload mix and characteristics Recommending appropriate workload goals Recommending workload to AG mapping plus PSF weights
This book provides information about Basic Teradata Query. Teradata BTEQ software is a general-purpose, command-based program that allows users on a workstation to communicate with one or more Teradata Database systems, and to format reports for both print and screen output. This book provides information about Teradata FastExport (FastExport). Teradata FastExport is a command-driven utility that uses multiple sessions to quickly transfer large amounts of data from tables and views of the Teradata Database to a client-based application. This book describes the operational features and capabilities of the utility, and includes the syntax for Teradata FastExport commands. This book provides information about Teradata Fastload (FastLoad). Teradata Fastload is a command-driven utility that quickly loads large amounts of data to empty tables in a Teradata Database. FastLoad uses multiple sessions to load data; however, it loads data into only one table on a Teradata Database per job.
82
Description This book provides information about Teradata MultiLoad (MultiLoad). MultiLoad provides an efficient way to deal with batch maintenance of large databases. MultiLoad is a command-driven utility for fast, high-volume maintenance on multiple tables and views of a Teradata Database. This book provides information about Teradata TPump (TPump). TPump is a data loading utility that helps you maintain (update, delete, insert, and atomic upsert) the data in your Teradata Database. TPump uses standard Teradata SQL to achieve moderate to high data loading rates to the Teradata Database. Multiple sessions and multi-statement requests are typically used to increase throughput. This book provides information about Teradata Parallel Transporter Application Programming Interface. Teradata Parallel Transporter Application Programming Interface is a set of application programming interfaces used to load and export data to and from Teradata Database systems. This book describes how to set up the interface, error reporting, checkpoint and restart, and other relevant topics. Coding examples are also included. This book provides information about Teradata Parallel Transporter (TPT). TPT provides high-performance extraction, transformation, and loading operations on the Teradata Database. This programmer guide provides the information necessary develop custom TPT operators, and information related to the operator interface functions that support all functional intercommunication between operators and the TPT infrastructure.
83
Description This book provides information about Teradata Parallel Transporter (TPT). TPT provides high-performance extraction, transformation, and loading operations on the Teradata Database. This book provides information on the how to use TPT. It includes an overview of the product and its components, and it describes the functions and features of the product. This book also describes the TPT infrastructure for channel-attached and network attached client systems, including the features, capabilities, and syntax for TPT. This book provides information about Teradata Parallel Transporter (TPT). TPT provides high-performance extraction, transformation, and loading operations on the Teradata Database. This book provides information about how to use TPT to extract, load, and update data.
This book provides information about Teradata Meta Data Services. This book explains how to install and configure Teradata Meta Data Services (MDS) Release 12.0. It also supports the operational responsibilities of the Teradata Meta Data Services Administrator.
84
Description This book provides information about the Teradata Meta Data Services Meta Data Development Kit. Programming information and descriptions of the Application Programming Interfaces (APIs), Meta Data Services DLLs, Libraries and other files needed to interface to the MDS Repository from customer applications are provided. The subjects covered in this book include: Overview of Interfaces to MDS Application Information Models General API Information Data Types Used in APIs C++ Class - CMetaRepository C++ Class - CMetaPersist C++ Class - CMetaObject C++ Class - CMetaFilterInfo C++ Class - CMetaObjectKey C++ Class - CMetaObjectClassKey C++ Class - AIM C++ Class - Security COM Interfaces XML Scripting Interface Error Messages Steps for Using the MDS APIs and Example Code
85