Idq9 0 1
Idq9 0 1
Idq9 0 1
Agenda
Analyst and Developer Tools Perform Column, Rule, Join and Mid-Stream Profiling Manage reference tables Collaborate on projects Scorecard data Design and develop Mapplets and Rules Create standardization, cleansing and parsing routines Validate addresses Identify duplicate records Associate and consolidate matched records Migrating from 8.6.2 to 9.0.1 Logs and troubleshooting 9.0.1
Conformity
Consistency
Accuracy
Duplicates
Integrity
CONFORMITY
CONSISTENCY
DUPLICATION
INTEGRITY
ACCURACY
2.
Identify DQ problems through Profiling using either the Analyst or Developer Tools
Developers and Analysts can work together to build the DQ management process Once the problems with the data have been identified, develop your standardization process to cleanse, standardize, enrich and validate your data Identify duplicate records in your data using a variety of matching techniques Automatically or manually consolidate your matched records Developers and Analysts can work together to build the DQ management process
10
Collaborate
3.
Standardize
Match
Consolidate
4.
Collaborate
2. Collaborate
3.
Developers and Analysts can work together to build the DQ management process
Once the problems with the data have been identified, develop your standardization process to cleanse, standardize, enrich and validate your data Identify duplicate records in your data using a variety of matching techniques Automatically or manually consolidate your matched records Developers and Analysts can work together to build the DQ management process
11
Standardize
Match
Consolidate
4.
Collaborate
Collaborate
3. Standardize
Once the problems with the data have been identified, develop your standardization process to cleanse, standardize, enrich and validate your data Identify duplicate records in your data using a variety of matching techniques Automatically or manually consolidate your matched records
Developers and Analysts can work together to build the DQ management process
12
Match
Consolidate
4. Collaborate
2.
Collaborate
3.
Standardize
Match
Consolidate
4.
Informatica Analyst
14
Data Quality Scorecarding Scorecards in the Analyst Tool Data Quality Assistant Management of Bad Records and Duplicate Records Auditing of changes
Project 1
Project 2
Folder 1
Folder 2
Folder 2-1
Folder 2-2
Folder 22-1
Folder 22-2
16
Projects
Shared option is available at folder creation time only and cannot be changed afterwards
Shared Projects Non-shared Projects
17
Project Navigator
Project Contents
Profiles Scorecards DQA Data Objects Reference Tables Rules
18
Table
19
Data Objects
Data Objects are listed in your project To view, double click on the link
20
Flat Files
Analyst enables any browser user to import flat files There are 2 import options for flat files:
Browse and Upload Network path/shared directory
21
Client/Browser Machine
flatfilecache Directory
22
File referenced
23
Relational Tables
Analyst users can
Create new DB connections
24
Relational Tables
Analyst users can
Leverage existing DB connections
25
Data Profiling
26
27
Analyst Profiling
There are two types of profiling available in the Analyst Tool:
Column and Rule Profiling
Column Profiling:
A process of discovering physical characteristics of each column in a file. It is the analysis of data quality based on the content and structure of the data.
Review Column profiling results to:
Identify possible anomalies in the data Build reference tables Apply or build Rules Develop Scorecards
28
Column Profiling
Two methods of creating profiles exist:
Quick Profile
Default Name Profile_ Profiles all columns and rows Drill down on live data
Custom Profile
User can select settings
29
Custom Profile
Specify name and location Select columns to profile Discard/keep profile results for columns not selected Select number of rows to profile Drilldown on live or staged data
Select Columns to view in drilldowns
30
Column Profiling
Value/Patterns/Statistics
Drilldown
31
Drilldowns
Click on the Drilldown arrow in value frequencies to drill down to the associated records. To drill down on multiple values select the values in the viewer, right click and choose Show Matching Rows
32
33
34
35
Project Collaboration
Seamless collaboration between Analysts and Developers
Projects created in either tool are visible in the other Team members can easily communicate and share work & findings through comments, bookmarks, shared data profiles & data quality scorecards Data can be easily exported from profiles or rules and emailed to the appropriate owner for review or correction
36
Collaboration - Comments
Analysts and Developers can use comments in profiles to collaborate on projects.
Document DQ issues
37
38
Collaboration via simple URL in email, portals, links in docs/specs, etc HTTPS Protocol Supported Metadata Bookmarks: All objects sharable via common metadata
39
High-Fidelity Collaboration
Mapplet
Common Metadata Mapplet=Rule
Rule
40
Rule Profiling
A Rule is a constraint written against data that is used to identify possible inconsistencies in the data.
Rule creation and editing (Expression based) Leveraging OOTB Rules / Developer created rules
Join Analysis and mid stream profiling are preformed in the Developer Tool only
41
Rule Profiling
Apply rules within profiles and analyze results in-line with original source data
Custom Developer Created Rules
42
43
44
46
47
Use the icons to find, edit and modify the data and the reference table
48
They can be edited to add columns and rows, or make changes to the data values.
Search and replace values Editing activities tracked in the audit trail log View properties for the reference table in the Properties view
49
50
51
52
Scorecarding
53
54
Single scorecard supports scores from multiple Data Objects Scores added to scorecard via profiles:
Are not connected to the profile(s) from which column/virtual column originated from.
Delete the profile without impacting the scorecard Deleting the source would invalidate both the profile and the scorecard
55
56
58
Scorecard
60
61
Outline View
62
View/edit Properties
Preview Data
63
PDOs that represent just the native metadata Reuse the native metadata and customize read/write at the mapping level. For e.g.., provide different filter, join conditions, etc
PDOs that represent both the native metadata and the configuration rules for read/write Reuse customized PDO in mappings, cannot overwrite further at the mapping level
65
66
Mappings
A Mapping reads data from sources, applies transformation logic to data and writes transformed data to targets. They can be used in IDQ to logically define the Data Quality/Integration Process.
67
Mapping elements
Physical Data Objects with Read access - Sources
file-based database
Operational transformations
tools to cleanse, enhance and match the data
68
69
Mapplet Example
Mapping Source and target data defined outside the Mapplet
70
Transformations
Data passes through linked ports in a mapping / mapplet. An object that generates, modifies, or passes data. Reusable transformations:
Can be used in multiple mappings or mapplets. All instances inherit changes.
Output Ports Input Ports
71
72
Data Preview
Data can be previewed even in incomplete partially valid mappings Immediate feedback as you develop, high productivity gains Shows output ports only
73
Data Preview
You can configure how many rows are read and displayed during the preview.
You can also configure how many rows are processed when running/testing mappings.
74
Troubleshooting
First error is displayed in the Output view View log file to get more detailed information
75
Search
Search within a particular context
Search within a particular folder Search within search
76
Search Results
Double-click or right-click on results to open directly Show In Object Explorer (Available elsewhere as well)
77
Developer Profiling
78
Column Profiling
Column Profiling
79
Value Frequencies
Create or update reference tables using frequency values output from profiling
80
81
Join Condition
to evaluate the degree of overlap between two columns Click on the Join Condition to view the Venn Diagram Double click on the area in the Venn Diagram to view the join/orphan records
82
83
Data Standardization
84
What is Standardization?
Standardization addresses the data quality issues identified through data profiling The key objectives in data standardization are:
to transform and parse data from single multi-token fields to multiple fields to correct completeness, conformity, and consistency problems to standardize field formats and extract important data from free text fields
85
86
Standardization Transformations
The Case Converter transformation creates data uniformity by standardizing the case of strings in input data. The Merge transformation reads the data values from multiple input fields to create a single output field. The Standardizer transformation standardizes characters and strings in data. It also can be used to remove noise from a field. The Decision transformation can be used to build rules The Parser transformation can parse input data using the following methods:
Token set. Regular expression. Reference table.
87
Address Validation
88
Address Validation
240+ Countries Supported by a single vendor AddressDoctor Ability to parse addresses New input strategies to support different customer storage formats Additional output transformations to support varying international addressing formats CASS and SERP reports Standardized address match codes across all countries Significant improvements in performance with the ability to multistream Improved data download processes via Address Doctor fulfilment processes Single License Key Location
89
Output Groups
Predefined output groups:
Geo Coding
Latitude Longitude
Country
Country name ISO country code
Status Info
Information on the quality of each input address
Residue
Unrecognized elements in the input address
90
91
92
93
94
95
GeoCoding
GeoCoding is available for the following countries
Andorra Australia Austria Belgium Canada Croatia Czech Republic Denmark Estonia Finland France Germany Gibraltar Greece Hungary Italy Latvia Liechtenstein Luxembourg Mexico Monaco Netherlands Norway Poland Portugal San Marino Singapore Slovakia Slovenia Spain Sweden Switzerland United Kingdom United States
96
97
98
Matching Theory
Consider the following records. How many duplicates are there? There are 2 records that could be considered matches. How did you work that out? There are 3 logical phases in the matching process:
Pair Generation Scoring (matching) Processing
Name Address Texas New York New York San Francisco Texas
99
George W Bush William J Clinton Hilary Rodham Clinton Nancy Pelosi George H W Bush
I.
In this example, each record in the dataset will be compared with all others. This gives a total of 10 pairs.
Name1 George W Bush George W Bush George W Bush George W Bush William J Clinton William J Clinton William J Clinton Hilary Rodham Clinton Hilary Rodham Clinton Nancy Pelosi Address1 Texas Texas Texas Texas New York New York New York New York New York San Francisco Name2 William J Clinton Hilary Rodham Clinton Nancy Pelosi George H W Bush Hilary Rodham Clinton Nancy Pelosi George H W Bush Nancy Pelosi George H W Bush George H W Bush Address2 New York New York San Francisco Texas New York San Francisco Texas San Francisco Texas Texas
100
II.
The next phase assigns a score (1 indicates they are identical) to each pair, which indicates how similar they are.
Name1 George W Bush George W Bush George W Bush George W Bush William J Clinton William J Clinton William J Clinton Hilary Rodham Clinton Hilary Rodham Clinton Nancy Pelosi Address1 Texas Texas Texas Texas New York New York New York New York New York San Francisco Name2 William J Clinton Hilary Rodham Clinton Nancy Pelosi George H W Bush Hilary Rodham Clinton Nancy Pelosi George H W Bush Nancy Pelosi George H W Bush George H W Bush Address2 New York New York San Francisco Texas New York San Francisco Texas San Francisco Texas Texas Score 0 0 0 0.9 0.6 0 0 0 0 0
101
Name George W Bush William J Clinton Hilary Rodham Clinton Nancy Pelosi George H W Bush
ClusterID 1 2 3 4 1
102
Transformations
Matching Transformations:
Key Generator used to group the data Match - used to match the data Typically the following will be used in Matching Mapplets:
Comparison Weighted Average
103
Grouping
The number of pairs that a dataset with N records will generate is given by the formula: (n n)2
2
5 records will create 10 pairs 50 records will create 1225 pairs 500 will create 124,750 5,000 records will generate nearly 12.5 million pairs.
We need to consider ways to reduce the number of pairs created, and so reduce the impact on performance.
To do this, we should only generate pairs for records that are likely to match only comparing records that share one (or more) particular characteristics.
104
1. Grouping
We do this by nominating a Group Key. All records that have the same Group Key are compared against each other.
If we nominate Address as the Group Key, we only get two pairs created.
Name George W Bush William J Clinton Address Texas New York Name George H W Bush Hilary Rodham Clinton Address Texas New York
If a data set of 5,000 records is grouped so there are 10 groups of 500 records, it will generate 1.2 million pairs instead of 12 million.
105
Group 1 Group 2
106
107
NYSIIS
The NYSIIS transformation converts a word into its phonetic equivalent.
Soundex
The Soundex generates an alphanumeric code that represents the characters at the start of a string. It creates a code based on how the word sounds and takes variations of spelling into account.
108
109
2. Data Matching
Matching will identify related or duplicate records within a dataset or across two datasets. Matching scores records between 0 and 1 on the strength of the match between them, with a score of 1 indicating a perfect match between records. Informatica 9 provides a wide range of matching capabilities for each data type. Users have the flexibility to decide which algorithms they would like to use as well as configuring null rules, weightings and thresholds.
110
Matching
The Match transformation reads values in selected input columns and calculates match scores representing the degrees of similarity between the pairs of values.
Match Type (Pair Generation) Strategies(Scoring ) Match Output (Processing)
111
Input ports: Unique Sequence ID Group Key Sorted Data Match fields Algorithm Based Single/Dual Source Identity (covered later) Single/Dual Source
112
113
114
HAMMING HAMMING 0 1
Weights 0.734402 Define the threshold that must be met before records will be output as a possible match
115
Comparison Transformation
Evaluates the similarity between pairs of input strings and calculates the degree of similarity for each pair as a numerical score. To configure, select a pair of input columns and assign a matching strategy to them. Outputs match scores in a range from 0 to 1, where 1 indicates a perfect match.
The strategies available are also available in the Match transformation. Used to define match comparison operations in a matching mapplet. Multiple Comparison transformations can be added to the mapplet.
116
Comparison Transformation
Expects pairs of records to be passed to it Outputs a Score Specify the Algorithm to use Specify the Input ports Define Match Parameters
117
118
Dual-Source Matching
Select Dual Source Pair Generation Option Two Key Generators to group data Single Match Transformation An output group per source ClusterID is the same for records in the same group
119
Identity Matching
120
122
Populations
Populations contain key building algorithms that have been developed for specific countries and languages. Rules differ depending on the country/language
E.G. when building keys using the UK population:
Name field: it assumes the surname is on the right of the field Organization names: assumes the major part of the name is on the left Address: St, Rd, Ave are all markers. The word before is typically the street name Rules differ for each field for example with the name field Bob = Robert but for address Bob <> Robert
123
Rule Type
e.g. THE, AND
Examples
e.g. INC, LTD, CO e.g. DEPARTMENT, ASSOCIATION e.g. MR, MRS, DR, JR e.g. CATH(E,IE,Y) => CATHERINE e.g. MIKE => MICHAEL e.g. SVCS => SERVICES e.g. AL => ALBERT, ALFRED
Germany
Category Name
Noise Word Company Word Delete Company Word Skip Personal Title Delete Nickname Replace Diminutives Nickname Replace Word Replace Secondary Lookup Word is Deleted Word is Deleted Word is marked Skip Word is Deleted Word and its Diminutives are Replaced Word is Replaced Word is Replaced Word generates additional search ranges
Rule Type
e.g. DAS, UND
Examples
e.g. AG, GMBH, KG e.g. ABTEIL,VEREIN e.g. HR., FR, FRL, DR., e.g. KATHY => CATHERINE e.g. HANS => JOHANNES e.g. DIENSTE => DIENST e.g. AL => ALBERT, ALFRED, ALFONS
124
Population Key Level Key Type Search Level Key Field Index Folder
125
Match Type
Key Level, Search Level specifies how hard Identity will work to find a candidate. Key Field, Key Type specifies which input should be used for keying, and also what type of field it is (Organization Name, Contact or Address). Identity logic will change depending on the type selected. Index folder: The key index folder where the index and data will be written.
126
Identity Matching
127
Conservative
Accepts close matches
Loose
Accepts matches with a higher degree of variation
128
129
131
Association Example
If we match on all of the columns below, the three records would not be identified as matching.
ID 1 2 3 Name David Jones Dennis Jones D. Jones Address 100 All Saints Ave 1000 Alberta Rd All Saints Ave City New York New Jersey New York State NY NY NY 10547-1521 Zip 10547 SSN 987-65-4320 987-65-4320
In order to identify all three of these records as matching, you need to match on two different criteria: 1) Name and Address 2) Name and SSN
132
Association Transformation
ID Name Address City State Zip SSN Name and Address Cluster ID
1 2 1
Assoc Cluster ID
1 1 1
1 2 3
NY NY NY
10547
987-65-4320 987-65-4320
10547 -1521
After matching on name and address, record 1 and 3 are in the same cluster, however record 2 is in a different cluster After matching on name and SSN, record 1 and 2 are in the same cluster and record 3 is in a different cluster The Association transformation creates links between records that share duplicate characteristics across more than one data field so they are treated as members of a single set in data consolidation
133
Consolidation Transformation
Create single version of the truth Merges related records, eliminating duplicates (de-duping) Append data from additional data set Take best data based on rule and/or replacing inaccurate data Example:
Consolidation rule = longest string of matched records for each field
Nicholas Jones
755 Tramway Av
Onalaska, WI 54650
(555) 5555555
134
Consolidation functions:
Most frequent Most frequent nonblank Longest Shortest Minimum (integer) Maximum (integer)
135
Consolidation Functions
MostFrequent
Returns the most frequently occurring value for the port, including blank and null values
MostFrequentNonBlank
Returns the most frequently occurring value for the port, ignoring blank and null values
Longest
Returns the longest value
Shortest
Returns the shortest value
Minimum (integer)
Returns the minimum value
Maximum (integer)
Returns the maximum value
136
137
DQ rules
Exceptions
138
139
Required Tables
The DQA uses 3 staging tables: Bad Record Management
The main data table. This table will store your data as well as the matching information after matching is performed. E.g. dqacustomer The issue table. This must have the name of the main data table suffixed with _issue e.g. dqacustomer_Issue. This table stores the issue identified per field
Consolidation
Duplicate record table. This will be used to hold the duplicate record clusters for consolidation.
Within each table there are certain columns that must exist and are reserved for use internally by the DQA
140
141
Duplicate Records
DQA allows for manual record consolidation after duplicates are detected Audit trail of manual changes
142
143
144
PowerCenter Integration
145
Export Options
Choose PC domain and repository Export: To file OR To PC repository Export mappings: As mappings OR As mapplets
147
148
PowerCenter 9.0.1
PowerCenter 8.6/8.6.1
IDQ 9.0.1
No separate integration DQ/PC Integration installers on both the Client and the installer. All required files placed by the 9.0.1 Universal Server side installer.
149
Content
150
151
152
153
154
155
Identity Populations
Populations need to be installed Parameter Sets are pre-populated in the Match transformation
156
IN_901_Content_InstallationGuide.pdf
157
158
Why is it Called Migration? Mig rate and convert all user content
159
160
Overview
Version Differences
8.6.2
One repository per user Reference data on the local file system Data quality metadata contained in IDQ Plan Connection details embedded within IDQ Plan
9.0.1
Central repository shared by all users Reference data in the Reference Table Manager Data Quality metadata in 9.0.1 models Connection details stored centrally
161
Domain
162
IDQ
IDQ
Data Integration Service Profile Service Mapping Service SQL Service Profile Warehouse
Domain Repository
IDQ
IDQ
Integration Service
Repository Service
PowerCenter
PC Workflow Mgr
PC Designer
PC Monitor
Informatica Domain
The Informatica domain include objects and services for the Informatica platform. The Admin console is now known as Administrator The Informatica domain includes services for PowerExchange, Informatica Analyst, and Informatica Developer.
165
Informatica Domain
IDQ Migration
Direct migration from 8.6.2 to 9.0.1 Direct upgrade from 9.0 to 9.0.1 To migrate Pre-8.6.2 installations you must first upgrade to IDQ8.6.2, then migrate to 9.0.1
Security
Informatica 9 platform provides full READ, WRITE, EXECUTE and GRANT permissions for domain connection objects. Support for MS SQL Server Trusted connection for hosting the domain repository (MRS) Ability to set and enforce permissions for all services and folders in the domain.
166
New Services
Analyst Service
Application service that runs Informatica Analyst in the Informatica domain. Create and enable an Analyst Service on the Domain tab of Informatica Administrator. When you enable the Analyst Service, the Service Manager starts Informatica Analyst. You can open Informatica Analyst from Informatica Administrator.
167
168
169
ClientPackage Overview
Export IDQ plans from IDQ repository Identify connection details Gather local dictionaries Package data for the next step - ServerImport
170
ClientPackage - Report
Default Location:
<MigrationPackageLocation>/Package/PackageReport.html
Identify Dictionaries used by plans and dictionaries that exist but are not used by any plan Database Connections used by plans. One entry for every DSN/Username/Password combination
171
ServerImport Overview
Unpack data from ClientPackage Create connections Import dictionary data into Reference Table Manager Convert 8.6.2 Plans to 9.0.1 Mapping XML
172
173
174
175
176
177
Imported Mappings
Imported Dictionaries
Imported Plan
178
Zip files generated by ClientPackage are not editable in WinZip (or similar) On a 64-bit client, manual export is required due to Java version incompatibility with IDQ 8.6.2 32-bit libraries Dictionaries from previous All World package are not automatically recognized as Informatica dictionaries.
179
Post-Migration Notes
Incompatible Components may require editing the Plan in 9.0.1 Address Validation components will require editing in 9.0.1
e.g. QAS and Melissa have been replaced with Address Doctor
IDQ8.6.2 Connections that source or target MySQL will have to be edited by hand
180
181
Logs
The purpose is to identify the logs populated by Informatica 9 IDQ (Informatica Data Quality). What logs exist, where they are located and what are their main purpose. Armed with this information, the user will be able to quickly identify issues during the installation process and with day to day operation. Also, the user will also be able to identify areas requiring periodic maintenance (i.e. Log removal).
182
Installation Logs
Server, Client and Content installation logs are located mostly in the root installation directory. On windows, the default is C:\informatica\9.0.1. For the rest of the document, it will be referred to by <Informatica install dir>. There are two logs for each installation. One shows the commands executed and the other shows the output of the installation. For debugging purposes, you will need to look at the InstallLog files.
183
184
185
186
187
188
189
190
191
192
193
194
196
Once you view the log and close it, it is no longer accessible via the client. You would need to go to the <Informatica install
197
198
www.informatica.com
Products & Services tab
199