Idq9 0 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 199

Informatica Data Quality 9.0.

Bev Duddridge Principal Instructor Global Education Services [email protected]

Agenda
Analyst and Developer Tools Perform Column, Rule, Join and Mid-Stream Profiling Manage reference tables Collaborate on projects Scorecard data Design and develop Mapplets and Rules Create standardization, cleansing and parsing routines Validate addresses Identify duplicate records Associate and consolidate matched records Migrating from 8.6.2 to 9.0.1 Logs and troubleshooting 9.0.1

Informatica Analyst 9.0.1


Informatica Analyst is a web-based application client that analysts can use to analyze, profile, and score data in an enterprise. Business analysts and developers use Informatica Analyst for data-driven collaboration. You can perform column and rule profiling, Scorecarding, bad record and duplicate record management. You can also manage reference data and provide the data to developers in a data quality solution.
3

Informatica Developer 9.0.1


Informatica Developer is an application client that developers use to design and implement data quality and data services solutions. Use the data quality capabilities in the Developer tool to analyze the content and structure of your data and enhance the data in ways that meet your business needs.
Profile, Standardize and Parse data. Validate postal addresses. Identify duplicate records. Create and run data quality rules. Collaborate with Informatica users.
4

Introduction to Data Quality Management

What is Data Quality Management?


A set of processes that measure and improve the quality of important data on an ongoing basis Ensures that data dependent business processes and applications deliver expected results

Six dimensions of Data Quality


Completeness What data is missing or unusable?

Conformity

What data is stored in a non-standard format?

Consistency

What data values give conflicting information?

Accuracy

What data is incorrect or out of date?

Duplicates

What data records or attributes are repeated?

Integrity

What data is missing or not referenced?

Data Quality Problems


COMPLETENESS

CONFORMITY

CONSISTENCY

DUPLICATION

INTEGRITY

ACCURACY

Data Quality Management


1. Profile

2.

Identify DQ problems through Profiling using either the Analyst or Developer Tools
Developers and Analysts can work together to build the DQ management process Once the problems with the data have been identified, develop your standardization process to cleanse, standardize, enrich and validate your data Identify duplicate records in your data using a variety of matching techniques Automatically or manually consolidate your matched records Developers and Analysts can work together to build the DQ management process
10

Collaborate

3.

Standardize

Match

Consolidate

4.

Collaborate

Data Quality Management


1. Profile
Identify DQ problems through Profiling using either the Analyst or Developer Tools

2. Collaborate

3.

Developers and Analysts can work together to build the DQ management process
Once the problems with the data have been identified, develop your standardization process to cleanse, standardize, enrich and validate your data Identify duplicate records in your data using a variety of matching techniques Automatically or manually consolidate your matched records Developers and Analysts can work together to build the DQ management process
11

Standardize

Match

Consolidate

4.

Collaborate

Data Quality Management


1. 2. Profile
Identify DQ problems through Profiling using either the Analyst or Developer Tools Developers and Analysts can work together to build the DQ management process

Collaborate

3. Standardize
Once the problems with the data have been identified, develop your standardization process to cleanse, standardize, enrich and validate your data Identify duplicate records in your data using a variety of matching techniques Automatically or manually consolidate your matched records
Developers and Analysts can work together to build the DQ management process
12

Match

Consolidate

4. Collaborate

Data Quality Management


1. Profile
Identify DQ problems through Profiling using either the Analyst or Developer Tools Developers and Analysts can work together to build the DQ management process Once the problems with the data have been identified, develop your standardization process to cleanse, standardize, enrich and validate your data Identify duplicate records in your data using a variety of matching techniques Automatically or manually consolidate your matched records

2.

Collaborate

3.

Standardize

Match

Consolidate

4.

Collaborate and Monitor


Developers and Analysts can work together to build the DQ management process
13

Informatica Analyst

14

Informatica Analyst Tool


Data Objects Metadata import for Data Sources Data access and preview Profiling Column Profiling Rule Profiling Expression based Rule creation/editing Project Collaboration Reference Table Manager Authoring and editing of reference data Auditing of changes
15

Data Quality Scorecarding Scorecards in the Analyst Tool Data Quality Assistant Management of Bad Records and Duplicate Records Auditing of changes

Repository, Projects and Folders


Projects are the highest level containers for the metadata Project can contain objects or Folders Folders can be nested Organize objects in folders as per your business needs
Repository

Project 1

Project 2

Folder 1

Folder 2

Folder 2-1

Folder 2-2

Folder 22-1

Folder 22-2

16

Projects
Shared option is available at folder creation time only and cannot be changed afterwards
Shared Projects Non-shared Projects

Indicates shared project

Indicates non shared project

17

The Informatica Analyst GUI


Actions

Project Navigator

Project Contents
Profiles Scorecards DQA Data Objects Reference Tables Rules
18

Physical Data Objects


Physical Data Objects
File
Browse and Upload Network path/shared directory

Table

Data Sources can be


Previewed Profiled Scorecarded

19

Data Objects
Data Objects are listed in your project To view, double click on the link

20

Flat Files
Analyst enables any browser user to import flat files There are 2 import options for flat files:
Browse and Upload Network path/shared directory

21

Flat Files - Browse and Upload


The Browse and upload action uploads the file via HTTP to the Server
A copy of the file Preview or Profile references the uploaded/copied file not the original Edits made to the local file will not be visible in Preview or Profile Edits to the Uploaded file will be seen Recommended option for files 10MB or smaller
9 Server Machine

Client/Browser Machine

Copy of file (via HTTP)

flatfilecache Directory

22

Flat Files - Network Path/Directory


References files located in a shared directory or file system
Share is specific to Server machine - not browser client No browse option for this reason File referenced no lag in time for Upload Preview/Profile references the file on network share Edits to the network shared file will be seen Recommended option for files larger than 10MB
Network Shared directory on server

File referenced

23

Relational Tables
Analyst users can
Create new DB connections

24

Relational Tables
Analyst users can
Leverage existing DB connections

25

Data Profiling

26

Why profile data?


Data profiling examines data in an existing data source, in order to identify possible data quality problems and issues that may exist. It collects statistics and information about the data to:
Assess the quality levels of the data, including whether the data conforms to particular standards or patterns. Understand the type of data quality issues that exist. Find out whether existing data can easily be used for other purposes.

27

Analyst Profiling
There are two types of profiling available in the Analyst Tool:
Column and Rule Profiling

Column Profiling:
A process of discovering physical characteristics of each column in a file. It is the analysis of data quality based on the content and structure of the data.
Review Column profiling results to:
Identify possible anomalies in the data Build reference tables Apply or build Rules Develop Scorecards

28

Column Profiling
Two methods of creating profiles exist:
Quick Profile
Default Name Profile_ Profiles all columns and rows Drill down on live data

Custom Profile
User can select settings

29

Custom Profile
Specify name and location Select columns to profile Discard/keep profile results for columns not selected Select number of rows to profile Drilldown on live or staged data
Select Columns to view in drilldowns

30

Column Profiling

Column & Rule Profiling

Value/Patterns/Statistics

Drilldown

31

Drilldowns
Click on the Drilldown arrow in value frequencies to drill down to the associated records. To drill down on multiple values select the values in the viewer, right click and choose Show Matching Rows

32

Column Profiling - Values


Distinct values for the Column, with their frequencies Value: The column values in order of decreasing frequency. Frequency: The number of times each value appears Percent: The percentage that each value represents Chart: Bar graph representing the percentage of each value found Drilldown: Click the arrow to see the associated records

33

Column Profiling - Patterns


Patterns inferred for the Column, with their frequencies and the percentage of values matching each Patterns: The patterns that exist in each column Frequency: The number of values in the data profiled which match each pattern Percent: The percentage of the values in the data profiled which match each pattern Chart: A bar graph representing the percentage of the data which match each pattern Drilldown: Click the arrow to see the associated records

34

Column Profiling - Statistics


The statistics include statistics about the column values, such as average, length, and top and bottom values. Average: Average of the values for the column (integer). Standard Deviation: The variability between column values (integer). Maximum Length: Length of the longest value for the column. Minimum Length: Length of the shortest value for the column. Bottom 5: lowest values for the column. Top 5: highest values for the column.

35

Project Collaboration
Seamless collaboration between Analysts and Developers
Projects created in either tool are visible in the other Team members can easily communicate and share work & findings through comments, bookmarks, shared data profiles & data quality scorecards Data can be easily exported from profiles or rules and emailed to the appropriate owner for review or correction

36

Collaboration - Comments
Analysts and Developers can use comments in profiles to collaborate on projects.

Document DQ issues

Lossless translation of information.


Leave comments within Profiles for team members

37

Collaboration - Exporting data


To export drilldown results click on the Export Data button Choose what you want to export:
Value frequencies Pattern frequencies Drill-down results

The file can sent to the appropriate data owner

38

Collaboration - Metadata Bookmarks(URLs)

Collaboration via simple URL in email, portals, links in docs/specs, etc HTTPS Protocol Supported Metadata Bookmarks: All objects sharable via common metadata

39

High-Fidelity Collaboration

Mapplet
Common Metadata Mapplet=Rule

Rule

40

Rule Profiling
A Rule is a constraint written against data that is used to identify possible inconsistencies in the data.
Rule creation and editing (Expression based) Leveraging OOTB Rules / Developer created rules

Join Analysis and mid stream profiling are preformed in the Developer Tool only

41

Rule Profiling
Apply rules within profiles and analyze results in-line with original source data
Custom Developer Created Rules

Select from one of the prebuilt rules or create your own

42

Apply Rules to the profile


Apply the rules to the profile Run the profile to view the results

43

Value Frequency Rules


Select the value frequency results to include in the Rule, right click and choose Add Rule Choose to create a Value Frequency Rule The expression is written based on your selection Can be reusable

Run profile (on all or just the rule column)

44

Value Frequency Rules


After running the profile click on the new frequency rule created 1: represents the records that met the criteria 0: represents the records that did not meet the criteria The rule will be available as a mapplet in the Developer tool
45

Reference Table Management

46

What are Reference Tables?


Reference tables enable validation, parsing, enrichment and enhancement of data. Reference data can include accurate and standardization values that can be used by analysts and developers in cleansing and validation rules. Create, edit, and import data quality dictionary files as reference tables.

47

Sample Reference Table

Use the icons to find, edit and modify the data and the reference table
48

How to create Reference Tables


Reference Tables are created in the Analyst Tool and also in the Developer Tool and can be created:
using the reference table editor by importing a flat file from a column profile

They can be edited to add columns and rows, or make changes to the data values.
Search and replace values Editing activities tracked in the audit trail log View properties for the reference table in the Properties view

49

How to create Reference Tables

50

Reference Table Editor


1. Define the table structure

2. Add the data values

51

Import Flat File


Browse and Upload the file Enter Name

Define Code Page

52

Scorecarding

53

What are Data Quality Scorecards?


A scorecard is the graphical representation of valid values for a column in a profile. Scorecards can be easily shared with Stakeholders via a URL. Further DQ rules can be created in the Developer and applied to the profile in the Analyst Tool. Use scorecards to measure data quality progress.

54

Data Quality Scorecards


Scores based on value frequencies
Includes Virtual Columns output of any rule

Single scorecard supports scores from multiple Data Objects Scores added to scorecard via profiles:
Are not connected to the profile(s) from which column/virtual column originated from.
Delete the profile without impacting the scorecard Deleting the source would invalidate both the profile and the scorecard

55

Scorecard - Valid Values per column


Run Profile and select Add to Scorecard

Add and rename columns/rules you want to include in the scorecard

56

Scorecard - Valid Values per column


Create/add to an existing scorecard/group Select the valid values from the frequency list for the column Once completed choose Finish Scorecards can be modified after creation
57

Scorecard - Out Of The Box Rules


Add the rule to the profile From the profile add the measure to your scorecard In the Scorecard, select the valid/true value

58

Scorecard Custom Rules


Build the rule in Developer and Validate as a Rule Add the rule to the profile and from the profile add the measure to your scorecard In the Scorecard, select the valid values Edit the Scorecard to move the measures into the Group
59

Scorecard

60

Informatica Developer Overview

61

Informatica Developer GUI


Multiple objects can be opened simultaneously

Editor Connection Explorer Object Explorer Properties

Outline View

62

Informatica Developer GUI

View/edit Properties

Preview Data

63

Physical Data Objects


Represents the native metadata in physical data sources and how it is accessed in the tool Physical data objects are used as sources, targets or lookups Relational tables are organized by connection names Connections are name-based reference
64

Relational Physical Data Objects


Relational PDO Customized PDO

PDOs that represent just the native metadata Reuse the native metadata and customize read/write at the mapping level. For e.g.., provide different filter, join conditions, etc

PDOs that represent both the native metadata and the configuration rules for read/write Reuse customized PDO in mappings, cannot overwrite further at the mapping level

65

Configuring Physical Data Objects - File


Configure the Read and Write tabs to indicate where the source file will be read from and written to
(server based)

Configured in the Physical Data Objects, not at mapping level

66

Mappings
A Mapping reads data from sources, applies transformation logic to data and writes transformed data to targets. They can be used in IDQ to logically define the Data Quality/Integration Process.

67

Mapping elements
Physical Data Objects with Read access - Sources
file-based database

Operational transformations
tools to cleanse, enhance and match the data

Physical Data Objects with Write access - Target


file-based database

Reference tables enable validation, parsing, enrichment and enhancement of data

68

Mapplets and Rules


A reusable object containing a set of transformations that you can use in multiple mappings. Use a mapplet in a mapping or, validate the mapplet as a rule and use in Informatica Analyst. When you use a mapplet in a mapping, you use an instance of the mapplet.
Changes made are inherited by all instances of the mapplet.

69

Mapplet Example
Mapping Source and target data defined outside the Mapplet

Mapplet Mapplet Input transformation


Passes data from a mapping into a mapplet

Mapplet Output transformation


Passes data from a mapplet into a mapping

70

Transformations
Data passes through linked ports in a mapping / mapplet. An object that generates, modifies, or passes data. Reusable transformations:
Can be used in multiple mappings or mapplets. All instances inherit changes.
Output Ports Input Ports

71

Autolink & Propagate Attributes


Autolink ports from one transformation to another Autolink by using Prefix / Suffix

Propagate attribute changes in mapping Doesnt affect reusable transformations

72

Data Preview
Data can be previewed even in incomplete partially valid mappings Immediate feedback as you develop, high productivity gains Shows output ports only
73

Data Preview
You can configure how many rows are read and displayed during the preview.

You can also configure how many rows are processed when running/testing mappings.
74

Troubleshooting

First error is displayed in the Output view View log file to get more detailed information
75

Search
Search within a particular context
Search within a particular folder Search within search

76

Search Results
Double-click or right-click on results to open directly Show In Object Explorer (Available elsewhere as well)

77

Developer Profiling

78

Column Profiling
Column Profiling

Value & Pattern Frequencies

Drill Down Results

79

Value Frequencies
Create or update reference tables using frequency values output from profiling

80

Exporting Profiling Results


200 value frequencies are displayed. To see more, export to a CSV output Drill Down results can also be exported for review
Export Value Frequencies

Export Drill down Results

81

Join Analysis Profiling


Venn Diagram with join results Use Join Analysis

Join Condition

to evaluate the degree of overlap between two columns Click on the Join Condition to view the Venn Diagram Double click on the area in the Venn Diagram to view the join/orphan records
82

Mid-Stream Profiling Profile at any point within a Mapping

*Targets can not be profiled

Profile Source Profile Mapplet/Rule

Profile any Transformation

83

Data Standardization

84

What is Standardization?
Standardization addresses the data quality issues identified through data profiling The key objectives in data standardization are:
to transform and parse data from single multi-token fields to multiple fields to correct completeness, conformity, and consistency problems to standardize field formats and extract important data from free text fields

85

Pattern Based Parsing


Create a reference table using output from the labeler Add a Pattern Parser and apply the new reference table Parse the patterns Output fields: Parsed Data Parse Status Overflow

86

Standardization Transformations
The Case Converter transformation creates data uniformity by standardizing the case of strings in input data. The Merge transformation reads the data values from multiple input fields to create a single output field. The Standardizer transformation standardizes characters and strings in data. It also can be used to remove noise from a field. The Decision transformation can be used to build rules The Parser transformation can parse input data using the following methods:
Token set. Regular expression. Reference table.
87

Address Validation

88

Address Validation
240+ Countries Supported by a single vendor AddressDoctor Ability to parse addresses New input strategies to support different customer storage formats Additional output transformations to support varying international addressing formats CASS and SERP reports Standardized address match codes across all countries Significant improvements in performance with the ability to multistream Improved data download processes via Address Doctor fulfilment processes Single License Key Location
89

Output Groups
Predefined output groups:
Geo Coding
Latitude Longitude

Country
Country name ISO country code

Status Info
Information on the quality of each input address

Formatted Address Line


Formats addresses for mailing

Residue
Unrecognized elements in the input address

90

Address Validation Configuration


Define default/force country Define Casing Define Mode Define input template Add input ports - Select ports from one
input group only

Add output ports - Add ports from multiple


output groups

Configure advanced settings *performance


improvements X5+

91

Address Validation level: A+


A+: Street or Building coverage for more than 98% of the country. The following countries are available:

92

Address Validation Level: A


A: Street, Building or Block coverage for major parts of the country. The following countries are available:

93

Address Validation Level: B


B Locality and Postal Code. Countries include:

94

Address Validation Level: B

95

GeoCoding
GeoCoding is available for the following countries
Andorra Australia Austria Belgium Canada Croatia Czech Republic Denmark Estonia Finland France Germany Gibraltar Greece Hungary Italy Latvia Liechtenstein Luxembourg Mexico Monaco Netherlands Norway Poland Portugal San Marino Singapore Slovakia Slovenia Spain Sweden Switzerland United Kingdom United States

96

Address Validation Parameters


Define the License key in Informatica Administrator (separate license for Geocoding) Define the location of the reference data License expires (not data except CASS data)

97

Grouping and Matching

98

Matching Theory
Consider the following records. How many duplicates are there? There are 2 records that could be considered matches. How did you work that out? There are 3 logical phases in the matching process:
Pair Generation Scoring (matching) Processing
Name Address Texas New York New York San Francisco Texas
99

George W Bush William J Clinton Hilary Rodham Clinton Nancy Pelosi George H W Bush

I.

Matching Theory - Pair Generation

In this example, each record in the dataset will be compared with all others. This gives a total of 10 pairs.
Name1 George W Bush George W Bush George W Bush George W Bush William J Clinton William J Clinton William J Clinton Hilary Rodham Clinton Hilary Rodham Clinton Nancy Pelosi Address1 Texas Texas Texas Texas New York New York New York New York New York San Francisco Name2 William J Clinton Hilary Rodham Clinton Nancy Pelosi George H W Bush Hilary Rodham Clinton Nancy Pelosi George H W Bush Nancy Pelosi George H W Bush George H W Bush Address2 New York New York San Francisco Texas New York San Francisco Texas San Francisco Texas Texas

100

II.

Matching Theory - Scoring

The next phase assigns a score (1 indicates they are identical) to each pair, which indicates how similar they are.
Name1 George W Bush George W Bush George W Bush George W Bush William J Clinton William J Clinton William J Clinton Hilary Rodham Clinton Hilary Rodham Clinton Nancy Pelosi Address1 Texas Texas Texas Texas New York New York New York New York New York San Francisco Name2 William J Clinton Hilary Rodham Clinton Nancy Pelosi George H W Bush Hilary Rodham Clinton Nancy Pelosi George H W Bush Nancy Pelosi George H W Bush George H W Bush Address2 New York New York San Francisco Texas New York San Francisco Texas San Francisco Texas Texas Score 0 0 0 0.9 0.6 0 0 0 0 0
101

III. Matching Theory - Processing


The same number of rows that were originally received are output with an identifier added to each row. Rows that are similar will have the same identifier or ClusterID. To determine if two rows are related, we specify a threshold value. Pairs with a score equal to or above the threshold are deemed to match. Our threshold is 0.8. Only one pair meets the threshold.

Name George W Bush William J Clinton Hilary Rodham Clinton Nancy Pelosi George H W Bush

Address Texas New York New York San Francisco Texas

ClusterID 1 2 3 4 1

102

Transformations
Matching Transformations:
Key Generator used to group the data Match - used to match the data Typically the following will be used in Matching Mapplets:
Comparison Weighted Average

103

Grouping
The number of pairs that a dataset with N records will generate is given by the formula: (n n)2
2

5 records will create 10 pairs 50 records will create 1225 pairs 500 will create 124,750 5,000 records will generate nearly 12.5 million pairs.

We need to consider ways to reduce the number of pairs created, and so reduce the impact on performance.
To do this, we should only generate pairs for records that are likely to match only comparing records that share one (or more) particular characteristics.

104

1. Grouping
We do this by nominating a Group Key. All records that have the same Group Key are compared against each other.
If we nominate Address as the Group Key, we only get two pairs created.
Name George W Bush William J Clinton Address Texas New York Name George H W Bush Hilary Rodham Clinton Address Texas New York

If a data set of 5,000 records is grouped so there are 10 groups of 500 records, it will generate 1.2 million pairs instead of 12 million.

105

IDQ Grouping and matching


In matching, the records within each group are compared against each other. Matching is not performed across groups, therefore be sure to group on a complete and accurate field.

Group 1 Group 2

106

Key Generator Transformation


The Key Generator transformation has three purposes:
Assign a unique identifier to each record in a dataset if one does not exist. Apply an operation to a field so that it is more suitable for grouping Sort the outgoing data so that rows with the same group key value are contiguous.
Only required for classic matching

107

Key creation strategy


String
Builds a group key using the first or last number of characters

NYSIIS
The NYSIIS transformation converts a word into its phonetic equivalent.

Soundex
The Soundex generates an alphanumeric code that represents the characters at the start of a string. It creates a code based on how the word sounds and takes variations of spelling into account.

108

Mid-Stream Profiling for Group Analysis


Profile Key Generator Transformation Review:
Number of records per group NULL keys Single record groups

109

2. Data Matching
Matching will identify related or duplicate records within a dataset or across two datasets. Matching scores records between 0 and 1 on the strength of the match between them, with a score of 1 indicating a perfect match between records. Informatica 9 provides a wide range of matching capabilities for each data type. Users have the flexibility to decide which algorithms they would like to use as well as configuring null rules, weightings and thresholds.

110

Matching
The Match transformation reads values in selected input columns and calculates match scores representing the degrees of similarity between the pairs of values.
Match Type (Pair Generation) Strategies(Scoring ) Match Output (Processing)

Classic Matching strategies:


Jaro Distance Bigram Distance Hamming Distance Edit Distance Reverse Hamming Distance

111

Match Transformation 1 - Pair Generation

Input ports: Unique Sequence ID Group Key Sorted Data Match fields Algorithm Based Single/Dual Source Identity (covered later) Single/Dual Source

112

Match Transformation 2 - Strategies

113

Match Transformation 3 Match Output


Clustered or Matched Pairs Select threshold that must be met for records to be identified as a match Choose the Scoring method

114

Example product data


Type SP SC EDIT 0.5 Material CHKS IN JY CKN CHKS IN JY CKN + BF BIGRAM 0.83871 Shelf 24M 12M HAMMING 0.333 Weight 3KG 1.3KG Quantity X6 X6 Color Red Red EDIT 1

HAMMING HAMMING 0 1

Weights 0.734402 Define the threshold that must be met before records will be output as a possible match

115

Comparison Transformation
Evaluates the similarity between pairs of input strings and calculates the degree of similarity for each pair as a numerical score. To configure, select a pair of input columns and assign a matching strategy to them. Outputs match scores in a range from 0 to 1, where 1 indicates a perfect match.
The strategies available are also available in the Match transformation. Used to define match comparison operations in a matching mapplet. Multiple Comparison transformations can be added to the mapplet.
116

Comparison Transformation
Expects pairs of records to be passed to it Outputs a Score Specify the Algorithm to use Specify the Input ports Define Match Parameters

117

Weighted Average Transformation


Inputs: Similarity scores Outputs: Weighted Average of Similarity scores

118

Dual-Source Matching
Select Dual Source Pair Generation Option Two Key Generators to group data Single Match Transformation An output group per source ClusterID is the same for records in the same group

119

Identity Matching

120

What is Identity Matching?


Identity Matching delivers next generation linguistic and statistical matching algorithms to ensure highly accurate matching out of the box for over 60 countries Identity matching enables a business user to deliver accurate matches across multiple languages Emulates a human experts ability to determine a match based on numerous fields & attributes Despite data having errors, variation, and duplication, Identity delivers the highest possible reliability when, matching data based on names, addresses, descriptions, and other identification data Identity Matching works through the use of the prebuilt population and matching strategies (purpose)
121

20 Common Data Errors & Variation


Variation or Error Sequence errors Involuntary corrections Concatenated names Nicknames and aliases Noise Example Mark Douglas or Douglas Mark Browne Brown Variation or Error Transcription mistakes Missing tokens Extra tokens Mary Anne, Maryanne Chris Christine, Christopher, Tina Full stops, dashes, slashes, titles, apostrophes Wlm/William, Mfg/Manufacturing Credit Suisse First Bost MacDonald/McDonald/Donald P0rter Beht Transliteration differences Phonetic errors Gang, Kang, Kwang Graeme Graham Foreign sourced data Unpredictable use of initials Transposed characters Localization Inaccurate dates Khader AL Ghamdi, Khadir A. AlGamdey John Alan Smith, J A Smith Johnson, Jhonson Stanislav Milosovich Stan Milo 12/10/1915, 21/10/1951, 10121951, 00001951 Example Hannah, Hamah George W Smith George Smith, Smith

Abbreviations Truncations Prefix/suffix errors Spelling errors Typing errors

122

Populations
Populations contain key building algorithms that have been developed for specific countries and languages. Rules differ depending on the country/language
E.G. when building keys using the UK population:
Name field: it assumes the surname is on the right of the field Organization names: assumes the major part of the name is on the left Address: St, Rd, Ave are all markers. The word before is typically the street name Rules differ for each field for example with the name field Bob = Robert but for address Bob <> Robert

123

Identity Populations sample rules


USA
Category Name
Noise Word Company Word Delete Company Word Skip Personal Title Delete Nickname Replace Diminutives Nickname Replace Word Replace Secondary Lookup Word is Deleted Word is Deleted Word is marked Skip Word is Deleted Word and its Diminutives are Replaced Word is Replaced Word is Replaced Word generates additional search ranges

Rule Type
e.g. THE, AND

Examples
e.g. INC, LTD, CO e.g. DEPARTMENT, ASSOCIATION e.g. MR, MRS, DR, JR e.g. CATH(E,IE,Y) => CATHERINE e.g. MIKE => MICHAEL e.g. SVCS => SERVICES e.g. AL => ALBERT, ALFRED

Germany
Category Name
Noise Word Company Word Delete Company Word Skip Personal Title Delete Nickname Replace Diminutives Nickname Replace Word Replace Secondary Lookup Word is Deleted Word is Deleted Word is marked Skip Word is Deleted Word and its Diminutives are Replaced Word is Replaced Word is Replaced Word generates additional search ranges

Rule Type
e.g. DAS, UND

Examples
e.g. AG, GMBH, KG e.g. ABTEIL,VEREIN e.g. HR., FR, FRL, DR., e.g. KATHY => CATHERINE e.g. HANS => JOHANNES e.g. DIENSTE => DIENST e.g. AL => ALBERT, ALFRED, ALFONS

124

Match Type Pair Generation

Population Key Level Key Type Search Level Key Field Index Folder

125

Match Type
Key Level, Search Level specifies how hard Identity will work to find a candidate. Key Field, Key Type specifies which input should be used for keying, and also what type of field it is (Organization Name, Contact or Address). Identity logic will change depending on the type selected. Index folder: The key index folder where the index and data will be written.

126

Identity Matching

127

Identity Match Strategy


For each Identity Match Strategy, three Match Levels are available:
Typical
Accepts reasonable matches Default if no Match_Level specified

Conservative
Accepts close matches

Loose
Accepts matches with a higher degree of variation

128

Match Output - Processing


Identity clustering can only be used with Identity Pair Generation.
It is possible to group using the key generator (instead of Identity) and match using Identity matching. In this case check Field Match on the Match Type tab

129

List of Identity Populations


Americas Argentina Brazil Canada Chile Mexico Peru USA Industry Solutions AML OFAC APAC Australia China (5) India Indonesia Japan (3) Korea (2) Malaysia New Zealand Philippines Singapore Taiwan Thailand (2) Hong Kong Vietnam EMEA Arabic (3) Belgium Czech Republic Denmark Finland France Germany Greece (2) Hungary Ireland Italy Luxembourg Netherlands Norway Poland Portugal Spain Sweden Switzerland Turkey United Kingdom
130

50 countries 65 populations e.g. China has 5 populations

Automatic Data Consolidation

131

Association Example
If we match on all of the columns below, the three records would not be identified as matching.
ID 1 2 3 Name David Jones Dennis Jones D. Jones Address 100 All Saints Ave 1000 Alberta Rd All Saints Ave City New York New Jersey New York State NY NY NY 10547-1521 Zip 10547 SSN 987-65-4320 987-65-4320

In order to identify all three of these records as matching, you need to match on two different criteria: 1) Name and Address 2) Name and SSN
132

Association Transformation
ID Name Address City State Zip SSN Name and Address Cluster ID
1 2 1

Name and SSN Cluster ID


1 1 2

Assoc Cluster ID
1 1 1

1 2 3

David Jones Dennis Jones D. Jones

100 All Saints Ave 1000 Alberta Rd All Saints Ave

New York New Jersey New York

NY NY NY

10547

987-65-4320 987-65-4320

10547 -1521

After matching on name and address, record 1 and 3 are in the same cluster, however record 2 is in a different cluster After matching on name and SSN, record 1 and 2 are in the same cluster and record 3 is in a different cluster The Association transformation creates links between records that share duplicate characteristics across more than one data field so they are treated as members of a single set in data consolidation

133

Consolidation Transformation
Create single version of the truth Merges related records, eliminating duplicates (de-duping) Append data from additional data set Take best data based on rule and/or replacing inaccurate data Example:
Consolidation rule = longest string of matched records for each field

Nick Jones Nicholas Jones

755 Tramway Av 755 Tramway Av

Onalaska, WI 54650 Onalaska, WI 54650

(555) 5555555 [email protected]

Nicholas Jones

755 Tramway Av

Onalaska, WI 54650

(555) 5555555

[email protected]

134

Consolidation Transformation - Create Survivor Record


Input data from Association or Match Transformation

Consolidation functions:
Most frequent Most frequent nonblank Longest Shortest Minimum (integer) Maximum (integer)

Select Group By Field

135

Consolidation Functions
MostFrequent
Returns the most frequently occurring value for the port, including blank and null values

MostFrequentNonBlank
Returns the most frequently occurring value for the port, ignoring blank and null values

Longest
Returns the longest value

Shortest
Returns the shortest value

Minimum (integer)
Returns the minimum value

Maximum (integer)
Returns the maximum value

136

Data Quality Assistant

137

Exception Management process


Sources Data Quality Checks Exception Management Target

Records that passed

Low Quality Data

Cleansing and Matching Rules

DQ rules

Exceptions

Data Quality Assistant

High Quality Data

Browser based exception review and manual consolidation process

138

The Data Quality Assistant


The DQA is a web based (Analyst) application for record management. It works in conjunction with data quality Mappings to sort and filter data records by data quality issue. It can be used to: Manage bad records
Users can easily view and update bad data in a table through an easy to use GUI

Consolidate duplicate records into a Master Record


Users can create a Master record from multiple duplicate records

View the audit trail


View the audit trail on changes made to the data

139

Required Tables
The DQA uses 3 staging tables: Bad Record Management
The main data table. This table will store your data as well as the matching information after matching is performed. E.g. dqacustomer The issue table. This must have the name of the main data table suffixed with _issue e.g. dqacustomer_Issue. This table stores the issue identified per field

Consolidation
Duplicate record table. This will be used to hold the duplicate record clusters for consolidation.
Within each table there are certain columns that must exist and are reserved for use internally by the DQA

140

Bad Record Table


Data Quality Assistant (DQA) allows users to review and correct exceptions Audit trail of manual changes

141

Duplicate Records
DQA allows for manual record consolidation after duplicates are detected Audit trail of manual changes

142

Business User - Manage Bad Records

143

Business User - Record Consolidation

144

PowerCenter Integration

145

Integration with PowerCenter


PowerCenter 8.6/8.6.1 or 9.0.1
Deployment to PC for Performance Scalability Connectivity Batch access Web Services DQ as part of ETL process

Informatica Developer 9.0.1


Informatica Developer objects exportable to PC repository Mappings Mapplets Data Object Read Maps Executed natively within PC No requirement to install PC Designer on the same machine as 9.0.1 Developer
146

Export Options

Choose PC domain and repository Export: To file OR To PC repository Export mappings: As mappings OR As mapplets
147

Export Reference Data


Defined content dependencies are identified at Export RTM tables converted to flat files

148

DQ/PC Integration Installation

PowerCenter 9.0.1

PowerCenter 8.6/8.6.1

IDQ 9.0.1

No separate integration DQ/PC Integration installers on both the Client and the installer. All required files placed by the 9.0.1 Universal Server side installer.

149

Content

150

What makes up OOTB Content?


Mapplets Snippets of DQ functionality used by the Developer Rules Mapplets that have been Validated as Rules for the Analyst to consume Reference Tables Reference data used in mapplets, rules, and mappings Address Validation data Subscription data used with the Address Validator transformation Identity Populations - Contains metadata on types of personal, household, and corporate identity including algorithms that apply the metadata to input data

151

Pre-Built Mapplets and Rules

152

Pre-Built Reference Tables

153

Add OOTB rules to Mappings

154

Address Validation Data

155

Identity Populations
Populations need to be installed Parameter Sets are pre-populated in the Match transformation

156

Installation Tips and Tricks


Client and Server Install
Client install has to be done first
Imports the mapplets

Server install has to be done second


Installs the content

Content is Database Specific

IN_901_Content_InstallationGuide.pdf

157

IDQ 9.0.1 Migration 8.6.2 to 9.0.1

158

Why is it Called Migration? Mig rate and convert all user content

to implement DQ logic designed in an 8.6.2 environment to an 9.0.1 environment.

159

Why is it Called Migration?


Why isnt it called Upgrade?
Significant changes to components Significant change from Dictionaries to Reference Tables Significant change in moving Plans from one architecture to another

160

Overview
Version Differences
8.6.2
One repository per user Reference data on the local file system Data quality metadata contained in IDQ Plan Connection details embedded within IDQ Plan

9.0.1
Central repository shared by all users Reference data in the Reference Table Manager Data Quality metadata in 9.0.1 models Connection details stored centrally

161

Domain

162

Informatica 9 Architecture for IDQ


Analyst Service http(s) Informatic a Analyst

IDQ

Model Repository Service Administrator http(s) I S P

Model Repository Informatica Developer

IDQ

Data Integration Service Profile Service Mapping Service SQL Service Profile Warehouse

Domain Repository

Informatica 9 Architecture for IDQ & PC


Analyst Service http(s) Informatic a Analyst

IDQ

Model Repository Service Administrator http(s) I S P Profile Service Domain Repository

Model Repository Informatica Developer

IDQ

Data Integration Service Mapping Service SQL Service Profile Warehouse

Integration Service

Repository Service

PowerCenter

PC Workflow Mgr

PC Designer

PC Repo Mgr PC Repository

PC Monitor

Informatica Domain
The Informatica domain include objects and services for the Informatica platform. The Admin console is now known as Administrator The Informatica domain includes services for PowerExchange, Informatica Analyst, and Informatica Developer.

165

Informatica Domain
IDQ Migration
Direct migration from 8.6.2 to 9.0.1 Direct upgrade from 9.0 to 9.0.1 To migrate Pre-8.6.2 installations you must first upgrade to IDQ8.6.2, then migrate to 9.0.1

Security
Informatica 9 platform provides full READ, WRITE, EXECUTE and GRANT permissions for domain connection objects. Support for MS SQL Server Trusted connection for hosting the domain repository (MRS) Ability to set and enforce permissions for all services and folders in the domain.

166

New Services
Analyst Service
Application service that runs Informatica Analyst in the Informatica domain. Create and enable an Analyst Service on the Domain tab of Informatica Administrator. When you enable the Analyst Service, the Service Manager starts Informatica Analyst. You can open Informatica Analyst from Informatica Administrator.

Model Repository Service


Application service that manages the Model repository. The Model repository is a relational database that stores the metadata for projects created in Informatica Analyst and Informatica Designer. The Model repository also stores run-time and configuration information for applications deployed to a Data Integration Service. Create and enable a Model Repository Service on the Domain tab of Informatica Administrator.

167

Migrating the Repository and Dictionaries

168

Steps for Migration


1. ClientPackage - On the IDQ 8.6.2 client single step process to:
Export IDQ plans from IDQ repository Identify connection details Gather local dictionaries Package data for the next step Unpack data from ClientPackage Create connections Import dictionary data into Reference Table Manager Convert Plans to 9.0.1 mapping XML

2. ServerImport - On the 9.0.1 Server single process to:

3. XML Import - On 9.0.1 Client


Import mapping XML from ServerImport into 9.0.1 repository via Developer

169

ClientPackage Overview
Export IDQ plans from IDQ repository Identify connection details Gather local dictionaries Package data for the next step - ServerImport

170

ClientPackage - Report
Default Location:
<MigrationPackageLocation>/Package/PackageReport.html

Identify Dictionaries used by plans and dictionaries that exist but are not used by any plan Database Connections used by plans. One entry for every DSN/Username/Password combination

171

ServerImport Overview
Unpack data from ClientPackage Create connections Import dictionary data into Reference Table Manager Convert 8.6.2 Plans to 9.0.1 Mapping XML

172

Steps to perform before ServerImport


Create new blank project for mappings to be imported to Create new folder for imported reference tables Install Informatica Content packages in shared project

173

ServerImport Summary / Overview Report


Overall status of conversion Links to detail / individual reports Default location
<MigrationPackageLocation>/migration_reports

174

ServerImport Detail Reports


One Detail report per 8.6.2 plan/9.0.1 mapping Component / Port level detail Includes warnings / errors Default location
<MigrationPackageLocation>/migration_reports

175

Client XML Import Overview


Import mapping XML generated through ServerImport into 9.0.1 repository
Through Informatica Designer Through infacmd

Default location for XML file:


<MigrationPackageLocation>/Output/MigratedMappings.xml

176

XML Import via Developer

177

Imported Mappings

Imported Dictionaries

Imported Plan

178

Tips and Tricks - General


Migration packages required Java 1.6 or later to be installed
e.g. C:\Informatica\9.0.1\Java\bin

Zip files generated by ClientPackage are not editable in WinZip (or similar) On a 64-bit client, manual export is required due to Java version incompatibility with IDQ 8.6.2 32-bit libraries Dictionaries from previous All World package are not automatically recognized as Informatica dictionaries.
179

Post-Migration Notes
Incompatible Components may require editing the Plan in 9.0.1 Address Validation components will require editing in 9.0.1
e.g. QAS and Melissa have been replaced with Address Doctor

IDQ8.6.2 Connections that source or target MySQL will have to be edited by hand

180

Logging and Logs in IDQ v9

181

Logs
The purpose is to identify the logs populated by Informatica 9 IDQ (Informatica Data Quality). What logs exist, where they are located and what are their main purpose. Armed with this information, the user will be able to quickly identify issues during the installation process and with day to day operation. Also, the user will also be able to identify areas requiring periodic maintenance (i.e. Log removal).

182

Installation Logs
Server, Client and Content installation logs are located mostly in the root installation directory. On windows, the default is C:\informatica\9.0.1. For the rest of the document, it will be referred to by <Informatica install dir>. There are two logs for each installation. One shows the commands executed and the other shows the output of the installation. For debugging purposes, you will need to look at the InstallLog files.
183

Installation Logs: Client, Server and Content


All these look the same look for the Summary Information Summary ------Installation: Successful. 18 Successes 0 Warnings 0 NonFatalErrors 0 FatalErrors

184

Additional Content Installation Logs


There are also content installation log files located at <Informatica install dir>\Content_Install_Logs

185

Day to Day Operations


Initial errors when starting up
When initially starting up the services and they dont start, look here: <Informatica install dir>\tomcat\logs There are two logs of interest. The exceptions.log and catalina.out.

186

Day to Day Operations


Catalina.out and Exceptions.log
While the services are up and running, these file are locked. Catalina.out has messages about the errors found when the domain starts Exceptions.log has messages referring to what happens after the domain has come up such as the status of gateway elections and it is found at <Informatica install dir>\tomcat\logs

187

Day to Day Operations - Analyst


When creating a physical object, the Analyst tool uses the data integration service. As it performs the task, it adds entries to the Data Integration Service (DIS) Logs located at: <Informatica install dir>\tomcat\temp\DIS\logs The logs are dated

188

Day to Day Operations - Analyst


Keep this area in mind, because this is one of the areas that will eventually need to be cleaned up. The Analyst Tool log (analyst.log) can be found at <Informatica install dir>\tomcat\logs\as

189

Day to Day Operations Profiling Logs


There are two logs created for each profiling job in <Informatica install dir>\tomcat\bin\disLogs\profiling. There is a summary log, which just tells you the mappings were complete and the details such as what tables were updated in the profiling warehouse but not a lot of details about the profile itself. Live drill down and Export of profiling results will also create log files here.

190

Day to Day Operations Profiling Logs


These logs can and should be moved to a location that is more accessible by the general community. Usually, a directory that has software install is usually inaccessible by the general user community. A more logical place than <Informatica install dir>\tomcat\bin\disLogs would help people find them

191

Day to Day Operations Profiling Logs


The location can be configured in the admin console:

The temp logs can also be configured somewhere else.

192

Day to Day Operations Profiling Logs


When you do mid-stream profiling, it creates a log in this directory but they are not accessible from the client tool. This is true for any profiling operation (from Dxt -Designer/AT Analyst Tool).

193

Day to Day Operations MRS Logs


When the service is initially brought up, an MRS log is started at <Informatica install dir>\tomcat\logs\MRS. Also, when you connect to the MRS with the client, its attempt and success is recorded here. While the services are up, this file is locked.

194

Day to Day Operations Mapping Service Logs


The mapping service logs are a little more helpful when looking for errors in a mapping. (remember profiling is done by a mapping). Among other things, It can confirm that the file was read without errors. They can be found at <Informatica install dir>\tomcat\bin\disLogs\ms. This is another area that will need occasional maintenance.
195

Day to Day Operations Mapping Service Logs


Anything you do in the client with regards to a mapping will update these logs. They are also assessable from the client. A simple run data viewer produced this log and was accessed via the client by double clicking on the show logs icon.

196

Day to Day Operations Mapping Service Logs


When you run a mapping, you can view the logs by clicking here

Once you view the log and close it, it is no longer accessible via the client. You would need to go to the <Informatica install

dir>\tomcat\bin\disLogs\ms directory and view it there.

197

Day to Day Operations Other Logs


Reference Table Manager Command Line Interface (CLI) logs: The Reference Table Manger CLI logs can be found at <Informatica install dir>\server\bin\rtm_cli_logs. They are generated when the reference tables are imported. Import / Export logs: You can find some import/export logs at the same location: <Informatica install
dir>\clients\DeveloperClient\infacmd\rtm_cli_logs

198

ESG Additional Training


Our Classes are available:
On-Site at your company location Virtual Academy on-line including conference calling Public classes at our training site throughout the world

IDQ 9.0.1 4 days IDQ Migration 1 day

List of classes and dates available are at:

www.informatica.com
Products & Services tab

199

You might also like