In 1022 IntegrationGuide en

Download as pdf or txt
Download as pdf or txt
You are on page 1of 196

Informatica® Big Data Management

10.2.2

Integration Guide
Informatica Big Data Management Integration Guide
10.2.2
February 2019
© Copyright Informatica LLC 2014, 2019

This software and documentation are provided only under a separate license agreement containing restrictions on use and disclosure. No part of this document may be
reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC.

U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial
computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such,
the use, duplication, disclosure, modification, and adaptation is subject to the restrictions and license terms set forth in the applicable Government contract, and, to the
extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License.

Informatica, the Informatica logo [and any other Informatica-owned trademarks appearing in the document] are trademarks or registered trademarks of Informatica
LLC in the United States and many jurisdictions throughout the world. A current list of Informatica trademarks is available on the web at https://www.informatica.com/
trademarks.html. Other company and product names may be trade names or trademarks of their respective owners.

Portions of this software and/or documentation are subject to copyright held by third parties. Required third party notices are included with the product.

The information in this documentation is subject to change without notice. If you find any problems in this documentation, report them to us at
[email protected].

Informatica products are warranted according to the terms and conditions of the agreements under which they are provided. INFORMATICA PROVIDES THE
INFORMATION IN THIS DOCUMENT "AS IS" WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.

Publication Date: 2019-02-25


Table of Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica Resources. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica Network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica Knowledge Base. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Informatica Documentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Informatica Product Availability Matrices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Informatica Velocity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Informatica Marketplace. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Informatica Global Customer Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

Part I: Hadoop Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

Chapter 1: Introduction to Hadoop Integration. . . . . . . . . . . . . . . . . . . . . . . . . . 12


Hadoop Integration Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Big Data Management Component Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Hadoop Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Clients and Tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Application Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Repositories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Integration with Other Informatica Products. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

Chapter 2: Before You Begin. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16


Read the Release Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Verify System Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Verify Product Installations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Verify HDFS Disk Space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Verify the Hadoop Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Verify Port Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Uninstall Big Data Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Uninstall for Amazon EMR, Azure HDInsight, and MapR. . . . . . . . . . . . . . . . . . . . . . . . . . 19
Uninstall for Cloudera CDH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Uninstall for Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Prepare Directories, Users, and Permissions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Verify and Create Users. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Grant Permissions to an Azure Active Directory User . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Create a Cluster Staging Directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Grant Permissions on the Hive Warehouse Directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Create a Hive Staging Directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Create a Spark Staging Directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Create a Sqoop Staging Directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

Table of Contents 3
Create Blaze Engine Directories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Edit the hosts File for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Create a Reject File Directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Create a Proxy Directory for MapR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Configure Access to Secure Hadoop Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Configure the Metadata Access Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Configure the Data Integration Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Download the Informatica Server Binaries for the Hadoop Environment. . . . . . . . . . . . . . . . 27
Configure Data Integration Service Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Prepare a Python Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Edit the hosts File for Access to Azure HDInsight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Chapter 3: Amazon EMR Integration Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32


Amazon EMR Task Flows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Task Flow to Integrate with Amazon EMR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Task Flow to Upgrade from Version 10.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Task Flow to Upgrade from Version 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Task Flow to Upgrade from a Version Earlier than 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Prepare for Cluster Import from Amazon EMR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Configure *-site.xml Files for Amazon EMR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Prepare the Archive File for Amazon EMR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Create a Cluster Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Importing a Hadoop Cluster Configuration from a File. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Verify or Refresh the Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Verify JDBC Drivers for Sqoop Connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Verify Design-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Verify Run-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Configure the Files for Hive Tables on S3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Setting S3 Access Policies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Step 1. Identify the S3 Access Policy Elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Step 2. Optionally Copy an Existing S3 Access Policy as a Template. . . . . . . . . . . . . . . . . . 47
Step 3. Create or Edit an S3 Access Policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Configure the Developer Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Configure developerCore.ini. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Complete Upgrade Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Update Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Update Streaming Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Chapter 4: Azure HDInsight Integration Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . 53


Azure HDInsight Task Flows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Task Flow to Integrate with Azure HDInsight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Task Flow to Upgrade from Version 10.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Task Flow to Upgrade from Version 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Table of Contents
Task Flow to Upgrade from a Version Earlier than 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Prepare for Cluster Import from Azure HDInsight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Configure *-site.xml Files for Azure HDInsight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Prepare for Direct Import from Azure HDInsight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Prepare the Archive File for Import from Azure HDInsight. . . . . . . . . . . . . . . . . . . . . . . . . 63
Create a Cluster Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Before You Import. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Importing a Hadoop Cluster Configuration from the Cluster. . . . . . . . . . . . . . . . . . . . . . . . 64
Importing a Hadoop Cluster Configuration from a File. . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Verify or Refresh the Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Verify JDBC Drivers for Sqoop Connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Verify Design-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Verify Run-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Configure the Developer Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Configure developerCore.ini. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Complete Upgrade Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Update Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Update Streaming Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

Chapter 5: Cloudera CDH Integration Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74


Cloudera CDH Task Flows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Task Flow to Integrate with Cloudera CDH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Task Flow to Upgrade from Version 10.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Task Flow to Upgrade from Version 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Task Flow to Upgrade from a Version Earlier than 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Prepare for Cluster Import from Cloudera CDH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Configure *-site.xml Files for Cloudera CDH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Prepare for Direct Import from Cloudera CDH. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Prepare the Archive File for Import from Cloudera CDH. . . . . . . . . . . . . . . . . . . . . . . . . . 83
Create a Cluster Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Before You Import. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Importing a Hadoop Cluster Configuration from the Cluster. . . . . . . . . . . . . . . . . . . . . . . . 84
Importing a Hadoop Cluster Configuration from a File. . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Verify or Refresh the Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Verify JDBC Drivers for Sqoop Connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Verify Design-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Verify Run-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Import Security Certificates to Clients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Configure the Developer Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Configure developerCore.ini. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Complete Upgrade Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Update Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Update Streaming Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

Table of Contents 5
Chapter 6: Hortonworks HDP Integration Tasks. . . . . . . . . . . . . . . . . . . . . . . . . 94
Hortonworks HDP Task Flows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Task Flow to Integrate with Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Task Flow to Upgrade from Version 10.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Task Flow to Upgrade from Version 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Task Flow to Upgrade from a Version Earlier than 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Prepare for Cluster Import from Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Configure *-site.xml Files for Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Prepare for Direct Import from Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Prepare the Archive File for Import from Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . 103
Create a Cluster Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Before You Import. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Importing a Hadoop Cluster Configuration from the Cluster. . . . . . . . . . . . . . . . . . . . . . . 105
Importing a Hadoop Cluster Configuration from a File. . . . . . . . . . . . . . . . . . . . . . . . . . 106
Verify or Refresh the Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Verify JDBC Drivers for Sqoop Connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Verify Design-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Verify Run-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Import Security Certificates to Clients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Configure the Developer Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Configure developerCore.ini. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Complete Upgrade Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Update Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Update Streaming Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

Chapter 7: MapR Integration Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115


MapR Task Flows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Task Flow to Integrate with MapR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Task Flow to Upgrade from Version 10.2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Task Flow to Upgrade from Version 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Task Flow to Upgrade from a Version Earlier than 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . 119
Install and Configure the MapR Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Prepare for Cluster Import from MapR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Configure *-site.xml Files for MapR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Prepare the Archive File for Import from MapR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
Create a Cluster Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Importing a Hadoop Cluster Configuration from a File. . . . . . . . . . . . . . . . . . . . . . . . . . 125
Verify or Refresh the Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Verify JDBC Drivers for Sqoop Connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Verify Design-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
Verify Run-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Generate MapR Tickets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

6 Table of Contents
Generate Tickets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Configure the Data Integration Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Configure the Metadata Access Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Configure the Analyst Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Configure the Developer Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Configure developerCore.ini. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Complete Upgrade Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Update Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

Part II: Databricks Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

Chapter 8: Introduction to Databricks Integration. . . . . . . . . . . . . . . . . . . . . . 136


Databricks Integration Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Run-time Process on the Databricks Spark Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Native Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Databricks Environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Databricks Integration Task Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

Chapter 9: Before You Begin Databricks Integration. . . . . . . . . . . . . . . . . . . . 140


Read the Release Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Verify System Requirements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Configure Preemption for Concurrent Jobs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Configure Storage Access. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Create a Staging Directory for Binary Archive Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Create a Staging Directory for Run-time Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Prepare for Token Authentication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Configure the Data Integration Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

Chapter 10: Databricks Integration Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144


Create a Databricks Cluster Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Importing a Databricks Cluster Configuration from the Cluster. . . . . . . . . . . . . . . . . . . . . 144
Importing a Databricks Cluster Configuration from a File. . . . . . . . . . . . . . . . . . . . . . . . . 145
Configure the Databricks Connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

Appendix A: Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148


Connections. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Cloud Provisioning Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
AWS Cloud Provisioning Configuration Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Azure Cloud Provisioning Configuration Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Databricks Cloud Provisioning Configuration Properties. . . . . . . . . . . . . . . . . . . . . . . . . 154
Amazon Redshift Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Amazon S3 Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Cassandra Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

Table of Contents 7
Databricks Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Google Analytics Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Google BigQuery Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Google Cloud Spanner Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Google Cloud Storage Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Hadoop Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Hadoop Cluster Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Common Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Reject Directory Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Hive Pushdown Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Blaze Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Spark Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
HDFS Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
HBase Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
HBase Connection Properties for MapR-DB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Hive Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
JDBC Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Sqoop Connection-Level Arguments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Kafka Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Microsoft Azure Blob Storage Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Microsoft Azure Cosmos DB SQL API Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . 182
Microsoft Azure Data Lake Store Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Microsoft Azure SQL Data Warehouse Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . 184
Snowflake Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Creating a Connection to Access Sources or Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Creating a Hadoop Connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Configuring Hadoop Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Cluster Environment Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Cluster Library Path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Common Advanced Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Blaze Engine Advanced Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Spark Advanced Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

8 Table of Contents
Preface
The Informatica Big Data Management™ Integration Guide is written for the system administrator who is
responsible for integrating the native environment of the Informatica domain with a non-native environment,
such as Hadoop or Databricks. This guide contains instructions to integrate the Informatica and non-native
environments.

Integration tasks are required on the Hadoop cluster, the Data Integration Service machine, and the Developer
tool machine. As a result, this guide contains tasks for administrators of the non-native environments,
Informatica administrators, and Informatica mapping developers. Tasks required by the Hadoop or
Databricks administrator are directed to the administrator.

Use this guide for new integrations and for upgrades. The instructions follow the same task flow. Tasks
required for upgrade indicate that they are for upgrade.

Informatica Resources
Informatica provides you with a range of product resources through the Informatica Network and other online
portals. Use the resources to get the most from your Informatica products and solutions and to learn from
other Informatica users and subject matter experts.

Informatica Network
The Informatica Network is the gateway to many resources, including the Informatica Knowledge Base and
Informatica Global Customer Support. To enter the Informatica Network, visit
https://network.informatica.com.

As an Informatica Network member, you have the following options:

• Search the Knowledge Base for product resources.


• View product availability information.
• Create and review your support cases.
• Find your local Informatica User Group Network and collaborate with your peers.

Informatica Knowledge Base


Use the Informatica Knowledge Base to find product resources such as how-to articles, best practices, video
tutorials, and answers to frequently asked questions.

To search the Knowledge Base, visit https://search.informatica.com. If you have questions, comments, or
ideas about the Knowledge Base, contact the Informatica Knowledge Base team at
[email protected].

9
Informatica Documentation
Use the Informatica Documentation Portal to explore an extensive library of documentation for current and
recent product releases. To explore the Documentation Portal, visit https://docs.informatica.com.

Informatica maintains documentation for many products on the Informatica Knowledge Base in addition to
the Documentation Portal. If you cannot find documentation for your product or product version on the
Documentation Portal, search the Knowledge Base at https://search.informatica.com.

If you have questions, comments, or ideas about the product documentation, contact the Informatica
Documentation team at [email protected].

Informatica Product Availability Matrices


Product Availability Matrices (PAMs) indicate the versions of the operating systems, databases, and types of
data sources and targets that a product release supports. You can browse the Informatica PAMs at
https://network.informatica.com/community/informatica-network/product-availability-matrices.

Informatica Velocity
Informatica Velocity is a collection of tips and best practices developed by Informatica Professional Services
and based on real-world experiences from hundreds of data management projects. Informatica Velocity
represents the collective knowledge of Informatica consultants who work with organizations around the
world to plan, develop, deploy, and maintain successful data management solutions.

You can find Informatica Velocity resources at http://velocity.informatica.com. If you have questions,
comments, or ideas about Informatica Velocity, contact Informatica Professional Services at
[email protected].

Informatica Marketplace
The Informatica Marketplace is a forum where you can find solutions that extend and enhance your
Informatica implementations. Leverage any of the hundreds of solutions from Informatica developers and
partners on the Marketplace to improve your productivity and speed up time to implementation on your
projects. You can find the Informatica Marketplace at https://marketplace.informatica.com.

Informatica Global Customer Support


You can contact a Global Support Center by telephone or through the Informatica Network.

To find your local Informatica Global Customer Support telephone number, visit the Informatica website at
the following link:
https://www.informatica.com/services-and-training/customer-success-services/contact-us.html.

To find online support resources on the Informatica Network, visit https://network.informatica.com and
select the eSupport option.

10 Preface
Part I: Hadoop Integration
This part contains the following chapters:

• Introduction to Hadoop Integration, 12


• Before You Begin, 16
• Amazon EMR Integration Tasks, 32
• Azure HDInsight Integration Tasks, 53
• Cloudera CDH Integration Tasks, 74
• Hortonworks HDP Integration Tasks, 94
• MapR Integration Tasks, 115

11
Chapter 1

Introduction to Hadoop
Integration
This chapter includes the following topics:

• Hadoop Integration Overview, 12


• Big Data Management Component Architecture, 13
• Integration with Other Informatica Products, 15

Hadoop Integration Overview


You can integrate the Informatica domain with the Hadoop cluster through Big Data Management.

The Data Integration Service automatically installs the Hadoop binaries to integrate the Informatica domain
with the Hadoop environment. The integration requires Informatica connection objects and cluster
configurations. A cluster configuration is a domain object that contains configuration parameters that you
import from the Hadoop cluster. You then associate the cluster configuration with connections to access the
Hadoop environment.

Perform the following tasks to integrate the Informatica domain with the Hadoop environment:

1. Install or upgrade to the current Informatica version.


2. Perform pre-import tasks, such as verifying system requirements and user permissions.
3. Import the cluster configuration into the domain. The cluster configuration contains properties from the
*-site.xml files on the cluster.
4. Create a Hadoop connection and other connections to run mappings within the Hadoop environment.
5. Perform post-import tasks specific to the Hadoop distribution that you integrate with.

When you run a mapping, the Data Integration Service checks for the binary files on the cluster. If they do not
exist or if they are not synchronized, the Data Integration Service prepares the files for transfer. It transfers
the files to the distributed cache through the Informatica Hadoop staging directory on HDFS. By default, the
staging directory is /tmp. This transfer process replaces the requirement to install distribution packages on
the Hadoop cluster.

12
Big Data Management Component Architecture
The Big Data Management components include client tools, application services, repositories, and third-party
tools that Big Data Management uses for a big data project. The specific components involved depend on the
task you perform.

Hadoop Integration
Big Data Management can connect to clusters that run different Hadoop distributions. Hadoop is an open-
source software framework that enables distributed processing of large data sets across clusters of
machines. You might also need to use third-party software clients to set up and manage your Hadoop cluster.

Big Data Management can connect to the supported data source in the Hadoop environment, such as HDFS,
HBase, or Hive, and push job processing to the Hadoop cluster. To enable high performance access to files
across the cluster, you can connect to an HDFS source. You can also connect to a Hive source, which is a
data warehouse that connects to HDFS.

It can also connect to NoSQL databases such as HBase, which is a database comprising key-value pairs on
Hadoop that performs operations in real-time. The Data Integration Service can push mapping jobs to the
Spark or Blaze engine, and it can push profile jobs to the Blaze engine in the Hadoop environment.

Big Data Management supports more than one version of some Hadoop distributions. By default, the cluster
configuration wizard populates the latest supported version.

Clients and Tools


Based on your product license, you can use multiple Informatica tools and clients to manage big data
projects.

Use the following tools to manage big data projects:


Informatica Administrator

Monitor the status of profile, mapping, and MDM Big Data Relationship Management jobs on the
Monitoring tab of the Administrator tool. The Monitoring tab of the Administrator tool is called the
Monitoring tool. You can also design a Vibe Data Stream workflow in the Administrator tool.

Informatica Analyst

Create and run profiles on big data sources, and create mapping specifications to collaborate on
projects and define business logic that populates a big data target with data.

Informatica Developer

Create and run profiles against big data sources, and run mappings and workflows on the Hadoop
cluster from the Developer tool.

Application Services
Big Data Management uses application services in the Informatica domain to process data.

Use the Administrator tool to create connections, monitor jobs, and manage application services that Big
Data Management uses.

Big Data Management uses the following application services:

Big Data Management Component Architecture 13


Analyst Service

The Analyst Service runs the Analyst tool in the Informatica domain. The Analyst Service manages the
connections between service components and the users that have access to the Analyst tool.

Data Integration Service

The Data Integration Service can process mappings in the native environment or push the mapping for
processing to a compute cluster in a non-native environment. The Data Integration Service also retrieves
metadata from the Model repository when you run a Developer tool mapping or workflow. The Analyst
tool and Developer tool connect to the Data Integration Service to run profile jobs and store profile
results in the profiling warehouse.

Mass Ingestion Service

The Mass Ingestion Service manages and validates mass ingestion specifications that you create in the
Mass Ingestion tool. The Mass Ingestion Service deploys specifications to the Data Integration Service.
When a specification runs, the Mass Ingestion Service generates ingestion statistics.

Metadata Access Service

The Metadata Access Service allows the Developer tool to import and preview metadata from a Hadoop
cluster.

The Metadata Access Service contains information about the Service Principal Name (SPN) and keytab
information if the Hadoop cluster uses Kerberos authentication. You can create one or more Metadata
Access Services on a node. Based on your license, the Metadata Access Service can be highly available.

HBase, HDFS, Hive, and MapR-DB connections use the Metadata Access Service when you import an
object from a Hadoop cluster. Create and configure a Metadata Access Service before you create HBase,
HDFS, Hive, and MapR-DB connections.

Model Repository Service

The Model Repository Service manages the Model repository. The Model Repository Service connects to
the Model repository when you run a mapping, mapping specification, profile, or workflow.

REST Operations Hub

The REST Operations Hub Service is an application service in the Informatica domain that exposes
Informatica product functionality to external clients through REST APIs.

Repositories
Big Data Management uses repositories and other databases to store data related to connections, source
metadata, data domains, data profiling, data masking, and data lineage. Big Data Management uses
application services in the Informatica domain to access data in repositories.

Big Data Management uses the following databases:

Model repository

The Model repository stores profiles, data domains, mapping, and workflows that you manage in the
Developer tool. The Model repository also stores profiles, data domains, and mapping specifications that
you manage in the Analyst tool.

Profiling warehouse

The Data Integration Service runs profiles and stores profile results in the profiling warehouse.

14 Chapter 1: Introduction to Hadoop Integration


Integration with Other Informatica Products
To expand functionality and to process data more efficiently, you can use Big Data Management in
conjunction with other Informatica products.

Big Data Management integrates with the following Informatica products:

• PowerExchange adapters. Connect to data sources through adapters.


• Enterprise Data Catalog. Perform data lineage analysis for big data sources and targets.
• Enterprise Data Lake. Discover raw data and publish it in a lake as a Hive table.
• Data Quality. Perform address validation and data discovery.
• Data Replication. Replicate change data to a Hadoop Distributed File System (HDFS).
• Data Transformation. Process complex file sources from the Hadoop environment.
• Big Data Streaming. Stream data as messages, and process it as it becomes available.
• Edge Data Streaming. Collect and ingest data in real time to a Kafka queue.
• Dynamic Data Masking. Mask or prevent access to sensitive data.

Integration with Other Informatica Products 15


Chapter 2

Before You Begin


This chapter includes the following topics:

• Read the Release Notes, 16


• Verify System Requirements, 16
• Uninstall Big Data Management, 18
• Prepare Directories, Users, and Permissions, 21
• Configure Access to Secure Hadoop Clusters, 26
• Configure the Metadata Access Service, 26
• Configure the Data Integration Service, 27

Read the Release Notes


Read the Release Notes for updates to the installation and upgrade process. You can also find information
about known and fixed limitations for the release.

Verify System Requirements


Verify that your environment meets the minimum system requirements for the installation process, disk
space requirements, port availability, and third-party software.

For more information about product requirements and supported platforms, see the Product Availability
Matrix on Informatica Network:
https://network.informatica.com/community/informatica-network/product-availability-matrices

Verify Product Installations


Before you begin the Big Data Management integration between the domain and Hadoop environments, verify
that Informatica and third-party products are installed.

You must install the following products:

Informatica domain and clients

Install and configure the Informatica domain and the Developer tool. The Informatica domain must have
a Model Repository Service, a Data Integration Service, and a Metadata Access Service.

16
Hadoop File System and MapReduce

The Hadoop installation must include a Hive data warehouse with a non-embedded database for the
Hive metastore. Verify that Hadoop is installed with Hadoop File System (HDFS) and MapReduce on
each node. Install Hadoop in a single node environment or in a cluster. For more information, see the
Apache website: http://hadoop.apache.org.

Database client software

To access relational databases in the Hadoop environment, install database client software and drivers
on each node in the cluster.

Verify HDFS Disk Space


When the Data Integration Service integrates the domain with the Hadoop cluster, it uploads the Informatica
binaries to HDFS.

Verify with the Hadoop administrator that the distributed cache has at least 1.5 GB of free disk space.

Verify the Hadoop Distribution


Verify the version of the Hadoop distribution in the Hadoop environment.

The following table lists the supported distribution versions:

Distribution Version

Amazon EMR 5.16

Azure HDInsight 3.6.x

Cloudera CDH 5.15


5.16

Hortonworks HDP 2.6.x

MapR 6.0.x MEP 5.0

Verify Port Requirements


Open a range of ports to enable the Informatica domain to communicate with the Hadoop cluster and the
distribution engine.

To ensure access to ports, the network administrator needs to complete additional tasks in the following
situations:

• The Hadoop cluster is behind a firewall. Work with the network administrator to open a range of ports that
a distribution engine uses.
• The Hadoop environment uses Azure HDInsight. Work with the network administrator to enable VPN
between the Informatica domain and the Azure cloud network.

Verify System Requirements 17


The following table lists the ports to open:

Port Description

7180 Cluster management web app for Cloudera. Required for Cloudera only.

8020 NameNode RPC. Required for all supported distributions except MapR.

8032 ResourceManager. Required for all distributions.

8080 Cluster management web app. Used by distributions that use Ambari to manage the cluster:
HDinsight, Hortonworks.

8088 Resource Manager web app. Required for all distributions.

8443 MapR control system. Required for MapR only.

9080 Blaze monitoring console. Required for all distributions if you run mappings using Blaze.

9083 Hive metastore. Required for all distributions.

12300 to 12600 Default port range for the Blaze distribution engine. A port range is required for all distributions if
you run mappings using Blaze.

19888 YARN JobHistory server webapp. Optional for all distributions.

50070 HDFS Namenode HTTP. Required for all distributions.

Note: The network administrators must ensure that the port used by the Metadata Access Service is
accessible from the cluster nodes.

Spark Engine Monitoring Port


Spark engine monitoring requires the cluster nodes to communicate with the Data Integration Service over a
socket. The Data Integration Service picks the socket port randomly from the port range configured for the
domain. You can view the port range in the advanced properties of the primary node. By default, the minimum
port number is 12000 and the maximum port number is 13000. The network administrators must ensure that
the port range is accessible from the cluster nodes to the Data Integration Service. If the administrators
cannot provide a port range access, you can configure the Data Integration Service to use a fixed port with
the SparkMonitoringPort custom property. The network administrator must ensure that the configured port is
accessible from the cluster nodes to the Data Integration Service.

Uninstall Big Data Management


If you are upgrading Big Data Management from a version earlier than 10.2 and have a previous version of
Big Data Management installed on the Hadoop environment, Informatica recommends that you uninstall the
previous version.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.

18 Chapter 2: Before You Begin


Uninstall for Amazon EMR, Azure HDInsight, and MapR
Complete the following prerequisite tasks before you uninstall Big Data Management:

1. Verify that the Big Data Management administrator can run sudo commands.
2. If you are uninstalling Big Data Management in a cluster environment, configure the root user to use a
passwordless Secure Shell (SSH) connection between the machine where you want to run the Big Data
Management uninstall and all of the nodes where Big Data Management is installed.
3. If you are uninstalling Big Data Management in a cluster environment using the HadoopDataNodes file,
verify that the HadoopDataNodes file contains the IP addresses or machine host names of each of the
nodes in the Hadoop cluster from which you want to uninstall Big Data Management. The
HadoopDataNodes file is located on the node from where you want to launch the Big Data Management
installation. You must add one IP address or machine host name of the nodes in the Hadoop cluster for
each line in the file.

Complete the following tasks to perform the uninstallation:

1. Log in to the machine as root user. The machine you log in to depends on the Big Data Management
environment and uninstallation method.
• To uninstall in a single node environment, log in to the machine on which Big Data Management is
installed.
• To uninstall in a cluster environment using the HADOOP_HOME environment variable, log in to the
primary name node.
• To uninstall in a cluster environment using the HadoopDataNodes file, log in to any node.
2. Run the following command to start the uninstallation in console mode:
bash InformaticaHadoopInstall.sh
sh InformaticaHadoopInstall.sh
./InformaticaHadoopInstall.sh
3. Press y to accept the Big Data Management terms of agreement.
4. Press Enter.
5. Select 3 to uninstall Big Data Management.
6. Press Enter.
7. Select the uninstallation option, depending on the Big Data Management environment:
• Select 1 to uninstall Big Data Management from a single node environment.
• Select 2 to uninstall Big Data Management from a cluster environment.
8. Press Enter.
9. If you are uninstalling Big Data Management in a cluster environment, select the uninstallation option,
depending on the uninstallation method:
• Select 1 to uninstall Big Data Management from the primary name node.
• Select 2 to uninstall Big Data Management using the HadoopDataNodes file.
10. Press Enter.
11. If you are uninstalling Big Data Management from a cluster environment from the primary name node,
type the absolute path for the Hadoop installation directory. Start the path with a slash.
The uninstaller deletes all of the Big Data Management binary files from the following directory: /<Big Data
Management installation directory>/Informatica
In a cluster environment, the uninstaller deletes the binary files from all nodes within the Hadoop cluster.

Uninstall Big Data Management 19


Uninstall for Cloudera CDH
Uninstall Big Data Management on Cloudera from the Cloudera Manager.

1. In Cloudera Manager, browse to Hosts > Parcels > Informatica.


2. Select Deactivate.
Cloudera Manager stops the Informatica Big Data Management instance.
3. Select Remove.
The cluster uninstalls Informatica Big Data Management.

Uninstall for Hortonworks HDP


To uninstall the stack deployment of Big Data Management, you use the Ambari configuration manager to
stop and deregister the Big Data Management service, and then perform manual removal of Informatica files
from the cluster.

1. In the Ambari configuration manager, select INFORMATICA BDM from the list of services.
2. Click the Service Actions dropdown menu and select Delete Service.
3. To confirm that you want to delete Informatica Big Data Management, perform the following steps:
a. In the Delete Service dialog box, click Delete.
b. In the Confirm Delete dialog box, type delete and then click Delete.
c. When the deletion process is complete, click OK.
Ambari stops the Big Data Management service and deletes it from the listing of available services.
To fully delete Big Data Management from the cluster, continue with the next steps.
4. In a command window, delete the INFORMATICABDM folder from the following directory on the name node
of the cluster: /var/lib/ambari-server/resources/stacks/<Hadoop distribution>/<Hadoop
version>/services/
5. Delete the INFORMATICABDM folder from the following location on all cluster nodes where it was
installed: /var/lib/ambari-agent/cache/stacks/<Hadoop distribution>/<Hadoop version>/
services
6. Perform the following steps to remove RPM binary files:
a. Run the following command to determine the name of the RPM binary archive:
rpm -qa |grep Informatica
b. Run the following command to remove RPM binary files:
rpm -ev <output_from_above_command>
For example:
rpm -ev InformaticaHadoop-10.1.1-1.x86_64

7. Repeat the previous step to remove RPM binary files from each cluster node.
8. Delete the following directory, if it exists, from the name node and each client node: /opt/Informatica/.
9. Repeat the last step on each cluster node where Big Data Management was installed.
10. On the name node, restart the Ambari server.

20 Chapter 2: Before You Begin


Prepare Directories, Users, and Permissions
The Data Integration Service needs access to the Hadoop environment for integration and staging.

Prepare the following directories, users, and permissions:

• Informatica cluster staging directory


• Hive warehouse directory
• Hive staging directory
• Blaze engine directories
• Spark engine staging directory
• Reject file directory

Verify and Create Users


The Data Integration Service requires different users to access the Hadoop environment. Any user that you
create for an Azure HDInsight distribution must be an Azure Active Directory user. For other distributions, use
Linux users.

Hadoop impersonation user


Verify that every node on the cluster has an impersonation user that can be used in a Hadoop connection.
Create one if it does not exist. The Data Integration Service impersonates this user to run jobs in the Hadoop
environment.

The following distributions use Hadoop impersonation user:

MapR distribution

If the MapR distribution uses Ticket or Kerberos authentication, the name must match the system user
that starts the Informatica daemon and the gid of the user must match the gid of the MapR user.

Azure HDInsight

If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, grant the required
permissions. For the permissions, see “Grant Permissions to an Azure Active Directory User ” on page
22.

To run Sqoop mappings on the Spark engine, add the Hadoop impersonation user as a Linux user on the
machine that hosts the Data Integration Service.

Service principal name (SPN) for the Data Integration Service


If the cluster uses Kerberos authentication, verify that the SPN corresponding to the cluster keytab file
matches the name of the system user that starts the Informatica daemon.

Hadoop staging user


Optionally, create an HDFS user that performs operations on the cluster staging directory. If you do not
create a staging user, the Data Integration Service uses the operating system user that starts the Informatica
daemon.

If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, grant the required
permissions. For the permissions, see “Grant Permissions to an Azure Active Directory User ” on page 22.

Prepare Directories, Users, and Permissions 21


Blaze user
Optionally, create an operating system user account that the Blaze engine uses to write to staging and log
directories. If you do not create a Blaze user, the Data Integration Service uses the Hadoop impersonation
user.

If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, grant the required
permissions. For the permissions, see “Grant Permissions to an Azure Active Directory User ” on page 22.

Operating system profile user


If operating system profiles are configured for the Data Integration Service, the Data Integration Service runs
jobs with permissions of the operating system user that you define in the profile. You can choose to use the
operating system profile user instead of the Hadoop impersonation users to run jobs in a Hadoop
environment.

If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, grant the required
permissions. Users must be present in the Azure Active Directory that matches the name on the Data
Integration Service machine. For the permissions, see “Grant Permissions to an Azure Active Directory
User ” on page 22.

Mapping impersonation user


A mapping impersonation user is valid for the native run time environment. Use mapping impersonation to
impersonate the Data Integration Service user that connects to Hive, HBase, or HDFS sources and targets
that use Kerberos authentication. Configure functionality in the Data Integration Service and the mapping
properties. The mapping impersonation user uses the following format: <Hadoop service name>/<host
name>@<Kerberos realm>

If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, grant the required
permissions. For the permissions, see “Grant Permissions to an Azure Active Directory User ” on page 22.

Grant Permissions to an Azure Active Directory User


If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, grant the following
permissions to all the users:

• Execute permission on the root folder and its subfolders of the Azure Data Lake Storage account.
• Read and execute permissions on the following directory and its contents: /hdp/apps/<version>
• Read, write, and execute permissions on the following directories:

/tmp
/app-logs
/hive/warehouse
/blaze/workdir
/user
/var/log/hadoop-yarn/apps
/mr-history
/tezstaging
/mapreducestaging

Note: If the directories are not available, create the directories and grant the required permissions.

22 Chapter 2: Before You Begin


Create a Cluster Staging Directory
Optionally, create a directory on HDFS that the Data Integration Service uses to stage the Informatica binary
archive files.

By default, the Data Integration Service writes the files to the HDFS directory /tmp.

Grant permission to the Hadoop staging user. If you did not create a Hadoop staging user, the Data
Integration Services uses the operating system user that starts the Informatica daemon.

Grant Permissions on the Hive Warehouse Directory


Grant access to the absolute HDFS file path of the default database for the hive warehouse.

Grant read and write permissions on the Hive warehouse directory. You can find the location of the
warehouse directory in the hive.metastore.warehouse.dir property of the hive-site.xml file. For example, the
default might be /user/hive/warehouse or /apps/hive/warehouse.

Grant permission to the Hadoop impersonation user. Optionally, you can assign -777 permissions on the
directory.

Create a Hive Staging Directory


The Blaze and Spark engines require access to the Hive staging directory. You can use the default directory,
or you can create a directory on HDFS. For example, if you create a directory, you might run the following
command:
hadoop fs –mkdir /staging
If you use the default directory or create a directory, you must grant execute permission to the Hadoop
impersonation user and the mapping impersonation users.

Create a Spark Staging Directory


When the Spark engine runs a job, it stores temporary files in a staging directory.

Optionally, create a staging directory on HDFS for the Spark engine. For example:
hadoop fs -mkdir -p /spark/staging
If you want to write the logs to the Informatica Hadoop staging directory, you do not need to create a Spark
staging directory. By default, the Data Integration Service uses the HDFS directory /tmp/SPARK_<user name>.

Grant permission to the following users:

• Hadoop impersonation user


• SPN of the Data Integration Service
• Mapping impersonation users
Optionally, you can assign -777 permissions on the directory.

Create a Sqoop Staging Directory


When you run Sqoop jobs on the Spark engine, the Data Integration Service creates a Sqoop staging directory
named sqoop_staging within the Spark staging directory by default. You can configure the Spark staging
directory that you want to use in the Hadoop connection.

Prepare Directories, Users, and Permissions 23


However, based on your processing requirements, you might need to create the directory manually and give
write permissions to the Hive super user. When you create the sqoop_staging directory manually, the Data
Integration Service uses this directory instead of creating another one.

Create a Sqoop staging directory named sqoop_staging manually in the following situations:

• You run a Sqoop pass-through mapping on the Spark engine to read data from a Sqoop source and write
data to a Hive target that uses the Text format.
• You use a Cloudera CDH cluster with Sentry authorization or a Hortonworks HDP cluster with Ranger
authorization.
After you create the sqoop_staging directory, you must add an Access Control List (ACL) for the
sqoop_staging directory and grant write permissions to the Hive super user. Run the following command on
the Cloudera CDH cluster or the Hortonworks HDP cluster to add an ACL for the sqoop_staging directory and
grant write permissions to the Hive super user:

hdfs dfs -setfacl -m default:user:hive:rwx /<Spark staging directory>/sqoop_staging/

For information about Sentry authorization, see the Cloudera documentation. For information about Ranger
authorization, see the Hortonworks documentation.

Create Blaze Engine Directories


Create a blaze user account and directories required by the Blaze engine.

Complete the following tasks to prepare the Hadoop cluster for the Blaze engine:

Create a home directory for the blaze user.

If you created a blaze user, create home directory for the blaze user. For example,
hdfs hadoop fs -mkdir /user/blaze
hdfs hadoop fs -chown blaze:blaze /user/blaze
If you did not create a blaze user, the Hadoop impersonation user is the default user.

Optionally, create a local services log directory.

By default, the Blaze engine writes the service logs to the YARN distributed cache. For example, run the
following command:
mkdir -p /opt/informatica/blazeLogs
$HADOOP_NODE_INFA_HOME gets set to the YARN distributed cache. If you create a directory, you must
update the value of the advanced property in the Hadoop connection.

Create an aggregated HDFS log directory.

Create a log directory on HDFS to contain aggregated logs for local services. For example:
hadoop fs -mkdir -p /var/log/hadoop-yarn/apps/informatica
Ensure that value of the advanced property in the Hadoop connection matches the directory that you
created.

Optionally, create a Blaze staging directory.

You can write the logs to the Informatica Hadoop staging directory, or you can create a Blaze staging
directory. If you do not want to use the default location, create a staging directory on the HDFS. For
example:
hadoop fs -mkdir -p /blaze/workdir
Note: If you do not create a staging directory, clear the Blaze staging directory property value in the
Hadoop connection and the Data Integration Service uses the HDFS directory /tmp/blaze_<user name>.

24 Chapter 2: Before You Begin


Grant permissions on the local services log directory, aggregated HDFS log directory, and the staging directory.

Grant permission to the following users:

• Blaze user
• Hadoop impersonation user
• Mapping impersonation users
If the blaze user does not have permission, the Blaze engine uses a different user, based on the cluster
security and the mapping impersonation configuration.

Edit the hosts File for the Blaze Engine


To run the Blaze engine on every node in the cluster, verify that the /etc/hosts file on every node has entries
for all other nodes.

Each node in the cluster requires an entry for the IP address and the fully qualified domain name (FQDN) of
all other nodes. For example,
127.0.0.1 localhost node1.node.com
208.164.186.1 node1.node.com node1
208.164.186.2 node2.node.com node2
208.164.186.3 node3.node.com node3
Changes take effect after you restart the network.

Create a Reject File Directory


You can choose to store reject files on HDFS for the Blaze and Spark engines.

Reject files can be very large, and you can choose to write them to HDFS instead of the Data Integration
Service machine. You can configure the Hadoop connection object to write to the reject file directory.

Grant permission to the following users:

• Blaze user
• Hadoop impersonation user
• Mapping impersonation users
If the blaze user does not have permission, the Blaze engine uses a different user, based on the cluster
security and the mapping impersonation configuration.

Create a Proxy Directory for MapR


If the Hadoop cluster runs on MapR, you must create a proxy directory for the user who will impersonate
other users.

Verify the following requirements for the proxy user:

• Create a user or verify that a user exists on every Data Integration Service machine and on every node in
the Hadoop cluster.
• Verify that the uid and the gid of the user match in both environments.
• Verify that a directory exists for the user on the cluster. For example, /opt/mapr/conf/proxy/<user
name>

Prepare Directories, Users, and Permissions 25


Configure Access to Secure Hadoop Clusters
If the Hadoop cluster uses Kerberos authentication or SSL/TLS, you must configure the Informatica domain
to access the cluster. If the cluster uses transparent encryption, you must configure the Key Management
Server (KMS) for Informatica user access.

Depending on the security implementation on the cluster, you must perform the following tasks:

Cluster uses Kerberos authentication.

You must configure the Kerberos configuration file on the Data Integration Service machine to match the
Kerberos realm properties of the Hadoop cluster. Verify that the Hadoop Kerberos properties are
configured in the Data Integration Service and the Metadata Access Service.

Cluster uses SSL/TLS.

You must import security certificates to the Data Integration Service and the Metadata Access Service
machines.

Cluster uses transparent encryption.

If the transparent encryption uses Cloudera Java KMS, Cloudera Navigator KMS, or Apache Ranger KMS,
you must configure the KMS for Informatica user access.

Cluster uses Enterprise Security Package.

If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, perform the following
tasks:

• Create a keytab file on any one of the cluster nodes for the specific user. To create a keytab file, use
the ktutil command.
• In the Azure portal, assign the Owner role to the Azure HDInsight cluster service principal display
name.
• Log in to Ambari Web UI with the Azure Active Directory user credentials to generate the OAuth token
for authentication for the following users:
- Keytab user

- Hadoop impersonation user

- Hadoop staging user

- Blaze user

- Operating system profile user

For more information, see the Informatica Big Data Management Administrator Guide.

Configure the Metadata Access Service


Configure the Metadata Access Service to integrate with the Hadoop environment.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.2 or earlier.

26 Chapter 2: Before You Begin


The following table describes the Metadata Access Service properties that you need to configure:

Property Description

Use Operating System Profiles If enabled, the Metadata Access Service uses the operating system profiles to
and Impersonation access the Hadoop cluster.

Hadoop Kerberos Service Service Principal Name (SPN) of the Metadata Access Service to connect to a
Principal Name Hadoop cluster that uses Kerberos authentication.
Not applicable for the MapR distribution.

Hadoop Kerberos Keytab The file path to the Kerberos keytab file on the machine on which the Metadata
Access Service runs.
Not applicable for the MapR distribution.

Use logged in user as Required if the Hadoop cluster uses Kerberos authentication. If enabled, the
impersonation user Metadata Access Service uses the impersonation user to access the Hadoop
environment. Default is false.

Configure the Data Integration Service


Configure the Data Integration Service to integrate with the Hadoop environment.

Perform the following pre-integration tasks:

1. Download Informatica Hadoop binaries to the Data Integration Service machine if the operating systems
of the Hadoop environment and the Data Integration Service are different.
2. Configure the Data Integration Service properties, such as the cluster staging directory, Hadoop
Kerberos service principal name, and the path to the Kerberos keytab file.
3. Prepare an installation of Python on the Data Integration Service machine or on the Hadoop cluster if you
plan to run the Python transformation.
4. Copy the krb5.conf file to the following location on the machine that hosts the Data Integration Service:
• <Informatica installation directory>/java/jre/lib/security
• <Informatica installation directory>/services/shared/security
5. Copy the keytab file to the following directory: <Informatica installation directory>/isp/config/
keys

Download the Informatica Server Binaries for the Hadoop


Environment
If the domain and the Hadoop environments use different supported operating systems, you must configure
the Data Integration Service to be compatible with the Hadoop environment. To run a mapping, the local path
to the Informatica server binaries must be compatible with the Hadoop operating system.

The Data Integration Service can synchronize the following operating systems: SUSE and Redhat

The Data Integration Service machine must include the Informatica server binaries that are compatible with
the Hadoop cluster operating system. The Data Integration Service uses the operating system binaries to
integrate the domain with the Hadoop cluster.

Configure the Data Integration Service 27


You must run the installer to extract the installation binaries into custom Hadoop OS path and then exit the
installer.

1. Create a directory on the Data Integration Service host machine to store the Informatica server binaries
associated with the Hadoop operating system.
If the Data Integration Service runs on a grid, Informatica recommends extracting the files to a location
that is shared by all services on the grid. If the location is not shared, you must extract the files to all
Data Integration Service machines that run on the grid.
The directory names in the path must not contain spaces or the following special characters: @ | * $ # !
%(){}[]
2. Download and extract the Informatica server binaries from the Informatica download site. For example,
tar -xvf <Informatica server binary tar file>
3. Run the installer to extract the installation binaries into the custom OS path.
Perform the following steps to run the installer:
• Run the sh Server/install.bin -DINSTALL_MODE=CONSOLE -DINSTALL_TYPE=0 file.
• Press Y to continue the installation.
• Press 1 to install Informatica Big Data Suite Products.
• Press 3 to run the installer.
• Press 2 to accept the terms and conditions.
• Press 2 to continue the installation for big data products only.
• Press 2 to configure the Informatica domain to run on a network with Kerberos authentication.
• Enter the path and file name of the Informatica license key and press an option to tune the services.
• Enter the custom Hadoop OS path.
• Type Quit to quit the installation.
4. Set the custom Hadoop OS path in the Data Integration Service and then restart the service
5. Optionally, you can delete files that are not required. For example, run the following command:
rm -Rf <Informatica server binary file> ./source/*.7z
Note: If you subsequently install an Informatica EBF, you must also install it in the path of the Informatica
server binaries associated with the Hadoop environment.

Configure Data Integration Service Properties


The Data Integration Service contains properties that integrate the domain with the Hadoop cluster.

The following table describes the Data Integration Service properties that you need to configure:

Property Description

Cluster Staging The directory on the cluster where the Data Integration Service pushes the binaries to integrate the
Directory native and non-native environments and to store temporary files during processing. Default is /
tmp.

Hadoop Staging The HDFS user that performs operations on the Hadoop staging directory. The user requires write
User permissions on Hadoop staging directory. Default is the operating system user that starts the
Informatica daemon.

28 Chapter 2: Before You Begin


Property Description

Custom Hadoop The local path to the Informatica server binaries compatible with the Hadoop operating system.
OS Path Required when the Hadoop cluster and the Data Integration Service are on different supported
operating systems. The Data Integration Service uses the binaries in this directory to integrate the
domain with the Hadoop cluster. The Data Integration Service can synchronize the following
operating systems:
- SUSE and Redhat
Include the source directory in the path. For example, <Informatica server binaries>/
source.
Changes take effect after you recycle the Data Integration Service.
Note: When you install an Informatica EBF, you must also install it in this directory.

Hadoop Service Principal Name (SPN) of the Data Integration Service to connect to a Hadoop cluster that
Kerberos uses Kerberos authentication.
Service Principal Not required for the MapR distribution.
Name

Hadoop The file path to the Kerberos keytab file on the machine on which the Data Integration Service
Kerberos Keytab runs.
Not required for the MapR distribution.

Custom Properties that are unique to specific environments.


Properties You can configure run-time properties for the Hadoop environment in the Data Integration Service,
the Hadoop connection, and in the mapping. You can override a property configured at a high level
by setting the value at a lower level. For example, if you configure a property in the Data
Integration Service custom properties, you can override it in the Hadoop connection or in the
mapping. The Data Integration Service processes property overrides based on the following
priorities:
1. Mapping custom properties set using infacmd ms runMapping with the -cp option
2. Mapping run-time properties for the Hadoop environment
3. Hadoop connection advanced properties for run-time engines
4. Hadoop connection advanced general properties, environment variables, and classpaths
5. Data Integration Service custom properties

Prepare a Python Installation


If you want to use the Python transformation, you must ensure that the worker nodes on the Hadoop cluster
contain an installation of Python. You must complete different tasks depending on the product that you use.

Installing Python for Big Data Management


To use the Python transformation in Big Data Management, the worker nodes on the cluster must contain a
uniform installation of Python. You can ensure that the installation is uniform in one of the following ways:

• Verify that all worker nodes on the cluster contain an installation of Python in the same directory, such as
usr/lib/python, and that each Python installation contains all required modules. You do not re-install
Python, but you must reconfigure the following Spark advanced property in the Hadoop connection:
infaspark.pythontx.executorEnv.PYTHONHOME
• Install Python on every Data Integration Service machine. You can create a custom installation of Python
that contains specific modules that you can reference in the Python code. When you run mappings, the
Python installation is propagated to the worker nodes on the cluster.

Configure the Data Integration Service 29


If you choose to install Python on the Data Integration Service machines, complete the following tasks:

1. Install Python.
2. Optionally, install any third-party libraries such as numpy, scikit-learn, and cv2. You can access the third-
party libraries in the Python transformation.
3. Copy the Python installation folder to the following location on the Data Integration Service machine:
<Informatica installation directory>/services/shared/spark/python
Note: If the Data Integration Service machine already contains an installation of Python, you can copy the
existing Python installation to the above location.

Changes take effect after you recycle the Data Integration Service.

Installing Python for Big Data Streaming


To use the Python transformation in Big Data Streaming, you must install Python and the Jep package.
Because you must install Jep, the Python version that you use must be compatible with Jep. You can use one
of the following versions of Python:

2.7
3.3
3.4
3.5
3.6

To install Python and Jep, complete the following tasks:

1. Install Python with the --enable-shared option to ensure that shared libraries are accessible by Jep.
2. Install Jep. To install Jep, consider the following installation options:
• Run pip install jep. Use this option if Python is installed with the pip package.
• Configure the Jep binaries. Ensure that jep.jar can be accessed by Java classloaders, the shared
Jep library can be accessed by Java, and Jep Python files can be accessed by Python.
3. Optionally, install any third-party libraries such as numpy, scikit-learn, and cv2. You can access the third-
party libraries in the Python transformation.
4. Copy the Python installation folder to the following location on the Data Integration Service machine:
<Informatica installation directory>/services/shared/spark/python
Note: If the Data Integration Service machine already contains an installation of Python, you can copy the
existing Python installation to the above location.

Changes take effect after you recycle the Data Integration Service.

Edit the hosts File for Access to Azure HDInsight


To ensure that Informatica can access the HDInsight cluster, edit the /etc/hosts file on the machine that
hosts the Data Integration Service to add the following information:

• Enter the IP address, DNS name, and DNS short name for each data node on the cluster. Use
headnodehost to identify the host as the cluster headnode host.
For example:
10.75.169.19 hn0-rndhdi.grg2yxlb0aouniiuvfp3bet13d.ix.internal.cloudapp.net
headnodehost
• If the HDInsight cluster is integrated with ADLS storage, you also need to enter the IP addresses and DNS
names for the hosts listed in the cluster property fs.azure.datalake.token.provider.service.urls.

30 Chapter 2: Before You Begin


For example:
1.2.3.67 gw1-ltsa.1320suh5npyudotcgaz0izgnhe.gx.internal.cloudapp.net
1.2.3.68 gw0-ltsa.1320suh5npyudotcgaz0izgnhe.gx.internal.cloudapp.net
Note: To get the IP addresses, run a telnet command from the cluster host using each host name found in
the fs.azure.datalake.token.provider.service.urls property.

Configure the Data Integration Service 31


Chapter 3

Amazon EMR Integration Tasks


This chapter includes the following topics:

• Amazon EMR Task Flows, 32


• Prepare for Cluster Import from Amazon EMR, 37
• Create a Cluster Configuration, 42
• Verify or Refresh the Cluster Configuration , 43
• Verify JDBC Drivers for Sqoop Connectivity, 44
• Configure the Files for Hive Tables on S3, 45
• Setting S3 Access Policies, 46
• Configure the Developer Tool, 48
• Complete Upgrade Tasks, 49

Amazon EMR Task Flows


Depending on whether you want to integrate or upgrade Big Data Management in an Amazon EMR
environment, you can use the flow charts to perform the following tasks:

• Integrate the Informatica domain with Amazon EMR for the first time.
• Upgrade from version 10.2.1.
• Upgrade from version 10.2.
• Upgrade from a version earlier than 10.2.

32
Task Flow to Integrate with Amazon EMR
The following diagram shows the task flow to integrate the Informatica domain with Amazon EMR:

Amazon EMR Task Flows 33


Task Flow to Upgrade from Version 10.2.1
The following diagram shows the task flow to upgrade Big Data Management 10.2.1 for Amazon EMR:

34 Chapter 3: Amazon EMR Integration Tasks


Task Flow to Upgrade from Version 10.2
The following diagram shows the task flow to upgrade Big Data Management 10.2 for Amazon EMR:

Amazon EMR Task Flows 35


Task Flow to Upgrade from a Version Earlier than 10.2
The following diagram shows the task flow to upgrade Big Data Management from a version earlier than 10.2
for Amazon EMR:

36 Chapter 3: Amazon EMR Integration Tasks


Prepare for Cluster Import from Amazon EMR
Before the Informatica administrator can import cluster information to create a cluster configuration in the
Informatica domain, the Hadoop administrator must perform some preliminary tasks.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from any previous version.

Note: If you are upgrading from a previous version, verify the properties and suggested values, as Big Data
Management might require additional properties or different values for existing properties.

Complete the following tasks to prepare the cluster before the Informatica administrator creates the cluster
configuration:

1. Verify property values in *-site.xml files that Big Data Management needs to run mappings in the Hadoop
environment.
2. Prepare the archive file to import into the domain.

Note: You cannot import cluster information directly from the Amazon EMR cluster into the Informatica
domain.

Configure *-site.xml Files for Amazon EMR


The Hadoop administrator needs to configure *-site.xml file properties and restart impacted services before
the Informatica administrator imports cluster information into the domain.

core-site.xml
Configure the following properties in the core-site.xml file:
fs.s3.awsAccessKeyID

The ID for the run-time engine to connect to the Amazon S3 file system. Required for the Blaze engine
and for the Spark engine if the Data Integration if S3 policy does not allow EMR access.

Note: If the Data Integration Service is deployed on an EC2 instance and the IAM roles and policies allow
access to S3 and other resources, this property is not required. If the Data Integration Service is
deployed on-premises, then you can choose to configure the value for this property in the cluster
configuration on the Data Integration Service after you import the cluster configuration. Configuring the
AccessKeyID value on the cluster configuration is more secure than configuring it in core-site.xml on the
cluster.

Set to your access ID.


fs.s3.awsSecretAccessKey

The access key for the Blaze and Spark engines to connect to the Amazon S3 file system. Required for
the Blaze engine and for the Spark engine if the Data Integration if S3 policy does not allow EMR access.

Note: If the Data Integration Service is deployed on an EC2 instance and the IAM roles and policies allow
access to S3 and other resources, this property is not required. If the Data Integration Service is
deployed on-premises, then you can choose to configure the value for this property in the cluster
configuration on the Data Integration Service after you import the cluster configuration. Configuring the
AccessKeyID value on the cluster configuration is more secure than configuring it in core-site.xml on the
cluster.

Prepare for Cluster Import from Amazon EMR 37


Set to your access key.

fs.s3.enableServerSideEncryption

Enables server side encryption for S3 buckets. Required for SSE and SSE-KMS encryption.

Set to: TRUE

fs.s3a.server-side-encryption-algorithm

The server-side encryption algorithm for S3. Required for SSE and SSE-KMS encryption. Set to the
encryption algorithm used.

fs.s3a.endpoint

URL of the entry point for the web service.

For example:
<property>
<name>fs.s3a.endpoint</name>
<value>s3-us-west-1.amazonaws.com</value>
</property>
fs.s3a.bucket.BUCKET_NAME.server-side-encryption.key

Server-side encryption key for the S3 bucket. Required if the S3 bucket is encrypted with SSE-KMS.

For example:
<property>
<name>fs.s3a.bucket.BUCKET_NAME.server-side-encryption.key</name>
<value>arn:aws:kms:us-west-1*******/value>
<source>core-site.xml</source>
</property>
where BUCKET_NAME is the name of the S3 bucket.

hadoop.proxyuser.<proxy user>.groups

Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.

Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.

hadoop.proxyuser.<proxy user>.hosts
Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.

Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.

hadoop.proxyuser.yarn.groups

Comma-separated list of groups that you want to allow the YARN user to impersonate on a non-secure
cluster.

Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.

hadoop.proxyuser.yarn.hosts

Comma-separated list of hosts that you want to allow the YARN user to impersonate on a non-secure
cluster.

38 Chapter 3: Amazon EMR Integration Tasks


Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.

hadoop.security.auth_to_local

Translates the principal names from the Active Directory and MIT realm into local names within the
Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.

Set to: RULE:[1:$1@$0](^.*@YOUR.REALM)s/^(.*)@YOUR.REALM\.COM$/$1/g

Set to: RULE:[2:$1@$0](^.*@YOUR.REALM\.$)s/^(.*)@YOUR.REALM\.COM$/$1/g

io.compression.codecs

Enables compression on temporary staging tables.

Set to a comma-separated list of compression codec classes on the cluster.

hbase-site.xml
Configure the following properties in the hbase-site.xml file:
hbase.use.dynamic.jars

Enables metadata import and test connection from the Developer tool. Required for an HDInsight cluster
that uses ADLS storage or an Amazon EMR cluster that uses HBase resources in S3 storage.

Set to: false

zookeeper.znode.parent

Identifies HBase master and region servers.

Set to the relative path to the znode directory of HBase.

hive-site.xml
Configure the following properties in the hive-site.xml file:
hive.cluster.delegation.token.store.class

The token store implementation. Required for HiveServer2 high availability and load balancing.

Set to: org.apache.hadoop.hive.thrift.DBTokenStore

hive.compactor.initiator.on

Runs the initiator and cleaner threads on metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.

Set to: TRUE

hive.compactor.worker.threads

The number of worker threads to run in a metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.

Set to: 1

hive.conf.hidden.list

Comma-separated list of hidden configuration properties.

Set to:
javax.jdo.option.ConnectionPassword,hive.server2.keystore.password,fs.s3n.awsAccessKeyId,fs.s3n.aw
sSecretAccessKey,fs.s3a.access.key,fs.s3a.secret.key,fs.s3a.proxy.password

hive.enforce.bucketing

Enables dynamic bucketing while loading to Hive. Required for an Update Strategy transformation in a
mapping that writes to a Hive target.

Prepare for Cluster Import from Amazon EMR 39


Set to: TRUE

hive.exec.dynamic.partition

Enables dynamic partitioned tables for Hive tables. Applicable for Hive versions 0.9 and earlier.

Set to: TRUE

hive.exec.dynamic.partition.mode

Allows all partitions to be dynamic. Required for the Update Strategy transformation in a mapping that
writes to a Hive target. Also required if you use Sqoop and define a DDL query to create or replace a
partitioned Hive target at run time.

Set to: nonstrict

hive.support.concurrency

Enables table locking in Hive. Required for an Update Strategy transformation in a mapping that writes to
a Hive target.

Set to: TRUE

hive.txn.manager

Turns on transaction support. Required for an Update Strategy transformation in a mapping that writes
to a Hive target.

Set to: org.apache.hadoop.hive.ql.lockmgr.DbTxnManager

kms-site.xml
Configure the following properties in the kms-site.xml file:
hadoop.kms.authentication.kerberos.name.rules

Translates the principal names from the Active Directory and MIT realm into local names within the
Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.

Set to: RULE:[1:$1@$0](^.*@YOUR.REALM\.COM$)s/^(.*)@YOUR.REALM\.COM$/$1/g

Set to: RULE:[2:$1@$0](^.*@YOUR.REALM\.COM$)s/^(.*)@YOUR.REALM\.COM$/$1/g

mapred-site.xml
Configure the following properties in the mapred-site.xml file:

mapreduce.framework.name

The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for
Sqoop.

Set to: yarn

yarn.app.mapreduce.am.staging-dir

The HDFS staging directory used while submitting jobs.

Set to the staging directory path.

yarn-site.xml
Configure the following properties in the yarn-site.xml file:
yarn.application.classpath

Required for dynamic resource allocation.

Add spark_shuffle.jar to the class path. The .jar file must contain the class
"org.apache.spark.network.yarn.YarnShuffleService."

40 Chapter 3: Amazon EMR Integration Tasks


yarn.nodemanager.resource.memory-mb

The maximum RAM available for each container. Set the maximum memory on the cluster to increase
resource memory available to the Blaze engine.

Set to 16 GB if value is less than 16 GB.

yarn.nodemanager.resource.cpu-vcores

The number of virtual cores for each container. Required for Blaze engine resource allocation.

Set to 10 if the value is less than 10.

yarn.scheduler.minimum-allocation-mb

The minimum RAM available for each container. Required for Blaze engine resource allocation.

Set to 6 GB if the value is less than 6 GB.

yarn.nodemanager.vmem-check-enabled

Disables virtual memory limits for containers. Required for the Blaze and Spark engines.

Set to: FALSE

yarn.nodemanager.aux-services

Required for dynamic resource allocation for the Spark engine.

Add an entry for "spark_shuffle."

yarn.nodemanager.aux-services.spark_shuffle.class

Required for dynamic resource allocation for the Spark engine.

Set to: org.apache.spark.network.yarn.YarnShuffleService

yarn.resourcemanager.scheduler.class

Defines the YARN scheduler that the Data Integration Service uses to assign resources.

Set to: org.apache.hadoop.yarn.server.resourcemanager.scheduler

yarn.node-labels.enabled

Enables node labeling.

Set to: TRUE

yarn.node-labels.fs-store.root-dir

The HDFS location to update node label dynamically.

Set to: <hdfs://[Node name]:[Port]/[Path to store]/[Node labels]/>

Prepare the Archive File for Amazon EMR


After you verify property values in the *-site.xml files, create a .zip or a .tar file that the Informatica
administrator can use to import the cluster configuration into the domain.

Create an archive file that contains the following files from the cluster:

• core-site.xml
• hbase-site.xml. Required only if you access HBase sources and targets.
• hdfs-site.xml
• hive-site.xml

Prepare for Cluster Import from Amazon EMR 41


• mapred-site.xml or tez-site.xml. Include the mapred-site.xml file or the tez-site.xml file based on the Hive
execution type used on the Hadoop cluster.
• yarn-site.xml

Note: To import from Amazon EMR, the Informatica administrator must use an archive file.

Create a Cluster Configuration


After the Hadoop administrator prepares the cluster for import, the Informatica administrator must create a
cluster configuration.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.1.1 or earlier.

A cluster configuration is an object in the domain that contains configuration information about the Hadoop
cluster. The cluster configuration enables the Data Integration Service to push mapping logic to the Hadoop
environment. Import configuration properties from the Hadoop cluster to create a cluster configuration.

The import process imports values from *-site.xml files into configuration sets based on the individual *-
site.xml files. When you perform the import, the cluster configuration wizard can create Hadoop, HBase,
HDFS, and Hive connection to access the Hadoop environment. If you choose to create the connections, the
wizard also associates the cluster configuration with the connections.

Note: If you are integrating for the first time and you imported the cluster configuration when you ran the
installer, you must re-create or refresh the cluster configuration.

Importing a Hadoop Cluster Configuration from a File


You can import properties from an archive file to create a cluster configuration.

Before you import from the cluster, you must get the archive file from the Hadoop administrator.

1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:

Property Description

Cluster Name of the cluster configuration.


configuration name

Description Optional description of the cluster configuration.

Distribution type The cluster Hadoop distribution type.

42 Chapter 3: Amazon EMR Integration Tasks


Property Description

Distribution version Version of the Hadoop distribution.


Each distribution type has a default version. This is the latest version of the Hadoop
distribution that Big Data Management supports.
When the cluster version differs from the default version, the cluster configuration wizard
populates the cluster configuration Hadoop distribution property with the most recent
supported version relative to the cluster version. For example, suppose Informatica
supports versions 5.10 and 5.13, and the cluster version is 5.12. In this case, the wizard
populates the version with 5.10.
You can edit the property to choose any supported version. Restart the Data Integration
Service for the changes to take effect.

Method to import Choose Import from file to import properties from an archive file.
the cluster
configuration

Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster
environment variables, cluster path variables, and advanced properties. Based on the
cluster environment and the functionality that you use, you can add to the default values
or change the default values of these properties. For a list of Hadoop connection
properties to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.

4. Click Browse to select a file. Select the file and click Open.
5. Click Next and verify the cluster configuration information on the summary page.

Verify or Refresh the Cluster Configuration


You might need to refresh the cluster configuration or update the distribution version in the cluster
configuration when you upgrade.

Perform this task in the following situation:

- You upgraded from version 10.2 or later.

Verify the Cluster Configuration


The cluster configuration contains a property for the distribution version. The verification task depends on
the version you upgraded:

Verify or Refresh the Cluster Configuration 43


Upgrade from 10.2

If you upgraded from 10.2 and you changed the distribution version, you need to verify the distribution
version in the General properties of the cluster configuration.

Upgrade from 10.2.1

Effective in version 10.2.1, Informatica assigns a default version to each Hadoop distribution type. If you
configure the cluster configuration to use the default version, the upgrade process upgrades to the
assigned default version if the version changes. If you have not upgraded your Hadoop distribution to
Informatica's default version, you need to update the distribution version property.

For example, suppose the assigned default Hadoop distribution version for 10.2.1 is n, and for 10.2.2 is n
+1. If the cluster configuration uses the default supported Hadoop version of n, the upgraded cluster
configuration uses the default version of n+1. If you have not upgraded the distribution in the Hadoop
environment you need to change the cluster configuration to use version n.

If you configure the cluster configuration to use a distribution version that is not the default version, you
need to update the distribution version property in the following circumstances:

• Informatica dropped support for the distribution version.


• You changed the distribution version.

Refresh the Cluster Configuration


If you updated any of the *-site.xml files noted in the topic to prepare for cluster import, you need to refresh
the cluster configuration in the Administrator tool.

Verify JDBC Drivers for Sqoop Connectivity


Verify that you have the JDBC drivers to access JDBC-compliant databases in the Hadoop environment. You
might need separate drivers for metadata import and for run-time processing.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.2.1 or earlier.

You download drivers based on design-time and run-time requirements:

• Design-time. To import metadata, you can use the DataDirect drivers packaged with the Informatica
installer if they are available. If they are not available, use any Type 4 JDBC driver that the database
vendor recommends.
• Run-time. To run mappings, use any Type 4 JDBC driver that the database vendor recommends. Some
distributions support other drivers to use Sqoop connectors. You cannot use the DataDirect drivers for
run-time processing.

Verify Design-time Drivers


Use the DataDirect JDBC drivers packaged with the Informatica installer to import metadata from JDBC-
compliant databases. If the DataDirect JDBC drivers are not available for a specific JDBC-compliant
database, download the Type 4 JDBC driver associated with that database.

Copy the JDBC driver .jar files to the following location on the Developer tool machine:

44 Chapter 3: Amazon EMR Integration Tasks


<Informatica installation directory>\clients\externaljdbcjars

Verify Run-time Drivers


Verify run-time drivers for mappings that access JDBC-compliant databases in the Hadoop environment. Use
any Type 4 JDBC driver that the database vendor recommends.

1. Download Type 4 JDBC drivers associated with the JCBC-compliant databases that you want to access.
2. To optimize the Sqoop mapping performance on the Spark engine while writing data to an HDFS
complex file target of the Parquet format, download the following .jar files:
• parquet-hadoop-bundle-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-avro/1.6.0/
• parquet-avro-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-hadoop-bundle/1.6.0/
• parquet-column-1.5.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-column/1.5.0/
3. Copy all of the .jar files to the following directory on the machine where the Data Integration Service
runs:
<Informatica installation directory>\externaljdbcjars
Changes take effect after you recycle the Data Integration Service. At run time, the Data Integration
Service copies the .jar files to the Hadoop distribution cache so that the .jar files are accessible to all
nodes in the cluster.

Configure the Files for Hive Tables on S3


To run mappings with Hive sources or targets on S3, you need to configure the files from the master node to
the Data Integration Service machine.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from any Informatica version and changed the distribution version.

You can perform one of the following steps to configure the files:

Copy the .jar file

To integrate with EMR 5.16, get emrfs-hadoop-assembly-2.25.0.jar from the Hadoop administrator.
Copy the file to the following locations on each Data Integration Service machine:
/<Informatica installation directory>/services/shared/hadoop/EMR_<version number>/lib
/<Informatica installation directory>/services/shared/hadoop/EMR_<version number>/
extras/hive-auxjars
Note: If you upgraded from EMR 5.10 to EMR 5.14, the part of the file path that includes EMR_<version
number> remains EMR_5.10.

Create a file

Create a ~/.aws/config on the Data Integration Service machine. The file must contain AWS location.

Configure the Files for Hive Tables on S3 45


For example,
[default] region=us-west-2
Create an environment variable
Create AWS_CONFIG_FILE environment variable on the Data Integration Service machine. Set the value to
<EMR_5.10>/conf/aws.default

Setting S3 Access Policies


The AWS administrator must set S3 access policies to grant users the required access to S3 resources.

Perform this task in the following situations:

- You are integrating for the first time.

S3 access policies allow control of user access to S3 resources and the actions that users can perform. The
AWS administrator uses policies to control access and actions for specific users and resources, depending
on the use case that mappings and workflows require.

AWS uses a JSON statement for S3 access policies. To set the S3 access policy, determine the principal,
actions, and resources to define, then create or edit an existing S3 access policy JSON statement.

For more information about Amazon S3 access policies, see AWS documentation.

Step 1. Identify the S3 Access Policy Elements


Identify the principal, actions, and resources to insert in the access policy.

The following table describes the tags to set in the access policy:

Tag Description

Principal The user, service, or account that receives permissions that are defined in a policy.
Assign the owner of the S3 bucket resources as the principal.
Note: The S3 bucket owner and the owner of resources within the bucket can be different.

Action The activity that the principal has permission to perform.


In the sample, the Action tag lists two put actions and one get action.
You must specify both get and put actions to grant read and write access to the S3 resource.

Resource The S3 bucket, or folder within a bucket.


Include only resources in the same bucket.

Sample S3 Policy JSON Statement


The following JSON statement contains the basic elements of an S3 bucket access policy:
{
"Version": "<date>",
"Id": "Allow", "Statement": [
{ "Sid": "<Statement ID>", "Effect": "Allow",
"Principal": {

46 Chapter 3: Amazon EMR Integration Tasks


"AWS": "arn:aws:iam::<account_2_ID>:<user>"
}
"Action":[
"s3:PutObject","s3:PutObjectAcl",
"s3:GetObject"
]
"Resource": [
"Resource": "arn:aws:s3:::<bucket_1_name>/foldername/*"
]
}

Step 2. Optionally Copy an Existing S3 Access Policy as a


Template
When the AWS administrator selects a role for cluster users, the AWS console generates a default access
policy. After the AWS console generates the default policy, you can copy it and customize it to grant access
to specific resources to specific users.

Complete the following steps to copy an existing S3 access policy:

1. In the AWS console, click the Services menu.


The image below shows the Services menu in the menu bar:

2. Type "IAM" in the search bar and press Enter.


The Welcome to Identity and Access Management screen opens.
3. In the menu on the left, select Policies.
The console displays a list of existing policies.
4. Type "S3" in the search bar and press Enter.
The console displays a list of existing S3 access policies.
The image below shows an example of a list of S3 access policies:

5. Click the name of the policy that you want to copy.


The policy opens in a read-only window.
6. Highlight and copy the policy statement.

After you copy the JSON statement, you can edit it in a text editor or in the bucket policy editor.

Step 3. Create or Edit an S3 Access Policy


Create an S3 access policy or edit an existing policy. The AWS administrator can enter a JSON statement,
based on a template. The administrator can copy and customize the S3 policy from another bucket.

1. In the AWS console, click the Services menu.

Setting S3 Access Policies 47


2. In the Storage section, choose S3.
The AWS console displays a list of existing buckets.
3. Use the search box to find the bucket you want to set a policy for, and select the bucket from the results.
4. Click the Permissions tab, then click Bucket Policy.
The Bucket Policy Editor opens.
The image below shows the Bucket Policy button:

5. Type the bucket access policy, or edit the existing policy, and click Save.
AWS applies the access policy to the bucket.

Configure the Developer Tool


To access the Hadoop environment from the Developer tool, the mapping developers must perform tasks on
each Developer tool machine.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from any previous version.

Configure developerCore.ini
Edit developerCore.ini to successfully import local complex files available on the Developer tool machine.

When you import a complex file, such as Avro or Parquet, the imported object includes metadata associated
with the distribution in the Hadoop environment. If the file resides on the Developer tool machine, the import
process picks up the distribution information from the developerCore.ini file. You must edit the
developerCore.ini file to point to the distribution directory on the Developer tool machine.

You can find developerCore.ini in the following directory:


<Informatica installation directory>\clients\DeveloperClient
Add the following property:
-DINFA_HADOOP_DIST_DIR=hadoop\<distribution>_<version>
The change takes effect when you restart the Developer tool.

48 Chapter 3: Amazon EMR Integration Tasks


Complete Upgrade Tasks
If you upgraded the Informatica platform, you need to perform some additional tasks within the Informatica
domain.

Based on the version that you upgraded from, you might need to update the following types of objects:
Connections

Based on the version you are upgrading from, you might need to update Hadoop connections or replace
connections to the Hadoop environment.

The Hadoop connection contains additional properties. You need to manually update it to include
customized configuration in the hadoopEnv.properties file from previous versions.

Streaming mappings

The mapping contains deferred data objects or transformations. Support will be reinstated in a future
release.

After you upgrade, the streaming mappings become invalid. You must re-create the physical data objects
to run the mappings in Spark engine that uses Structured Streaming.

After you re-create the physical data objects to run the mappings in Spark engine that uses Structured
Streaming some properties are not available for Azure Event Hubs data objects.

Update Connections
You might need to update connections based on the version you are upgrading from.

Consider the following types of updates that you might need to make:
Configure the Hadoop connection.

Configure the Hadoop connection to incorporate properties from the hadoopEnv.properties file.

Replace connections.

If you chose the option to create connections when you ran the Cluster Configuration wizard, you need
to replace connections in mappings with the new connections.

Complete connection upgrades.

If you did not create connections when you created the cluster configuration, you need to update the
connections.

Configure the Hadoop Connection


To use properties that you customized in the hadoopEnv.properties file, you must configure the Hadoop
connection properties such as cluster environment variables, cluster path variables, and advanced properties.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.

When you run the Informatica upgrade, the installer backs up the existing hadoopEnv.properties file. You can
find the backup hadoopEnv.properties file in the following location:

<Previous Informatica installation directory>/services/shared/hadoop/<Hadoop distribution


name>_<version>/infaConf

Complete Upgrade Tasks 49


Edit the Hadoop connection in the Administrator tool or the Developer tool to include any properties that you
manually configured in the hadoopEnv.properties file. The Hadoop connection contains default values for
properties such as cluster environment and path variables and advanced properties. You can update the
default values to match the properties in the hadoopEnv.properties file.

Replace the Connections with New Connections


If you created connections when you imported the cluster configuration, you need to replace connections in
mappings with the new connections.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.

The method that you use to replace connections in mappings depends on the type of connection.
Hadoop connection

Run the following commands to replace the connections:

• infacmd dis replaceMappingHadoopRuntimeConnections. Replaces connections associated with


mappings that are deployed in applications.
• infacmd mrs replaceMappingHadoopRuntimeConnections. Replaces connections associated with
mappings that you run from the Developer tool.

For information about the infacmd commands, see the Informatica Command Reference.

Hive, HDFS, and HBase connections

You must replace the connections manually.

Complete Connection Upgrade


If you did not create connections when you imported the cluster configuration, you need to update connection
properties for Hadoop, Hive, HDFS, and HBase connections.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.


- You upgraded from version 10.2 or later and changed the distribution version.

Perform the following tasks to update the connections:

Update changed properties

Review connections that you created in a previous release to update the values for connection
properties. For example, if you added nodes to the cluster or if you updated the distribution version, you
might need to verify host names, URIs, or port numbers for some of the properties.
Associate the cluster configuration

The Hadoop, Hive, HDFS, and HBase connections must be associated with a cluster configuration.
Complete the following tasks:

1. Run infacmd isp listConnections to identify the connections that you need to upgrade. Use -ct
to list connections of a particular type.

50 Chapter 3: Amazon EMR Integration Tasks


2. Run infacmd isp UpdateConnection to associate the cluster configuration with the connection.
Use -cn to name the connection and -o clusterConfigID to associate the cluster configuration
with the connection.
For more information about infacmd, see the Informatica Command Reference.

Update Streaming Objects


Big Data Streaming uses Spark Structured Streaming to process data instead of Spark Streaming. To support
Spark Structured Streaming, some header ports are added to the data objects, and support to some of the
data objects and transformations are deferred to a future release. The behavior of some of the data objects
is also updated.

After you upgrade, the existing streaming mappings become invalid because of the unavailable header ports,
the unsupported transformations or data objects, and the behavior change of some data objects.

Perform this task in the following situations:

- You upgraded from version 10.1.1, 10.2.0, or 10.2.1.

To use an existing streaming mapping, perform the following tasks:

• Re-create the physical data objects. After you re-create the physical data objects, the data objects get the
required header ports, such as timestamp, partitionID, or key based on the data object.
• In a Normalizer transformation, if the Occurs column is set to Auto, re-create the Normalizer
transformation. You must re-create the Normalizer transformation because the type configuration
property of the complex port refers to the physical data object that you plan to replace.
• Update the streaming mapping. If the mapping contains Kafka target, Aggregator transformation, Joiner
transformation, or Normalizer transformation, replace the data object or transformation, and then update
the mapping because of the changed behavior of these transformations and data objects.
• Verify the deferred data object types. If the streaming mapping contains unsupported transformations or
data objects, contact Informatica Global Customer Support.

Re-create the Physical Data Objects


When you re-create the physical data objects, the physical data objects get the header ports and some
properties are not available for some data objects. Update the existing mapping with the newly created
physical data objects.

1. Go to the existing mapping, select the data object from the mapping.
2. Click the Properties tab. On the Column Projection tab, click Edit Schema.
3. Note the schema information from the Edit Schema dialog box.
4. Note the parameters information from the Parameters tab.
5. Create new physical data objects.
After you re-create the data objects, the physical data objects get the required header ports. The Microsoft
Azure does not support the following properties and are not available for Azure Event Hubs data objects:

• Consumer Properties
• Partition Count

Complete Upgrade Tasks 51


Re-create the Normalizer Transformation
If the mapping contains a Normalizer transformation with the Occurs column set to Auto, re-create the
Normalizer transformation. When you re-create the Normalizer transformation, the type configuration
property of the complex port refers to the re-created physical data object.

Update the Streaming Mappings


After you re-create the data object, replace the existing data objects with the re-created data objects. If the
mapping contains Normaliser Transformation, Aggregator transformation, or Joiner transformation, update
the mapping because of the changed behavior of these transformations and data objects.

Transformation Updates

If a transformation uses a complex port, configure the type configuration property of the port because
the property refers to the physical data object that you replaced.

Aggregator and Joiner Transformation Updates

An Aggregator transformation must be downstream from a Joiner transformation. A Window


transformation must be directly upstream from both Aggregator and Joiner transformations. Previously,
you could use an Aggregator transformation anywhere in the streaming mapping.

If a mapping contains an Aggregator transformation upstream from a Joiner transformation, move the
Aggregator transformation downstream from a Joiner transformation. Add a Window transformation
directly upstream from both Aggregator and Joiner transformations.

Verify the Deferred Data Object Types


After you upgrade, the streaming mappings might contain some transformations and data objects that are
deferred.

The following table lists the data object types to which the support is deferred to a future release:

Object Type Object

Source JMS
MapR Streams

Target MapR Streams

Transformation Data Masking


Joiner
Rank
Sorter

If you want to continue using the mappings that contain deferred data objects or transformations, you must
contact Informatica Global Customer Support.

52 Chapter 3: Amazon EMR Integration Tasks


Chapter 4

Azure HDInsight Integration


Tasks
This chapter includes the following topics:

• Azure HDInsight Task Flows, 53


• Prepare for Cluster Import from Azure HDInsight, 58
• Create a Cluster Configuration, 64
• Verify or Refresh the Cluster Configuration , 67
• Verify JDBC Drivers for Sqoop Connectivity, 67
• Configure the Developer Tool, 69
• Complete Upgrade Tasks, 69

Azure HDInsight Task Flows


Depending on whether you want to integrate or upgrade Big Data Management in an Azure HDInsight
environment, you can use the flow charts to perform the following tasks:

• Integrate the Informatica domain with Azure HDInsight for the first time.
• Upgrade from version 10.2.1.
• Upgrade from version 10.2.
• Upgrade from a version earlier than 10.2.

53
Task Flow to Integrate with Azure HDInsight
The following diagram shows the task flow to integrate the Informatica domain with Azure HDInsight:

54 Chapter 4: Azure HDInsight Integration Tasks


Task Flow to Upgrade from Version 10.2.1
The following diagram shows the task flow to upgrade Big Data Management 10.2.1 for Azure HDI:

Azure HDInsight Task Flows 55


Task Flow to Upgrade from Version 10.2
The following diagram shows the task flow to upgrade Big Data Management 10.2 for Azure HDInsight:

56 Chapter 4: Azure HDInsight Integration Tasks


Task Flow to Upgrade from a Version Earlier than 10.2
The following diagram shows the task flow to upgrade Big Data Management from a version earlier than 10.2
for Azure HDInsight:

Azure HDInsight Task Flows 57


Prepare for Cluster Import from Azure HDInsight
Before the Informatica administrator can import cluster information to create a cluster configuration in the
Informatica domain, the Hadoop administrator must perform some preliminary tasks.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from any previous version.

Note: If you are upgrading from a previous version, verify the properties and suggested values, as Big Data
Management might require additional properties or different values for existing properties.

Complete the following tasks to prepare the cluster before the Informatica administrator creates the cluster
configuration:

1. Verify that the VPN is enabled between the Informatica domain and the Azure HDInsight cloud network.
2. Verify property values in *-site.xml files that Big Data Management needs to run mappings in the Hadoop
environment.
3. Provide information to the Informatica administrator that is required to import cluster information into
the domain. Depending on the method of import, perform one of the following tasks:
• To import directly from the cluster, give the Informatica administrator cluster authentication
information to connect to the cluster.
• To import from an archive file, export cluster information and provide an archive file to the
Informatica administrator.

Configure *-site.xml Files for Azure HDInsight


The Hadoop administrator needs to configure *-site.xml file properties and restart the credential service and
other impacted services before the Informatica administrator imports cluster information into the domain.

core-site.xml
Configure the following properties in the core-site.xml file:
fs.azure.account.key.<youraccount>.blob.core.windows.net

Required for Azure HDInsight cluster that uses WASB storage. The storage account access key required
to access the storage.

You can contact the HDInsight cluster administrator to get the storage account key associated with the
HDInsight cluster. If you are unable to contact the administrator, perform the following steps to decrypt
the encrypted storage account key:

• Copy the value of the fs.azure.account.key.<youraccount>.blob.core.windows.net property.


<property>
<name>fs.azure.account.key.<youraccount>.blob.core.windows.net</name>

58 Chapter 4: Azure HDInsight Integration Tasks


<value>STORAGE ACCOUNT KEY</value>
</property>
- Decrypt the storage account key. Run thedecrypt.sh specified in the
fs.azure.shellkeyprovider.script property along with the encrypted value you copied in the
previous step.
<property>

<name>fs.azure.shellkeyprovider.script</name>
<value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value>
</property>
- Copy the decrypted value and update the value of
fs.azure.account.key.youraccount.blob.core.windows.net property in the cluster configuration
core-site.xml.
dfs.adls.oauth2.client.id

Required for Azure HDInsight cluster that uses ADLS storage without Enterprise Security Package. The
application ID associated with the Service Principal required to authorize the service principal and
access the storage.

To find the application ID for a service principal, in the Azure Portal, click Azure Active Directory > App
registrations > Service Principal Display Name.

dfs.adls.oauth2.refresh.url

Required for Azure HDInsight cluster that uses ADLS storage without Enterprise Security Package. The
OAuth 2.0 token endpoint required to authorize the service principal and access the storage.

To find the refresh URL OAuth 2.0 endpoint, in the Azure portal, click Azure Active Directory > App
registrations > Endpoints.

dfs.adls.oauth2.credential

Required for Azure HDInsight cluster that uses ADLS storage without Enterprise Security Package. The
password required to authorize the service principal and access the storage.

To find the password for a service principal, in the Azure portal, click Azure Active Directory > App
registrations > Service Principal Display Name > Settings > Keys.

hadoop.proxyuser.<proxy user>.groups

Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.

Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.

hadoop.proxyuser.<proxy user>.users

Required for Azure HDInsight cluster that uses Enterprise Security Package and ADLS storage. Defines
the user account that the proxy user account can impersonate. On a secure cluster, the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.
Set to a single user account or set to a comma-separated list. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.

hadoop.proxyuser.<proxy user>.hosts

Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.

Prepare for Cluster Import from Azure HDInsight 59


Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.

hadoop.proxyuser.yarn.groups

Comma-separated list of groups that you want to allow the YARN user to impersonate on a non-secure
cluster.

Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.

hadoop.proxyuser.yarn.hosts

Comma-separated list of hosts that you want to allow the YARN user to impersonate on a non-secure
cluster.

Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.

io.compression.codecs

Enables compression on temporary staging tables.

Set to a comma-separated list of compression codec classes on the cluster.

hadoop.security.auth_to_local

Translates the principal names from the Active Directory and MIT realm into local names within the
Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.

Set to: RULE:[1:$1@$0](^.*@YOUR.REALM)s/^(.*)@YOUR.REALM\.COM$/$1/g

Set to: RULE:[2:$1@$0](^.*@YOUR.REALM\.$)s/^(.*)@YOUR.REALM\.COM$/$1/g

hbase-site.xml
Configure the following properties in the hbase-site.xml file:
hbase.use.dynamic.jars

Enables metadata import and test connection from the Developer tool. Required for an HDInsight cluster
that uses ADLS storage or an Amazon EMR cluster that uses HBase resources in S3 storage.

Set to: false

zookeeper.znode.parent

Identifies HBase master and region servers.

Set to the relative path to the znode directory of HBase.

hive-site.xml
Configure the following properties in the hive-site.xml file:
hive.cluster.delegation.token.store.class

The token store implementation. Required for HiveServer2 high availability and load balancing.

Set to: org.apache.hadoop.hive.thrift.DBTokenStore

hive.compactor.initiator.on

Runs the initiator and cleaner threads on metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.

Set to: TRUE

hive.compactor.worker.threads

The number of worker threads to run in a metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.

60 Chapter 4: Azure HDInsight Integration Tasks


Set to: 1

hive.enforce.bucketing

Enables dynamic bucketing while loading to Hive. Required for an Update Strategy transformation in a
mapping that writes to a Hive target.

Set to: TRUE

hive.exec.dynamic.partition

Enables dynamic partitioned tables for Hive tables. Applicable for Hive versions 0.9 and earlier.

Set to: TRUE

hive.exec.dynamic.partition.mode

Allows all partitions to be dynamic. Required for the Update Strategy transformation in a mapping that
writes to a Hive target. Also required if you use Sqoop and define a DDL query to create or replace a
partitioned Hive target at run time.

Set to: nonstrict

hive.support.concurrency

Enables table locking in Hive. Required for an Update Strategy transformation in a mapping that writes to
a Hive target.

Set to: TRUE

hive.server2.support.dynamic.service.discovery

Enables HiveServer2 dynamic service discovery. Required for HiveServer2 high availability.

Set to: TRUE

hive.server2.zookeeper.namespace

The value of the ZooKeeper namespace in the JDBC connection string. Required for HiveServer2 high
availability.

Set to: jdbc:hive2://<zookeeper_ensemble>/


default;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2

hive.txn.manager

Turns on transaction support. Required for an Update Strategy transformation in a mapping that writes
to a Hive target.

Set to: org.apache.hadoop.hive.ql.lockmgr.DbTxnManager

hive.zookeeper.quorum

Comma-separated list of ZooKeeper server host:ports in a cluster. The value of the ZooKeeper ensemble
in the JDBC connection string. Required for HiveServer2 high availability.

Set to: jdbc:hive2://<zookeeper_ensemble>/default;serviceDiscoveryMode=zooKeeper;

mapred-site.xml
Configure the following properties in the mapred-site.xml file:

mapreduce.framework.name

The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for
Sqoop.

Set to: yarn

Prepare for Cluster Import from Azure HDInsight 61


yarn.app.mapreduce.am.staging-dir

The HDFS staging directory used while submitting jobs.

Set to the staging directory path.

yarn-site.xml
Configure the following properties in the yarn-site.xml file:
yarn.application.classpath

Required for dynamic resource allocation.

Add spark_shuffle.jar to the class path. The .jar file must contain the class
"org.apache.spark.network.yarn.YarnShuffleService."

yarn.nodemanager.resource.memory-mb

The maximum RAM available for each container. Set the maximum memory on the cluster to increase
resource memory available to the Blaze engine.

Set to 16 GB if value is less than 16 GB.

yarn.nodemanager.resource.cpu-vcores

The number of virtual cores for each container. Required for Blaze engine resource allocation.

Set to 10 if the value is less than 10.

yarn.scheduler.minimum-allocation-mb

The minimum RAM available for each container. Required for Blaze engine resource allocation.

Set to 6 GB if the value is less than 6 GB.

yarn.nodemanager.vmem-check-enabled

Disables virtual memory limits for containers. Required for the Blaze and Spark engines.

Set to: FALSE

yarn.nodemanager.aux-services

Required for dynamic resource allocation for the Spark engine.

Add an entry for "spark_shuffle."

yarn.nodemanager.aux-services.spark_shuffle.class

Required for dynamic resource allocation for the Spark engine.

Set to: org.apache.spark.network.yarn.YarnShuffleService

yarn.resourcemanager.scheduler.class

Defines the YARN scheduler that the Data Integration Service uses to assign resources.

Set to: org.apache.hadoop.yarn.server.resourcemanager.scheduler

yarn.node-labels.enabled

Enables node labeling.

Set to: TRUE

yarn.node-labels.fs-store.root-dir

The HDFS location to update node label dynamically.

Set to: <hdfs://[Node name]:[Port]/[Path to store]/[Node labels]/>

tez-site.xml
Configure the following properties in the tez-site.xml file:

62 Chapter 4: Azure HDInsight Integration Tasks


tez.runtime.io.sort.mb

The sort buffer memory. Required when the output needs to be sorted for Blaze and Spark engines.

Set value to 270 MB.

Prepare for Direct Import from Azure HDInsight


If you plan to provide direct access to the Informatica administrator to import cluster information, provide the
required connection information.

The following table describes the information that you need to provide to the Informatica administrator to
create the cluster configuration directly from the cluster:

Property Description

Host IP address of the cluster manager.

Port Port of the cluster manager.

User ID Cluster user ID.

Password Password for the user.

Cluster name Name of the cluster. Use the display name if the cluster manager manages multiple clusters. If you do
not provide a cluster name, the wizard imports information based on the default cluster.

Prepare the Archive File for Import from Azure HDInsight


When you prepare the archive file for cluster configuration import from HDInsight, include all required *-
site.xml files and edit the file manually after you create it.

Create a .zip or .tar file that contains the following *-site.xml files:

• core-site.xml
• hbase-site.xml. Required only to access HBase sources and targets.
• hdfs-site.xml
• hive-site.xml
• mapred-site.xml or tez-site.xml. Include the mapred-site.xml file or the tez-site.xml file based on the Hive
execution type used on the Hadoop cluster.
• yarn-site.xml

Update the Archive File


After you create the archive file, edit the Hortonworks Data Platform (HDP) version string wherever it appears
in the archive file. Search for the string ${hdp.version} and replace all instances with the HDP version that
HDInsight includes in the Hadoop distribution.

For example, the edited tez.task.launch.cluster-default.cmd-opts property value looks similar to the following:
<property>
<name>tez.task.launch.cluster-default.cmd-opts</name>
<value>-server -Djava.net.preferIPv4Stack=true -Dhdp.version=2.6.0.2-76</value>
</property>

Prepare for Cluster Import from Azure HDInsight 63


Create a Cluster Configuration
After the Hadoop administrator prepares the cluster for import, the Informatica administrator must create a
cluster configuration.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.1.1 or earlier.

A cluster configuration is an object in the domain that contains configuration information about the Hadoop
cluster. The cluster configuration enables the Data Integration Service to push mapping logic to the Hadoop
environment. Import configuration properties from the Hadoop cluster to create a cluster configuration.

The import process imports values from *-site.xml files into configuration sets based on the individual *-
site.xml files. When you perform the import, the cluster configuration wizard can create Hadoop, HBase,
HDFS, and Hive connection to access the Hadoop environment. If you choose to create the connections, the
wizard also associates the cluster configuration with the connections.

Note: If you are integrating for the first time and you imported the cluster configuration when you ran the
installer, you must re-create or refresh the cluster configuration.

Before You Import


Before you can import the cluster configuration, you must get information from the Hadoop administrator
based on the method of import.

If you import directly from the cluster, contact the Hadoop administrator to get cluster connection
information. If you import from a file, get an archive file of exported cluster information.

Importing a Hadoop Cluster Configuration from the Cluster


When you import the Hadoop cluster configuration directly from the cluster, you provide information to
connect to the cluster.

Get cluster connection information from the Hadoop administrator.

1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following General properties:

Property Description

Cluster Name of the cluster configuration.


configuration name

Description Optional description of the cluster configuration.

Distribution type The cluster Hadoop distribution type.

64 Chapter 4: Azure HDInsight Integration Tasks


Property Description

Distribution version Version of the Hadoop distribution.


Each distribution type has a default version. The default version is the latest version of the
Hadoop distribution that Big Data Management supports.
Note: When the cluster version differs from the default version and Informatica supports
more than one version, the cluster configuration import process populates the property
with the most recent supported version. For example, consider the case where Informatica
supports versions 5.10 and 5.13, and the cluster version is 5.12. In this case, the cluster
configuration import process populates this property with 5.10, because 5.10 is the most
recent supported version before 5.12.
You can edit the property to choose any supported version. Restart the Data Integration
Service for the changes to take effect.

Method to import Choose Import from cluster.


the cluster
configuration

Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster environment
variables, cluster path variables, and advanced properties. Based on the cluster
environment and the functionality that you use, you can add to the default values or
change the default values of these properties. For a list of Hadoop connection properties
to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.

The cluster properties appear.

4. Configure the following properties:

Property Description

Host IP address of the cluster manager.

Port Port of the cluster manager.

User ID Cluster user ID.

Password Password for the user.

Cluster name Name of the cluster. Use the display name if the cluster manager manages multiple clusters. If
you do not provide a cluster name, the wizard imports information based on the default cluster.

5. Click Next and verify the cluster configuration information on the summary page.

Create a Cluster Configuration 65


Importing a Hadoop Cluster Configuration from a File
You can import properties from an archive file to create a cluster configuration.

Before you import from the cluster, you must get the archive file from the Hadoop administrator.

1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:

Property Description

Cluster Name of the cluster configuration.


configuration name

Description Optional description of the cluster configuration.

Distribution type The cluster Hadoop distribution type.

Distribution version Version of the Hadoop distribution.


Each distribution type has a default version. This is the latest version of the Hadoop
distribution that Big Data Management supports.
When the cluster version differs from the default version, the cluster configuration wizard
populates the cluster configuration Hadoop distribution property with the most recent
supported version relative to the cluster version. For example, suppose Informatica
supports versions 5.10 and 5.13, and the cluster version is 5.12. In this case, the wizard
populates the version with 5.10.
You can edit the property to choose any supported version. Restart the Data Integration
Service for the changes to take effect.

Method to import Choose Import from file to import properties from an archive file.
the cluster
configuration

Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster
environment variables, cluster path variables, and advanced properties. Based on the
cluster environment and the functionality that you use, you can add to the default values
or change the default values of these properties. For a list of Hadoop connection
properties to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.

4. Click Browse to select a file. Select the file and click Open.
5. Click Next and verify the cluster configuration information on the summary page.

66 Chapter 4: Azure HDInsight Integration Tasks


Verify or Refresh the Cluster Configuration
You might need to refresh the cluster configuration or update the distribution version in the cluster
configuration when you upgrade.

Perform this task in the following situation:

- You upgraded from version 10.2 or later.

Verify the Cluster Configuration


The cluster configuration contains a property for the distribution version. The verification task depends on
the version you upgraded:
Upgrade from 10.2

If you upgraded from 10.2 and you changed the distribution version, you need to verify the distribution
version in the General properties of the cluster configuration.

Upgrade from 10.2.1

Effective in version 10.2.1, Informatica assigns a default version to each Hadoop distribution type. If you
configure the cluster configuration to use the default version, the upgrade process upgrades to the
assigned default version if the version changes. If you have not upgraded your Hadoop distribution to
Informatica's default version, you need to update the distribution version property.
For example, suppose the assigned default Hadoop distribution version for 10.2.1 is n, and for 10.2.2 is n
+1. If the cluster configuration uses the default supported Hadoop version of n, the upgraded cluster
configuration uses the default version of n+1. If you have not upgraded the distribution in the Hadoop
environment you need to change the cluster configuration to use version n.

If you configure the cluster configuration to use a distribution version that is not the default version, you
need to update the distribution version property in the following circumstances:

• Informatica dropped support for the distribution version.


• You changed the distribution version.

Refresh the Cluster Configuration


If you updated any of the *-site.xml files noted in the topic to prepare for cluster import, you need to refresh
the cluster configuration in the Administrator tool.

Verify JDBC Drivers for Sqoop Connectivity


Verify that you have the JDBC drivers to access JDBC-compliant databases in the Hadoop environment. You
might need separate drivers for metadata import and for run-time processing.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.2.1 or earlier.

Verify or Refresh the Cluster Configuration 67


You download drivers based on design-time and run-time requirements:

• Design-time. To import metadata, you can use the DataDirect drivers packaged with the Informatica
installer if they are available. If they are not available, use any Type 4 JDBC driver that the database
vendor recommends.
• Run-time. To run mappings, use any Type 4 JDBC driver that the database vendor recommends. Some
distributions support other drivers to use Sqoop connectors. You cannot use the DataDirect drivers for
run-time processing.

Verify Design-time Drivers


Use the DataDirect JDBC drivers packaged with the Informatica installer to import metadata from JDBC-
compliant databases. If the DataDirect JDBC drivers are not available for a specific JDBC-compliant
database, download the Type 4 JDBC driver associated with that database.

Copy the JDBC driver .jar files to the following location on the Developer tool machine:

<Informatica installation directory>\clients\externaljdbcjars

Verify Run-time Drivers


Verify run-time drivers for mappings that access JDBC-compliant databases in the Hadoop environment. Use
any Type 4 JDBC driver that the database vendor recommends.

1. Download Type 4 JDBC drivers associated with the JCBC-compliant databases that you want to access.
2. To optimize the Sqoop mapping performance on the Spark engine while writing data to an HDFS
complex file target of the Parquet format, download the following .jar files:
• parquet-hadoop-bundle-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-avro/1.6.0/
• parquet-avro-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-hadoop-bundle/1.6.0/
• parquet-column-1.5.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-column/1.5.0/
3. Copy all of the .jar files to the following directory on the machine where the Data Integration Service
runs:
<Informatica installation directory>\externaljdbcjars
Changes take effect after you recycle the Data Integration Service. At run time, the Data Integration
Service copies the .jar files to the Hadoop distribution cache so that the .jar files are accessible to all
nodes in the cluster.

68 Chapter 4: Azure HDInsight Integration Tasks


Configure the Developer Tool
To access the Hadoop environment from the Developer tool, the mapping developers must perform tasks on
each Developer tool machine.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from any previous version.

Configure developerCore.ini
Edit developerCore.ini to successfully import local complex files available on the Developer tool machine.

When you import a complex file, such as Avro or Parquet, the imported object includes metadata associated
with the distribution in the Hadoop environment. If the file resides on the Developer tool machine, the import
process picks up the distribution information from the developerCore.ini file. You must edit the
developerCore.ini file to point to the distribution directory on the Developer tool machine.

You can find developerCore.ini in the following directory:


<Informatica installation directory>\clients\DeveloperClient
Add the following property:
-DINFA_HADOOP_DIST_DIR=hadoop\<distribution>_<version>
The change takes effect when you restart the Developer tool.

Complete Upgrade Tasks


If you upgraded the Informatica platform, you need to perform some additional tasks within the Informatica
domain.

Based on the version that you upgraded from, you might need to update the following types of objects:
Connections

Based on the version you are upgrading from, you might need to update Hadoop connections or replace
connections to the Hadoop environment.

The Hadoop connection contains additional properties. You need to manually update it to include
customized configuration in the hadoopEnv.properties file from previous versions.
Streaming mappings

The mapping contains deferred data objects or transformations. Support will be reinstated in a future
release.

After you upgrade, the streaming mappings become invalid. You must re-create the physical data objects
to run the mappings in Spark engine that uses Structured Streaming.

After you re-create the physical data objects to run the mappings in Spark engine that uses Structured
Streaming some properties are not available for Azure Event Hubs data objects.

Configure the Developer Tool 69


Update Connections
You might need to update connections based on the version you are upgrading from.

Consider the following types of updates that you might need to make:
Configure the Hadoop connection.

Configure the Hadoop connection to incorporate properties from the hadoopEnv.properties file.

Replace connections.

If you chose the option to create connections when you ran the Cluster Configuration wizard, you need
to replace connections in mappings with the new connections.

Complete connection upgrades.

If you did not create connections when you created the cluster configuration, you need to update the
connections.

Configure the Hadoop Connection


To use properties that you customized in the hadoopEnv.properties file, you must configure the Hadoop
connection properties such as cluster environment variables, cluster path variables, and advanced properties.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.

When you run the Informatica upgrade, the installer backs up the existing hadoopEnv.properties file. You can
find the backup hadoopEnv.properties file in the following location:

<Previous Informatica installation directory>/services/shared/hadoop/<Hadoop distribution


name>_<version>/infaConf

Edit the Hadoop connection in the Administrator tool or the Developer tool to include any properties that you
manually configured in the hadoopEnv.properties file. The Hadoop connection contains default values for
properties such as cluster environment and path variables and advanced properties. You can update the
default values to match the properties in the hadoopEnv.properties file.

Replace the Connections with New Connections


If you created connections when you imported the cluster configuration, you need to replace connections in
mappings with the new connections.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.

The method that you use to replace connections in mappings depends on the type of connection.
Hadoop connection

Run the following commands to replace the connections:

• infacmd dis replaceMappingHadoopRuntimeConnections. Replaces connections associated with


mappings that are deployed in applications.
• infacmd mrs replaceMappingHadoopRuntimeConnections. Replaces connections associated with
mappings that you run from the Developer tool.

70 Chapter 4: Azure HDInsight Integration Tasks


For information about the infacmd commands, see the Informatica Command Reference.

Hive, HDFS, and HBase connections

You must replace the connections manually.

Complete Connection Upgrade


If you did not create connections when you imported the cluster configuration, you need to update connection
properties for Hadoop, Hive, HDFS, and HBase connections.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.


- You upgraded from version 10.2 or later and changed the distribution version.

Perform the following tasks to update the connections:

Update changed properties

Review connections that you created in a previous release to update the values for connection
properties. For example, if you added nodes to the cluster or if you updated the distribution version, you
might need to verify host names, URIs, or port numbers for some of the properties.

Associate the cluster configuration

The Hadoop, Hive, HDFS, and HBase connections must be associated with a cluster configuration.
Complete the following tasks:

1. Run infacmd isp listConnections to identify the connections that you need to upgrade. Use -ct
to list connections of a particular type.
2. Run infacmd isp UpdateConnection to associate the cluster configuration with the connection.
Use -cn to name the connection and -o clusterConfigID to associate the cluster configuration
with the connection.

For more information about infacmd, see the Informatica Command Reference.

Update Streaming Objects


Big Data Streaming uses Spark Structured Streaming to process data instead of Spark Streaming. To support
Spark Structured Streaming, some header ports are added to the data objects, and support to some of the
data objects and transformations are deferred to a future release. The behavior of some of the data objects
is also updated.

After you upgrade, the existing streaming mappings become invalid because of the unavailable header ports,
the unsupported transformations or data objects, and the behavior change of some data objects.

Perform this task in the following situations:

- You upgraded from version 10.1.1, 10.2.0, or 10.2.1.

To use an existing streaming mapping, perform the following tasks:

• Re-create the physical data objects. After you re-create the physical data objects, the data objects get the
required header ports, such as timestamp, partitionID, or key based on the data object.

Complete Upgrade Tasks 71


• In a Normalizer transformation, if the Occurs column is set to Auto, re-create the Normalizer
transformation. You must re-create the Normalizer transformation because the type configuration
property of the complex port refers to the physical data object that you plan to replace.
• Update the streaming mapping. If the mapping contains Kafka target, Aggregator transformation, Joiner
transformation, or Normalizer transformation, replace the data object or transformation, and then update
the mapping because of the changed behavior of these transformations and data objects.
• Verify the deferred data object types. If the streaming mapping contains unsupported transformations or
data objects, contact Informatica Global Customer Support.

Re-create the Physical Data Objects


When you re-create the physical data objects, the physical data objects get the header ports and some
properties are not available for some data objects. Update the existing mapping with the newly created
physical data objects.

1. Go to the existing mapping, select the data object from the mapping.
2. Click the Properties tab. On the Column Projection tab, click Edit Schema.
3. Note the schema information from the Edit Schema dialog box.
4. Note the parameters information from the Parameters tab.
5. Create new physical data objects.
After you re-create the data objects, the physical data objects get the required header ports. The Microsoft
Azure does not support the following properties and are not available for Azure Event Hubs data objects:

• Consumer Properties
• Partition Count

Update the Streaming Mappings


After you re-create the data object, replace the existing data objects with the re-created data objects. If the
mapping contains Normaliser Transformation, Aggregator transformation, or Joiner transformation, update
the mapping because of the changed behavior of these transformations and data objects.

Transformation Updates

If a transformation uses a complex port, configure the type configuration property of the port because
the property refers to the physical data object that you replaced.

Aggregator and Joiner Transformation Updates

An Aggregator transformation must be downstream from a Joiner transformation. A Window


transformation must be directly upstream from both Aggregator and Joiner transformations. Previously,
you could use an Aggregator transformation anywhere in the streaming mapping.

If a mapping contains an Aggregator transformation upstream from a Joiner transformation, move the
Aggregator transformation downstream from a Joiner transformation. Add a Window transformation
directly upstream from both Aggregator and Joiner transformations.

72 Chapter 4: Azure HDInsight Integration Tasks


Verify the Deferred Data Object Types
After you upgrade, the streaming mappings might contain some transformations and data objects that are
deferred.

The following table lists the data object types to which the support is deferred to a future release:

Object Type Object

Source JMS
MapR Streams

Target MapR Streams

Transformation Data Masking


Joiner
Rank
Sorter

If you want to continue using the mappings that contain deferred data objects or transformations, you must
contact Informatica Global Customer Support.

Complete Upgrade Tasks 73


Chapter 5

Cloudera CDH Integration Tasks


This chapter includes the following topics:

• Cloudera CDH Task Flows, 74


• Prepare for Cluster Import from Cloudera CDH, 79
• Create a Cluster Configuration, 84
• Verify or Refresh the Cluster Configuration , 87
• Verify JDBC Drivers for Sqoop Connectivity, 87
• Import Security Certificates to Clients, 89
• Configure the Developer Tool, 89
• Complete Upgrade Tasks, 90

Cloudera CDH Task Flows


Depending on whether you want to integrate or upgrade Big Data Management in a Cloudera CDH
environment, you can use the flow charts to perform the following tasks:

• Integrate the Informatica domain with Cloudera CDH for the first time.
• Upgrade from version 10.2.1.
• Upgrade from version 10.2.
• Upgrade from a version earlier than 10.2.

74
Task Flow to Integrate with Cloudera CDH
The following diagram shows the task flow to integrate the Informatica domain with Cloudera CDH:

Cloudera CDH Task Flows 75


Task Flow to Upgrade from Version 10.2.1
The following diagram shows the task flow to upgrade Big Data Management 10.2.1 for Cloudera CDH:

76 Chapter 5: Cloudera CDH Integration Tasks


Task Flow to Upgrade from Version 10.2
The following diagram shows the task flow to upgrade Big Data Management 10.2 for Cloudera CDH:

Cloudera CDH Task Flows 77


Task Flow to Upgrade from a Version Earlier than 10.2
The following diagram shows the task flow to upgrade Big Data Management from a version earlier than 10.2
for Cloudera CDH:

78 Chapter 5: Cloudera CDH Integration Tasks


Prepare for Cluster Import from Cloudera CDH
Before the Informatica administrator can import cluster information to create a cluster configuration in the
Informatica domain, the Hadoop administrator must perform some preliminary tasks.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from any previous version.

Note: If you are upgrading from a previous version, verify the properties and suggested values, as Big Data
Management might require additional properties or different values for existing properties.

Complete the following tasks to prepare the cluster before the Informatica administrator creates the cluster
configuration:

1. Verify property values in *-site.xml files that Big Data Management needs to run mappings in the Hadoop
environment.
2. Provide information to the Informatica administrator that is required to import cluster information into
the domain. Depending on the method of import, perform one of the following tasks:
• To import directly from the cluster, give the Informatica administrator cluster authentication
information to connect to the cluster.
• To import from an archive file, export cluster information and provide an archive file to the Big Data
Management administrator.

Configure *-site.xml Files for Cloudera CDH


The Hadoop administrator needs to configure *-site.xml file properties and restart impacted services before
the Informatica administrator imports cluster information into the domain.

core-site.xml
Configure the following properties in the core-site.xml file:
fs.s3.enableServerSideEncryption

Enables server side encryption for S3 buckets. Required for SSE and SSE-KMS encryption.

Set to: TRUE

fs.s3a.access.key

The ID for the Blaze and Spark engines to connect to the Amazon S3 file system.

Set to your access key.

fs.s3a.secret.key

The password for the Blaze and Spark engines to connect to the Amazon S3 file system

Set to your access ID.

Prepare for Cluster Import from Cloudera CDH 79


fs.s3a.server-side-encryption-algorithm

The server-side encryption algorithm for S3. Required for SSE and SSE-KMS encryption. Set to the
encryption algorithm used.

hadoop.proxyuser.<proxy user>.groups

Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.

Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.

hadoop.proxyuser.<proxy user>.hosts

Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.

Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.

io.compression.codecs

Enables compression on temporary staging tables.

Set to a comma-separated list of compression codec classes on the cluster.

hadoop.security.auth_to_local

Translates the principal names from the Active Directory and MIT realm into local names within the
Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.

Set to: RULE:[1:$1@$0](^.*@YOUR.REALM)s/^(.*)@YOUR.REALM\.COM$/$1/g

Set to: RULE:[2:$1@$0](^.*@YOUR.REALM\.$)s/^(.*)@YOUR.REALM\.COM$/$1/g

hbase-site.xml
Configure the following properties in the hbase-site.xml file:
zookeeper.znode.parent

Identifies HBase master and region servers.

Set to the relative path to the znode directory of HBase.

hdfs-site.xml
Configure the following properties in the hdfs-site.xml file:
dfs.encryption.key.provider.uri

The KeyProvider used to interact with encryption keys when reading and writing to an encryption zone.
Required if sources or targets reside in the HDFS encrypted zone on Java KeyStore KMS-enabled
Cloudera CDH cluster or a Ranger KMS-enabled Hortonworks HDP cluster.

Set to: kmf://[email protected]:16000/kms

hive-site.xml
Configure the following properties in the hive-site.xml file:
hive.cluster.delegation.token.store.class

The token store implementation. Required for HiveServer2 high availability and load balancing.

Set to: org.apache.hadoop.hive.thrift.DBTokenStore

80 Chapter 5: Cloudera CDH Integration Tasks


hive.exec.dynamic.partition

Enables dynamic partitioned tables for Hive tables. Applicable for Hive versions 0.9 and earlier.

Set to: TRUE

hive.exec.dynamic.partition.mode

Allows all partitions to be dynamic. Required if you use Sqoop and define a DDL query to create or
replace a partitioned Hive target at run time.

Set to: nonstrict

hiveserver2_load_balancer

Enables high availability for multiple HiveServer2 hosts.

Set to: jdbc:hive2://<HiveServer2 Load Balancer>:<HiveServer2 Port>/default;principal=hive/


<HiveServer2 load Balancer>@<REALM>

mapred-site.xml
Configure the following properties in the mapred-site.xml file:
mapreduce.application.classpath

A comma-separated list of CLASSPATH entries for MapReduce applications. Required for Sqoop.

Include the entries: $HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$MR2_CLASSPATH,


$CDH_MR2_HOME

mapreduce.framework.name

The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for
Sqoop.

Set to: yarn

mapreduce.jobhistory.address

Location of the MapReduce JobHistory Server. The default port is 10020. Required for Sqoop.

Set to: <MapReduce JobHistory Server>:<port>

mapreduce.jobhistory.intermediate-done-dir

Directory where MapReduce jobs write history files. Required for Sqoop.

Set to: /mr-history/tmp

mapreduce.jobhistory.done-dir

Directory where the MapReduce JobHistory Server manages history files. Required for Sqoop.

Set to: /mr-history/done

mapreduce.jobhistory.principal
The Service Principal Name for the MapReduce JobHistory Server. Required for Sqoop.

Set to: mapred/_HOST@YOUR-REALM

mapreduce.jobhistory.webapp.address

Web address of the MapReduce JobHistory Server. The default value is 19888. Required for Sqoop.

Set to: <host>:<port>

yarn.app.mapreduce.am.staging-dir

The HDFS staging directory used while submitting jobs.

Set to the staging directory path.

Prepare for Cluster Import from Cloudera CDH 81


yarn-site.xml
Configure the following properties in the yarn-site.xml file:
yarn.application.classpath

Required for dynamic resource allocation.

Add spark_shuffle.jar to the class path. The .jar file must contain the class
"org.apache.spark.network.yarn.YarnShuffleService."

yarn.nodemanager.resource.memory-mb

The maximum RAM available for each container. Set the maximum memory on the cluster to increase
resource memory available to the Blaze engine.

Set to 16 GB if value is less than 16 GB.

yarn.nodemanager.resource.cpu-vcores

The number of virtual cores for each container. Required for Blaze engine resource allocation.

Set to 10 if the value is less than 10.

yarn.scheduler.minimum-allocation-mb

The minimum RAM available for each container. Required for Blaze engine resource allocation.

Set to 6 GB if the value is less than 6 GB.

yarn.nodemanager.vmem-check-enabled

Disables virtual memory limits for containers. Required for the Blaze and Spark engines.

Set to: FALSE

yarn.nodemanager.aux-services

Required for dynamic resource allocation for the Spark engine.

Add an entry for "spark_shuffle."

yarn.nodemanager.aux-services.spark_shuffle.class

Required for dynamic resource allocation for the Spark engine.

Set to: org.apache.spark.network.yarn.YarnShuffleService

yarn.resourcemanager.scheduler.class

Defines the YARN scheduler that the Data Integration Service uses to assign resources.

Set to: org.apache.hadoop.yarn.server.resourcemanager.scheduler

yarn.node-labels.enabled

Enables node labeling.

Set to: TRUE

yarn.node-labels.fs-store.root-dir

The HDFS location to update node label dynamically.

Set to: <hdfs://[Node name]:[Port]/[Path to store]/[Node labels]/>

82 Chapter 5: Cloudera CDH Integration Tasks


Prepare for Direct Import from Cloudera CDH
If you plan to provide direct access to the Informatica administrator to import cluster information, provide the
required connection information.

The following table describes the information that you need to provide to the Informatica administrator to
create the cluster configuration directly from the cluster:

Property Description

Host IP address of the cluster manager.

Port Port of the cluster manager.

User ID Cluster user ID.

Password Password for the user.

Cluster name Name of the cluster. Use the display name if the cluster manager manages multiple clusters. If you do
not provide a cluster name, the wizard imports information based on the default cluster.
To find the correct Cloudera cluster name when you have multiple clusters, perform the following
steps:
1. Log in to Cloudera Manager adding the following string to the URL: /api/v8/clusters
2. Provide the Informatica Administrator the cluster property name that appears in the browser tab.

Prepare the Archive File for Import from Cloudera CDH


If you plan to provide an archive file for the Informatica administrator, ensure that you include all required
site-*.xml files.

Create a .zip or .tar file that contains the following *-site.xml files:

• core-site.xml
• hbase-site.xml. Required only for access to HBase sources and targets.
• hdfs-site.xml
• hive-site.xml
• mapred-site.xml
• yarn-site.xml

Give the Informatica administrator access to the archive file to import the cluster information into the
domain.

Prepare for Cluster Import from Cloudera CDH 83


Create a Cluster Configuration
After the Hadoop administrator prepares the cluster for import, the Informatica administrator must create a
cluster configuration.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.1.1 or earlier.

A cluster configuration is an object in the domain that contains configuration information about the Hadoop
cluster. The cluster configuration enables the Data Integration Service to push mapping logic to the Hadoop
environment. Import configuration properties from the Hadoop cluster to create a cluster configuration.

The import process imports values from *-site.xml files into configuration sets based on the individual *-
site.xml files. When you perform the import, the cluster configuration wizard can create Hadoop, HBase,
HDFS, and Hive connection to access the Hadoop environment. If you choose to create the connections, the
wizard also associates the cluster configuration with the connections.

Note: If you are integrating for the first time and you imported the cluster configuration when you ran the
installer, you must re-create or refresh the cluster configuration.

Before You Import


Before you can import the cluster configuration, you must get information from the Hadoop administrator
based on the method of import.

If you import directly from the cluster, contact the Hadoop administrator to get cluster connection
information. If you import from a file, get an archive file of exported cluster information.

Importing a Hadoop Cluster Configuration from the Cluster


When you import the Hadoop cluster configuration directly from the cluster, you provide information to
connect to the cluster.

Get cluster connection information from the Hadoop administrator.

1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following General properties:

Property Description

Cluster Name of the cluster configuration.


configuration name

Description Optional description of the cluster configuration.

Distribution type The cluster Hadoop distribution type.

84 Chapter 5: Cloudera CDH Integration Tasks


Property Description

Distribution version Version of the Hadoop distribution.


Each distribution type has a default version. The default version is the latest version of the
Hadoop distribution that Big Data Management supports.
Note: When the cluster version differs from the default version and Informatica supports
more than one version, the cluster configuration import process populates the property
with the most recent supported version. For example, consider the case where Informatica
supports versions 5.10 and 5.13, and the cluster version is 5.12. In this case, the cluster
configuration import process populates this property with 5.10, because 5.10 is the most
recent supported version before 5.12.
You can edit the property to choose any supported version. Restart the Data Integration
Service for the changes to take effect.

Method to import Choose Import from cluster.


the cluster
configuration

Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster environment
variables, cluster path variables, and advanced properties. Based on the cluster
environment and the functionality that you use, you can add to the default values or
change the default values of these properties. For a list of Hadoop connection properties
to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.

The cluster properties appear.

4. Configure the following properties:

Property Description

Host IP address of the cluster manager.

Port Port of the cluster manager.

User ID Cluster user ID.

Password Password for the user.

Cluster name Name of the cluster. Use the display name if the cluster manager manages multiple clusters. If
you do not provide a cluster name, the wizard imports information based on the default cluster.

5. Click Next and verify the cluster configuration information on the summary page.

Create a Cluster Configuration 85


Importing a Hadoop Cluster Configuration from a File
You can import properties from an archive file to create a cluster configuration.

Before you import from the cluster, you must get the archive file from the Hadoop administrator.

1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:

Property Description

Cluster Name of the cluster configuration.


configuration name

Description Optional description of the cluster configuration.

Distribution type The cluster Hadoop distribution type.

Distribution version Version of the Hadoop distribution.


Each distribution type has a default version. This is the latest version of the Hadoop
distribution that Big Data Management supports.
When the cluster version differs from the default version, the cluster configuration wizard
populates the cluster configuration Hadoop distribution property with the most recent
supported version relative to the cluster version. For example, suppose Informatica
supports versions 5.10 and 5.13, and the cluster version is 5.12. In this case, the wizard
populates the version with 5.10.
You can edit the property to choose any supported version. Restart the Data Integration
Service for the changes to take effect.

Method to import Choose Import from file to import properties from an archive file.
the cluster
configuration

Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster
environment variables, cluster path variables, and advanced properties. Based on the
cluster environment and the functionality that you use, you can add to the default values
or change the default values of these properties. For a list of Hadoop connection
properties to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.

4. Click Browse to select a file. Select the file and click Open.
5. Click Next and verify the cluster configuration information on the summary page.

86 Chapter 5: Cloudera CDH Integration Tasks


Verify or Refresh the Cluster Configuration
You might need to refresh the cluster configuration or update the distribution version in the cluster
configuration when you upgrade.

Perform this task in the following situation:

- You upgraded from version 10.2 or later.

Verify the Cluster Configuration


The cluster configuration contains a property for the distribution version. The verification task depends on
the version you upgraded:
Upgrade from 10.2

If you upgraded from 10.2 and you changed the distribution version, you need to verify the distribution
version in the General properties of the cluster configuration.

Upgrade from 10.2.1

Effective in version 10.2.1, Informatica assigns a default version to each Hadoop distribution type. If you
configure the cluster configuration to use the default version, the upgrade process upgrades to the
assigned default version if the version changes. If you have not upgraded your Hadoop distribution to
Informatica's default version, you need to update the distribution version property.
For example, suppose the assigned default Hadoop distribution version for 10.2.1 is n, and for 10.2.2 is n
+1. If the cluster configuration uses the default supported Hadoop version of n, the upgraded cluster
configuration uses the default version of n+1. If you have not upgraded the distribution in the Hadoop
environment you need to change the cluster configuration to use version n.

If you configure the cluster configuration to use a distribution version that is not the default version, you
need to update the distribution version property in the following circumstances:

• Informatica dropped support for the distribution version.


• You changed the distribution version.

Refresh the Cluster Configuration


If you updated any of the *-site.xml files noted in the topic to prepare for cluster import, you need to refresh
the cluster configuration in the Administrator tool.

Verify JDBC Drivers for Sqoop Connectivity


Verify that you have the JDBC drivers to access JDBC-compliant databases in the Hadoop environment. You
might need separate drivers for metadata import and for run-time processing.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.2.1 or earlier.

Verify or Refresh the Cluster Configuration 87


You download drivers based on design-time and run-time requirements:

• Design-time. To import metadata, you can use the DataDirect drivers packaged with the Informatica
installer if they are available. If they are not available, use any Type 4 JDBC driver that the database
vendor recommends.
• Run-time. To run mappings, use any Type 4 JDBC driver that the database vendor recommends. Some
distributions support other drivers to use Sqoop connectors. You cannot use the DataDirect drivers for
run-time processing.

Verify Design-time Drivers


Use the DataDirect JDBC drivers packaged with the Informatica installer to import metadata from JDBC-
compliant databases. If the DataDirect JDBC drivers are not available for a specific JDBC-compliant
database, download the Type 4 JDBC driver associated with that database.

Copy the JDBC driver .jar files to the following location on the Developer tool machine:

<Informatica installation directory>\clients\externaljdbcjars

Verify Run-time Drivers


Verify run-time drivers for mappings that access JDBC-compliant databases in the Hadoop environment. Use
any Type 4 JDBC driver that the database vendor recommends.

1. Download Type 4 JDBC drivers associated with the JCBC-compliant databases that you want to access.
2. To use Sqoop TDCH Cloudera Connector Powered by Teradata, perform the following tasks:
• Download all .jar files in the Cloudera Connector Powered by Teradata package from the following
location: http://www.cloudera.com/downloads.html. The package has the following naming
convention: sqoop-connector-teradata-<version>.tar
• Download terajdbc4.jar and tdgssconfig.jar from the following location:
http://downloads.teradata.com/download/connectivity/jdbc-driver
3. To optimize the Sqoop mapping performance on the Spark engine while writing data to an HDFS
complex file target of the Parquet format, download the following .jar files:
• parquet-hadoop-bundle-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-avro/1.6.0/
• parquet-avro-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-hadoop-bundle/1.6.0/
• parquet-column-1.5.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-column/1.5.0/
4. Copy all of the .jar files to the following directory on the machine where the Data Integration Service
runs:
<Informatica installation directory>\externaljdbcjars
Changes take effect after you recycle the Data Integration Service. At run time, the Data Integration
Service copies the .jar files to the Hadoop distribution cache so that the .jar files are accessible to all
nodes in the cluster.

88 Chapter 5: Cloudera CDH Integration Tasks


Import Security Certificates to Clients
When you use custom, special, or self-signed security certificates to secure the Hadoop cluster, Informatica
clients that connect to the cluster require these certificates to be present in the client machine truststore.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.1.1 or earlier.

To connect to the Hadoop cluster to develop a mapping, the Developer tool requires security certificate
aliases on the machine that hosts the Developer tool. To run a mapping, the machine that hosts the Data
Integration Service requires these same certificate alias files.

Perform the following steps from the Developer tool host machine, and then repeat them from the Data
Integration Service host machine:

1. Run the following command to export the certificates from the cluster:
keytool -export -alias <alias name> -keystore <custom.truststore file location> -
file <exported certificate file location> -storepass <password>
For example,
keytool -export -alias <alias name> -keystore ~/custom.truststore -file ~/
exported.cer
The command produces a certificate file.
2. Choose to import security certificates to an SSL-enabled domain or a domain that is not SSL-enabled
using the following command:
keytool -import -trustcacerts -alias <alias name> -file <exported certificate file
location> -keystore <java cacerts location> -storepass <password>
For example,
keytool -import -alias <alias name> -file ~/exported.cer -keystore <Informatica
installation directory>/java/jre/lib/security/cacerts

Configure the Developer Tool


To access the Hadoop environment from the Developer tool, the mapping developers must perform tasks on
each Developer tool machine.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from any previous version.

Configure developerCore.ini
Edit developerCore.ini to successfully import local complex files available on the Developer tool machine.

When you import a complex file, such as Avro or Parquet, the imported object includes metadata associated
with the distribution in the Hadoop environment. If the file resides on the Developer tool machine, the import

Import Security Certificates to Clients 89


process picks up the distribution information from the developerCore.ini file. You must edit the
developerCore.ini file to point to the distribution directory on the Developer tool machine.

You can find developerCore.ini in the following directory:


<Informatica installation directory>\clients\DeveloperClient
Add the following property:
-DINFA_HADOOP_DIST_DIR=hadoop\<distribution>_<version>
The change takes effect when you restart the Developer tool.

Complete Upgrade Tasks


If you upgraded the Informatica platform, you need to perform some additional tasks within the Informatica
domain.

Based on the version that you upgraded from, you might need to update the following types of objects:
Connections

Based on the version you are upgrading from, you might need to update Hadoop connections or replace
connections to the Hadoop environment.

The Hadoop connection contains additional properties. You need to manually update it to include
customized configuration in the hadoopEnv.properties file from previous versions.

Streaming mappings

The mapping contains deferred data objects or transformations. Support will be reinstated in a future
release.

After you upgrade, the streaming mappings become invalid. You must re-create the physical data objects
to run the mappings in Spark engine that uses Structured Streaming.

After you re-create the physical data objects to run the mappings in Spark engine that uses Structured
Streaming some properties are not available for Azure Event Hubs data objects.

Update Connections
You might need to update connections based on the version you are upgrading from.

Consider the following types of updates that you might need to make:
Configure the Hadoop connection.

Configure the Hadoop connection to incorporate properties from the hadoopEnv.properties file.

Replace connections.

If you chose the option to create connections when you ran the Cluster Configuration wizard, you need
to replace connections in mappings with the new connections.

Complete connection upgrades.

If you did not create connections when you created the cluster configuration, you need to update the
connections.

90 Chapter 5: Cloudera CDH Integration Tasks


Configure the Hadoop Connection
To use properties that you customized in the hadoopEnv.properties file, you must configure the Hadoop
connection properties such as cluster environment variables, cluster path variables, and advanced properties.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.

When you run the Informatica upgrade, the installer backs up the existing hadoopEnv.properties file. You can
find the backup hadoopEnv.properties file in the following location:

<Previous Informatica installation directory>/services/shared/hadoop/<Hadoop distribution


name>_<version>/infaConf

Edit the Hadoop connection in the Administrator tool or the Developer tool to include any properties that you
manually configured in the hadoopEnv.properties file. The Hadoop connection contains default values for
properties such as cluster environment and path variables and advanced properties. You can update the
default values to match the properties in the hadoopEnv.properties file.

Replace the Connections with New Connections


If you created connections when you imported the cluster configuration, you need to replace connections in
mappings with the new connections.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.

The method that you use to replace connections in mappings depends on the type of connection.
Hadoop connection

Run the following commands to replace the connections:

• infacmd dis replaceMappingHadoopRuntimeConnections. Replaces connections associated with


mappings that are deployed in applications.
• infacmd mrs replaceMappingHadoopRuntimeConnections. Replaces connections associated with
mappings that you run from the Developer tool.

For information about the infacmd commands, see the Informatica Command Reference.

Hive, HDFS, and HBase connections

You must replace the connections manually.

Complete Connection Upgrade


If you did not create connections when you imported the cluster configuration, you need to update connection
properties for Hadoop, Hive, HDFS, and HBase connections.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.


- You upgraded from version 10.2 or later and changed the distribution version.

Complete Upgrade Tasks 91


Perform the following tasks to update the connections:

Update changed properties

Review connections that you created in a previous release to update the values for connection
properties. For example, if you added nodes to the cluster or if you updated the distribution version, you
might need to verify host names, URIs, or port numbers for some of the properties.

Associate the cluster configuration

The Hadoop, Hive, HDFS, and HBase connections must be associated with a cluster configuration.
Complete the following tasks:

1. Run infacmd isp listConnections to identify the connections that you need to upgrade. Use -ct
to list connections of a particular type.
2. Run infacmd isp UpdateConnection to associate the cluster configuration with the connection.
Use -cn to name the connection and -o clusterConfigID to associate the cluster configuration
with the connection.

For more information about infacmd, see the Informatica Command Reference.

Update Streaming Objects


Big Data Streaming uses Spark Structured Streaming to process data instead of Spark Streaming. To support
Spark Structured Streaming, some header ports are added to the data objects, and support to some of the
data objects and transformations are deferred to a future release. The behavior of some of the data objects
is also updated.

After you upgrade, the existing streaming mappings become invalid because of the unavailable header ports,
the unsupported transformations or data objects, and the behavior change of some data objects.

Perform this task in the following situations:

- You upgraded from version 10.1.1, 10.2.0, or 10.2.1.

To use an existing streaming mapping, perform the following tasks:

• Re-create the physical data objects. After you re-create the physical data objects, the data objects get the
required header ports, such as timestamp, partitionID, or key based on the data object.
• In a Normalizer transformation, if the Occurs column is set to Auto, re-create the Normalizer
transformation. You must re-create the Normalizer transformation because the type configuration
property of the complex port refers to the physical data object that you plan to replace.
• Update the streaming mapping. If the mapping contains Kafka target, Aggregator transformation, Joiner
transformation, or Normalizer transformation, replace the data object or transformation, and then update
the mapping because of the changed behavior of these transformations and data objects.
• Verify the deferred data object types. If the streaming mapping contains unsupported transformations or
data objects, contact Informatica Global Customer Support.

Re-create the Physical Data Objects


When you re-create the physical data objects, the physical data objects get the header ports and some
properties are not available for some data objects. Update the existing mapping with the newly created
physical data objects.

1. Go to the existing mapping, select the data object from the mapping.

92 Chapter 5: Cloudera CDH Integration Tasks


2. Click the Properties tab. On the Column Projection tab, click Edit Schema.
3. Note the schema information from the Edit Schema dialog box.
4. Note the parameters information from the Parameters tab.
5. Create new physical data objects.
After you re-create the data objects, the physical data objects get the required header ports. The Microsoft
Azure does not support the following properties and are not available for Azure Event Hubs data objects:

• Consumer Properties
• Partition Count

Update the Streaming Mappings


After you re-create the data object, replace the existing data objects with the re-created data objects. If the
mapping contains Normaliser Transformation, Aggregator transformation, or Joiner transformation, update
the mapping because of the changed behavior of these transformations and data objects.

Transformation Updates

If a transformation uses a complex port, configure the type configuration property of the port because
the property refers to the physical data object that you replaced.

Aggregator and Joiner Transformation Updates

An Aggregator transformation must be downstream from a Joiner transformation. A Window


transformation must be directly upstream from both Aggregator and Joiner transformations. Previously,
you could use an Aggregator transformation anywhere in the streaming mapping.

If a mapping contains an Aggregator transformation upstream from a Joiner transformation, move the
Aggregator transformation downstream from a Joiner transformation. Add a Window transformation
directly upstream from both Aggregator and Joiner transformations.

Verify the Deferred Data Object Types


After you upgrade, the streaming mappings might contain some transformations and data objects that are
deferred.

The following table lists the data object types to which the support is deferred to a future release:

Object Type Object

Source JMS
MapR Streams

Target MapR Streams

Transformation Data Masking


Joiner
Rank
Sorter

If you want to continue using the mappings that contain deferred data objects or transformations, you must
contact Informatica Global Customer Support.

Complete Upgrade Tasks 93


Chapter 6

Hortonworks HDP Integration


Tasks
This chapter includes the following topics:

• Hortonworks HDP Task Flows, 94


• Prepare for Cluster Import from Hortonworks HDP, 99
• Create a Cluster Configuration, 104
• Verify or Refresh the Cluster Configuration , 107
• Verify JDBC Drivers for Sqoop Connectivity, 108
• Import Security Certificates to Clients, 109
• Configure the Developer Tool, 110
• Complete Upgrade Tasks, 110

Hortonworks HDP Task Flows


Depending on whether you want to integrate or upgrade Big Data Management in a Hortonworks HDP
environment, you can use the flow charts to perform the following tasks:

• Integrate the Informatica domain with Hortonworks HDP for the first time.
• Upgrade from version 10.2.1.
• Upgrade from version 10.2.
• Upgrade from a version earlier than 10.2.

94
Task Flow to Integrate with Hortonworks HDP
The following diagram shows the task flow to integrate the Informatica domain with Hortonworks HDP:

Hortonworks HDP Task Flows 95


Task Flow to Upgrade from Version 10.2.1
The following diagram shows the task flow to upgrade Big Data Management from version 10.2.1 for
Hortonworks HDP:

96 Chapter 6: Hortonworks HDP Integration Tasks


Task Flow to Upgrade from Version 10.2
The following diagram shows the task flow to upgrade Big Data Management 10.2 for Hortonworks HDP:

Hortonworks HDP Task Flows 97


Task Flow to Upgrade from a Version Earlier than 10.2
The following diagram shows the task flow to upgrade Big Data Management from a version earlier than 10.2
for Hortonworks HDP:

98 Chapter 6: Hortonworks HDP Integration Tasks


Prepare for Cluster Import from Hortonworks HDP
Before the Informatica administrator can import cluster information to create a cluster configuration in the
Informatica domain, the Hadoop administrator must perform some preliminary tasks.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from any previous version.

Note: If you are upgrading from a previous version, verify the properties and suggested values, as Big Data
Management might require additional properties or different values for existing properties.

Complete the following tasks to prepare the cluster before the Informatica administrator creates the cluster
configuration:

1. Verify property values in *-site.xml files that Big Data Management needs to run mappings in the Hadoop
environment.
2. Provide information to the Informatica administrator that is required to import cluster information into
the domain. Depending on the method of import, perform one of the following tasks:
• To import directly from the cluster, give the Informatica administrator cluster authentication
information to connect to the cluster.
• To import from an archive file, export cluster information and provide an archive file to the Big Data
Management administrator.

Configure *-site.xml Files for Hortonworks HDP


The Hadoop administrator needs to configure *-site.xml file properties and restart impacted services before
the Informatica administrator imports cluster information into the domain.

core-site.xml
Configure the following properties in the core-site.xml file:
fs.s3.enableServerSideEncryption

Enables server side encryption for S3 buckets. Required for SSE and SSE-KMS encryption.

Set to: TRUE

fs.s3a.access.key

The ID for the Blaze and Spark engines to connect to the Amazon S3 file system.

Set to your access key.

fs.s3a.secret.key

The password for the Blaze and Spark engines to connect to the Amazon S3 file system

Set to your access ID.

Prepare for Cluster Import from Hortonworks HDP 99


fs.s3a.server-side-encryption-algorithm

The server-side encryption algorithm for S3. Required for SSE and SSE-KMS encryption. Set to the
encryption algorithm used.

hadoop.proxyuser.<proxy user>.groups

Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.

Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.

hadoop.proxyuser.<proxy user>.hosts

Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.

Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.

hadoop.proxyuser.yarn.groups

Comma-separated list of groups that you want to allow the YARN user to impersonate on a non-secure
cluster.

Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.

hadoop.proxyuser.yarn.hosts

Comma-separated list of hosts that you want to allow the YARN user to impersonate on a non-secure
cluster.

Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.

hadoop.security.auth_to_local

Translates the principal names from the Active Directory and MIT realm into local names within the
Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.

Set to: RULE:[1:$1@$0](^.*@YOUR.REALM)s/^(.*)@YOUR.REALM\.COM$/$1/g

Set to: RULE:[2:$1@$0](^.*@YOUR.REALM\.$)s/^(.*)@YOUR.REALM\.COM$/$1/g

hbase-site.xml
Configure the following properties in the hbase-site.xml file:
zookeeper.znode.parent

Identifies HBase master and region servers.

Set to the relative path to the znode directory of HBase.

hdfs-site.xml
Configure the following properties in the hdfs-site.xml file:
dfs.encryption.key.provider.uri

The KeyProvider used to interact with encryption keys when reading and writing to an encryption zone.
Required if sources or targets reside in the HDFS encrypted zone on Java KeyStore KMS-enabled
Cloudera CDH cluster or a Ranger KMS-enabled Hortonworks HDP cluster.

Set to: kmf://[email protected]:16000/kms

100 Chapter 6: Hortonworks HDP Integration Tasks


hive-site.xml
Configure the following properties in the hive-site.xml file:
hive.cluster.delegation.token.store.class

The token store implementation. Required for HiveServer2 high availability and load balancing.

Set to: org.apache.hadoop.hive.thrift.DBTokenStore

hive.compactor.initiator.on

Runs the initiator and cleaner threads on metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.

Set to: TRUE

hive.compactor.worker.threads

The number of worker threads to run in a metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.

Set to: 1

hive.enforce.bucketing

Enables dynamic bucketing while loading to Hive. Required for an Update Strategy transformation in a
mapping that writes to a Hive target.

Set to: TRUE

io.compression.codecs

Enables compression on temporary staging tables.

Set to a comma-separated list of compression codec classes on the cluster.

hive.exec.dynamic.partition.mode

Allows all partitions to be dynamic. Required for the Update Strategy transformation in a mapping that
writes to a Hive target. Also required if you use Sqoop and define a DDL query to create or replace a
partitioned Hive target at run time.

Set to: nonstrict

hive.support.concurrency

Enables table locking in Hive. Required for an Update Strategy transformation in a mapping that writes to
a Hive target.

Set to: TRUE

hive.server2.support.dynamic.service.discovery

Enables HiveServer2 dynamic service discovery. Required for HiveServer2 high availability.

Set to: TRUE

hive.server2.zookeeper.namespace

The value of the ZooKeeper namespace in the JDBC connection string. Required for HiveServer2 high
availability.

Set to: jdbc:hive2://<zookeeper_ensemble>/


default;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2

hive.txn.manager

Turns on transaction support. Required for an Update Strategy transformation in a mapping that writes
to a Hive target.

Prepare for Cluster Import from Hortonworks HDP 101


Set to: org.apache.hadoop.hive.ql.lockmgr.DbTxnManager

hive.zookeeper.quorum

Comma-separated list of ZooKeeper server host:ports in a cluster. The value of the ZooKeeper ensemble
in the JDBC connection string. Required for HiveServer2 high availability.

Set to: jdbc:hive2://<zookeeper_ensemble>/default;serviceDiscoveryMode=zooKeeper;

mapred-site.xml
Configure the following properties in the mapred-site.xml file:

mapreduce.framework.name
The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for
Sqoop.

Set to: yarn


yarn.app.mapreduce.am.staging-dir

The HDFS staging directory used while submitting jobs.

Set to the staging directory path.

yarn-site.xml
Configure the following properties in the yarn-site.xml file:
yarn.application.classpath

Required for dynamic resource allocation.

Add spark_shuffle.jar to the class path. The .jar file must contain the class
"org.apache.spark.network.yarn.YarnShuffleService."

yarn.nodemanager.resource.memory-mb

The maximum RAM available for each container. Set the maximum memory on the cluster to increase
resource memory available to the Blaze engine.

Set to 16 GB if value is less than 16 GB.

yarn.nodemanager.resource.cpu-vcores

The number of virtual cores for each container. Required for Blaze engine resource allocation.

Set to 10 if the value is less than 10.

yarn.scheduler.minimum-allocation-mb

The minimum RAM available for each container. Required for Blaze engine resource allocation.

Set to 6 GB if the value is less than 6 GB.

yarn.nodemanager.vmem-check-enabled
Disables virtual memory limits for containers. Required for the Blaze and Spark engines.

Set to: FALSE

yarn.nodemanager.aux-services

Required for dynamic resource allocation for the Spark engine.

Add an entry for "spark_shuffle."

yarn.nodemanager.aux-services.spark_shuffle.class

Required for dynamic resource allocation for the Spark engine.

Set to: org.apache.spark.network.yarn.YarnShuffleService

102 Chapter 6: Hortonworks HDP Integration Tasks


yarn.resourcemanager.scheduler.class

Defines the YARN scheduler that the Data Integration Service uses to assign resources.

Set to: org.apache.hadoop.yarn.server.resourcemanager.scheduler

yarn.node-labels.enabled

Enables node labeling.

Set to: TRUE

yarn.node-labels.fs-store.root-dir

The HDFS location to update node label dynamically.

Set to: <hdfs://[Node name]:[Port]/[Path to store]/[Node labels]/>

tez-site.xml
Configure the following properties in the tez-site.xml file:
tez.runtime.io.sort.mb

The sort buffer memory. Required when the output needs to be sorted for Blaze and Spark engines.

Set value to 270 MB.

Prepare for Direct Import from Hortonworks HDP


If you plan to provide direct access to the Informatica administrator to import cluster information, provide the
required connection information.

The following table describes the information that you need to provide to the Informatica administrator to
create the cluster configuration directly from the cluster:

Property Description

Host IP address of the cluster manager.

Port Port of the cluster manager.

User ID Cluster user ID.

Password Password for the user.

Cluster name Name of the cluster. Use the display name if the cluster manager manages multiple clusters. If you do
not provide a cluster name, the wizard imports information based on the default cluster.

Prepare the Archive File for Import from Hortonworks HDP


When you prepare the archive file for cluster configuration import from Hortonworks, include all required *-
site.xml files and edit the file manually after you create it.

The Hortonworks cluster configuration archive file must have the following contents:

• core-site.xml
• hbase-site.xml. hbase-site.xml is required only if you access HBase sources and targets.
• hdfs-site.xml
• hive-site.xml

Prepare for Cluster Import from Hortonworks HDP 103


• mapred-site.xml or tez-site.xml. Include the mapred-site.xml file or the tez-site.xml file based on the Hive
execution type used on the Hadoop cluster.
• yarn-site.xml

Update the Archive File


After you create the archive file, edit the Hortonworks Data Platform (HDP) version string wherever it appears
in the archive file. Search for the string ${hdp.version} and replace all instances with the HDP version that
Hortonworks includes in the Hadoop distribution.

For example, the edited tez.lib.uris property looks similar to the following:
<property>
<name>tez.lib.uris</name>
<value>/hdp/apps/2.5.0.0-1245/tez/tez.tar.gz</value>
</property>

Create a Cluster Configuration


After the Hadoop administrator prepares the cluster for import, the Informatica administrator must create a
cluster configuration.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.1.1 or earlier.

A cluster configuration is an object in the domain that contains configuration information about the Hadoop
cluster. The cluster configuration enables the Data Integration Service to push mapping logic to the Hadoop
environment. Import configuration properties from the Hadoop cluster to create a cluster configuration.

The import process imports values from *-site.xml files into configuration sets based on the individual *-
site.xml files. When you perform the import, the cluster configuration wizard can create Hadoop, HBase,
HDFS, and Hive connection to access the Hadoop environment. If you choose to create the connections, the
wizard also associates the cluster configuration with the connections.

Note: If you are integrating for the first time and you imported the cluster configuration when you ran the
installer, you must re-create or refresh the cluster configuration.

Before You Import


Before you can import the cluster configuration, you must get information from the Hadoop administrator
based on the method of import.

If you import directly from the cluster, contact the Hadoop administrator to get cluster connection
information. If you import from a file, get an archive file of exported cluster information.

104 Chapter 6: Hortonworks HDP Integration Tasks


Importing a Hadoop Cluster Configuration from the Cluster
When you import the Hadoop cluster configuration directly from the cluster, you provide information to
connect to the cluster.

Get cluster connection information from the Hadoop administrator.

1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following General properties:

Property Description

Cluster Name of the cluster configuration.


configuration name

Description Optional description of the cluster configuration.

Distribution type The cluster Hadoop distribution type.

Distribution version Version of the Hadoop distribution.


Each distribution type has a default version. The default version is the latest version of the
Hadoop distribution that Big Data Management supports.
Note: When the cluster version differs from the default version and Informatica supports
more than one version, the cluster configuration import process populates the property
with the most recent supported version. For example, consider the case where Informatica
supports versions 5.10 and 5.13, and the cluster version is 5.12. In this case, the cluster
configuration import process populates this property with 5.10, because 5.10 is the most
recent supported version before 5.12.
You can edit the property to choose any supported version. Restart the Data Integration
Service for the changes to take effect.

Method to import Choose Import from cluster.


the cluster
configuration

Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster environment
variables, cluster path variables, and advanced properties. Based on the cluster
environment and the functionality that you use, you can add to the default values or
change the default values of these properties. For a list of Hadoop connection properties
to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.

The cluster properties appear.

Create a Cluster Configuration 105


4. Configure the following properties:

Property Description

Host IP address of the cluster manager.

Port Port of the cluster manager.

User ID Cluster user ID.

Password Password for the user.

Cluster name Name of the cluster. Use the display name if the cluster manager manages multiple clusters. If
you do not provide a cluster name, the wizard imports information based on the default cluster.

5. Click Next and verify the cluster configuration information on the summary page.

Importing a Hadoop Cluster Configuration from a File


You can import properties from an archive file to create a cluster configuration.

Before you import from the cluster, you must get the archive file from the Hadoop administrator.

1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:

Property Description

Cluster Name of the cluster configuration.


configuration name

Description Optional description of the cluster configuration.

Distribution type The cluster Hadoop distribution type.

Distribution version Version of the Hadoop distribution.


Each distribution type has a default version. This is the latest version of the Hadoop
distribution that Big Data Management supports.
When the cluster version differs from the default version, the cluster configuration wizard
populates the cluster configuration Hadoop distribution property with the most recent
supported version relative to the cluster version. For example, suppose Informatica
supports versions 5.10 and 5.13, and the cluster version is 5.12. In this case, the wizard
populates the version with 5.10.
You can edit the property to choose any supported version. Restart the Data Integration
Service for the changes to take effect.

106 Chapter 6: Hortonworks HDP Integration Tasks


Property Description

Method to import Choose Import from file to import properties from an archive file.
the cluster
configuration

Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster
environment variables, cluster path variables, and advanced properties. Based on the
cluster environment and the functionality that you use, you can add to the default values
or change the default values of these properties. For a list of Hadoop connection
properties to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.

4. Click Browse to select a file. Select the file and click Open.
5. Click Next and verify the cluster configuration information on the summary page.

Verify or Refresh the Cluster Configuration


You might need to refresh the cluster configuration or update the distribution version in the cluster
configuration when you upgrade.

Perform this task in the following situation:

- You upgraded from version 10.2 or later.

Verify the Cluster Configuration


The cluster configuration contains a property for the distribution version. The verification task depends on
the version you upgraded:
Upgrade from 10.2

If you upgraded from 10.2 and you changed the distribution version, you need to verify the distribution
version in the General properties of the cluster configuration.

Upgrade from 10.2.1

Effective in version 10.2.1, Informatica assigns a default version to each Hadoop distribution type. If you
configure the cluster configuration to use the default version, the upgrade process upgrades to the
assigned default version if the version changes. If you have not upgraded your Hadoop distribution to
Informatica's default version, you need to update the distribution version property.

For example, suppose the assigned default Hadoop distribution version for 10.2.1 is n, and for 10.2.2 is n
+1. If the cluster configuration uses the default supported Hadoop version of n, the upgraded cluster

Verify or Refresh the Cluster Configuration 107


configuration uses the default version of n+1. If you have not upgraded the distribution in the Hadoop
environment you need to change the cluster configuration to use version n.

If you configure the cluster configuration to use a distribution version that is not the default version, you
need to update the distribution version property in the following circumstances:

• Informatica dropped support for the distribution version.


• You changed the distribution version.

Refresh the Cluster Configuration


If you updated any of the *-site.xml files noted in the topic to prepare for cluster import, you need to refresh
the cluster configuration in the Administrator tool.

Verify JDBC Drivers for Sqoop Connectivity


Verify that you have the JDBC drivers to access JDBC-compliant databases in the Hadoop environment. You
might need separate drivers for metadata import and for run-time processing.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.2.1 or earlier.

You download drivers based on design-time and run-time requirements:

• Design-time. To import metadata, you can use the DataDirect drivers packaged with the Informatica
installer if they are available. If they are not available, use any Type 4 JDBC driver that the database
vendor recommends.
• Run-time. To run mappings, use any Type 4 JDBC driver that the database vendor recommends. Some
distributions support other drivers to use Sqoop connectors. You cannot use the DataDirect drivers for
run-time processing.

Verify Design-time Drivers


Use the DataDirect JDBC drivers packaged with the Informatica installer to import metadata from JDBC-
compliant databases. If the DataDirect JDBC drivers are not available for a specific JDBC-compliant
database, download the Type 4 JDBC driver associated with that database.

Copy the JDBC driver .jar files to the following location on the Developer tool machine:

<Informatica installation directory>\clients\externaljdbcjars

Verify Run-time Drivers


Verify run-time drivers for mappings that access JDBC-compliant databases in the Hadoop environment. Use
any Type 4 JDBC driver that the database vendor recommends.

1. Download Type 4 JDBC drivers associated with the JCBC-compliant databases that you want to access.
2. To use Sqoop TDCH Hortonworks Connector for Teradata, perform the following task:

108 Chapter 6: Hortonworks HDP Integration Tasks


Download all .jar files in the Hortonworks Connector for Teradata package from the following
location : http://hortonworks.com/downloads/#addons
The package has the following naming convention: hdp-connector-for-teradata-<version>-
distro.tar.gz
3. To optimize the Sqoop mapping performance on the Spark engine while writing data to an HDFS
complex file target of the Parquet format, download the following .jar files:
• parquet-hadoop-bundle-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-avro/1.6.0/
• parquet-avro-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-hadoop-bundle/1.6.0/
• parquet-column-1.5.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-column/1.5.0/
4. Copy all of the .jar files to the following directory on the machine where the Data Integration Service
runs:
<Informatica installation directory>\externaljdbcjars
Changes take effect after you recycle the Data Integration Service. At run time, the Data Integration
Service copies the .jar files to the Hadoop distribution cache so that the .jar files are accessible to all
nodes in the cluster.

Import Security Certificates to Clients


When you use custom, special, or self-signed security certificates to secure the Hadoop cluster, Informatica
clients that connect to the cluster require these certificates to be present in the client machine truststore.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.1.1 or earlier.

To connect to the Hadoop cluster to develop a mapping, the Developer tool requires security certificate
aliases on the machine that hosts the Developer tool. To run a mapping, the machine that hosts the Data
Integration Service requires these same certificate alias files.

Perform the following steps from the Developer tool host machine, and then repeat them from the Data
Integration Service host machine:

1. Run the following command to export the certificates from the cluster:
keytool -export -alias <alias name> -keystore <custom.truststore file location> -
file <exported certificate file location> -storepass <password>
For example,
keytool -export -alias <alias name> -keystore ~/custom.truststore -file ~/
exported.cer
The command produces a certificate file.

Import Security Certificates to Clients 109


2. Choose to import security certificates to an SSL-enabled domain or a domain that is not SSL-enabled
using the following command:
keytool -import -trustcacerts -alias <alias name> -file <exported certificate file
location> -keystore <java cacerts location> -storepass <password>
For example,
keytool -import -alias <alias name> -file ~/exported.cer -keystore <Informatica
installation directory>/java/jre/lib/security/cacerts

Configure the Developer Tool


To access the Hadoop environment from the Developer tool, the mapping developers must perform tasks on
each Developer tool machine.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from any previous version.

Configure developerCore.ini
Edit developerCore.ini to successfully import local complex files available on the Developer tool machine.

When you import a complex file, such as Avro or Parquet, the imported object includes metadata associated
with the distribution in the Hadoop environment. If the file resides on the Developer tool machine, the import
process picks up the distribution information from the developerCore.ini file. You must edit the
developerCore.ini file to point to the distribution directory on the Developer tool machine.

You can find developerCore.ini in the following directory:


<Informatica installation directory>\clients\DeveloperClient
Add the following property:
-DINFA_HADOOP_DIST_DIR=hadoop\<distribution>_<version>
The change takes effect when you restart the Developer tool.

Complete Upgrade Tasks


If you upgraded the Informatica platform, you need to perform some additional tasks within the Informatica
domain.

Based on the version that you upgraded from, you might need to update the following types of objects:
Connections

Based on the version you are upgrading from, you might need to update Hadoop connections or replace
connections to the Hadoop environment.

The Hadoop connection contains additional properties. You need to manually update it to include
customized configuration in the hadoopEnv.properties file from previous versions.

110 Chapter 6: Hortonworks HDP Integration Tasks


Streaming mappings

The mapping contains deferred data objects or transformations. Support will be reinstated in a future
release.

After you upgrade, the streaming mappings become invalid. You must re-create the physical data objects
to run the mappings in Spark engine that uses Structured Streaming.

After you re-create the physical data objects to run the mappings in Spark engine that uses Structured
Streaming some properties are not available for Azure Event Hubs data objects.

Update Connections
You might need to update connections based on the version you are upgrading from.

Consider the following types of updates that you might need to make:
Configure the Hadoop connection.

Configure the Hadoop connection to incorporate properties from the hadoopEnv.properties file.

Replace connections.

If you chose the option to create connections when you ran the Cluster Configuration wizard, you need
to replace connections in mappings with the new connections.

Complete connection upgrades.

If you did not create connections when you created the cluster configuration, you need to update the
connections.

Configure the Hadoop Connection


To use properties that you customized in the hadoopEnv.properties file, you must configure the Hadoop
connection properties such as cluster environment variables, cluster path variables, and advanced properties.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.

When you run the Informatica upgrade, the installer backs up the existing hadoopEnv.properties file. You can
find the backup hadoopEnv.properties file in the following location:

<Previous Informatica installation directory>/services/shared/hadoop/<Hadoop distribution


name>_<version>/infaConf

Edit the Hadoop connection in the Administrator tool or the Developer tool to include any properties that you
manually configured in the hadoopEnv.properties file. The Hadoop connection contains default values for
properties such as cluster environment and path variables and advanced properties. You can update the
default values to match the properties in the hadoopEnv.properties file.

Complete Upgrade Tasks 111


Replace the Connections with New Connections
If you created connections when you imported the cluster configuration, you need to replace connections in
mappings with the new connections.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.

The method that you use to replace connections in mappings depends on the type of connection.
Hadoop connection

Run the following commands to replace the connections:

• infacmd dis replaceMappingHadoopRuntimeConnections. Replaces connections associated with


mappings that are deployed in applications.
• infacmd mrs replaceMappingHadoopRuntimeConnections. Replaces connections associated with
mappings that you run from the Developer tool.
For information about the infacmd commands, see the Informatica Command Reference.

Hive, HDFS, and HBase connections

You must replace the connections manually.

Complete Connection Upgrade


If you did not create connections when you imported the cluster configuration, you need to update connection
properties for Hadoop, Hive, HDFS, and HBase connections.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.


- You upgraded from version 10.2 or later and changed the distribution version.

Perform the following tasks to update the connections:

Update changed properties

Review connections that you created in a previous release to update the values for connection
properties. For example, if you added nodes to the cluster or if you updated the distribution version, you
might need to verify host names, URIs, or port numbers for some of the properties.

Associate the cluster configuration

The Hadoop, Hive, HDFS, and HBase connections must be associated with a cluster configuration.
Complete the following tasks:

1. Run infacmd isp listConnections to identify the connections that you need to upgrade. Use -ct
to list connections of a particular type.
2. Run infacmd isp UpdateConnection to associate the cluster configuration with the connection.
Use -cn to name the connection and -o clusterConfigID to associate the cluster configuration
with the connection.

For more information about infacmd, see the Informatica Command Reference.

112 Chapter 6: Hortonworks HDP Integration Tasks


Update Streaming Objects
Big Data Streaming uses Spark Structured Streaming to process data instead of Spark Streaming. To support
Spark Structured Streaming, some header ports are added to the data objects, and support to some of the
data objects and transformations are deferred to a future release. The behavior of some of the data objects
is also updated.

After you upgrade, the existing streaming mappings become invalid because of the unavailable header ports,
the unsupported transformations or data objects, and the behavior change of some data objects.

Perform this task in the following situations:

- You upgraded from version 10.1.1, 10.2.0, or 10.2.1.

To use an existing streaming mapping, perform the following tasks:

• Re-create the physical data objects. After you re-create the physical data objects, the data objects get the
required header ports, such as timestamp, partitionID, or key based on the data object.
• In a Normalizer transformation, if the Occurs column is set to Auto, re-create the Normalizer
transformation. You must re-create the Normalizer transformation because the type configuration
property of the complex port refers to the physical data object that you plan to replace.
• Update the streaming mapping. If the mapping contains Kafka target, Aggregator transformation, Joiner
transformation, or Normalizer transformation, replace the data object or transformation, and then update
the mapping because of the changed behavior of these transformations and data objects.
• Verify the deferred data object types. If the streaming mapping contains unsupported transformations or
data objects, contact Informatica Global Customer Support.

Re-create the Physical Data Objects


When you re-create the physical data objects, the physical data objects get the header ports and some
properties are not available for some data objects. Update the existing mapping with the newly created
physical data objects.

1. Go to the existing mapping, select the data object from the mapping.
2. Click the Properties tab. On the Column Projection tab, click Edit Schema.
3. Note the schema information from the Edit Schema dialog box.
4. Note the parameters information from the Parameters tab.
5. Create new physical data objects.
After you re-create the data objects, the physical data objects get the required header ports. The Microsoft
Azure does not support the following properties and are not available for Azure Event Hubs data objects:

• Consumer Properties
• Partition Count

Complete Upgrade Tasks 113


Update the Streaming Mappings
After you re-create the data object, replace the existing data objects with the re-created data objects. If the
mapping contains Normaliser Transformation, Aggregator transformation, or Joiner transformation, update
the mapping because of the changed behavior of these transformations and data objects.

Transformation Updates

If a transformation uses a complex port, configure the type configuration property of the port because
the property refers to the physical data object that you replaced.

Aggregator and Joiner Transformation Updates

An Aggregator transformation must be downstream from a Joiner transformation. A Window


transformation must be directly upstream from both Aggregator and Joiner transformations. Previously,
you could use an Aggregator transformation anywhere in the streaming mapping.

If a mapping contains an Aggregator transformation upstream from a Joiner transformation, move the
Aggregator transformation downstream from a Joiner transformation. Add a Window transformation
directly upstream from both Aggregator and Joiner transformations.

Verify the Deferred Data Object Types


After you upgrade, the streaming mappings might contain some transformations and data objects that are
deferred.

The following table lists the data object types to which the support is deferred to a future release:

Object Type Object

Source JMS
MapR Streams

Target MapR Streams

Transformation Data Masking


Joiner
Rank
Sorter

If you want to continue using the mappings that contain deferred data objects or transformations, you must
contact Informatica Global Customer Support.

114 Chapter 6: Hortonworks HDP Integration Tasks


Chapter 7

MapR Integration Tasks


This chapter includes the following topics:

• MapR Task Flows, 115


• Install and Configure the MapR Client , 120
• Prepare for Cluster Import from MapR, 120
• Create a Cluster Configuration, 125
• Verify or Refresh the Cluster Configuration , 126
• Verify JDBC Drivers for Sqoop Connectivity, 127
• Generate MapR Tickets, 128
• Configure the Developer Tool, 131
• Complete Upgrade Tasks, 132

MapR Task Flows


Depending on whether you want to integrate or upgrade Big Data Management in a MapR environment, you
can use the flow charts to perform the following tasks:

• Integrate the Informatica domain with MapR for the first time.
• Upgrade from version 10.2.1.
• Upgrade from version 10.2.
• Upgrade from a version earlier than 10.2.

115
Task Flow to Integrate with MapR
The following diagram shows the task flow to integrate the Informatica domain with MapR:

116 Chapter 7: MapR Integration Tasks


Task Flow to Upgrade from Version 10.2.1
The following diagram shows the task flow to upgrade Big Data Management 10.2.1 for MapR:

MapR Task Flows 117


Task Flow to Upgrade from Version 10.2
The following diagram shows the task flow to upgrade Big Data Management 10.2 for MapR:

118 Chapter 7: MapR Integration Tasks


Task Flow to Upgrade from a Version Earlier than 10.2
The following diagram shows the task flow to upgrade Big Data Management from a version earlier than 10.2
for MapR:

MapR Task Flows 119


Install and Configure the MapR Client
To enable communication between the Informatica domain and the MapR cluster, install and configure the
MapR client on the application service machines. The MapR client version on the MapR cluster and the
application service machines must match.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.2 or earlier.

You install the MapR client on the Data Integration Service, Metadata Access Service, and Analyst Service
machines in the following directory:

/opt/mapr

For instructions about installing and configuring the MapR client, refer to the MapR documentation at
https://mapr.com/docs/60/AdvancedInstallation/SettingUptheClient-install-mapr-client.html.

Prepare for Cluster Import from MapR


Before the Informatica administrator can import cluster information to create a cluster configuration in the
Informatica domain, the Hadoop administrator must perform some preliminary tasks.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from any previous version.

Note: If you are upgrading from a previous version, verify the properties and suggested values, as Big Data
Management might require additional properties or different values for existing properties.

Complete the following tasks to prepare the cluster before the Informatica administrator creates the cluster
configuration:

1. Verify property values in *-site.xml files that Big Data Management needs to run mappings in the Hadoop
environment.
2. Prepare the archive file to import into the domain.

Note: You cannot import cluster information directly from the MapR cluster into the Informatica domain.

120 Chapter 7: MapR Integration Tasks


Configure *-site.xml Files for MapR
The Hadoop administrator needs to configure *-site.xml file properties and restart impacted services before
the Informatica administrator imports cluster information into the domain.

core.site.xml
Configure the following properties in the core-site.xml file:
fs.s3.enableServerSideEncryption

Enables server side encryption for S3 buckets. Required for SSE and SSE-KMS encryption.

Set to: TRUE

fs.s3a.access.key

The ID for the Blaze and Spark engines to connect to the Amazon S3 file system.

Set to your access key.

fs.s3a.secret.key

The password for the Blaze and Spark engines to connect to the Amazon S3 file system

Set to your access ID.

fs.s3a.server-side-encryption-algorithm

The server-side encryption algorithm for S3. Required for SSE and SSE-KMS encryption. Set to the
encryption algorithm used.

hadoop.proxyuser.<proxy user>.groups

Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.

Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.

hadoop.proxyuser.<proxy user>.hosts

Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.

Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.

hadoop.proxyuser.yarn.groups

Comma-separated list of groups that you want to allow the YARN user to impersonate on a non-secure
cluster.

Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.

hadoop.proxyuser.yarn.hosts

Comma-separated list of hosts that you want to allow the YARN user to impersonate on a non-secure
cluster.

Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.

io.compression.codecs

Enables compression on temporary staging tables.

Prepare for Cluster Import from MapR 121


Set to a comma-separated list of compression codec classes on the cluster.

hadoop.security.auth_to_local

Translates the principal names from the Active Directory and MIT realm into local names within the
Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.

Set to: RULE:[1:$1@$0](^.*@YOUR.REALM)s/^(.*)@YOUR.REALM\.COM$/$1/g

Set to: RULE:[2:$1@$0](^.*@YOUR.REALM\.$)s/^(.*)@YOUR.REALM\.COM$/$1/g

hbase-site.xml
Configure the following properties in the hbase-site.xml file:
zookeeper.znode.parent

Identifies HBase master and region servers.

Set to the relative path to the znode directory of HBase.

hive-site.xml
Configure the following properties in the hive-site.xml file:
hive.cluster.delegation.token.store.class

The token store implementation. Required for HiveServer2 high availability and load balancing.

Set to: org.apache.hadoop.hive.thrift.DBTokenStore

hive.compactor.initiator.on

Runs the initiator and cleaner threads on metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.

Set to: TRUE

hive.compactor.worker.threads

The number of worker threads to run in a metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.

Set to: 1

hive.enforce.bucketing

Enables dynamic bucketing while loading to Hive. Required for an Update Strategy transformation in a
mapping that writes to a Hive target.

Set to: TRUE

hive.exec.dynamic.partition

Enables dynamic partitioned tables for Hive tables. Applicable for Hive versions 0.9 and earlier.

Set to: TRUE

hive.exec.dynamic.partition.mode

Allows all partitions to be dynamic. Required for the Update Strategy transformation in a mapping that
writes to a Hive target. Also required if you use Sqoop and define a DDL query to create or replace a
partitioned Hive target at run time.

Set to: nonstrict

hive.support.concurrency

Enables table locking in Hive. Required for an Update Strategy transformation in a mapping that writes to
a Hive target.

Set to: TRUE

122 Chapter 7: MapR Integration Tasks


hive.server2.support.dynamic.service.discovery

Enables HiveServer2 dynamic service discovery. Required for HiveServer2 high availability.

Set to: TRUE

hive.server2.zookeeper.namespace

The value of the ZooKeeper namespace in the JDBC connection string. Required for HiveServer2 high
availability.

Set to: jdbc:hive2://<zookeeper_ensemble>/


default;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2

hive.txn.manager

Turns on transaction support. Required for an Update Strategy transformation in a mapping that writes
to a Hive target.
Set to: org.apache.hadoop.hive.ql.lockmgr.DbTxnManager

hive.zookeeper.quorum

Comma-separated list of ZooKeeper server host:ports in a cluster. The value of the ZooKeeper ensemble
in the JDBC connection string. Required for HiveServer2 high availability.

Set to: jdbc:hive2://<zookeeper_ensemble>/default;serviceDiscoveryMode=zooKeeper;

mapred-site.xml
Configure the following properties in the mapred-site.xml file:

mapreduce.framework.name

The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for
Sqoop.

Set to: yarn

mapreduce.jobhistory.address

Location of the MapReduce JobHistory Server. The default port is 10020. Required for Sqoop.

Set to: <MapReduce JobHistory Server>:<port>

yarn.app.mapreduce.am.staging-dir

The HDFS staging directory used while submitting jobs.

Set to the staging directory path.

yarn-site.xml
Configure the following properties in the yarn-site.xml file:
yarn.application.classpath

Required for dynamic resource allocation.

Add spark_shuffle.jar to the class path. The .jar file must contain the class
"org.apache.spark.network.yarn.YarnShuffleService."

yarn.nodemanager.resource.memory-mb

The maximum RAM available for each container. Set the maximum memory on the cluster to increase
resource memory available to the Blaze engine.

Set to 16 GB if value is less than 16 GB.

yarn.nodemanager.resource.cpu-vcores

The number of virtual cores for each container. Required for Blaze engine resource allocation.

Prepare for Cluster Import from MapR 123


Set to 10 if the value is less than 10.

yarn.scheduler.minimum-allocation-mb

The minimum RAM available for each container. Required for Blaze engine resource allocation.

Set to 6 GB if the value is less than 6 GB.

yarn.nodemanager.vmem-check-enabled

Disables virtual memory limits for containers. Required for the Blaze and Spark engines.

Set to: FALSE

yarn.nodemanager.aux-services

Required for dynamic resource allocation for the Spark engine.

Add an entry for "spark_shuffle."

yarn.nodemanager.aux-services.spark_shuffle.class

Required for dynamic resource allocation for the Spark engine.

Set to: org.apache.spark.network.yarn.YarnShuffleService

yarn.resourcemanager.scheduler.class

Defines the YARN scheduler that the Data Integration Service uses to assign resources.

Set to: org.apache.hadoop.yarn.server.resourcemanager.scheduler

yarn.node-labels.enabled

Enables node labeling.

Set to: TRUE

yarn.node-labels.fs-store.root-dir

The HDFS location to update node label dynamically.

Set to: <hdfs://[Node name]:[Port]/[Path to store]/[Node labels]/>

Prepare the Archive File for Import from MapR


After you verify property values in the *-site.xml files, create a .zip or a .tar file that the Informatica
administrator can use to import the cluster configuration into the domain.

Create an archive file that contains the following files from the cluster:

• core-site.xml
• hbase-site.xml. Required only if you access HBase sources and targets.
• hive-site.xml
• mapred-site.xml
• yarn-site.xml

Note: To import from MapR, the Informatica administrator must use an archive file.

124 Chapter 7: MapR Integration Tasks


Create a Cluster Configuration
After the Hadoop administrator prepares the cluster for import, the Informatica administrator must create a
cluster configuration.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.1.1 or earlier.

A cluster configuration is an object in the domain that contains configuration information about the Hadoop
cluster. The cluster configuration enables the Data Integration Service to push mapping logic to the Hadoop
environment. Import configuration properties from the Hadoop cluster to create a cluster configuration.

The import process imports values from *-site.xml files into configuration sets based on the individual *-
site.xml files. When you perform the import, the cluster configuration wizard can create Hadoop, HBase,
HDFS, and Hive connection to access the Hadoop environment. If you choose to create the connections, the
wizard also associates the cluster configuration with the connections.

Note: If you are integrating for the first time and you imported the cluster configuration when you ran the
installer, you must re-create or refresh the cluster configuration.

Importing a Hadoop Cluster Configuration from a File


You can import properties from an archive file to create a cluster configuration.

Before you import from the cluster, you must get the archive file from the Hadoop administrator.

1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:

Property Description

Cluster Name of the cluster configuration.


configuration name

Description Optional description of the cluster configuration.

Distribution type The cluster Hadoop distribution type.

Create a Cluster Configuration 125


Property Description

Distribution version Version of the Hadoop distribution.


Each distribution type has a default version. This is the latest version of the Hadoop
distribution that Big Data Management supports.
When the cluster version differs from the default version, the cluster configuration wizard
populates the cluster configuration Hadoop distribution property with the most recent
supported version relative to the cluster version. For example, suppose Informatica
supports versions 5.10 and 5.13, and the cluster version is 5.12. In this case, the wizard
populates the version with 5.10.
You can edit the property to choose any supported version. Restart the Data Integration
Service for the changes to take effect.

Method to import Choose Import from file to import properties from an archive file.
the cluster
configuration

Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster
environment variables, cluster path variables, and advanced properties. Based on the
cluster environment and the functionality that you use, you can add to the default values
or change the default values of these properties. For a list of Hadoop connection
properties to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.

4. Click Browse to select a file. Select the file and click Open.
5. Click Next and verify the cluster configuration information on the summary page.

Verify or Refresh the Cluster Configuration


You might need to refresh the cluster configuration or update the distribution version in the cluster
configuration when you upgrade.

Perform this task in the following situation:

- You upgraded from version 10.2 or later.

Verify the Cluster Configuration


The cluster configuration contains a property for the distribution version. The verification task depends on
the version you upgraded:

126 Chapter 7: MapR Integration Tasks


Upgrade from 10.2

If you upgraded from 10.2 and you changed the distribution version, you need to verify the distribution
version in the General properties of the cluster configuration.

Upgrade from 10.2.1

Effective in version 10.2.1, Informatica assigns a default version to each Hadoop distribution type. If you
configure the cluster configuration to use the default version, the upgrade process upgrades to the
assigned default version if the version changes. If you have not upgraded your Hadoop distribution to
Informatica's default version, you need to update the distribution version property.

For example, suppose the assigned default Hadoop distribution version for 10.2.1 is n, and for 10.2.2 is n
+1. If the cluster configuration uses the default supported Hadoop version of n, the upgraded cluster
configuration uses the default version of n+1. If you have not upgraded the distribution in the Hadoop
environment you need to change the cluster configuration to use version n.

If you configure the cluster configuration to use a distribution version that is not the default version, you
need to update the distribution version property in the following circumstances:

• Informatica dropped support for the distribution version.


• You changed the distribution version.

Refresh the Cluster Configuration


If you updated any of the *-site.xml files noted in the topic to prepare for cluster import, you need to refresh
the cluster configuration in the Administrator tool.

Verify JDBC Drivers for Sqoop Connectivity


Verify that you have the JDBC drivers to access JDBC-compliant databases in the Hadoop environment. You
might need separate drivers for metadata import and for run-time processing.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.2.1 or earlier.

You download drivers based on design-time and run-time requirements:

• Design-time. To import metadata, you can use the DataDirect drivers packaged with the Informatica
installer if they are available. If they are not available, use any Type 4 JDBC driver that the database
vendor recommends.
• Run-time. To run mappings, use any Type 4 JDBC driver that the database vendor recommends. Some
distributions support other drivers to use Sqoop connectors. You cannot use the DataDirect drivers for
run-time processing.

Verify Design-time Drivers


Use the DataDirect JDBC drivers packaged with the Informatica installer to import metadata from JDBC-
compliant databases. If the DataDirect JDBC drivers are not available for a specific JDBC-compliant
database, download the Type 4 JDBC driver associated with that database.

Copy the JDBC driver .jar files to the following location on the Developer tool machine:

Verify JDBC Drivers for Sqoop Connectivity 127


<Informatica installation directory>\clients\externaljdbcjars

Verify Run-time Drivers


Verify run-time drivers for mappings that access JDBC-compliant databases in the Hadoop environment. Use
any Type 4 JDBC driver that the database vendor recommends.

1. Download Type 4 JDBC drivers associated with the JCBC-compliant databases that you want to access.
2. To use Sqoop TDCH MapR Connector for Teradata, download the following files:
• sqoop-connector-tdch-1.1-mapr-1707.jar from
http://repository.mapr.com/nexus/content/groups/mapr-public/org/apache/sqoop/connector/
sqoop-connector-tdch/1.1-mapr-1707/
• terajdbc4.jar and tdgssconfig.jar from
http://downloads.teradata.com/download/connectivity/jdbc-driver
• The MapR Connector for Teradata .jar file from the Teradata website.
3. To optimize the Sqoop mapping performance on the Spark engine while writing data to an HDFS
complex file target of the Parquet format, download the following .jar files:
• parquet-hadoop-bundle-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-avro/1.6.0/
• parquet-avro-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-hadoop-bundle/1.6.0/
• parquet-column-1.5.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-column/1.5.0/
4. Copy all of the .jar files to the following directory on the machine where the Data Integration Service
runs:
<Informatica installation directory>\externaljdbcjars
Changes take effect after you recycle the Data Integration Service. At run time, the Data Integration
Service copies the .jar files to the Hadoop distribution cache so that the .jar files are accessible to all
nodes in the cluster.

Generate MapR Tickets


To run mappings on a MapR cluster that uses Kerberos or MapR Ticket authentication with information in
Hive tables, generate a MapR ticket for the Data Integration Service user.

The Data Integration Service user requires an account on the MapR cluster and a MapR ticket on the
application service machines that require access to MapR. When the MapR cluster uses both Kerberos and
Ticket authentication, you generate a single ticket for the Data Integration Service user for both
authentication systems.

After you generate and save MapR tickets, you perform additional steps to configure the Data Integration
Service, the Metadata Access Service, and the Analyst Service to communicate with the MapR cluster.

128 Chapter 7: MapR Integration Tasks


Generate Tickets
After you create a MapR user account for the Data Integration Service user, name the ticket file using the
following naming convention:
maprticket_<user name>
For example, for a user ID 1234, a MapR ticket file named maprticket_1234 is generated.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.1.1 or earlier.

Save the ticket on the machines that host the Data Integration Service, the Metadata Access Service, and the
Analyst Service. The Data Integration Service and the Analyst Service access the ticket at run time. The
Metadata Access Service access the ticket for the Developer tool at design time.

By default, the services access the ticket in the /tmp directory. If you save the ticket to any other location,
you must configure the MAPR_TICKETFILE_LOCATION environment variable in the service properties.

Configure the Data Integration Service


When the MapR cluster is secured with Kerberos or MapR Ticket authentication, edit Data Integration Service
properties to enable communication between the Informatica domain and the cluster.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.1.1 or earlier.

In the Administrator tool Domain Navigator, select the Data Integration Service to configure, and then select
the Processes tab.

Generate MapR Tickets 129


In the Environment Variables area, configure the following property to define the Kerberos authentication
protocol:

Property Value

JAVA_OPTS -Dhadoop.login=<MAPR_ECOSYSTEM_LOGIN_OPTS> -
Dhttps.protocols=TLSv1.2
where <MAPR_ECOSYSTEM_LOGIN_OPTS> is the value of the
MAPR_ECOSYSTEM_LOGIN_OPTS property in the file /opt/mapr/conf/env.sh.

MAPR_HOME MapR client directory on the machine that runs the Data Integration Service.
For example, opt/mapr
Required if you want to fetch a MapR Streams data object.

MAPR_TICKETFILE_LOCATION Required when the MapR cluster uses Kerberos or MapR Ticket authentication.
Location of the MapR ticket file if you saved it to a directory other than /tmp.
For example:
/export/home/username1/Keytabs_and_krb5conf/Tickets/project1/
maprticket_30103

Changes take effect when you restart the Data Integration Service.

Configure the Metadata Access Service


When the MapR cluster is secured with MapR Kerberos or ticketed authentication, edit Metadata Access
Service properties to enable communication between the Developer tool and the cluster.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.2 or earlier.

In the Administrator tool Domain Navigator, select the Metadata Access Service to configure, and then select
the Processes tab.

In the Environment Variables area, configure the following property to define the Kerberos authentication
protocol:

Property Value

JAVA_OPTS -Dhadoop.login=<MAPR_ECOSYSTEM_LOGIN_OPTS> -
Dhttps.protocols=TLSv1.2
where <MAPR_ECOSYSTEM_LOGIN_OPTS> is the value of the
MAPR_ECOSYSTEM_LOGIN_OPTS property in the file /opt/mapr/conf/env.sh.

MAPR_TICKETFILE_LOCATION Required when the MapR cluster uses Kerberos or MapR Ticket authentication.
Location of the MapR ticket file if you saved it to a directory other than /tmp.
For example,
/export/home/username1/Keytabs_and_krb5conf/Tickets/project1/
maprticket_30103

Changes take effect when you restart the Metadata Access Service.

130 Chapter 7: MapR Integration Tasks


Configure the Analyst Service
If you use the Analyst tool to profile data in Hive data objects, configure properties on the Analyst Service to
enable communication between the Analyst tool and the cluster, including testing of the Hive connection.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from version 10.1.1 or earlier.

In the Administrator tool Domain Navigator, select the Analyst Service to configure, then select the
Processes tab.

In the Environment Variables area, configure the following property to define the Kerberos authentication
protocol:

Property Value

JAVA_OPTS -Dhadoop.login=hybrid -Dhttps.protocols=TLSv1.2

MAPR_TICKETFILE_LOCATION Required when the MapR cluster uses Kerberos or MapR Ticket authentication.
Location of the MapR ticket file if you saved it to a directory other than /tmp.
For example,
/export/home/username1/Keytabs_and_krb5conf/Tickets/project1/
maprticket_30103

LD_LIBRARY_PATH The location of Hadoop libraries.


For example,
<Informatica installation directory>/java/jre/lib:<Informatica
installation directory>/services/shared/bin:<Informatica
installation directory>/server/bin:<Informatica installation
directory>/services/shared/hadoop/<MapR location>/lib/native/
Linux-amd64-64

Changes take effect when you restart the Analyst Service.

Configure the Developer Tool


To access the Hadoop environment from the Developer tool, the mapping developers must perform tasks on
each Developer tool machine.

Perform this task in the following situations:

- You are integrating for the first time.


- You upgraded from any previous version.

Configure the Developer Tool 131


Configure developerCore.ini
Edit developerCore.ini to successfully import local complex files available on the Developer tool machine.

When you import a complex file, such as Avro or Parquet, the imported object includes metadata associated
with the distribution in the Hadoop environment. If the file resides on the Developer tool machine, the import
process picks up the distribution information from the developerCore.ini file. You must edit the
developerCore.ini file to point to the distribution directory on the Developer tool machine.

You can find developerCore.ini in the following directory: <Informatica installation directory>
\clients\DeveloperClient

Add the following property:


-DINFA_HADOOP_DIST_DIR=hadoop\<distribution>_<version>
For example, -DINFA_HADOOP_DIST_DIR=hadoop\mapr_5.2.0

Complete Upgrade Tasks


If you upgraded the Informatica platform, you need to perform some additional tasks within the Informatica
domain.

Based on the version that you upgraded from, you might need to update the following types of objects:
Connections

Based on the version you are upgrading from, you might need to update Hadoop connections or replace
connections to the Hadoop environment.

The Hadoop connection contains additional properties. You need to manually update it to include
customized configuration in the hadoopEnv.properties file from previous versions.

Streaming mappings

The mapping contains deferred data objects or transformations. Support will be reinstated in a future
release.

After you upgrade, the streaming mappings become invalid. You must re-create the physical data objects
to run the mappings in Spark engine that uses Structured Streaming.

After you re-create the physical data objects to run the mappings in Spark engine that uses Structured
Streaming some properties are not available for Azure Event Hubs data objects.

Update Connections
You might need to update connections based on the version you are upgrading from.

Consider the following types of updates that you might need to make:
Configure the Hadoop connection.

Configure the Hadoop connection to incorporate properties from the hadoopEnv.properties file.

Replace connections.

If you chose the option to create connections when you ran the Cluster Configuration wizard, you need
to replace connections in mappings with the new connections.

132 Chapter 7: MapR Integration Tasks


Complete connection upgrades.

If you did not create connections when you created the cluster configuration, you need to update the
connections.

Configure the Hadoop Connection


To use properties that you customized in the hadoopEnv.properties file, you must configure the Hadoop
connection properties such as cluster environment variables, cluster path variables, and advanced properties.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.

When you run the Informatica upgrade, the installer backs up the existing hadoopEnv.properties file. You can
find the backup hadoopEnv.properties file in the following location:

<Previous Informatica installation directory>/services/shared/hadoop/<Hadoop distribution


name>_<version>/infaConf

Edit the Hadoop connection in the Administrator tool or the Developer tool to include any properties that you
manually configured in the hadoopEnv.properties file. The Hadoop connection contains default values for
properties such as cluster environment and path variables and advanced properties. You can update the
default values to match the properties in the hadoopEnv.properties file.

Replace the Connections with New Connections


If you created connections when you imported the cluster configuration, you need to replace connections in
mappings with the new connections.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.

The method that you use to replace connections in mappings depends on the type of connection.
Hadoop connection

Run the following commands to replace the connections:

• infacmd dis replaceMappingHadoopRuntimeConnections. Replaces connections associated with


mappings that are deployed in applications.
• infacmd mrs replaceMappingHadoopRuntimeConnections. Replaces connections associated with
mappings that you run from the Developer tool.

For information about the infacmd commands, see the Informatica Command Reference.

Hive, HDFS, and HBase connections

You must replace the connections manually.

Complete Upgrade Tasks 133


Complete Connection Upgrade
If you did not create connections when you imported the cluster configuration, you need to update connection
properties for Hadoop, Hive, HDFS, and HBase connections.

Perform this task in the following situation:

- You upgraded from version 10.1.1 or earlier.


- You upgraded from version 10.2 or later and changed the distribution version.

Perform the following tasks to update the connections:

Update changed properties

Review connections that you created in a previous release to update the values for connection
properties. For example, if you added nodes to the cluster or if you updated the distribution version, you
might need to verify host names, URIs, or port numbers for some of the properties.

Associate the cluster configuration

The Hadoop, Hive, HDFS, and HBase connections must be associated with a cluster configuration.
Complete the following tasks:

1. Run infacmd isp listConnections to identify the connections that you need to upgrade. Use -ct
to list connections of a particular type.
2. Run infacmd isp UpdateConnection to associate the cluster configuration with the connection.
Use -cn to name the connection and -o clusterConfigID to associate the cluster configuration
with the connection.

For more information about infacmd, see the Informatica Command Reference.

134 Chapter 7: MapR Integration Tasks


Part II: Databricks Integration
This part contains the following chapters:

• Introduction to Databricks Integration, 136


• Before You Begin Databricks Integration, 140
• Databricks Integration Tasks, 144

135
Chapter 8

Introduction to Databricks
Integration
This chapter includes the following topics:

• Databricks Integration Overview, 136


• Run-time Process on the Databricks Spark Engine, 136
• Databricks Integration Task Flow, 139

Databricks Integration Overview


Big Data Management can connect to Azure Databricks. Azure Databricks is an analytics cloud platform that
is optimized for the Microsoft Azure cloud services. It incorporates the open-source Apache Spark cluster
technologies and capabilities.

The Data Integration Service automatically installs the binaries required to integrate the Informatica domain
with the Databricks environment. The integration requires Informatica connection objects and cluster
configurations. A cluster configuration is a domain object that contains configuration parameters that you
import from the Databricks cluster. You then associate the cluster configuration with connections to access
the Databricks environment.

Perform the following tasks to integrate the Informatica domain with the Databricks environment:

1. Install or upgrade to the current Informatica version.


2. Perform pre-import tasks, such as verifying system requirements and permissions.
3. Import the cluster configuration into the domain.
4. Create a Databricks connection to run mappings within the Databricks environment.

Run-time Process on the Databricks Spark Engine


When you run a job on the Databricks Spark engine, the Data Integration Service pushes the processing to the
Databricks cluster, and the Databricks Spark engine runs the job.

The following image shows the components of the Informatica and the Databricks environments:

136
1. The Logical Data Transformation Manager translates the mapping into a Scala program, packages it as
an application, and sends it to the Databricks Engine Executor on the Data Integration Service machine.
2. The Databricks Engine Executor submits the application through REST API to the Databricks cluster,
requests to run the application, and stages files for access during run time.
3. The Databricks cluster passes the request to the Databricks Spark driver on the driver node.
4. The Databricks Spark driver distributes the job to one or more Databricks Spark executors that reside on
worker nodes.
5. The executors run the job and stage run-time data to the Databricks File System (DBFS) of the
workspace.

Native Environment
The integration with Databricks requires tools, services, and a repository database in the Informatica domain.

Clients and Tools


When the Informatica domain is integrated with Databricks, you can use the following tools:
Informatica Administrator

Use the Administrator tool to mange the Informatica domain and application services. You can also
create objects such as connections, cluster configurations, and cloud provisioning configurations to
enable big data operations.

The Developer tool

Use the Developer tool to import sources and targets and create mappings to run in the Databricks
environment.

Application Services
The domain integration with Databricks uses the following services:
Data Integration Service

The Data Integration Service can process mappings in the native environment, or it can push the
processing to the Databricks environment. The Data Integration Service retrieves metadata from the
Model repository when you run a mapping.

Run-time Process on the Databricks Spark Engine 137


Model Repository Service

The Model Repository Service manages the Model repository. All requests to save or access Model
repository metadata go through the Model repository.

Model Repository
The Model repository stores mappings that you create and manage in the Developer tool.

Databricks Environment
Integration with the Databricks environment includes the following components:

Databricks Spark engine

The Databricks run-time engine based on the open-source Apache Spark engine.

Databricks File System (DBFS)

A distributed file system installed on Databricks Runtime clusters. Run-time data is staged in the DBFS
and is persisted to a mounted Blob storage container.

138 Chapter 8: Introduction to Databricks Integration


Databricks Integration Task Flow
The following diagram shows the task flow to integrate the Informatica domain with Azure Databricks:

Databricks Integration Task Flow 139


Chapter 9

Before You Begin Databricks


Integration
This chapter includes the following topics:

• Read the Release Notes, 140


• Verify System Requirements, 140
• Configure Preemption for Concurrent Jobs, 141
• Configure Storage Access, 142
• Create a Staging Directory for Binary Archive Files, 142
• Create a Staging Directory for Run-time Processing, 143
• Prepare for Token Authentication, 143
• Configure the Data Integration Service, 143

Read the Release Notes


Read the Release Notes for updates to the installation and upgrade process. You can also find information
about known and fixed limitations for the release.

Verify System Requirements


Verify that your environment meets the following minimum system requirements for the integration:
Informatica services

Install and configure the Informatica services and the Developer tool. Verify that the domain contains a
Model Repository Service and a Data Integration Service.

Domain access to Databricks

Verify access domain access to Databricks through one of the following methods:

• VPN is enabled between the Informatica domain and the Azure cloud network.
• The Informatica domain is installed within the Azure ecosystem.

140
Databricks distribution

Verify that the Databricks distribution is a version 5.1 standard concurrency distribution.

Databricks File System (DBFS)

Verify that the DBFS has WASB storage or a mounted Blob storage container.

For more information about product requirements and supported platforms, see the Product Availability
Matrix on Informatica Network:
https://network.informatica.com/community/informatica-network/product-availability-matrices

Configure Preemption for Concurrent Jobs


Configure the Databricks cluster to improve concurrency of jobs.

When you submit a job to Databricks, it allocates resources to run the job. If it does not have enough
resources, it puts the job in a queue. Pending jobs fail if resources do not become available before the
timeout of 30 minutes.

You can configure preemption on the cluster to control the amount of resources that Databricks allocates to
each job, thereby allowing more jobs to run concurrently. You can also configure the timeout for the queue
and the interval at which the Databricks Spark engine checks for available resources.

Configure the following environment variables for the Databricks Spark engine:
spark.databricks.preemption.enabled

Enables the Spark scheduler for preemption. Default is false.

Set to: true

spark.databricks.preemption.threshold

A percentage of resources that are allocated to each submitted job. The job runs with the allocated
resources until completion. Default is 0.5, or 50 percent.

Set to a value lower than default, such as 0.1.

spark.databricks.preemption.timeout

The number of seconds that a job remains in the queue before failing. Default is 30.

Set to: 1,800.

Note: If you set a value higher than 1,800, Databricks ignores the value and uses the maximum timeout
of 1,800.

spark.databricks.preemption.interval

The number of seconds to check for available resources to assign to a job in the queue. Default is 5.

Set to a value lower than the timeout.

Changes take effect after you restart the cluster.

Important: Informatica integrates with Databricks, supporting standard concurrency clusters. Standard
concurrency clusters have a maximum queue time of 30 minutes, and jobs fail when the timeout is reached.
The maximum queue time cannot be extended. Setting the preemption threshold allows more jobs to run
concurrently, but with a lower percentage of allocated resources, the jobs can take longer to run. Also,
configuring the environment for preemption does not ensure that all jobs will run. In addition to configuring
preemption, you might choose to run cluster workflows to create ephemeral clusters that create the cluster,

Configure Preemption for Concurrent Jobs 141


run the job, and then delete the cluster. For more information about Databricks concurrency, contact Azure
Databricks.

Configure Storage Access


Based on the cluster storage type, you can configure the storage key or get the client credentials of service
principal to access the storage in the cluster. Add the configuration to the Spark configuration on the
Databricks cluster.

Configure ADLS Storage Access


If you use ADLS storage, you need to set some Hadoop credential configuration options as Databricks Spark
options. Add "spark.hadoop" as a prefix to the Hadoop configuration keys as shown in the following text:
spark.hadoop.dfs.adls.oauth2.access.token.provider.type ClientCredential
spark.hadoop.dfs.adls.oauth2.client.id <your-service-client-id>
spark.hadoop.dfs.adls.oauth2.credential <your-service-credentials>
spark.hadoop.dfs.adls.oauth2.refresh.url "https://login.microsoftonline.com/<your-
directory-id>/oauth2/token"

Configure WASB Storage Access


If you use WASB storage, you need to set a Hadoop configuration key based one of the following methods of
access:
Account access key

If you use an account access key, add "spark.hadoop" as a prefix to the Hadoop configuration key as
shown in the following text:
spark.hadoop.fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net
<your-storage-account-access-key>
SAS token

If you use an SAS token, add "spark.hadoop" as a prefix to the Hadoop configuration key as shown in the
following text:
spark.hadoop.fs.azure.sas.<your-container-name>.<your-storage-account-
name>.blob.core.windows.net <complete-query-string-of-your-sas-for-the-container>

Create a Staging Directory for Binary Archive Files


Optionally, create a directory on DBFS that the Data Integration Service uses to stage the Informatica binary
archive files.

By default, the Data Integration Service writes the files to the DBFS directory /tmp.

If you create a staging directory, you configure this path in the Cluster Staging Directory property of the Data
Integration Service.

142 Chapter 9: Before You Begin Databricks Integration


Create a Staging Directory for Run-time Processing
When the Databricks Spark engine runs job, it stores temporary files in a staging directory.

Optionally, you can create a directory on DBFS to stage temporary files during run time. By default, the Data
Integration Service uses the DBFS directory /<Cluster Staging Directory>/DATABRICKS.

Prepare for Token Authentication


The Data Integration Service uses token-based authentication to provide access to the Databricks
environment.

Create a Databricks user to generate the authentication token. Complete the following tasks to prepare for
authentication.

1. Enable tokens from the Databricks Admin Console.


2. Verify that the Databricks environment contains a user to generate the token.
3. Grant the token user permissions.
• If you created staging directories, grant permission to access and write to the directories.
• If you did not create staging directories, grant permission to create directories.
4. Log in to the Databricks Admin Console as the token user and generate the token.

Configure the Data Integration Service


If you created a staging directory, configure the path in the Data Integration Service properties.

Configure the following property in the Data Integration Service:


Cluster Staging Directory
The directory on the cluster where the Data Integration Service pushes the binaries to integrate the
native and non-native environments and to store temporary files during processing. Default is /tmp.

Create a Staging Directory for Run-time Processing 143


Chapter 10

Databricks Integration Tasks


This chapter includes the following topics:

• Create a Databricks Cluster Configuration, 144


• Configure the Databricks Connection, 147

Create a Databricks Cluster Configuration


A Databricks cluster configuration is an object in the domain that contains configuration information about
the Databricks cluster. The cluster configuration enables the Data Integration Service to push mapping logic
to the Databricks environment.

Use the Administrator tool to import configuration properties from the Databricks cluster to create a cluster
configuration. You can import configuration properties from the cluster or from a file that contains cluster
properties. You can choose to create a Databricks connection when you perform the import.

Importing a Databricks Cluster Configuration from the Cluster


When you import the cluster configuration directly from the cluster, you provide information to connect to the
cluster.

Before you import the cluster configuration, get cluster information from the Databricks administrator.

1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:

Property Description

Cluster configuration name Name of the cluster configuration.

Description Optional description of the cluster configuration.

Distribution type The distribution type. Choose Databricks.

Method to import the Choose Import from cluster.


cluster configuration

144
Property Description

Databricks domain Domain name of the Databricks deployment.

Databricks access token The token ID created within Databricks required for authentication.
Note: If the token has an expiration date, verify that you get a new token from the
Databricks administrator before it expires.

Databricks cluster ID The cluster ID of the Databricks cluster.

Create connection Choose to create a Databricks connection.


If you choose to create a connection, the Cluster Configuration wizard associates
the cluster configuration with the Databricks connection.
If you do not choose to create a connection, you must manually create one and
associate the cluster configuration with it.

4. Click Next to verify the information on the summary page.

Importing a Databricks Cluster Configuration from a File


You can import properties from an archive file to create a cluster configuration.

Complete the following tasks to import a Databricks cluster from a file:

1. Get required cluster properties from the Databricks administrator.


2. Create an .xml file with the cluster properties, and compress it into a .zip or .tar file.
3. Log in to the Administrator tool and import the file.

Create the Import File


To import the cluster configuration from a file, you must create an archive file.

To create the .xml file for import, you must get required information from the Databricks administrator. You
can provide any name for the file and store it locally.

The following table describes the properties required to import the cluster information:

Property Name Description

cluster_name Name of the Databricks cluster.

cluster_ID The cluster ID of the Databricks cluster.

baseURL URL to access the Databricks cluster.

accesstoken The token ID created within Databricks required for authentication.

Optionally, you can include other properties specific to the Databricks environment.

When you complete the .xml file, compress it into a .zip or .tar file for import.

Create a Databricks Cluster Configuration 145


Sample Import File
The following text shows a sample import file with the required properties:
<?xml version="1.0" encoding="UTF-8"?><configuration>
<property>
<name>cluster_name</name>
<value>my_cluster</value>
</property>
<property>
<name>cluster_id</name>
<value>0926-294544-bckt123</value>
</property>
<property>
<name>baseURL</name>
<value>https://provide.adatabricks.net/</value>
</property>
<property>
<name>accesstoken</name>
<value>dapicf76c2d4567c6sldn654fe875936e778</value>
</property>
</configuration>

Import the Cluster Configuration


After you create the .xml file with the cluster properties, use the Administrator tool to import into the domain
and create the cluster configuration.

1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:

Property Description

Cluster configuration name Name of the cluster configuration.

Description Optional description of the cluster configuration.

Distribution type The distribution type. Choose Databricks.

Method to import the Choose Import from file.


cluster configuration

Upload configuration The full path and file name of the file. Click the Browse button to navigate to the
archive file file.

Create connection Choose to create a Databricks connection.


If you choose to create a connection, the Cluster Configuration wizard associates
the cluster configuration with the Databricks connection.
If you do not choose to create a connection, you must manually create one and
associate the cluster configuration with it.

4. Click Next to verify the information on the summary page.

146 Chapter 10: Databricks Integration Tasks


Configure the Databricks Connection
Databricks connections contain information to connect to the Databricks cluster. If you did not choose to
create a Databricks connection when you imported the cluster configuration, you must manually create one.

For information about the Databricks connection properties, see “Databricks Connection Properties” on page
159.

Configure the Databricks Connection 147


Appendix A

Connections
This appendix includes the following topics:

• Connections, 149
• Cloud Provisioning Configuration, 149
• Amazon Redshift Connection Properties, 155
• Amazon S3 Connection Properties, 156
• Cassandra Connection Properties, 158
• Databricks Connection Properties, 159
• Google Analytics Connection Properties, 161
• Google BigQuery Connection Properties, 161
• Google Cloud Spanner Connection Properties, 162
• Google Cloud Storage Connection Properties, 163
• Hadoop Connection Properties, 164
• HDFS Connection Properties, 169
• HBase Connection Properties, 171
• HBase Connection Properties for MapR-DB, 171
• Hive Connection Properties, 172
• JDBC Connection Properties, 175
• Kafka Connection Properties, 180
• Microsoft Azure Blob Storage Connection Properties, 181
• Microsoft Azure Cosmos DB SQL API Connection Properties, 182
• Microsoft Azure Data Lake Store Connection Properties, 183
• Microsoft Azure SQL Data Warehouse Connection Properties, 184
• Snowflake Connection Properties, 185
• Creating a Connection to Access Sources or Targets, 186
• Creating a Hadoop Connection, 186
• Configuring Hadoop Connection Properties, 188

148
Connections
Create a connection to access non-native environments, Hadoop and Databricks. If you access HBase, HDFS,
or Hive sources or targets in the Hadoop environment, you must also create those connections. You can
create the connections using the Developer tool, Administrator tool, and infacmd.

You can create the following types of connections:


Hadoop connection

Create a Hadoop connection to run mappings in the Hadoop environment.

HBase connection

Create an HBase connection to access HBase. The HBase connection is a NoSQL connection.

HDFS connection

Create an HDFS connection to read data from or write data to the HDFS file system on a Hadoop cluster.

Hive connection

Create a Hive connection to access Hive as a source or target. You can access Hive as a source if the
mapping is enabled for the native or Hadoop environment. You can access Hive as a target if the
mapping runs on the Blaze engine.

JDBC connection

Create a JDBC connection and configure Sqoop properties in the connection to import and export
relational data through Sqoop.

Databricks connection

Create a Databricks connection to run mappings in the Databricks environment.

Note: For information about creating connections to other sources or targets such as social media web sites
or Teradata, see the respective PowerExchange adapter user guide for information.

Cloud Provisioning Configuration


The cloud provisioning configuration establishes a relationship between the Create Cluster task and the
cluster connection that the workflows use to run mapping tasks. The Create Cluster task must include a
reference to the cloud provisioning configuration. In turn, the cloud provisioning configuration points to the
cluster connection that you create for use by the cluster workflow.

The properties to populate depend on the Hadoop distribution you choose to build a cluster on. Choose one
of the following connection types:

• AWS Cloud Provisioning. Connects to an Amazon EMR cluster on Amazon Web Services.
• Azure Cloud Provisioning. Connects to an HDInsight cluster on the Azure platform.
• Databricks Cloud Provisioning. Connects to a Databricks cluster on the Azure Databricks platform.

Connections 149
AWS Cloud Provisioning Configuration Properties
The properties in the AWS cloud provisioning configuration enable the Data Integration Service to contact
and create resources on the AWS cloud platform.

General Properties
The following table describes cloud provisioning configuration general properties:

Property Description

Name Name of the cloud provisioning configuration.

ID ID of the cloud provisioning configuration. Default: Same as the cloud provisioning


configuration name.

Description. Optional. Description of the cloud provisioning configuration.

AWS Access Key ID Optional. ID of the AWS access key, which AWS uses to control REST or HTTP query protocol
requests to AWS service APIs.
If you do not specify a value, Informatica attempts to follow the Default Credential Provider
Chain.

AWS Secret Access Secret component of the AWS access key.


Key Required if you specify the AWS Access Key ID.

Region Region in which to create the cluster. This must be the region in which the VPC is running.
Use AWS region values. For a list of acceptable values, see AWS documentation.
Note: The region where you want to create the cluster can be different from the region in which
the Informatica domain is installed.

Permissions
The following table describes cloud provisioning configuration permissions properties:

Property Description

EMR Role Name of the service role for the EMR cluster that you create. The role must have sufficient
permissions to create a cluster, access S3 resources, and run jobs on the cluster.
When the AWS administrator creates this role, they select the “EMR” role. This contains the default
AmazonElasticMapReduceRole policy. You can edit the services in this policy.

EC2 Instance Name of the EC2 instance profile role that controls permissions on processes that run on the cluster.
Profile When the AWS administrator creates this role, they select the “EMR Role for EC2” role. This includes
S3 access by default.

Auto Scaling Required if you configure auto-scaling for the EMR cluster.
Role This role is created when the AWS administrator configures auto-scaling on any cluster in the VPC.
Default: When you leave this field blank, it is equivalent to setting the Auto Scaling role to “Proceed
without role” when the AWS administrator creates a cluster in the AWS console.

150 Appendix A: Connections


EC2 Configuration
The following table describes cloud provisioning configuration EC2 configuration properties:

Property Description

EC2 Key Pair EC2 key pair to enable communication with the EMR cluster master node.
Optional. This credential enables you to log into the cluster. Configure this property if you intend
the cluster to be non-ephemeral.

EC2 Subnet ID of the subnet on the VPC in which to create the cluster.
Use the subnet ID of the EC2 instance where the cluster runs.

Master Security Optional. ID of the security group for the cluster master node. Acts as a virtual firewall to control
Group inbound and outbound traffic to cluster nodes.
Security groups are created when the AWS administrator creates and configures a cluster in a
VPC. In the AWS console, the property is equivalent to ElasticMapReduce-master.
You can use existing security groups, or the AWS administrator might create dedicated security
groups for the ephemeral cluster.
If you do not specify a value, the cluster applies the default security group for the VPC.

Additional Master Optional. IDs of additional security groups to attach to the cluster master node. Use a comma-
Security Groups separated list of security group IDs.

Core and Task Optional. ID of the security group for the cluster core and task nodes. When the AWS
Security Group administrator creates and configures a cluster In the AWS console, the property is equivalent to
the ElasticMapReduce-slave security group
If you do not specify a value, the cluster applies the default security group for the VPC.

Additional Core Optional. IDs of additional security groups to attach to cluster core and task nodes. Use a
and Task Security comma-separated list of security group IDs.
Groups

Service Access EMR managed security group for service access. Required when you provision an EMR cluster in
Security Group a private subnet.

Azure Cloud Provisioning Configuration Properties


The properties in the Azure cloud provisioning configuration enable the Data Integration Service to contact
and create resources on the Azure cloud platform.

Authentication Details
The following table describes authentication properties to configure:

Property Description

Name Name of the cloud provisioning configuration.

ID ID of the cloud provisioning configuration. Default: Same as the cloud provisioning configuration
name.

Description Optional. Description of the cloud provisioning configuration.

Cloud Provisioning Configuration 151


Property Description

Subscription ID ID of the Azure account to use in the cluster creation process.

Tenant ID A GUID string associated with the Azure Active Directory.

Client ID A GUID string that is the same as the Application ID associated with the Service Principal. The
Service Principal must be assigned to a role that has permission to create resources in the
subscription that you identified in the Subscription ID property.

Client Secret An octet string that provides a key associated with the client ID.

Storage Account Details


Choose to configure access to one of the following storage types:

• Azure Data Lake Storage (ADLS). See Azure documentation.


• An Azure Storage Account, known as general or blob storage. See Azure documentation.

The following table describes the information you need to configure Azure Data Lake Storage (ADLS) with the
HDInsight cluster:

Property Description

Azure Data Lake Store Name of the ADLS storage to access. The ADLS storage and the cluster to create must
Name reside in the same region.

Data Lake Service A credential that enables programmatic access to ADLS storage. Enables the Informatica
Principal Client ID domain to communicate with ADLS and run commands and mappings on the HDInsight
cluster.
The service principal is an Azure user that meets the following requirements:
- Permissions to access required directories in ADLS storage.
- Certificate-based authentication for ADLS storage.
- Key-based authentication for ADLS storage.

Data Lake Service The Base64 encoded text of the public certificate used with the service principal.
Principal Certificate Leave this property blank when you create the cloud provisioning configuration. After you
Contents save the cloud provisioning configuration, log in to the VM where the Informatica domain
is installed and run infacmd ccps updateADLSCertificate to populate this property.

Data Lake Service Private key for the service principal. This private key must be associated with the service
Principal Certificate principal certificate.
Password

Data Lake Service An octet string that provides a key associated with the service principal.
Principal Client Secret

Data Lake Service Endpoint for OAUTH token based authentication.


Principal OAUTH Token
Endpoint

152 Appendix A: Connections


The following table describes the information you need to configure Azure General Storage, also known as
blob storage, with the HDInsight cluster:

Property Description

Azure Storage Name of the storage account to access. Get the value from the Storage Accounts node in the
Account Name Azure web console. The storage and the cluster to create must reside in the same region.

Azure Storage A key to authenticate access to the storage account. To get the value from the Azure web
Account Key console, select the storage account, then Access Keys. The console displays the account keys.

Cluster Deployment Details


The following table describes the cluster deployment properties that you configure:

Property Description

Resource Resource group in which to create the cluster. A resource group is a logical set of Azure resources.
Group

Virtual Optional. Resource group to which the virtual network belongs.


Network If you do not specify a resource group, the Data Integration Service assumes that the virtual network
Resource is a member of the same resource group as the cluster.
Group

Virtual Name of the virtual network or vnet where you want to create the cluster. Specify a vnet that resides
Network in the resource group that you specified in the Virtual Network Resource Group property.
The vnet must be in the same region as the region in which to create the cluster.

Subnet Name Subnet in which to create the cluster. The subnet must be a part of the vnet that you designated in
the previous property.
Each vnet can have one or more subnets. The Azure administrator can choose an existing subnet or
create one for the cluster.

External Hive Metastore Details


You can specify the properties to enable the cluster to connect to a Hive metastore database that is external
to the cluster.

You can use an external relational database like MySQL or Amazon RDS as the Hive metastore database. The
external database must be on the same cloud platform as the cluster to create.

If you do not specify an existing external database in this dialog box, the cluster creates its own database on
the cluster. This database is terminated when the cluster is terminated.

Cloud Provisioning Configuration 153


The following table describes the Hive metastore database properties that you configure:

Property Description

Database Name Name of the Hive metastore database.

Database Server Server on which the database resides.


Name Note: The database server name on the Azure web console commonly includes the suffix
database.windows.net. For example: server123xyz.database.windows.net. You can
specify the database server name without the suffix and Informatica will automatically append
the suffix. For example, you can specify server123xyz.

Database User User name of the account for the domain to use to access the database.
Name

Database Password for the user account.


Password

Databricks Cloud Provisioning Configuration Properties


The properties in the Databricks cloud provisioning configuration enable the Data Integration Service to
contact and create resources on the Databricks cloud platform.

The following table describes the Databricks cloud provisioning configuration properties:

Property Description

Name Name of the cloud provisioning configuration.

ID The cluster ID of the Databricks cluster.

Description Optional description of the cloud provisioning configuration.

Databricks domain Domain name of the Databricks deployment.

Databricks token ID The token ID created within Databricks required for authentication.
Note: If the token has an expiration date, verify that you get a new token from the Databricks
administrator before it expires.

154 Appendix A: Connections


Amazon Redshift Connection Properties
When you set up an Amazon Redshift connection, you must configure the connection properties.

The following table describes the Amazon Redshift connection properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within the domain. You
can change this property after you create the connection. The name cannot exceed 128 characters,
contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It
must be 255 characters or less and must be unique in the domain. You cannot change this property
after you create the connection. Default value is the connection name.

Description The description of the connection. The description cannot exceed 4,000 characters.

Location The domain where you want to create the connection.

Type The connection type. Select Amazon Redshift in the Database.

The Details tab contains the connection attributes of the Amazon Redshift connection. The following table
describes the connection attributes:

Property Description

Username User name of the Amazon Redshift account.

Password Password for the Amazon Redshift account.

Schema Optional. Amazon Redshift schema name. Do not specify the schema name if you want to use
multiple schema. The Data Object wizard displays all the user-defined schemas available for the
Amazon Redshift objects.
Default is public.

AWS Access Key Amazon S3 bucket access key ID.


ID Note: Required if you do not use AWS Identity and Access Management (IAM) authentication.

AWS Secret Amazon S3 bucket secret access key ID.


Access Key Note: Required if you do not use AWS Identity and Access Management (IAM) authentication.

Master Symmetric Optional. Provide a 256-bit AES encryption key in the Base64 format when you enable client-side
Key encryption. You can generate a key using a third-party tool.
If you specify a value, ensure that you specify the encryption type as client side encryption in the
advanced target properties.

Amazon Redshift Connection Properties 155


Property Description

Customer Master Optional. Specify the customer master key ID or alias name generated by AWS Key Management
Key ID Service (AWS KMS). You must generate the customer master key corresponding to the region
where Amazon S3 bucket resides. You can specify any of the following values:
Customer generated customer master key

Enables client-side or server-side encryption.


Default customer master key

Enables client-side or server-side encryption. Only the administrator user of the account can
use the default customer master key ID to enable client-side encryption.
Note: You can use customer master key ID when you run a mapping in the native environment or
on the Spark engine.

Cluster Node Type Node type of the Amazon Redshift cluster.


You can select the following options:
- ds1.xlarge
- ds1.8xlarge
- dc1.large
- dc1.8xlarge
- ds2.xlarge
- ds2.8xlarge
For more information about nodes in the cluster, see the Amazon Redshift documentation.

Number of Nodes Number of nodes in the Amazon Redshift cluster.


in Cluster For more information about nodes in the cluster, see the Amazon Redshift documentation.

JDBC URL Amazon Redshift connection URL.

Note: If you upgrade the mappings created in versions 10.1.1 Update 2 or earlier, you must select the relevant
schema in the connection property. Else, the mappings fail when you run them on current version.

Amazon S3 Connection Properties


When you set up an Amazon S3 connection, you must configure the connection properties.

The following table describes the Amazon S3 connection properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within the domain.
You can change this property after you create the connection. The name cannot exceed 128
characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + = { [ } ] |
\:;"'<,>.?/

ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive.
It must be 255 characters or less and must be unique in the domain. You cannot change this
property after you create the connection. Default value is the connection name.

Description Optional. The description of the connection. The description cannot exceed 4,000 characters.

156 Appendix A: Connections


Property Description

Location The domain where you want to create the connection.

Type The Amazon S3 connection type.

Access Key The access key ID for access to Amazon account resources.
Note: Required if you do not use AWS Identity and Access Management (IAM) authentication.

Secret Key The secret access key for access to Amazon account resources. The secret key is associated with
the access key and uniquely identifies the account.
Note: Required if you do not use AWS Identity and Access Management (IAM) authentication.

Folder Path The complete path to Amazon S3 objects. The path must include the bucket name and any folder
name.
Do not use a slash at the end of the folder path. For example, <bucket name>/<my folder
name>.

Master Optional. Provide a 256-bit AES encryption key in the Base64 format when you enable client-side
Symmetric Key encryption. You can generate a master symmetric key using a third-party tool.

Customer Optional. Specify the customer master key ID or alias name generated by AWS Key Management
Master Key ID Service (AWS KMS). You must generate the customer master key for the same region where Amazon
S3 bucket reside.
You can specify any of the following values:
Customer generated customer master key

Enables client-side or server-side encryption.

Default customer master key

Enables client-side or server-side encryption. Only the administrator user of the account can use
the default customer master key ID to enable client-side encryption.
Note: Applicable when you run a mapping in the native environment or on the Spark engine.

Region Name Select the AWS region in which the bucket you want to access resides.
Select one of the following regions:
- Asia Pacific (Mumbai)
- Asia Pacific (Seoul)
- Asia Pacific (Singapore)
- Asia Pacific (Sydney)
- Asia Pacific (Tokyo)
- AWS GovCloud (US)
- Canada (Central)
- China (Beijing)
- China (Ningxia)
- EU (Ireland)
- EU (Frankfurt)
- EU (London)
- EU (Paris)
- South America (Sao Paulo)
- US East (Ohio)
- US East (N. Virginia)
- US West (N. California)
- US West (Oregon)
Default is US East (N. Virginia).

Amazon S3 Connection Properties 157


Cassandra Connection Properties
When you set up a Cassandra connection, you must configure the connection properties.

Note: The order of the connection properties might vary depending on the tool where you view them.

The following table describes the Cassandra connection properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within the domain.
You can change this property after you create the connection. The name cannot exceed 128
characters, contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection.
The ID is not case sensitive. The ID must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection.
Default value is the connection name.

Description Optional. The description of the connection. The description cannot exceed 4,000 characters.

Location The domain where you want to create the connection.

Type The connection type. Select Cassandra.

Host Name Host name or IP address of the Cassandra server.

Port Cassandra server port number. Default is 9042.

User Name User name to access the Cassandra server.

Password Password corresponding to the user name to access the Cassandra server.

Default Name of the Cassandra keyspace to use by default.


Keyspace

SQL Identifier Type of character that the database uses to enclose delimited identifiers in SQL or CQL queries. The
Character available characters depend on the database type.
Select None if the database uses regular identifiers. When the Data Integration Service generates
SQL or CQL queries, the service does not place delimited characters around any identifiers.
Select a character if the database uses delimited identifiers. When the Data Integration Service
generates SQL or CQL queries, the service encloses delimited identifiers within this character.

Additional Enter one or more JDBC connection parameters in the following format:
Connection <param1>=<value>;<param2>=<value>;<param3>=<value>
Properties PowerExchange for Cassandra JDBC supports the following JDBC connection parameters:
- BinaryColumnLength
- DecimalColumnScale
- EnableCaseSensitive
- EnableNullInsert
- EnablePaging
- RowsPerPage
- StringColumnLength
- VTTableNameSeparator

158 Appendix A: Connections


Property Description

SSL Mode Not applicable for PowerExchange for Cassandra JDBC.


Select disabled.

SSL Truststore Not applicable for PowerExchange for Cassandra JDBC.


Path

SSL Truststore Not applicable for PowerExchange for Cassandra JDBC.


Password

SSL Keystore Not applicable for PowerExchange for Cassandra JDBC.


Path

SSL Keystore Not applicable for PowerExchange for Cassandra JDBC.


Password

Databricks Connection Properties


Use the Databricks connection to run mappings on a Databricks cluster.

A Databricks connection is a cluster type connection. You can create and manage a Databricks connection in
the Administrator tool or the Developer tool. You can use infacmd to create a Databricks connection.
Configure properties in the Databricks connection to enable communication between the Data Integration
Service and the Databricks cluster.

The following table describes the general connection properties for the Databricks connection:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within the
domain. You can change this property after you create the connection. The name cannot exceed
128 characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + =
{[}]|\:;"'<,>.?/

ID String that the Data Integration Service uses to identify the connection. The ID is not case
sensitive. It must be 255 characters or less and must be unique in the domain. You cannot change
this property after you create the connection. Default value is the connection name.

Description Optional. The description of the connection. The description cannot exceed 4,000 characters.

Connection Type Choose Databricks.

Cluster Name of the cluster configuration associated with the Databricks environment.
Configuration Required if you do not configure the cloud provisioning configuration.

Cloud Name of the cloud provisioning configuration associated with a Databricks cloud platform.
Provisioning Required if you do not configure the cluster configuration.
Configuration

Databricks Connection Properties 159


Property Description

Staging Directory The directory where the Databricks Spark engine stages run-time files.
If you specify a directory that does not exist, the Data Integration Service creates it at run time.
If you do not provide a directory path, the run-time staging files are written to /<cluster staging
directory>/DATABRICKS.

Advanced List of advanced properties that are unique to the Databricks environment.
Properties You can configure run-time properties for the Databricks environment in the Data Integration
Service and in the Databricks connection. You can override a property configured at a high level
by setting the value at a lower level. For example, if you configure a property in the Data
Integration Service custom properties, you can override it in the Databricks connection. The Data
Integration Service processes property overrides based on the following priorities:
1. Databricks connection advanced properties
2. Data Integration Service custom properties
Note: Informatica does not recommend changing these property values before you consult with
third-party documentation, Informatica documentation, or Informatica Global Customer Support. If
you change a value without knowledge of the property, you might experience performance
degradation or other unexpected results.

Advanced Properties
Configure the following properties in the Advanced Properties of the Databricks configuration section:

infaspark.json.parser.mode

Specifies the parser how to handle corrupt JSON records. You can set the value to one of the following
modes:

• DROPMALFORMED. The parser ignores all corrupted records. Default mode.


• PERMISSIVE. The parser accepts non-standard fields as nulls in corrupted records.
• FAILFAST. The parser generates an exception when it encounters a corrupted record and the Spark
application goes down.

infaspark.json.parser.multiLine

Specifies whether the parser can read a multiline record in a JSON file. You can set the value to true or
false. Default is false. Applies only to non-native distributions that use Spark version 2.2.x and above.

infaspark.flatfile.writer.nullValue

When the Databricks Spark engine writes to a target, it converts null values to empty strings (" "). For
example, 12, AB,"",23p09udj.

The Databricks Spark engine can write the empty strings to string columns, but when it tries to write an
empty string to a non-string column, the mapping fails with a type mismatch.

To allow the Databricks Spark engine to convert the empty strings back to null values and write to the
target, configure the following advanced property in the Databricks Spark connection:

infaspark.flatfile.writer.nullValue=true

160 Appendix A: Connections


Google Analytics Connection Properties
When you set up a Google Analytics connection, you must configure the connection properties.

Note: The order of the connection properties might vary depending on the tool where you view them.

The following table describes the Google Analytics connection properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within the
domain. You can change this property after you create the connection. The name cannot exceed
128 characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + =
{[}]|\:;"'<,>.?/

ID String that the Data Integration Service uses to identify the connection.
The ID is not case sensitive. The ID must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection.
Default value is the connection name.

Description Optional. The description of the connection. The description cannot exceed 4,000 characters.

Location The domain where you want to create the connection.

Type The connection type. Select Google Analytics.

Service Account Specifies the client_email value present in the JSON file that you download after you create a
ID service account.

Service Account Specifies the private_key value present in the JSON file that you download after you create a
Key service account.

APIVersion API that PowerExchange for Google Analytics uses to read from Google Analytics reports.
Select Core Reporting API v3.
Note: PowerExchange for Google Analytics does not support Analytics Reporting API v4.

Google BigQuery Connection Properties


When you set up a Google BigQuery connection, you must configure the connection properties.

Note: The order of the connection properties might vary depending on the tool where you view them.

The following table describes the Google BigQuery connection properties:

Property Description

Service Specifies the client_email value present in the JSON file that you download after you create a
Account ID service account in Google BigQuery.

Service Specifies the private_key value present in the JSON file that you download after you create a service
Account Key account in Google BigQuery.

Google Analytics Connection Properties 161


Property Description

Connection The mode that you want to use to read data from or write data to Google BigQuery.
mode Select one of the following connection modes:
- Simple. Flattens each field within the Record data type field as a separate field in the mapping.
- Hybrid. Displays all the top-level fields in the Google BigQuery table including Record data type
fields. PowerExchange for Google BigQuery displays the top-level Record data type field as a
single field of the String data type in the mapping.
- Complex. Displays all the columns in the Google BigQuery table as a single field of the String
data type in the mapping.
Default is Simple.

Schema Specifies a directory on the client machine where the PowerCenter Integration ServiceData
Definition File Integration Service must create a JSON file with the sample schema of the Google BigQuery table.
Path The JSON file name is the same as the Google BigQuery table name.
Alternatively, you can specify a storage path in Google Cloud Storage where the PowerCenter
Integration ServiceData Integration Service must create a JSON file with the sample schema of the
Google BigQuery table. You can download the JSON file from the specified storage path in Google
Cloud Storage to a local machine.

Project ID Specifies the project_id value present in the JSON file that you download after you create a service
account in Google BigQuery.
If you have created multiple projects with the same service account, enter the ID of the project that
contains the dataset that you want to connect to.

Storage Path This property applies when you read or write large volumes of data.
Path in Google Cloud Storage where the PowerCenter Integration ServiceData Integration Service
creates a local stage file to store the data temporarily.
You can either enter the bucket name or the bucket name and folder name.
For example, enter gs://<bucket_name> or gs://<bucket_name>/<folder_name>

Google Cloud Spanner Connection Properties


When you set up a Google Cloud Spanner connection, you must configure the connection properties.

Note: The order of the connection properties might vary depending on the tool where you view them.

The following table describes the Google Cloud Spanner connection properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within the
domain. You can change this property after you create the connection. The name cannot exceed
128 characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + =
{[}]|\:;"'<,>.?/

ID String that the Data Integration Service uses to identify the connection.
The ID is not case sensitive. The ID must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection.
Default value is the connection name.

Description Optional. The description of the connection. The description cannot exceed 4,000 characters.

162 Appendix A: Connections


Property Description

Location The domain where you want to create the connection.

Type The connection type. Select Google Cloud Spanner.

Project ID Specifies the project_id value present in the JSON file that you download after you create a service
account.
If you have created multiple projects with the same service account, enter the ID of the project
that contains the bucket that you want to connect to.

Service Account Specifies the client_email value present in the JSON file that you download after you create a
ID service account.

Service Account Specifies the private_key value present in the JSON file that you download after you create a
Key service account.

Instance ID Name of the instance that you created in Google Cloud Spanner.

Google Cloud Storage Connection Properties


When you set up a Google Cloud Storage connection, you must configure the connection properties.

Note: The order of the connection properties might vary depending on the tool where you view them.

The following table describes the Google Cloud Storage connection properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within the
domain. You can change this property after you create the connection. The name cannot exceed
128 characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + =
{[}]|\:;"'<,>.?/

ID String that the Data Integration Service uses to identify the connection.
The ID is not case sensitive. The ID must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection.
Default value is the connection name.

Description Optional. The description of the connection. The description cannot exceed 4,000 characters.

Location The domain where you want to create the connection.

Type The connection type. Select Google Cloud Storage.

Project ID Specifies the project_id value present in the JSON file that you download after you create a service
account.
If you have created multiple projects with the same service account, enter the ID of the project that
contains the bucket that you want to connect to.

Google Cloud Storage Connection Properties 163


Property Description

Service Account Specifies the client_email value present in the JSON file that you download after you create a
ID service account.

Service Account Specifies the private_key value present in the JSON file that you download after you create a
Key service account.

Hadoop Connection Properties


Use the Hadoop connection to configure mappings to run on a Hadoop cluster. A Hadoop connection is a
cluster type connection. You can create and manage a Hadoop connection in the Administrator tool or the
Developer tool. You can use infacmd to create a Hadoop connection. Hadoop connection properties are case
sensitive unless otherwise noted.

Hadoop Cluster Properties


Configure properties in the Hadoop connection to enable communication between the Data Integration
Service and the Hadoop cluster.

The following table describes the general connection properties for the Hadoop connection:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within the
domain. You can change this property after you create the connection. The name cannot
exceed 128 characters, contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is not case
sensitive. It must be 255 characters or less and must be unique in the domain. You cannot
change this property after you create the connection. Default value is the connection name.

Description The description of the connection. Enter a string that you can use to identify the connection.
The description cannot exceed 4,000 characters.

Cluster The name of the cluster configuration associated with the Hadoop environment.
Configuration Required if you do not configure the Cloud Provisioning Configuration.

Cloud Provisioning Name of the cloud provisioning configuration associated with a cloud platform such as
Configuration Amazon AWS or Microsoft Azure.
Required if you do not configure the Cluster Configuration.

164 Appendix A: Connections


Property Description

Cluster Environment Environment variables that the Hadoop cluster uses.


Variables* For example, the variable ORACLE_HOME represents the directory where the Oracle database
client software is installed.
You can configure run-time properties for the Hadoop environment in the Data Integration
Service, the Hadoop connection, and in the mapping. You can override a property configured at
a high level by setting the value at a lower level. For example, if you configure a property in the
Data Integration Service custom properties, you can override it in the Hadoop connection or in
the mapping. The Data Integration Service processes property overrides based on the
following priorities:
1. Mapping custom properties set using infacmd ms runMapping with the -cp option
2. Mapping run-time properties for the Hadoop environment
3. Hadoop connection advanced properties for run-time engines
4. Hadoop connection advanced general properties, environment variables, and classpaths
5. Data Integration Service custom properties

Cluster Library Path* The path for shared libraries on the cluster.
The $DEFAULT_CLUSTER_LIBRARY_PATH variable contains a list of default directories.

Cluster Classpath* The classpath to access the Hadoop jar files and the required libraries.
The $DEFAULT_CLUSTER_CLASSPATH variable contains a list of paths to the default jar files
and libraries.
You can configure run-time properties for the Hadoop environment in the Data Integration
Service, the Hadoop connection, and in the mapping. You can override a property configured at
a high level by setting the value at a lower level. For example, if you configure a property in the
Data Integration Service custom properties, you can override it in the Hadoop connection or in
the mapping. The Data Integration Service processes property overrides based on the
following priorities:
1. Mapping custom properties set using infacmd ms runMapping with the -cp option
2. Mapping run-time properties for the Hadoop environment
3. Hadoop connection advanced properties for run-time engines
4. Hadoop connection advanced general properties, environment variables, and classpaths
5. Data Integration Service custom properties

Cluster Executable The path for executable files on the cluster.


Path* The $DEFAULT_CLUSTER_EXEC_PATH variable contains a list of paths to the default
executable files.

* Informatica does not recommend changing these property values before you consult with third-party documentation,
Informatica documentation, or Informatica Global Customer Support. If you change a value without knowledge of the
property, you might experience performance degradation or other unexpected results.

Hadoop Connection Properties 165


Common Properties
The following table describes the common connection properties that you configure for the Hadoop
connection:

Property Description

Impersonation Required if the Hadoop cluster uses Kerberos authentication. Hadoop impersonation user. The
User Name user name that the Data Integration Service impersonates to run mappings in the Hadoop
environment.
The Data Integration Service runs mappings based on the user that is configured. Refer the
following order to determine which user the Data Integration Services uses to run mappings:
1. Operating system profile user. The mapping runs with the operating system profile user if the
profile user is configured. If there is no operating system profile user, the mapping runs with
the Hadoop impersonation user.
2. Hadoop impersonation user. The mapping runs with the Hadoop impersonation user if the
operating system profile user is not configured. If the Hadoop impersonation user is not
configured, the Data Integration Service runs mappings with the Data Integration Service user.
3. Informatica services user. The mapping runs with the operating user that starts the
Informatica daemon if the operating system profile user and the Hadoop impersonation user
are not configured.

Temporary Table Hadoop compression library for a compression codec class name.
Compression Note: The Spark engine does not support compression settings for temporary tables. When you
Codec run mappings on the Spark engine, the Spark engine stores temporary tables in an uncompressed
file format.

Codec Class Codec class name that enables data compression and improves performance on temporary
Name staging tables.

Hive Staging Namespace for Hive staging tables. Use the name default for tables that do not have a
Database Name specified database name.
If you do not configure a namespace, the Data Integration Service uses the Hive database name
in the Hive target connection to create staging tables.
When you run a mapping in the native environment to write data to Hive, you must configure the
Hive staging database name in the Hive connection. The Data Integration Service ignores the
value you configure in the Hadoop connection.

Advanced List of advanced properties that are unique to the Hadoop environment. The properties are
Properties common to the Blaze and Spark engines. The advanced properties include a list of default
properties.
You can configure run-time properties for the Hadoop environment in the Data Integration
Service, the Hadoop connection, and in the mapping. You can override a property configured at a
high level by setting the value at a lower level. For example, if you configure a property in the
Data Integration Service custom properties, you can override it in the Hadoop connection or in
the mapping. The Data Integration Service processes property overrides based on the following
priorities:
1. Mapping custom properties set using infacmd ms runMapping with the -cp option
2. Mapping run-time properties for the Hadoop environment
3. Hadoop connection advanced properties for run-time engines
4. Hadoop connection advanced general properties, environment variables, and classpaths
5. Data Integration Service custom properties
Note: Informatica does not recommend changing these property values before you consult with
third-party documentation, Informatica documentation, or Informatica Global Customer Support.
If you change a value without knowledge of the property, you might experience performance
degradation or other unexpected results.

166 Appendix A: Connections


Reject Directory Properties
The following table describes the connection properties that you configure to the Hadoop Reject Directory.

Property Description

Write Reject Files If you use the Blaze engine to run mappings, select the check box to specify a location to move
to Hadoop reject files. If checked, the Data Integration Service moves the reject files to the HDFS location
listed in the property, Reject File Directory.
By default, the Data Integration Service stores the reject files based on the RejectDir system
parameter.

Reject File The directory for Hadoop mapping files on HDFS when you run mappings.
Directory

Hive Pushdown Configuration


Note: Effective in version 10.2.2, Informatica dropped support for the Hive engine. Do not configure the
pushdown properties related to the Hive engine.

Blaze Configuration
The following table describes the connection properties that you configure for the Blaze engine:

Property Description

Blaze Staging The HDFS file path of the directory that the Blaze engine uses to store temporary files. Verify that
Directory the directory exists. The YARN user, Blaze engine user, and mapping impersonation user must have
write permission on this directory.
Default is /blaze/workdir. If you clear this property, the staging files are written to the Hadoop
staging directory /tmp/blaze_<user name>.

Blaze User The owner of the Blaze service and Blaze service logs.
Name When the Hadoop cluster uses Kerberos authentication, the default user is the Data Integration
Service SPN user. When the Hadoop cluster does not use Kerberos authentication and the Blaze
user is not configured, the default user is the Data Integration Service user.

Minimum Port The minimum value for the port number range for the Blaze engine. Default is 12300.

Maximum Port The maximum value for the port number range for the Blaze engine. Default is 12600.

YARN Queue The YARN scheduler queue name used by the Blaze engine that specifies available resources on a
Name cluster.

Blaze Job The host name and port number for the Blaze Job Monitor.
Monitor Use the following format:
Address <hostname>:<port>
Where
- <hostname> is the host name or IP address of the Blaze Job Monitor server.
- <port> is the port on which the Blaze Job Monitor listens for remote procedure calls (RPC).
For example, enter: myhostname:9080

Hadoop Connection Properties 167


Property Description

Blaze YARN Node label that determines the node on the Hadoop cluster where the Blaze engine runs. If you do
Node Label not specify a node label, the Blaze engine runs on the nodes in the default partition.
If the Hadoop cluster supports logical operators for node labels, you can specify a list of node
labels. To list the node labels, use the operators && (AND), || (OR), and ! (NOT).

Advanced List of advanced properties that are unique to the Blaze engine. The advanced properties include a
Properties list of default properties.
You can configure run-time properties for the Hadoop environment in the Data Integration Service,
the Hadoop connection, and in the mapping. You can override a property configured at a high level
by setting the value at a lower level. For example, if you configure a property in the Data Integration
Service custom properties, you can override it in the Hadoop connection or in the mapping. The Data
Integration Service processes property overrides based on the following priorities:
1. Mapping custom properties set using infacmd ms runMapping with the -cp option
2. Mapping run-time properties for the Hadoop environment
3. Hadoop connection advanced properties for run-time engines
4. Hadoop connection advanced general properties, environment variables, and classpaths
5. Data Integration Service custom properties
Note: Informatica does not recommend changing these property values before you consult with
third-party documentation, Informatica documentation, or Informatica Global Customer Support. If
you change a value without knowledge of the property, you might experience performance
degradation or other unexpected results.

Spark Configuration
The following table describes the connection properties that you configure for the Spark engine:

Property Description

Spark Staging The HDFS file path of the directory that the Spark engine uses to store temporary files for running
Directory jobs. The YARN user, Data Integration Service user, and mapping impersonation user must have
write permission on this directory.
If you do not specify a file path, by default, the temporary files are written to the Hadoop staging
directory /tmp/SPARK_<user name>.
When you run Sqoop jobs on the Spark engine, the Data Integration Service creates a Sqoop staging
directory within the Spark staging directory to store temporary files: <Spark staging
directory>/sqoop_staging

Spark Event Optional. The HDFS file path of the directory that the Spark engine uses to log events.
Log Directory

168 Appendix A: Connections


Property Description

YARN Queue The YARN scheduler queue name used by the Spark engine that specifies available resources on a
Name cluster. The name is case sensitive.

Advanced List of advanced properties that are unique to the Spark engine. The advanced properties include a
Properties list of default properties.
You can configure run-time properties for the Hadoop environment in the Data Integration Service,
the Hadoop connection, and in the mapping. You can override a property configured at a high level
by setting the value at a lower level. For example, if you configure a property in the Data Integration
Service custom properties, you can override it in the Hadoop connection or in the mapping. The Data
Integration Service processes property overrides based on the following priorities:
1. Mapping custom properties set using infacmd ms runMapping with the -cp option
2. Mapping run-time properties for the Hadoop environment
3. Hadoop connection advanced properties for run-time engines
4. Hadoop connection advanced general properties, environment variables, and classpaths
5. Data Integration Service custom properties
Note: Informatica does not recommend changing these property values before you consult with
third-party documentation, Informatica documentation, or Informatica Global Customer Support. If
you change a value without knowledge of the property, you might experience performance
degradation or other unexpected results.

HDFS Connection Properties


Use a Hadoop File System (HDFS) connection to access data in the Hadoop cluster. The HDFS connection is
a file system type connection. You can create and manage an HDFS connection in the Administrator tool,
Analyst tool, or the Developer tool. HDFS connection properties are case sensitive unless otherwise noted.

Note: The order of the connection properties might vary depending on the tool where you view them.

The following table describes HDFS connection properties:

Property Description

Name Name of the connection. The name is not case sensitive and must be unique within the domain. The
name cannot exceed 128 characters, contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive.
It must be 255 characters or less and must be unique in the domain. You cannot change this
property after you create the connection. Default value is the connection name.

Description The description of the connection. The description cannot exceed 765 characters.

Location The domain where you want to create the connection. Not valid for the Analyst tool.

Type The connection type. Default is Hadoop File System.

HDFS Connection Properties 169


Property Description

User Name User name to access HDFS.

NameNode URI The URI to access the storage system.


You can find the value for fs.defaultFS in the core-site.xml configuration set of the cluster
configuration.
Note: If you create connections when you import the cluster configuration, the NameNode URI
property is populated by default, and it is updated each time you refresh the cluster configuration. If
you manually set this property or override the value, the refresh operation does not update this
property.

Accessing Multiple Storage Types


Use the NameNode URI property in the connection parameters to connect to various storage types. The
following table lists the storage type and the NameNode URI format for the storage type:

Storage NameNode URI Format

HDFS hdfs://<namenode>:<port>
where:
- <namenode> is the host name or IP address of the NameNode.
- <port> is the port that the NameNode listens for remote procedure calls (RPC).
hdfs://<nameservice> in case of NameNode high availability.

MapR-FS maprfs:///

WASB in wasb://<container_name>@<account_name>.blob.core.windows.net/<path>
HDInsight
where:
- <container_name> identifies a specific Azure Storage Blob container.
Note: <container_name> is optional.
- <account_name> identifies the Azure Storage Blob object.
Example:
wasb://infabdmoffering1storage.blob.core.windows.net/
infabdmoffering1cluster/mr-history

ADLS in adl://home
HDInsight

When you create a cluster configuration from an Azure HDInsight cluster, the cluster configuration uses
either ADLS or WASB as the primary storage. You cannot create a cluster configuration with ADLS or WASB
as the secondary storage. You can edit the NameNode URI property in the HDFS connection to connect to a
local HDFS location.

170 Appendix A: Connections


HBase Connection Properties
Use an HBase connection to access HBase. The HBase connection is a NoSQL connection. You can create
and manage an HBase connection in the Administrator tool or the Developer tool. HBase connection
properties are case sensitive unless otherwise noted.

The following table describes HBase connection properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique
within the domain. You can change this property after you create the connection.
The name cannot exceed 128 characters, contain spaces, or contain the following
special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is
not case sensitive. It must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection. Default
value is the connection name.

Description The description of the connection. The description cannot exceed 4,000
characters.

Location The domain where you want to create the connection.

Type The connection type. Select HBase.

Database Type Type of database that you want to connect to.


Select HBase to create a connection for an HBase table.

HBase Connection Properties for MapR-DB


Use an HBase connection to connect to a MapR-DB table. The HBase connection is a NoSQL connection. You
can create and manage an HBase connection in the Administrator tool or the Developer tool. HBase
connection properties are case sensitive unless otherwise noted.

The following table describes the HBase connection properties for MapR-DB:

Property Description

Name Name of the connection. The name is not case sensitive and must be unique
within the domain. You can change this property after you create the connection.
The name cannot exceed 128 characters, contain spaces, or contain the following
special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is
not case sensitive. It must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection. Default
value is the connection name.

HBase Connection Properties 171


Property Description

Description Description of the connection. The description cannot exceed 4,000 characters.

Location Domain where you want to create the connection.

Type Connection type. Select HBase.

Database Type Type of database that you want to connect to.


Select MapR-DB to create a connection for a MapR-DB table.

Cluster Configuration The name of the cluster configuration associated with the Hadoop environment.

MapR-DB Database Path Database path that contains the MapR-DB table that you want to connect to. Enter
a valid MapR cluster path.
When you create an HBase data object for MapR-DB, you can browse only tables
that exist in the MapR-DB path that you specify in the Database Path field. You
cannot access tables that are available in sub-directories in the specified path.
For example, if you specify the path as /user/customers/, you can access the
tables in the customers directory. However, if the customers directory contains
a sub-directory named regions, you cannot access the tables in the following
directory:
/user/customers/regions

Hive Connection Properties


Use the Hive connection to access Hive data. A Hive connection is a database type connection. You can
create and manage a Hive connection in the Administrator tool, Analyst tool, or the Developer tool. Hive
connection properties are case sensitive unless otherwise noted.

Note: The order of the connection properties might vary depending on the tool where you view them.

The following table describes Hive connection properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique
within the domain. You can change this property after you create the connection.
The name cannot exceed 128 characters, contain spaces, or contain the following
special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is
not case sensitive. It must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection. Default
value is the connection name.

Description The description of the connection. The description cannot exceed 4000
characters.

172 Appendix A: Connections


Property Description

Location The domain where you want to create the connection. Not valid for the Analyst
tool.

Type The connection type. Select Hive.

LDAP username LDAP user name of the user that the Data Integration Service impersonates to run
mappings on a Hadoop cluster. The user name depends on the JDBC connection
string that you specify in the Metadata Connection String or Data Access
Connection String for the native environment.
If the Hadoop cluster uses Kerberos authentication, the principal name for the
JDBC connection string and the user name must be the same. Otherwise, the user
name depends on the behavior of the JDBC driver. With Hive JDBC driver, you can
specify a user name in many ways and the user name can become a part of the
JDBC URL.
If the Hadoop cluster does not use Kerberos authentication, the user name
depends on the behavior of the JDBC driver.
If you do not specify a user name, the Hadoop cluster authenticates jobs based on
the following criteria:
- The Hadoop cluster does not use Kerberos authentication. It authenticates jobs
based on the operating system profile user name of the machine that runs the
Data Integration Service.
- The Hadoop cluster uses Kerberos authentication. It authenticates jobs based
on the SPN of the Data Integration Service. LDAP username will be ignored.

Password Password for the LDAP username.

Environment SQL SQL commands to set the Hadoop environment. In native environment type, the
Data Integration Service executes the environment SQL each time it creates a
connection to a Hive metastore. If you use the Hive connection to run profiles on
a Hadoop cluster, the Data Integration Service executes the environment SQL at
the beginning of each Hive session.
The following rules and guidelines apply to the usage of environment SQL in both
connection modes:
- Use the environment SQL to specify Hive queries.
- Use the environment SQL to set the classpath for Hive user-defined functions
and then use environment SQL or PreSQL to specify the Hive user-defined
functions. You cannot use PreSQL in the data object properties to specify the
classpath. If you use Hive user-defined functions, you must copy the .jar files to
the following directory:
<Informatica installation directory>/services/shared/hadoop/
<Hadoop distribution name>/extras/hive-auxjars
- You can use environment SQL to define Hadoop or Hive parameters that you
want to use in the PreSQL commands or in custom queries.
- If you use multiple values for the Environment SQL property, ensure that there is
no space between the values.

SQL Identifier Character The type of character used to identify special characters and reserved SQL
keywords, such as WHERE. The Data Integration Service places the selected
character around special characters and reserved SQL keywords. The Data
Integration Service also uses this character for the Support mixed-case
identifiers property.

Hive Connection Properties 173


Properties to Access Hive as Source or Target
The following table describes the connection properties that you configure to access Hive as a source or
target:

Property Description

JDBC Driver Name of the Hive JDBC driver class. If you leave this option blank, the Developer tool uses the
Class Name default Apache Hive JDBC driver shipped with the distribution. If the default Apache Hive JDBC
driver does not fit your requirements, you can override the Apache Hive JDBC driver with a third-
party Hive JDBC driver by specifying the driver class name.

Metadata The JDBC connection URI used to access the metadata from the Hadoop server.
Connection You can use PowerExchange for Hive to communicate with a HiveServer service or HiveServer2
String service. To connect to HiveServer, specify the connection string in the following format:
jdbc:hive2://<hostname>:<port>/<db>

Where
- <hostname> is name or IP address of the machine on which HiveServer2 runs.
- <port> is the port number on which HiveServer2 listens.
- <db> is the database name to which you want to connect. If you do not provide the database
name, the Data Integration Service uses the default database details.
To connect to HiveServer2, use the connection string format that Apache Hive implements for
that specific Hadoop Distribution. For more information about Apache Hive connection string
formats, see the Apache Hive documentation.
For user impersonation, you must add hive.server2.proxy.user=<xyz> to the JDBC
connection URI. If you do not configure user impersonation, the current user's credentials are
used connect to the HiveServer2.
If the Hadoop cluster uses SSL or TLS authentication, you must add ssl=true to the JDBC
connection URI. For example: jdbc:hive2://<hostname>:<port>/<db>;ssl=true
If you use self-signed certificate for SSL or TLS authentication, ensure that the certificate file is
available on the client machine and the Data Integration Service machine. For more information,
see the Informatica Big Data Management Integration Guide.

Bypass Hive JDBC driver mode. Select the check box to use the embedded JDBC driver mode.
JDBC Server To use the JDBC embedded mode, perform the following tasks:
- Verify that Hive client and Informatica services are installed on the same machine.
- Configure the Hive connection properties to run mappings on a Hadoop cluster.
If you choose the non-embedded mode, you must configure the Data Access Connection String.
Informatica recommends that you use the JDBC embedded mode.

Fine Grained When you select the option to observe fine grained authorization in a Hive source, the mapping
Authorization observes the following:
- Row and column level restrictions. Applies to Hadoop clusters where Sentry or Ranger security
modes are enabled.
- Data masking rules. Applies to masking rules set on columns containing sensitive data by
Dynamic Data Masking.
If you do not select the option, the Blaze and Spark engines ignore the restrictions and masking
rules, and results include restricted or sensitive data.

174 Appendix A: Connections


Property Description

Data Access The connection string to access data from the Hadoop data store. To connect to HiveServer,
Connection specify the non-embedded JDBC mode connection string in the following format:
String
jdbc:hive2://<hostname>:<port>/<db>
Where
- <hostname> is name or IP address of the machine on which HiveServer2 runs.
- <port> is the port number on which HiveServer2 listens.
- <db> is the database to which you want to connect. If you do not provide the database name,
the Data Integration Service uses the default database details.
To connect to HiveServer2, use the connection string format that Apache Hive implements for the
specific Hadoop Distribution. For more information about Apache Hive connection string formats,
see the Apache Hive documentation.
For user impersonation, you must add hive.server2.proxy.user=<xyz> to the JDBC
connection URI. If you do not configure user impersonation, the current user's credentials are
used connect to the HiveServer2.
If the Hadoop cluster uses SSL or TLS authentication, you must add ssl=true to the JDBC
connection URI. For example: jdbc:hive2://<hostname>:<port>/<db>;ssl=true
If you use self-signed certificate for SSL or TLS authentication, ensure that the certificate file is
available on the client machine and the Data Integration Service machine. For more information,
see the Informatica Big Data Management Integration Guide.

Hive Staging HDFS directory for Hive staging tables. You must grant execute permission to the Hadoop
Directory on impersonation user and the mapping impersonation users.
HDFS This option is applicable and required when you write data to a Hive target in the native
environment.

Hive Staging Namespace for Hive staging tables. Use the name default for tables that do not have a
Database Name specified database name.
This option is applicable when you run a mapping in the native environment to write data to a
Hive target.
If you run the mapping on the Blaze or Spark engine, you do not need to configure the Hive
staging database name in the Hive connection. The Data Integration Service uses the value that
you configure in the Hadoop connection.

JDBC Connection Properties


You can use a JDBC connection to access tables in a database. You can create and manage a JDBC
connection in the Administrator tool, the Developer tool, or the Analyst tool.

Note: The order of the connection properties might vary depending on the tool where you view them.

JDBC Connection Properties 175


The following table describes JDBC connection properties:

Property Description

Database The database type.


Type

Name Name of the connection. The name is not case sensitive and must be unique within the domain. The
name cannot exceed 128 characters, contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive.
It must be 255 characters or less and must be unique in the domain. You cannot change this property
after you create the connection. Default value is the connection name.

Description The description of the connection. The description cannot exceed 765 characters.

User Name The database user name.

Password The password for the database user name.

JDBC Driver Name of the JDBC driver class.


Class Name The following list provides the driver class name that you can enter for the applicable database type:
- DataDirect JDBC driver class name for Oracle:
com.informatica.jdbc.oracle.OracleDriver
- DataDirect JDBC driver class name for IBM DB2:
com.informatica.jdbc.db2.DB2Driver
- DataDirect JDBC driver class name for Microsoft SQL Server:
com.informatica.jdbc.sqlserver.SQLServerDriver
- DataDirect JDBC driver class name for Sybase ASE:
com.informatica.jdbc.sybase.SybaseDriver
- DataDirect JDBC driver class name for Informix:
com.informatica.jdbc.informix.InformixDriver
- DataDirect JDBC driver class name for MySQL:
com.informatica.jdbc.mysql.MySQLDriver
For more information about which driver class to use with specific databases, see the vendor
documentation.

176 Appendix A: Connections


Property Description

Connection Connection string to connect to the database. Use the following connection string:
String
jdbc:<subprotocol>:<subname>
The following list provides sample connection strings that you can enter for the applicable database
type:
- Connection string for DataDirect Oracle JDBC driver:
jdbc:informatica:oracle://<host>:<port>;SID=<value>
- Connection string for Oracle JDBC driver:
jdbc:oracle:thin:@//<host>:<port>:<SID>
- Connection string for DataDirect IBM DB2 JDBC driver:
jdbc:informatica:db2://<host>:<port>;DatabaseName=<value>
- Connection string for IBM DB2 JDBC driver:
jdbc:db2://<host>:<port>/<database_name>
- Connection string for DataDirect Microsoft SQL Server JDBC driver:
jdbc:informatica:sqlserver://<host>;DatabaseName=<value>
- Connection string for Microsoft SQL Server JDBC driver:
jdbc:sqlserver://<host>;DatabaseName=<value>
- Connection string for Netezza JDBC driver:
jdbc:netezza://<host>:<port>/<database_name>
- Connection string for Pivotal Greenplum driver:
jdbc:pivotal:greenplum://<host>:<port>;/database_name=<value>
- Connection string for Postgres Greenplum driver:
jdbc:postgressql://<host>:<port>/<database_name>
- Connection string for Teradata JDBC driver:
jdbc:teradata://<host>/database_name=<value>,tmode=<value>,charset=<value>
For more information about the connection string to use with specific drivers, see the vendor
documentation.

Environment Optional. Enter SQL commands to set the database environment when you connect to the database.
SQL The Data Integration Service executes the connection environment SQL each time it connects to the
database.
Note: If you enable Sqoop, Sqoop ignores this property.

Transaction Optional. Enter SQL commands to set the database environment when you connect to the database.
SQL The Data Integration Service executes the transaction environment SQL at the beginning of each
transaction.
Note: If you enable Sqoop, Sqoop ignores this property.

SQL Identifier Type of character that the database uses to enclose delimited identifiers in SQL queries. The
Character available characters depend on the database type.
Select (None) if the database uses regular identifiers. When the Data Integration Service generates
SQL queries, the service does not place delimited characters around any identifiers.
Select a character if the database uses delimited identifiers. When the Data Integration Service
generates SQL queries, the service encloses delimited identifiers within this character.
Note: If you enable Sqoop, Sqoop ignores this property.

Support Enable if the database uses case-sensitive identifiers. When enabled, the Data Integration Service
Mixed-case encloses all identifiers within the character selected for the SQL Identifier Character property.
Identifiers When the SQL Identifier Character property is set to none, the Support Mixed-case Identifiers
property is disabled.
Note: If you enable Sqoop, Sqoop honors this property when you generate and execute a DDL script
to create or replace a target at run time. In all other scenarios, Sqoop ignores this property.

JDBC Connection Properties 177


Sqoop Connection-Level Arguments
In the JDBC connection, you can define the arguments that Sqoop must use to connect to the database. The
Data Integration Service merges the arguments that you specify with the default command that it constructs
based on the JDBC connection properties. The arguments that you specify take precedence over the JDBC
connection properties.

If you want to use the same driver to import metadata and run the mapping, and do not want to specify any
additional Sqoop arguments, select Sqoop v1.x from the Use Sqoop Version list and leave the Sqoop
Arguments field empty in the JDBC connection. The Data Integration Service constructs the Sqoop command
based on the JDBC connection properties that you specify.

However, if you want to use a different driver for run-time tasks or specify additional run-time Sqoop
arguments, select Sqoop v1.x from the Use Sqoop Version list and specify the arguments in the Sqoop
Arguments field.

You can configure the following Sqoop arguments in the JDBC connection:

driver

Defines the JDBC driver class that Sqoop must use to connect to the database.

Use the following syntax:

--driver <JDBC driver class>

For example, use the following syntax depending on the database type that you want to connect to:

• Aurora: --driver com.mysql.jdbc.Driver


• Greenplum: --driver org.postgresql.Driver
• IBM DB2: --driver com.ibm.db2.jcc.DB2Driver
• IBM DB2 z/OS: --driver com.ibm.db2.jcc.DB2Driver
• Microsoft SQL Server: --driver com.microsoft.sqlserver.jdbc.SQLServerDriver
• Netezza: --driver org.netezza.Driver
• Oracle: --driver oracle.jdbc.driver.OracleDriver
• Teradata: --driver com.teradata.jdbc.TeraDriver

connect

Defines the JDBC connection string that Sqoop must use to connect to the database. The JDBC
connection string must be based on the driver that you define in the driver argument.

Use the following syntax:

--connect <JDBC connection string>

For example, use the following syntax depending on the database type that you want to connect to:

• Aurora: --connect "jdbc:mysql://<host_name>:<port>/<schema_name>"


• Greenplum: --connect jdbc:postgresql://<host_name>:<port>/<database_name>
• IBM DB2: --connect jdbc:db2://<host_name>:<port>/<database_name>
• IBM DB2 z/OS: --connect jdbc:db2://<host_name>:<port>/<database_name>
• Microsoft SQL Server: --connect jdbc:sqlserver://<host_name>:<port or
named_instance>;databaseName=<database_name>
• Netezza: --connect "jdbc:netezza://<database_server_name>:<port>/
<database_name>;schema=<schema_name>"

178 Appendix A: Connections


• Oracle: --connect jdbc:oracle:thin:@<database_host_name>:<database_port>:<database_SID>
• Teradata: --connect jdbc:teradata://<host_name>/database=<database_name>

connection-manager

Defines the connection manager class name that Sqoop must use to connect to the database.

Use the following syntax:

--connection-manager <connection manager class name>

For example, use the following syntax to use the generic JDBC manager class name:

--connection-manager org.apache.sqoop.manager.GenericJdbcManager

direct

When you read data from or write data to Oracle, you can configure the direct argument to enable Sqoop
to use OraOop. OraOop is a specialized Sqoop plug-in for Oracle that uses native protocols to connect to
the Oracle database. When you configure OraOop, the performance improves.

You can configure OraOop when you run Sqoop mappings on the Spark engine.

Use the following syntax:

--direct

When you use OraOop, you must use the following syntax to specify multiple arguments:

-D<argument=value> -D<argument=value>

Note: If you specify multiple arguments and include a space character between -D and the argument
name-value pair, Sqoop considers only the first argument and ignores the remaining arguments.

If you do not direct the job to a specific queue, the Spark engine uses the default queue.

-Dsqoop.connection.factories

To run the mapping on the Blaze engine with the Teradata Connector for Hadoop (TDCH) specialized
connectors for Sqoop, you must configure the -Dsqoop.connection.factories argument. Use the
argument to define the TDCH connection factory class that Sqoop must use. The connection factory
class varies based on the TDCH Sqoop Connector that you want to use.

• To use Cloudera Connector Powered by Teradata, configure the -Dsqoop.connection.factories


argument as follows:
-Dsqoop.connection.factories=com.cloudera.connector.teradata.TeradataManagerFactory
• To use Hortonworks Connector for Teradata (powered by the Teradata Connector for Hadoop),
configure the -Dsqoop.connection.factories argument as follows:
-Dsqoop.connection.factories=org.apache.sqoop.teradata.TeradataManagerFactory

Note: To run the mapping on the Spark engine, you do not need to configure the -
Dsqoop.connection.factories argument. The Data Integration Service invokes Cloudera Connector
Powered by Teradata and Hortonworks Connector for Teradata (powered by the Teradata Connector for
Hadoop) by default.

--infaoptimize

Use this argument to disable the performance optimization of Sqoop pass-through mappings on the
Spark engine.

JDBC Connection Properties 179


When you run a Sqoop pass-through mapping on the Spark engine, the Data Integration Service
optimizes mapping performance in the following scenarios:

• You read data from a Sqoop source and write data to a Hive target that uses the Text format.
• You read data from a Sqoop source and write data to an HDFS target that uses the Flat, Avro, or
Parquet format.

If you want to disable the performance optimization, set the --infaoptimize argument to false. For
example, if you see data type issues after you run an optimized Sqoop mapping, you can disable the
performance optimization.

Use the following syntax:

--infaoptimize false

For a complete list of the Sqoop arguments that you can configure, see the Sqoop documentation.

Kafka Connection Properties


The Kafka connection is a messaging connection. Use the Kafka connection to access Kafka as a message
target. You can create and manage a Kafka connection in the Developer tool or through infacmd.

The following table describes the general connection properties for the Kafka connection:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within the domain. You
can change this property after you create the connection. The name cannot exceed 128 characters,
contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID The string that the Data Integration Service uses to identify the connection. The ID is not case sensitive.
It must be 255 characters or less and must be unique in the domain. You cannot change this property
after you create the connection. Default value is the connection name.

Description The description of the connection. Enter a string that you can use to identify the connection. The
description cannot exceed 4,000 characters.

Location The domain where you want to create the connection.

Type The connection type.

180 Appendix A: Connections


The following table describes the Kafka broker properties for the Kafka connection:

Property Description

Kafka Broker List Comma-separated list of Kafka brokers which maintains the
configuration of the Kafka messaging broker.
To specify a Kafka broker, use the following format:
<IP Address>:<port>

ZooKeeper Host Port List Optional. Comma-separated list of Apache ZooKeeper which maintains
the configuration of the Kafka messaging broker.
To specify the ZooKeeper, use the following format:
<IP Address>:<port>

Retry Timeout Number of seconds the Integration Service attempts to reconnect to the
Kafka broker to write data. If the source or target is not available for the
time you specify, the mapping execution stops to avoid any data loss.

Kafka Broker Version Configure the Kafka messaging broker version to 0.10.1.x-2.0.0.

Microsoft Azure Blob Storage Connection Properties


Use a Microsoft Azure SQL Blob Storage connection to access a Microsoft Azure Blob Storage.

Note: The order of the connection properties might vary depending on the tool where you view them.

You can create and manage a Microsoft Azure Blob Storage connection in the Administrator tool or the
Developer tool. The following table describes the Microsoft Azure Blob Storage connection properties:

Property Description

Name Name of the Microsoft Azure Blob Storage connection.

ID String that the Data Integration Service uses to identify the connection. The ID is not
case sensitive. It must be 255 characters or less and must be unique in the domain.
You cannot change this property after you create the connection. Default value is the
connection name.

Description Description of the connection.

Location The domain where you want to create the connection.

Type Type of connection. Select AzureBlob.

Microsoft Azure Blob Storage Connection Properties 181


The Connection Details tab contains the connection attributes of the Microsoft Azure Blob Storage
connection. The following table describes the connection attributes:

Property Description

Account Name Name of the Microsoft Azure Storage account.

Account Key Microsoft Azure Storage access key.

Container Name The root container or sub-folders with the absolute path.

Endpoint Suffix Type of Microsoft Azure end-points. You can select any of the following end-
points:
- core.windows.net: Default
- core.usgovcloudapi.net: To select the US government Microsoft Azure end-
points
- core.chinacloudapi.cn: Not applicable

Microsoft Azure Cosmos DB SQL API Connection


Properties
Use a Microsoft Azure Cosmos DB connection to connect to the Cosmos DB database. When you create a
Microsoft Azure Cosmos DB connection, you enter information for metadata and data access.

The following table describes the Microsoft Azure Cosmos DB connection properties:

Property Description

Name Name of the Cosmos DB connection.

ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive.
It must be 255 characters or less and must be unique in the domain. You cannot change this
property after you create the connection. Default value is the connection name.

Description Description of the connection. The description cannot exceed 765 characters.

Location The project or folder in the Model repository where you want to store the Cosmos DB connection.

Type Select Microsoft Azure Cosmos DB SQL API.

Cosmos DB URI The URI of Microsoft Azure Cosmos DB account.

Key The primary and secondary key to which provides you complete administrative access to the
resources within Microsoft Azure Cosmos DB account.

Database Name of the database that contains the collections from which you want to read or write JSON
documents.

Note: You can find the Cosmos DB URI and Key values in the Keys settings on Azure portal. Contact your
Azure administrator for more details.

182 Appendix A: Connections


Microsoft Azure Data Lake Store Connection
Properties
Use a Microsoft Azure Data Lake Store connection to access a Microsoft Azure Data Lake Store.

Note: The order of the connection properties might vary depending on the tool where you view them.

You can create and manage a Microsoft Azure SQL Data Warehouse connection in the Administrator tool or
the Developer tool. The following table describes the Microsoft Azure Data Lake Store connection properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within the domain. You
can change this property after you create the connection. The name cannot exceed 128 characters,
contain spaces, or contain the following special characters: ~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It
must be 255 characters or less and must be unique in the domain. You cannot change this property
after you create the connection.
Default value is the connection name.

Description The description of the connection. The description cannot exceed 4,000 characters.

Location The domain where you want to create the connection.

Type The connection type. Select Microsoft Azure Data Lake Store.

The following table describes the properties for metadata access:

Property Description

ADLS Account Name The name of the Microsoft Azure Data Lake Store.

ClientID The ID of your application to complete the OAuth Authentication in the Active Directory.

Client Secret The client secret key to complete the OAuth Authentication in the Active Directory.

Directory The Microsoft Azure Data Lake Store directory that you use to read data or write data. The
default is root directory.

AuthEndpoint The OAuth 2.0 token endpoint from where access code is generated based on based on the
Client ID and Client secret is completed.

For more information about creating a client ID, client secret, and auth end point, contact the Azure
administrator or see Microsoft Azure Data Lake Store documentation.

Microsoft Azure Data Lake Store Connection Properties 183


Microsoft Azure SQL Data Warehouse Connection
Properties
Use a Microsoft Azure SQL Data Warehouse connection to access a Microsoft Azure SQL Data Warehouse.

Note: The order of the connection properties might vary depending on the tool where you view them.

You can create and manage a Microsoft Azure SQL Data Warehouse connection in the Administrator tool or
the Developer tool. The following table describes the Microsoft Azure SQL Data Warehouse connection
properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within the domain. You
can change this property after you create the connection. The name cannot exceed 128 characters,
contain spaces, or contain the following special characters: ~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /

ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It
must be 255 characters or less and must be unique in the domain. You cannot change this property
after you create the connection. Default value is the connection name.

Description The description of the connection. The description cannot exceed 4,000 characters.

Location The domain where you want to create the connection.

Type The connection type. Select Microsoft Azure SQL Data Warehouse.

The following table describes the properties for metadata access:

Property Description

Azure DW JDBC URL Microsoft Azure Data Warehouse JDBC connection string. For example, you can enter the
following connection string: jdbc:sqlserver:// <Server>.database.windows.net:
1433;database=<Database>. The Administrator can download the URL from Microsoft Azure
portal.

Azure DW JDBC User name to connect to the Microsoft Azure SQL Data Warehouse account. You must have
Username permission to read, write, and truncate data in Microsoft Azure SQL Data Warehouse.

Azure DW JDBC Password to connect to the Microsoft Azure SQL Data Warehouse account.
Password

Azure DW Schema Name of the schema in Microsoft Azure SQL Data Warehouse.
Name

Azure Blob Account Name of the Microsoft Azure Storage account to stage the files.
Name

184 Appendix A: Connections


Property Description

Azure Blob Account The key that authenticates the access to the Blob storage account.
Key

Blob End-point Type of Microsoft Azure end-points. You can select any of the following end-points:
- core.windows.net: Default
- core.usgovcloudapi.net: To select the US government Microsoft Azure end-points
- core.chinacloudapi.cn: Not applicable
You can configure the US government Microsoft Azure end-points when a mapping runs in the
native environment and on the Spark engine.

Snowflake Connection Properties


When you set up a Snowflake connection, you must configure the connection properties.

Note: The order of the connection properties might vary depending on the tool where you view them.

The following table describes the Snowflake connection properties:

Property Description

Name The name of the connection. The name is not case sensitive and must be unique within the domain.
You can change this property after you create the connection. The name cannot exceed 128
characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + = { [ } ] |
\:;"'<,>.?/

ID String that the Data Integration Service uses to identify the connection.
The ID is not case sensitive. The ID must be 255 characters or less and must be unique in the domain.
You cannot change this property after you create the connection.
Default value is the connection name.

Description Optional. The description of the connection. The description cannot exceed 4,000 characters.

Location The domain where you want to create the connection.

Type The connection type. Select SnowFlake.

Username The user name to connect to the Snowflake account.

Password The password to connect to the Snowflake account.

Account The name of the Snowflake account.

Warehouse The Snowflake warehouse name.

Snowflake Connection Properties 185


Property Description

Role The Snowflake role assigned to the user.

Additional Enter one or more JDBC connection parameters in the following format:
JDBC URL
<param1>=<value>&<param2>=<value>&<param3>=<value>....
Parameters
For example:
user=jon&warehouse=mywh&db=mydb&schema=public
To access Snowflake through Okta SSO authentication, enter the web-based IdP implementing SAML
2.0 protocol in the following format:
authenticator=https://<Your_Okta_Account_Name>.okta.com
Note: Microsoft ADFS is not supported.
For more information about configuring Okta authentication, see the following website:
https://docs.snowflake.net/manuals/user-guide/admin-security-fed-auth-configure-
snowflake.html#configuring-snowflake-to-use-federated-authentication

Creating a Connection to Access Sources or Targets


Create connections before you import data objects, preview data, and profile data.

1. Within the Administrator tool click Manage > Connections.


2. Select Actions > New > Connection.
3. Select the type of connection that you want to create:
• To select an HBase connection, select NoSQL > HBase.
• To select an HDFS connection, select File Systems > Hadoop File System.
• To select a Hive connection, select Database > Hive.
• To select a JDBC connection, select Database > JDBC.
4. Click OK.
5. Enter a connection name, ID, and optional description.
6. Configure the connection properties. For a Hive connection, you must choose the Access Hive as a
source or target option to use Hive as a source or a target.
7. Click Test Connection to verify the connection.
8. Click Finish.

Creating a Hadoop Connection


Create a Hadoop connection before you run a mapping in the Hadoop environment.

1. Click Window > Preferences.


2. Select Informatica > Connections.

186 Appendix A: Connections


3. Expand the domain in the Available Connections list.
4. Select the Cluster connection type in the Available Connections list and click Add.
The New Cluster Connection dialog box appears.
5. Enter the general properties for the connection.

6. Click Next.
7. Enter the Hadoop cluster properties, common properties, and the reject directory properties.
8. Click Next.
9. Click Next.
Effective in version 10.2.2, Informatica dropped support for the Hive engine. Do not enter Hive
configuration properties.
10. Enter configuration properties for the Blaze engine and click Next.
11. Enter configuration properties for the Spark engine and click Finish.

Creating a Hadoop Connection 187


Configuring Hadoop Connection Properties
When you create a Hadoop connection, default values are assigned to cluster environment variables, cluster
path properties, and advanced properties. You can add or edit values for these properties. You can also reset
to default values.

You can configure the following Hadoop connection properties based on the cluster environment and
functionality that you use:

• Cluster Environment Variables


• Cluster Library Path
• Common Advanced Properties
• Blaze Engine Advanced Properties
• Spark Engine Advanced Properties

Note: Informatica does not recommend changing these property values before you consult with third-party
documentation, Informatica documentation, or Informatica Global Customer Support. If you change a value
without knowledge of the property, you might experience performance degradation or other unexpected
results.

To reset to default values, delete the property values. For example, if you delete the values of an edited
Cluster Library Path property, the value resets to the default $DEFAULT_CLUSTER_LIBRARY_PATH.

Cluster Environment Variables


Cluster Environment Variables property lists the environment variables that the cluster uses. Each
environment variable contains a name and a value. You can add environment variables or edit environment
variables.

To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following environment variables in the Cluster Environment Variables property:

HADOOP_NODE_JDK_HOME

Represents the directory from which you run the cluster services and the JDK version that the cluster
nodes use. Required to run the Java transformation in the Hadoop environment and Sqoop mappings on
the Blaze engine. Default is /usr/java/default. The JDK version that the Data Integration Service uses
must be compatible with the JDK version on the cluster.

Set to <cluster JDK home>/jdk<version>.

For example, HADOOP_NODE_JDK_HOME=<cluster JDK home>/jdk<version>.

Cluster Library Path


Cluster Library Path property is a list of path variables for shared libraries on the cluster. You can add or edit
library path variables.

To edit the property in the text box, use the following format with : to separate each path variable:
<variable1>[:<variable2>…:<variableN]
Configure the library path variables in the Cluster Library Path property.

188 Appendix A: Connections


Common Advanced Properties
Common advanced properties are a list of advanced or custom properties that are unique to the Hadoop
environment. The properties are common to the Blaze and Spark engines. Each property contains a name and
a value. You can add or edit advanced properties.

To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following property in the Advanced Properties of the common properties section:

infapdo.java.opts

List of Java options to customize the Java run-time environment. The property contains default values.

If mappings in a MapR environment contain a Consolidation transformation or a Match transformation,


change the following value:

• -Xmx512M. Specifies the maximum size for the Java virtual memory. Default is 512 MB. Increase the
value to at least 700 MB.
For example, infapdo.java.opts=-Xmx700M

Blaze Engine Advanced Properties


Blaze advanced properties are a list of advanced or custom properties that are unique to the Blaze engine.
Each property contains a name and a value. You can add or edit advanced properties.

To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following properties in the Advanced Properties of the Blaze configuration section:

infagrid.cadi.namespace

Namespace for the Data Integration Service to use. Required to set up multiple Blaze instances.

Set to <unique namespace>.

For example, infagrid.cadi.namespace=TestUser1_namespace

infagrid.blaze.console.jsfport

JSF port for the Blaze engine console. Use a port number that no other cluster processes use. Required
to set up multiple Blaze instances.

Set to <unique JSF port value>.

For example, infagrid.blaze.console.jsfport=9090

infagrid.blaze.console.httpport

HTTP port for the Blaze engine console. Use a port number that no other cluster processes use.
Required to set up multiple Blaze instances.

Set to <unique HTTP port value>.

For example, infagrid.blaze.console.httpport=9091

infagrid.node.local.root.log.dir

Path for the Blaze service logs. Default is /tmp/infa/logs/blaze. Required to set up multiple Blaze
instances.

Set to <local Blaze services log directory>.

Configuring Hadoop Connection Properties 189


For example, infagrid.node.local.root.log.dir=<directory path>

infacal.hadoop.logs.directory

Path in HDFS for the persistent Blaze logs. Default is /var/log/hadoop-yarn/apps/informatica. Required
to set up multiple Blaze instances.

Set to <persistent log directory path>.

For example, infacal.hadoop.logs.directory=<directory path>

infagrid.node.hadoop.local.root.log.dir

Path in the Hadoop connection for the service log directory.

Set to <service log directory path>.

For example, infagrid.node.local.root.log.dir=$HADOOP_NODE_INFA_HOME/blazeLogs

Spark Advanced Properties


Spark advanced properties are a list of advanced or custom properties that are unique to the Spark engine.
Each property contains a name and a value. You can add or edit advanced properties. Each property contains
a name and a value. You can add or edit advanced properties.

Configure the following properties in the Advanced Properties of the Spark configuration section:

To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
spark.authenticate

Enables authentication for the Spark service on Hadoop. Required for Spark encryption.

Set to TRUE.

For example, spark.authenticate=TRUE

spark.authenticate.enableSaslEncryption

Enables encrypted communication when SASL authentication is enabled. Required if Spark encryption
uses SASL authentication.

Set to TRUE.

For example, spark.authenticate.enableSaslEncryption=TRUE

spark.executor.cores

Indicates the number of cores that each executor process uses to run tasklets on the Spark engine.

Set to: spark.executor.cores=1

spark.executor.instances

Indicates the number of instances that each executor process uses to run tasklets on the Spark engine.

Set to: spark.executor.instances=1

spark.executor.memory

Indicates the amount of memory that each executor process uses to run tasklets on the Spark engine.

Set to: spark.executor.memory=3G

190 Appendix A: Connections


infaspark.driver.cluster.mode.extraJavaOptions

List of extra Java options for the Spark driver that runs inside the cluster. Required for streaming
mappings to read from or write to a Kafka cluster that uses Kerberos authentication.

For example, set to:


infaspark.driver.cluster.mode.extraJavaOptions=
-Djava.security.egd=file:/dev/./urandom
-XX:MaxMetaspaceSize=256M -Djavax.security.auth.useSubjectCredsOnly=true
-Djava.security.krb5.conf=/<path to keytab file>/krb5.conf
-Djava.security.auth.login.config=<path to jaas config>/kafka_client_jaas.config
To configure the property for a specific user, you can include the following lines of code:
infaspark.driver.cluster.mode.extraJavaOptions =
-Djava.security.egd=file:/dev/./urandom
-XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500
-Djava.security.krb5.conf=/etc/krb5.conf
infaspark.executor.extraJavaOptions

List of extra Java options for the Spark executor. Required for streaming mappings to read from or write
to a Kafka cluster that uses Kerberos authentication.

For example, set to:


infaspark.executor.extraJavaOptions=
-Djava.security.egd=file:/dev/./urandom
-XX:MaxMetaspaceSize=256M -Djavax.security.auth.useSubjectCredsOnly=true
-Djava.security.krb5.conf=/<path to krb5.conf file>/krb5.conf
-Djava.security.auth.login.config=/<path to jAAS config>/kafka_client_jaas.config
To configure the property for a specific user, you can include the following lines of code:
infaspark.executor.extraJavaOptions =
-Djava.security.egd=file:/dev/./urandom
-XX:MaxMetaspaceSize=256M -XX:+UseG1GC -XX:MaxGCPauseMillis=500
-Djava.security.krb5.conf=/etc/krb5.conf
infaspark.flatfile.writer.nullValue

When the Databricks Spark engine writes to a target, it converts null values to empty strings (" "). For
example, 12, AB,"",23p09udj.

The Databricks Spark engine can write the empty strings to string columns, but when it tries to write an
empty string to a non-string column, the mapping fails with a type mismatch.

To allow the Databricks Spark engine to convert the empty strings back to null values and write to the
target, configure the following advanced property in the Databricks Spark connection:

infaspark.flatfile.writer.nullValue=true

spark.hadoop.validateOutputSpecs

Validates if the HBase table exists. Required for streaming mappings to write to a HBase target in an
Amazon EMR cluster. Set the value to false.

infaspark.json.parser.mode

Specifies the parser how to handle corrupt JSON records. You can set the value to one of the following
modes:

• DROPMALFORMED. The parser ignores all corrupted records. Default mode.


• PERMISSIVE. The parser accepts non-standard fields as nulls in corrupted records.
• FAILFAST. The parser generates an exception when it encounters a corrupted record and the Spark
application goes down.

Configuring Hadoop Connection Properties 191


infaspark.json.parser.multiLine

Specifies whether the parser can read a multiline record in a JSON file. You can set the value to true or
false. Default is false. Applies only to non-native distributions that use Spark version 2.2.x and above.

infaspark.pythontx.exec

Required to run a Python transformation on the Spark engine for Big Data Management. The location of
the Python executable binary on the worker nodes in the Hadoop cluster.

For example, set to:


infaspark.pythontx.exec=/usr/bin/python3.4
If you use the installation of Python on the Data Integration Service machine, set the value to the Python
executable binary in the Informatica installation directory on the Data Integration Service machine.

For example, set to:


infaspark.pythontx.exec=INFA_HOME/services/shared/spark/python/lib/python3.4
infaspark.pythontx.executorEnv.PYTHONHOME

Required to run a Python transformation on the Spark engine for Big Data Management and Big Data
Streaming. The location of the Python installation directory on the worker nodes in the Hadoop cluster.

If the Python installation directory on the worker nodes is in a directory such as usr/lib/python, set the
property to the following value:
infaspark.pythontx.executorEnv.PYTHONHOME=usr/lib/python
If you use the installation of Python on the Data Integration Service machine, use the location of the
Python installation directory on the Data Integration Service machine.

For example, set the property to the following value:


infaspark.pythontx.executorEnv.PYTHONHOME=
INFA_HOME/services/shared/spark/python/
infaspark.pythontx.executorEnv.LD_PRELOAD

Required to run a Python transformation on the Spark engine for Big Data Streaming. The location of the
Python shared library in the Python installation folder on the Data Integration Service machine.

For example, set to:


infaspark.pythontx.executorEnv.LD_PRELOAD=
INFA_HOME/services/shared/spark/python/lib/libpython3.6m.so
infaspark.pythontx.submit.lib.JEP_HOME

Required to run a Python transformation on the Spark engine for Big Data Streaming. The location of the
Jep package in the Python installation folder on the Data Integration Service machine.

For example, set to:


infaspark.pythontx.submit.lib.JEP_HOME=
INFA_HOME/services/shared/spark/python/lib/python3.6/site-packages/jep/
spark.shuffle.encryption.enabled

Enables encrypted communication when authentication is enabled. Required for Spark encryption.

Set to TRUE.

For example, spark.shuffle.encryption.enabled=TRUE

192 Appendix A: Connections


spark.scheduler.maxRegisteredResourcesWaitingTime

The number of milliseconds to wait for resources to register before scheduling a task. Default is 30000.
Decrease the value to reduce delays before starting the Spark job execution. Required to improve
performance for mappings on the Spark engine.

Set to 15000.

For example, spark.scheduler.maxRegisteredResourcesWaitingTime=15000

spark.scheduler.minRegisteredResourcesRatio

The minimum ratio of registered resources to acquire before task scheduling begins. Default is 0.8.
Decrease the value to reduce any delay before starting the Spark job execution. Required to improve
performance for mappings on the Spark engine.

Set to: 0.5

For example, spark.scheduler.minRegisteredResourcesRatio=0.5

Configuring Hadoop Connection Properties 193


Index

A cluster configuration (continued)


import from a file 42, 66, 86, 106, 125
ADLS cluster integration 12
Databricks access 142 cluster workflow
Amazon AWS 150 cloud provisioning connection 149
Amazon EMR component architecture
Hadoop administrator tasks 37 clients and tools 13
Hadop administrator tasks 37 components
Hive access 45 Informatica with Databricks environment 137
S3 access policies 46 configuration files
Amazon Redshift connection Developer tool configuration 132
properties 155 connecting to a cluster 64, 84, 105
Amazon S3 connection Connection
properties 156 details 181
Analyst Service properties 181
configuration for MapR 131 connection properties
architecture Databricks 159
Big Data Management 13 connections
Big Data Management with Databricks 136 properties 149, 164
Hadoop environment 13 HBase 149
authentication HDFS 149
Databricks 143 Hive 149
Azure JDBC 149
configuration 151 Cosmos DB connection
Azure HDInsight creating 182
Hadoop administrator tasks 58 creating
hosts file requirement 30 Cosmos DB connection 182
Custom Hadoop OS Path
configuring 28

B
big data
application services 13
D
repositories 14 Data Integration Service
Big Data Management prerequisites 28
integration with Informatica products 15 configuration for MapR 129
Blaze engine Databricks
create a user account 24 authentication 143
port requirements 17 cloud provisioning configuration 154
connection properties 164 components 138
directories to create 24 import file 145
import from file 146
run-time staging directory 143

C staging directory 142


storage access 142
Cassandra connections import from cluster 144
properties 158 Databricks connection
cloud provisioning configuration configure 147
Databricks properties 154 Databricks connection properties 159
Amazon AWS properties 150 Databricks integration
Microsoft Azure properties 151 overview 136
Cloudera CDH system requirements 140
Hadoop administrator tasks 79 Developer tool
cluster configuration configuration 132
create 42, 64, 84, 104, 125 disk space
import from a cluster 64, 84, 105 requirements 17

194
E install
Jep 29
ephemeral clusters Python 29
cloud provisioning connection 149 Python transformation 29
installation
MapR client 120

G
Google Analytics connections
properties 161
J
Google BigQuery connection JDBC
properties 161 Sqoop connectivity 108
Google Cloud Spanner connection JDBC connections
properties 162 properties 175
Google Cloud Storage connections
properties 163
K
H Kerberos authentication
security certificate import 89, 109
Hadoop 149
Hadoop administrator
prerequisite tasks for Amazon EMR 37
prerequisite tasks for Azure HDInsight 58
M
prerequisite tasks for Cloudera CDH 79 MapR
prerequisite tasks for Hortonworks HDP 99 Hadoop administrator tasks 120, 121
prerequisite tasks for MapR 120 Analyst Service configuration 131
Hadoop administrator tasks Data Integration Service configuration 129
Amazon EMR 37 Metadata Access Service configuration 130
Azure HDInsight 58 tickets 128
Cloudera CDH 79 MapR client
configure *-site.files 37, 58, 79, 99, 121 installing 120
Hortonworks HDP 99 Metadata Access Service
MapR 121 configuration for MapR 130
Hadoop connections Microsoft Azure 151
creating 186 Microsoft Azure Data Lake Store connection
Hadoop operating system properties 183
on Data Integration Service 27 Microsoft Azure SQL Data Warehouse connection
HBase connections properties 184
MapR-DB properties 171
properties 171
HDFS connections
creating 186
O
properties 169 overview 12
HDFS staging directory 23
high availability
configuration on Developer tool 132
Hive access
P
for Amazon EMR 45 permissions
Hive connections Blaze engine user 24
creating 186 ports
properties 172 Amazon EMR requirements 17
Hive pushdown Azure HDInsight requirements 17
connection properties 164 Blaze engine requirements 17
Hortonworks HDP Prerequisite
Hadoop administrator tasks 99 download Hadoop operating system 27
hosts file prerequisites
Azure HDInsight 30 create directories for the Blaze engine 24
disk space 17
Hadoop administrator tasks. 37, 58, 79, 99, 120

I verify system requirements 16


Data Integration Service properties 28
import file uninstall 18
Databricks 145 verify product installations 16
import from cluster process 12
Databricks 144 product installations
import from file prerequisites 16
Databricks 146

Index 195
R staging directory (continued)
HDFS 23
reject file directory system requirements
HDFS 25 Databricks integration 140
prerequisites 16

S T
S3 access policies 46
Snowflake connection TDCH connection factory
properties 185 -Dsqoop.connection.factories 178
Spark deploy mode
Hadoop connection properties 164
Spark engine
connection properties 164
U
Spark Event Log directory uninstall
Hadoop connection properties 164 prerequisite 18
Spark execution parameters user accounts
Hadoop connection properties 164 MapR 128
Spark HDFS staging directory
Hadoop connection properties 164
Sqoop
JDBC drivers 108
W
Sqoop connection arguments WASB
-Dsqoop.connection.factories 178 Databricks access 142
connect 178
direct 178
driver 178
staging directory
Databricks 142

196 Index

You might also like