In 1022 IntegrationGuide en
In 1022 IntegrationGuide en
In 1022 IntegrationGuide en
10.2.2
Integration Guide
Informatica Big Data Management Integration Guide
10.2.2
February 2019
© Copyright Informatica LLC 2014, 2019
This software and documentation are provided only under a separate license agreement containing restrictions on use and disclosure. No part of this document may be
reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise) without prior consent of Informatica LLC.
U.S. GOVERNMENT RIGHTS Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial
computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such,
the use, duplication, disclosure, modification, and adaptation is subject to the restrictions and license terms set forth in the applicable Government contract, and, to the
extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License.
Informatica, the Informatica logo [and any other Informatica-owned trademarks appearing in the document] are trademarks or registered trademarks of Informatica
LLC in the United States and many jurisdictions throughout the world. A current list of Informatica trademarks is available on the web at https://www.informatica.com/
trademarks.html. Other company and product names may be trade names or trademarks of their respective owners.
Portions of this software and/or documentation are subject to copyright held by third parties. Required third party notices are included with the product.
The information in this documentation is subject to change without notice. If you find any problems in this documentation, report them to us at
[email protected].
Informatica products are warranted according to the terms and conditions of the agreements under which they are provided. INFORMATICA PROVIDES THE
INFORMATION IN THIS DOCUMENT "AS IS" WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.
Table of Contents 3
Create Blaze Engine Directories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Edit the hosts File for the Blaze Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Create a Reject File Directory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Create a Proxy Directory for MapR. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Configure Access to Secure Hadoop Clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Configure the Metadata Access Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Configure the Data Integration Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Download the Informatica Server Binaries for the Hadoop Environment. . . . . . . . . . . . . . . . 27
Configure Data Integration Service Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Prepare a Python Installation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Edit the hosts File for Access to Azure HDInsight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 Table of Contents
Task Flow to Upgrade from a Version Earlier than 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Prepare for Cluster Import from Azure HDInsight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Configure *-site.xml Files for Azure HDInsight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Prepare for Direct Import from Azure HDInsight. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Prepare the Archive File for Import from Azure HDInsight. . . . . . . . . . . . . . . . . . . . . . . . . 63
Create a Cluster Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Before You Import. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Importing a Hadoop Cluster Configuration from the Cluster. . . . . . . . . . . . . . . . . . . . . . . . 64
Importing a Hadoop Cluster Configuration from a File. . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Verify or Refresh the Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Verify JDBC Drivers for Sqoop Connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Verify Design-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Verify Run-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Configure the Developer Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Configure developerCore.ini. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Complete Upgrade Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Update Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Update Streaming Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Table of Contents 5
Chapter 6: Hortonworks HDP Integration Tasks. . . . . . . . . . . . . . . . . . . . . . . . . 94
Hortonworks HDP Task Flows. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Task Flow to Integrate with Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Task Flow to Upgrade from Version 10.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Task Flow to Upgrade from Version 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Task Flow to Upgrade from a Version Earlier than 10.2. . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Prepare for Cluster Import from Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Configure *-site.xml Files for Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Prepare for Direct Import from Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Prepare the Archive File for Import from Hortonworks HDP. . . . . . . . . . . . . . . . . . . . . . . 103
Create a Cluster Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Before You Import. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Importing a Hadoop Cluster Configuration from the Cluster. . . . . . . . . . . . . . . . . . . . . . . 105
Importing a Hadoop Cluster Configuration from a File. . . . . . . . . . . . . . . . . . . . . . . . . . 106
Verify or Refresh the Cluster Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Verify JDBC Drivers for Sqoop Connectivity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Verify Design-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Verify Run-time Drivers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Import Security Certificates to Clients. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Configure the Developer Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Configure developerCore.ini. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Complete Upgrade Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Update Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Update Streaming Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6 Table of Contents
Generate Tickets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Configure the Data Integration Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Configure the Metadata Access Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Configure the Analyst Service. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Configure the Developer Tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
Configure developerCore.ini. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Complete Upgrade Tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Update Connections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Table of Contents 7
Databricks Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Google Analytics Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Google BigQuery Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
Google Cloud Spanner Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
Google Cloud Storage Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
Hadoop Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Hadoop Cluster Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Common Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Reject Directory Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Hive Pushdown Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Blaze Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Spark Configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
HDFS Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
HBase Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
HBase Connection Properties for MapR-DB. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Hive Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
JDBC Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Sqoop Connection-Level Arguments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Kafka Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Microsoft Azure Blob Storage Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
Microsoft Azure Cosmos DB SQL API Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . 182
Microsoft Azure Data Lake Store Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Microsoft Azure SQL Data Warehouse Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . 184
Snowflake Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Creating a Connection to Access Sources or Targets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Creating a Hadoop Connection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Configuring Hadoop Connection Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Cluster Environment Variables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Cluster Library Path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Common Advanced Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Blaze Engine Advanced Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
Spark Advanced Properties. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8 Table of Contents
Preface
The Informatica Big Data Management™ Integration Guide is written for the system administrator who is
responsible for integrating the native environment of the Informatica domain with a non-native environment,
such as Hadoop or Databricks. This guide contains instructions to integrate the Informatica and non-native
environments.
Integration tasks are required on the Hadoop cluster, the Data Integration Service machine, and the Developer
tool machine. As a result, this guide contains tasks for administrators of the non-native environments,
Informatica administrators, and Informatica mapping developers. Tasks required by the Hadoop or
Databricks administrator are directed to the administrator.
Use this guide for new integrations and for upgrades. The instructions follow the same task flow. Tasks
required for upgrade indicate that they are for upgrade.
Informatica Resources
Informatica provides you with a range of product resources through the Informatica Network and other online
portals. Use the resources to get the most from your Informatica products and solutions and to learn from
other Informatica users and subject matter experts.
Informatica Network
The Informatica Network is the gateway to many resources, including the Informatica Knowledge Base and
Informatica Global Customer Support. To enter the Informatica Network, visit
https://network.informatica.com.
To search the Knowledge Base, visit https://search.informatica.com. If you have questions, comments, or
ideas about the Knowledge Base, contact the Informatica Knowledge Base team at
[email protected].
9
Informatica Documentation
Use the Informatica Documentation Portal to explore an extensive library of documentation for current and
recent product releases. To explore the Documentation Portal, visit https://docs.informatica.com.
Informatica maintains documentation for many products on the Informatica Knowledge Base in addition to
the Documentation Portal. If you cannot find documentation for your product or product version on the
Documentation Portal, search the Knowledge Base at https://search.informatica.com.
If you have questions, comments, or ideas about the product documentation, contact the Informatica
Documentation team at [email protected].
Informatica Velocity
Informatica Velocity is a collection of tips and best practices developed by Informatica Professional Services
and based on real-world experiences from hundreds of data management projects. Informatica Velocity
represents the collective knowledge of Informatica consultants who work with organizations around the
world to plan, develop, deploy, and maintain successful data management solutions.
You can find Informatica Velocity resources at http://velocity.informatica.com. If you have questions,
comments, or ideas about Informatica Velocity, contact Informatica Professional Services at
[email protected].
Informatica Marketplace
The Informatica Marketplace is a forum where you can find solutions that extend and enhance your
Informatica implementations. Leverage any of the hundreds of solutions from Informatica developers and
partners on the Marketplace to improve your productivity and speed up time to implementation on your
projects. You can find the Informatica Marketplace at https://marketplace.informatica.com.
To find your local Informatica Global Customer Support telephone number, visit the Informatica website at
the following link:
https://www.informatica.com/services-and-training/customer-success-services/contact-us.html.
To find online support resources on the Informatica Network, visit https://network.informatica.com and
select the eSupport option.
10 Preface
Part I: Hadoop Integration
This part contains the following chapters:
11
Chapter 1
Introduction to Hadoop
Integration
This chapter includes the following topics:
The Data Integration Service automatically installs the Hadoop binaries to integrate the Informatica domain
with the Hadoop environment. The integration requires Informatica connection objects and cluster
configurations. A cluster configuration is a domain object that contains configuration parameters that you
import from the Hadoop cluster. You then associate the cluster configuration with connections to access the
Hadoop environment.
Perform the following tasks to integrate the Informatica domain with the Hadoop environment:
When you run a mapping, the Data Integration Service checks for the binary files on the cluster. If they do not
exist or if they are not synchronized, the Data Integration Service prepares the files for transfer. It transfers
the files to the distributed cache through the Informatica Hadoop staging directory on HDFS. By default, the
staging directory is /tmp. This transfer process replaces the requirement to install distribution packages on
the Hadoop cluster.
12
Big Data Management Component Architecture
The Big Data Management components include client tools, application services, repositories, and third-party
tools that Big Data Management uses for a big data project. The specific components involved depend on the
task you perform.
Hadoop Integration
Big Data Management can connect to clusters that run different Hadoop distributions. Hadoop is an open-
source software framework that enables distributed processing of large data sets across clusters of
machines. You might also need to use third-party software clients to set up and manage your Hadoop cluster.
Big Data Management can connect to the supported data source in the Hadoop environment, such as HDFS,
HBase, or Hive, and push job processing to the Hadoop cluster. To enable high performance access to files
across the cluster, you can connect to an HDFS source. You can also connect to a Hive source, which is a
data warehouse that connects to HDFS.
It can also connect to NoSQL databases such as HBase, which is a database comprising key-value pairs on
Hadoop that performs operations in real-time. The Data Integration Service can push mapping jobs to the
Spark or Blaze engine, and it can push profile jobs to the Blaze engine in the Hadoop environment.
Big Data Management supports more than one version of some Hadoop distributions. By default, the cluster
configuration wizard populates the latest supported version.
Monitor the status of profile, mapping, and MDM Big Data Relationship Management jobs on the
Monitoring tab of the Administrator tool. The Monitoring tab of the Administrator tool is called the
Monitoring tool. You can also design a Vibe Data Stream workflow in the Administrator tool.
Informatica Analyst
Create and run profiles on big data sources, and create mapping specifications to collaborate on
projects and define business logic that populates a big data target with data.
Informatica Developer
Create and run profiles against big data sources, and run mappings and workflows on the Hadoop
cluster from the Developer tool.
Application Services
Big Data Management uses application services in the Informatica domain to process data.
Use the Administrator tool to create connections, monitor jobs, and manage application services that Big
Data Management uses.
The Analyst Service runs the Analyst tool in the Informatica domain. The Analyst Service manages the
connections between service components and the users that have access to the Analyst tool.
The Data Integration Service can process mappings in the native environment or push the mapping for
processing to a compute cluster in a non-native environment. The Data Integration Service also retrieves
metadata from the Model repository when you run a Developer tool mapping or workflow. The Analyst
tool and Developer tool connect to the Data Integration Service to run profile jobs and store profile
results in the profiling warehouse.
The Mass Ingestion Service manages and validates mass ingestion specifications that you create in the
Mass Ingestion tool. The Mass Ingestion Service deploys specifications to the Data Integration Service.
When a specification runs, the Mass Ingestion Service generates ingestion statistics.
The Metadata Access Service allows the Developer tool to import and preview metadata from a Hadoop
cluster.
The Metadata Access Service contains information about the Service Principal Name (SPN) and keytab
information if the Hadoop cluster uses Kerberos authentication. You can create one or more Metadata
Access Services on a node. Based on your license, the Metadata Access Service can be highly available.
HBase, HDFS, Hive, and MapR-DB connections use the Metadata Access Service when you import an
object from a Hadoop cluster. Create and configure a Metadata Access Service before you create HBase,
HDFS, Hive, and MapR-DB connections.
The Model Repository Service manages the Model repository. The Model Repository Service connects to
the Model repository when you run a mapping, mapping specification, profile, or workflow.
The REST Operations Hub Service is an application service in the Informatica domain that exposes
Informatica product functionality to external clients through REST APIs.
Repositories
Big Data Management uses repositories and other databases to store data related to connections, source
metadata, data domains, data profiling, data masking, and data lineage. Big Data Management uses
application services in the Informatica domain to access data in repositories.
Model repository
The Model repository stores profiles, data domains, mapping, and workflows that you manage in the
Developer tool. The Model repository also stores profiles, data domains, and mapping specifications that
you manage in the Analyst tool.
Profiling warehouse
The Data Integration Service runs profiles and stores profile results in the profiling warehouse.
For more information about product requirements and supported platforms, see the Product Availability
Matrix on Informatica Network:
https://network.informatica.com/community/informatica-network/product-availability-matrices
Install and configure the Informatica domain and the Developer tool. The Informatica domain must have
a Model Repository Service, a Data Integration Service, and a Metadata Access Service.
16
Hadoop File System and MapReduce
The Hadoop installation must include a Hive data warehouse with a non-embedded database for the
Hive metastore. Verify that Hadoop is installed with Hadoop File System (HDFS) and MapReduce on
each node. Install Hadoop in a single node environment or in a cluster. For more information, see the
Apache website: http://hadoop.apache.org.
To access relational databases in the Hadoop environment, install database client software and drivers
on each node in the cluster.
Verify with the Hadoop administrator that the distributed cache has at least 1.5 GB of free disk space.
Distribution Version
To ensure access to ports, the network administrator needs to complete additional tasks in the following
situations:
• The Hadoop cluster is behind a firewall. Work with the network administrator to open a range of ports that
a distribution engine uses.
• The Hadoop environment uses Azure HDInsight. Work with the network administrator to enable VPN
between the Informatica domain and the Azure cloud network.
Port Description
7180 Cluster management web app for Cloudera. Required for Cloudera only.
8020 NameNode RPC. Required for all supported distributions except MapR.
8080 Cluster management web app. Used by distributions that use Ambari to manage the cluster:
HDinsight, Hortonworks.
9080 Blaze monitoring console. Required for all distributions if you run mappings using Blaze.
12300 to 12600 Default port range for the Blaze distribution engine. A port range is required for all distributions if
you run mappings using Blaze.
Note: The network administrators must ensure that the port used by the Metadata Access Service is
accessible from the cluster nodes.
1. Verify that the Big Data Management administrator can run sudo commands.
2. If you are uninstalling Big Data Management in a cluster environment, configure the root user to use a
passwordless Secure Shell (SSH) connection between the machine where you want to run the Big Data
Management uninstall and all of the nodes where Big Data Management is installed.
3. If you are uninstalling Big Data Management in a cluster environment using the HadoopDataNodes file,
verify that the HadoopDataNodes file contains the IP addresses or machine host names of each of the
nodes in the Hadoop cluster from which you want to uninstall Big Data Management. The
HadoopDataNodes file is located on the node from where you want to launch the Big Data Management
installation. You must add one IP address or machine host name of the nodes in the Hadoop cluster for
each line in the file.
1. Log in to the machine as root user. The machine you log in to depends on the Big Data Management
environment and uninstallation method.
• To uninstall in a single node environment, log in to the machine on which Big Data Management is
installed.
• To uninstall in a cluster environment using the HADOOP_HOME environment variable, log in to the
primary name node.
• To uninstall in a cluster environment using the HadoopDataNodes file, log in to any node.
2. Run the following command to start the uninstallation in console mode:
bash InformaticaHadoopInstall.sh
sh InformaticaHadoopInstall.sh
./InformaticaHadoopInstall.sh
3. Press y to accept the Big Data Management terms of agreement.
4. Press Enter.
5. Select 3 to uninstall Big Data Management.
6. Press Enter.
7. Select the uninstallation option, depending on the Big Data Management environment:
• Select 1 to uninstall Big Data Management from a single node environment.
• Select 2 to uninstall Big Data Management from a cluster environment.
8. Press Enter.
9. If you are uninstalling Big Data Management in a cluster environment, select the uninstallation option,
depending on the uninstallation method:
• Select 1 to uninstall Big Data Management from the primary name node.
• Select 2 to uninstall Big Data Management using the HadoopDataNodes file.
10. Press Enter.
11. If you are uninstalling Big Data Management from a cluster environment from the primary name node,
type the absolute path for the Hadoop installation directory. Start the path with a slash.
The uninstaller deletes all of the Big Data Management binary files from the following directory: /<Big Data
Management installation directory>/Informatica
In a cluster environment, the uninstaller deletes the binary files from all nodes within the Hadoop cluster.
1. In the Ambari configuration manager, select INFORMATICA BDM from the list of services.
2. Click the Service Actions dropdown menu and select Delete Service.
3. To confirm that you want to delete Informatica Big Data Management, perform the following steps:
a. In the Delete Service dialog box, click Delete.
b. In the Confirm Delete dialog box, type delete and then click Delete.
c. When the deletion process is complete, click OK.
Ambari stops the Big Data Management service and deletes it from the listing of available services.
To fully delete Big Data Management from the cluster, continue with the next steps.
4. In a command window, delete the INFORMATICABDM folder from the following directory on the name node
of the cluster: /var/lib/ambari-server/resources/stacks/<Hadoop distribution>/<Hadoop
version>/services/
5. Delete the INFORMATICABDM folder from the following location on all cluster nodes where it was
installed: /var/lib/ambari-agent/cache/stacks/<Hadoop distribution>/<Hadoop version>/
services
6. Perform the following steps to remove RPM binary files:
a. Run the following command to determine the name of the RPM binary archive:
rpm -qa |grep Informatica
b. Run the following command to remove RPM binary files:
rpm -ev <output_from_above_command>
For example:
rpm -ev InformaticaHadoop-10.1.1-1.x86_64
7. Repeat the previous step to remove RPM binary files from each cluster node.
8. Delete the following directory, if it exists, from the name node and each client node: /opt/Informatica/.
9. Repeat the last step on each cluster node where Big Data Management was installed.
10. On the name node, restart the Ambari server.
MapR distribution
If the MapR distribution uses Ticket or Kerberos authentication, the name must match the system user
that starts the Informatica daemon and the gid of the user must match the gid of the MapR user.
Azure HDInsight
If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, grant the required
permissions. For the permissions, see “Grant Permissions to an Azure Active Directory User ” on page
22.
To run Sqoop mappings on the Spark engine, add the Hadoop impersonation user as a Linux user on the
machine that hosts the Data Integration Service.
If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, grant the required
permissions. For the permissions, see “Grant Permissions to an Azure Active Directory User ” on page 22.
If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, grant the required
permissions. For the permissions, see “Grant Permissions to an Azure Active Directory User ” on page 22.
If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, grant the required
permissions. Users must be present in the Azure Active Directory that matches the name on the Data
Integration Service machine. For the permissions, see “Grant Permissions to an Azure Active Directory
User ” on page 22.
If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, grant the required
permissions. For the permissions, see “Grant Permissions to an Azure Active Directory User ” on page 22.
• Execute permission on the root folder and its subfolders of the Azure Data Lake Storage account.
• Read and execute permissions on the following directory and its contents: /hdp/apps/<version>
• Read, write, and execute permissions on the following directories:
/tmp
/app-logs
/hive/warehouse
/blaze/workdir
/user
/var/log/hadoop-yarn/apps
/mr-history
/tezstaging
/mapreducestaging
Note: If the directories are not available, create the directories and grant the required permissions.
By default, the Data Integration Service writes the files to the HDFS directory /tmp.
Grant permission to the Hadoop staging user. If you did not create a Hadoop staging user, the Data
Integration Services uses the operating system user that starts the Informatica daemon.
Grant read and write permissions on the Hive warehouse directory. You can find the location of the
warehouse directory in the hive.metastore.warehouse.dir property of the hive-site.xml file. For example, the
default might be /user/hive/warehouse or /apps/hive/warehouse.
Grant permission to the Hadoop impersonation user. Optionally, you can assign -777 permissions on the
directory.
Optionally, create a staging directory on HDFS for the Spark engine. For example:
hadoop fs -mkdir -p /spark/staging
If you want to write the logs to the Informatica Hadoop staging directory, you do not need to create a Spark
staging directory. By default, the Data Integration Service uses the HDFS directory /tmp/SPARK_<user name>.
Create a Sqoop staging directory named sqoop_staging manually in the following situations:
• You run a Sqoop pass-through mapping on the Spark engine to read data from a Sqoop source and write
data to a Hive target that uses the Text format.
• You use a Cloudera CDH cluster with Sentry authorization or a Hortonworks HDP cluster with Ranger
authorization.
After you create the sqoop_staging directory, you must add an Access Control List (ACL) for the
sqoop_staging directory and grant write permissions to the Hive super user. Run the following command on
the Cloudera CDH cluster or the Hortonworks HDP cluster to add an ACL for the sqoop_staging directory and
grant write permissions to the Hive super user:
For information about Sentry authorization, see the Cloudera documentation. For information about Ranger
authorization, see the Hortonworks documentation.
Complete the following tasks to prepare the Hadoop cluster for the Blaze engine:
If you created a blaze user, create home directory for the blaze user. For example,
hdfs hadoop fs -mkdir /user/blaze
hdfs hadoop fs -chown blaze:blaze /user/blaze
If you did not create a blaze user, the Hadoop impersonation user is the default user.
By default, the Blaze engine writes the service logs to the YARN distributed cache. For example, run the
following command:
mkdir -p /opt/informatica/blazeLogs
$HADOOP_NODE_INFA_HOME gets set to the YARN distributed cache. If you create a directory, you must
update the value of the advanced property in the Hadoop connection.
Create a log directory on HDFS to contain aggregated logs for local services. For example:
hadoop fs -mkdir -p /var/log/hadoop-yarn/apps/informatica
Ensure that value of the advanced property in the Hadoop connection matches the directory that you
created.
You can write the logs to the Informatica Hadoop staging directory, or you can create a Blaze staging
directory. If you do not want to use the default location, create a staging directory on the HDFS. For
example:
hadoop fs -mkdir -p /blaze/workdir
Note: If you do not create a staging directory, clear the Blaze staging directory property value in the
Hadoop connection and the Data Integration Service uses the HDFS directory /tmp/blaze_<user name>.
• Blaze user
• Hadoop impersonation user
• Mapping impersonation users
If the blaze user does not have permission, the Blaze engine uses a different user, based on the cluster
security and the mapping impersonation configuration.
Each node in the cluster requires an entry for the IP address and the fully qualified domain name (FQDN) of
all other nodes. For example,
127.0.0.1 localhost node1.node.com
208.164.186.1 node1.node.com node1
208.164.186.2 node2.node.com node2
208.164.186.3 node3.node.com node3
Changes take effect after you restart the network.
Reject files can be very large, and you can choose to write them to HDFS instead of the Data Integration
Service machine. You can configure the Hadoop connection object to write to the reject file directory.
• Blaze user
• Hadoop impersonation user
• Mapping impersonation users
If the blaze user does not have permission, the Blaze engine uses a different user, based on the cluster
security and the mapping impersonation configuration.
• Create a user or verify that a user exists on every Data Integration Service machine and on every node in
the Hadoop cluster.
• Verify that the uid and the gid of the user match in both environments.
• Verify that a directory exists for the user on the cluster. For example, /opt/mapr/conf/proxy/<user
name>
Depending on the security implementation on the cluster, you must perform the following tasks:
You must configure the Kerberos configuration file on the Data Integration Service machine to match the
Kerberos realm properties of the Hadoop cluster. Verify that the Hadoop Kerberos properties are
configured in the Data Integration Service and the Metadata Access Service.
You must import security certificates to the Data Integration Service and the Metadata Access Service
machines.
If the transparent encryption uses Cloudera Java KMS, Cloudera Navigator KMS, or Apache Ranger KMS,
you must configure the KMS for Informatica user access.
If an Azure HDInsight cluster uses Enterprise Security Package and ADLS storage, perform the following
tasks:
• Create a keytab file on any one of the cluster nodes for the specific user. To create a keytab file, use
the ktutil command.
• In the Azure portal, assign the Owner role to the Azure HDInsight cluster service principal display
name.
• Log in to Ambari Web UI with the Azure Active Directory user credentials to generate the OAuth token
for authentication for the following users:
- Keytab user
- Blaze user
For more information, see the Informatica Big Data Management Administrator Guide.
Property Description
Use Operating System Profiles If enabled, the Metadata Access Service uses the operating system profiles to
and Impersonation access the Hadoop cluster.
Hadoop Kerberos Service Service Principal Name (SPN) of the Metadata Access Service to connect to a
Principal Name Hadoop cluster that uses Kerberos authentication.
Not applicable for the MapR distribution.
Hadoop Kerberos Keytab The file path to the Kerberos keytab file on the machine on which the Metadata
Access Service runs.
Not applicable for the MapR distribution.
Use logged in user as Required if the Hadoop cluster uses Kerberos authentication. If enabled, the
impersonation user Metadata Access Service uses the impersonation user to access the Hadoop
environment. Default is false.
1. Download Informatica Hadoop binaries to the Data Integration Service machine if the operating systems
of the Hadoop environment and the Data Integration Service are different.
2. Configure the Data Integration Service properties, such as the cluster staging directory, Hadoop
Kerberos service principal name, and the path to the Kerberos keytab file.
3. Prepare an installation of Python on the Data Integration Service machine or on the Hadoop cluster if you
plan to run the Python transformation.
4. Copy the krb5.conf file to the following location on the machine that hosts the Data Integration Service:
• <Informatica installation directory>/java/jre/lib/security
• <Informatica installation directory>/services/shared/security
5. Copy the keytab file to the following directory: <Informatica installation directory>/isp/config/
keys
The Data Integration Service can synchronize the following operating systems: SUSE and Redhat
The Data Integration Service machine must include the Informatica server binaries that are compatible with
the Hadoop cluster operating system. The Data Integration Service uses the operating system binaries to
integrate the domain with the Hadoop cluster.
1. Create a directory on the Data Integration Service host machine to store the Informatica server binaries
associated with the Hadoop operating system.
If the Data Integration Service runs on a grid, Informatica recommends extracting the files to a location
that is shared by all services on the grid. If the location is not shared, you must extract the files to all
Data Integration Service machines that run on the grid.
The directory names in the path must not contain spaces or the following special characters: @ | * $ # !
%(){}[]
2. Download and extract the Informatica server binaries from the Informatica download site. For example,
tar -xvf <Informatica server binary tar file>
3. Run the installer to extract the installation binaries into the custom OS path.
Perform the following steps to run the installer:
• Run the sh Server/install.bin -DINSTALL_MODE=CONSOLE -DINSTALL_TYPE=0 file.
• Press Y to continue the installation.
• Press 1 to install Informatica Big Data Suite Products.
• Press 3 to run the installer.
• Press 2 to accept the terms and conditions.
• Press 2 to continue the installation for big data products only.
• Press 2 to configure the Informatica domain to run on a network with Kerberos authentication.
• Enter the path and file name of the Informatica license key and press an option to tune the services.
• Enter the custom Hadoop OS path.
• Type Quit to quit the installation.
4. Set the custom Hadoop OS path in the Data Integration Service and then restart the service
5. Optionally, you can delete files that are not required. For example, run the following command:
rm -Rf <Informatica server binary file> ./source/*.7z
Note: If you subsequently install an Informatica EBF, you must also install it in the path of the Informatica
server binaries associated with the Hadoop environment.
The following table describes the Data Integration Service properties that you need to configure:
Property Description
Cluster Staging The directory on the cluster where the Data Integration Service pushes the binaries to integrate the
Directory native and non-native environments and to store temporary files during processing. Default is /
tmp.
Hadoop Staging The HDFS user that performs operations on the Hadoop staging directory. The user requires write
User permissions on Hadoop staging directory. Default is the operating system user that starts the
Informatica daemon.
Custom Hadoop The local path to the Informatica server binaries compatible with the Hadoop operating system.
OS Path Required when the Hadoop cluster and the Data Integration Service are on different supported
operating systems. The Data Integration Service uses the binaries in this directory to integrate the
domain with the Hadoop cluster. The Data Integration Service can synchronize the following
operating systems:
- SUSE and Redhat
Include the source directory in the path. For example, <Informatica server binaries>/
source.
Changes take effect after you recycle the Data Integration Service.
Note: When you install an Informatica EBF, you must also install it in this directory.
Hadoop Service Principal Name (SPN) of the Data Integration Service to connect to a Hadoop cluster that
Kerberos uses Kerberos authentication.
Service Principal Not required for the MapR distribution.
Name
Hadoop The file path to the Kerberos keytab file on the machine on which the Data Integration Service
Kerberos Keytab runs.
Not required for the MapR distribution.
• Verify that all worker nodes on the cluster contain an installation of Python in the same directory, such as
usr/lib/python, and that each Python installation contains all required modules. You do not re-install
Python, but you must reconfigure the following Spark advanced property in the Hadoop connection:
infaspark.pythontx.executorEnv.PYTHONHOME
• Install Python on every Data Integration Service machine. You can create a custom installation of Python
that contains specific modules that you can reference in the Python code. When you run mappings, the
Python installation is propagated to the worker nodes on the cluster.
1. Install Python.
2. Optionally, install any third-party libraries such as numpy, scikit-learn, and cv2. You can access the third-
party libraries in the Python transformation.
3. Copy the Python installation folder to the following location on the Data Integration Service machine:
<Informatica installation directory>/services/shared/spark/python
Note: If the Data Integration Service machine already contains an installation of Python, you can copy the
existing Python installation to the above location.
Changes take effect after you recycle the Data Integration Service.
2.7
3.3
3.4
3.5
3.6
1. Install Python with the --enable-shared option to ensure that shared libraries are accessible by Jep.
2. Install Jep. To install Jep, consider the following installation options:
• Run pip install jep. Use this option if Python is installed with the pip package.
• Configure the Jep binaries. Ensure that jep.jar can be accessed by Java classloaders, the shared
Jep library can be accessed by Java, and Jep Python files can be accessed by Python.
3. Optionally, install any third-party libraries such as numpy, scikit-learn, and cv2. You can access the third-
party libraries in the Python transformation.
4. Copy the Python installation folder to the following location on the Data Integration Service machine:
<Informatica installation directory>/services/shared/spark/python
Note: If the Data Integration Service machine already contains an installation of Python, you can copy the
existing Python installation to the above location.
Changes take effect after you recycle the Data Integration Service.
• Enter the IP address, DNS name, and DNS short name for each data node on the cluster. Use
headnodehost to identify the host as the cluster headnode host.
For example:
10.75.169.19 hn0-rndhdi.grg2yxlb0aouniiuvfp3bet13d.ix.internal.cloudapp.net
headnodehost
• If the HDInsight cluster is integrated with ADLS storage, you also need to enter the IP addresses and DNS
names for the hosts listed in the cluster property fs.azure.datalake.token.provider.service.urls.
• Integrate the Informatica domain with Amazon EMR for the first time.
• Upgrade from version 10.2.1.
• Upgrade from version 10.2.
• Upgrade from a version earlier than 10.2.
32
Task Flow to Integrate with Amazon EMR
The following diagram shows the task flow to integrate the Informatica domain with Amazon EMR:
Note: If you are upgrading from a previous version, verify the properties and suggested values, as Big Data
Management might require additional properties or different values for existing properties.
Complete the following tasks to prepare the cluster before the Informatica administrator creates the cluster
configuration:
1. Verify property values in *-site.xml files that Big Data Management needs to run mappings in the Hadoop
environment.
2. Prepare the archive file to import into the domain.
Note: You cannot import cluster information directly from the Amazon EMR cluster into the Informatica
domain.
core-site.xml
Configure the following properties in the core-site.xml file:
fs.s3.awsAccessKeyID
The ID for the run-time engine to connect to the Amazon S3 file system. Required for the Blaze engine
and for the Spark engine if the Data Integration if S3 policy does not allow EMR access.
Note: If the Data Integration Service is deployed on an EC2 instance and the IAM roles and policies allow
access to S3 and other resources, this property is not required. If the Data Integration Service is
deployed on-premises, then you can choose to configure the value for this property in the cluster
configuration on the Data Integration Service after you import the cluster configuration. Configuring the
AccessKeyID value on the cluster configuration is more secure than configuring it in core-site.xml on the
cluster.
The access key for the Blaze and Spark engines to connect to the Amazon S3 file system. Required for
the Blaze engine and for the Spark engine if the Data Integration if S3 policy does not allow EMR access.
Note: If the Data Integration Service is deployed on an EC2 instance and the IAM roles and policies allow
access to S3 and other resources, this property is not required. If the Data Integration Service is
deployed on-premises, then you can choose to configure the value for this property in the cluster
configuration on the Data Integration Service after you import the cluster configuration. Configuring the
AccessKeyID value on the cluster configuration is more secure than configuring it in core-site.xml on the
cluster.
fs.s3.enableServerSideEncryption
Enables server side encryption for S3 buckets. Required for SSE and SSE-KMS encryption.
fs.s3a.server-side-encryption-algorithm
The server-side encryption algorithm for S3. Required for SSE and SSE-KMS encryption. Set to the
encryption algorithm used.
fs.s3a.endpoint
For example:
<property>
<name>fs.s3a.endpoint</name>
<value>s3-us-west-1.amazonaws.com</value>
</property>
fs.s3a.bucket.BUCKET_NAME.server-side-encryption.key
Server-side encryption key for the S3 bucket. Required if the S3 bucket is encrypted with SSE-KMS.
For example:
<property>
<name>fs.s3a.bucket.BUCKET_NAME.server-side-encryption.key</name>
<value>arn:aws:kms:us-west-1*******/value>
<source>core-site.xml</source>
</property>
where BUCKET_NAME is the name of the S3 bucket.
hadoop.proxyuser.<proxy user>.groups
Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.
Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.<proxy user>.hosts
Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.
Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.
hadoop.proxyuser.yarn.groups
Comma-separated list of groups that you want to allow the YARN user to impersonate on a non-secure
cluster.
Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.yarn.hosts
Comma-separated list of hosts that you want to allow the YARN user to impersonate on a non-secure
cluster.
hadoop.security.auth_to_local
Translates the principal names from the Active Directory and MIT realm into local names within the
Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.
io.compression.codecs
hbase-site.xml
Configure the following properties in the hbase-site.xml file:
hbase.use.dynamic.jars
Enables metadata import and test connection from the Developer tool. Required for an HDInsight cluster
that uses ADLS storage or an Amazon EMR cluster that uses HBase resources in S3 storage.
zookeeper.znode.parent
hive-site.xml
Configure the following properties in the hive-site.xml file:
hive.cluster.delegation.token.store.class
The token store implementation. Required for HiveServer2 high availability and load balancing.
hive.compactor.initiator.on
Runs the initiator and cleaner threads on metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.
hive.compactor.worker.threads
The number of worker threads to run in a metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.
Set to: 1
hive.conf.hidden.list
Set to:
javax.jdo.option.ConnectionPassword,hive.server2.keystore.password,fs.s3n.awsAccessKeyId,fs.s3n.aw
sSecretAccessKey,fs.s3a.access.key,fs.s3a.secret.key,fs.s3a.proxy.password
hive.enforce.bucketing
Enables dynamic bucketing while loading to Hive. Required for an Update Strategy transformation in a
mapping that writes to a Hive target.
hive.exec.dynamic.partition
Enables dynamic partitioned tables for Hive tables. Applicable for Hive versions 0.9 and earlier.
hive.exec.dynamic.partition.mode
Allows all partitions to be dynamic. Required for the Update Strategy transformation in a mapping that
writes to a Hive target. Also required if you use Sqoop and define a DDL query to create or replace a
partitioned Hive target at run time.
hive.support.concurrency
Enables table locking in Hive. Required for an Update Strategy transformation in a mapping that writes to
a Hive target.
hive.txn.manager
Turns on transaction support. Required for an Update Strategy transformation in a mapping that writes
to a Hive target.
kms-site.xml
Configure the following properties in the kms-site.xml file:
hadoop.kms.authentication.kerberos.name.rules
Translates the principal names from the Active Directory and MIT realm into local names within the
Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.
mapred-site.xml
Configure the following properties in the mapred-site.xml file:
mapreduce.framework.name
The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for
Sqoop.
yarn.app.mapreduce.am.staging-dir
yarn-site.xml
Configure the following properties in the yarn-site.xml file:
yarn.application.classpath
Add spark_shuffle.jar to the class path. The .jar file must contain the class
"org.apache.spark.network.yarn.YarnShuffleService."
The maximum RAM available for each container. Set the maximum memory on the cluster to increase
resource memory available to the Blaze engine.
yarn.nodemanager.resource.cpu-vcores
The number of virtual cores for each container. Required for Blaze engine resource allocation.
yarn.scheduler.minimum-allocation-mb
The minimum RAM available for each container. Required for Blaze engine resource allocation.
yarn.nodemanager.vmem-check-enabled
Disables virtual memory limits for containers. Required for the Blaze and Spark engines.
yarn.nodemanager.aux-services
yarn.nodemanager.aux-services.spark_shuffle.class
yarn.resourcemanager.scheduler.class
Defines the YARN scheduler that the Data Integration Service uses to assign resources.
yarn.node-labels.enabled
yarn.node-labels.fs-store.root-dir
Create an archive file that contains the following files from the cluster:
• core-site.xml
• hbase-site.xml. Required only if you access HBase sources and targets.
• hdfs-site.xml
• hive-site.xml
Note: To import from Amazon EMR, the Informatica administrator must use an archive file.
A cluster configuration is an object in the domain that contains configuration information about the Hadoop
cluster. The cluster configuration enables the Data Integration Service to push mapping logic to the Hadoop
environment. Import configuration properties from the Hadoop cluster to create a cluster configuration.
The import process imports values from *-site.xml files into configuration sets based on the individual *-
site.xml files. When you perform the import, the cluster configuration wizard can create Hadoop, HBase,
HDFS, and Hive connection to access the Hadoop environment. If you choose to create the connections, the
wizard also associates the cluster configuration with the connections.
Note: If you are integrating for the first time and you imported the cluster configuration when you ran the
installer, you must re-create or refresh the cluster configuration.
Before you import from the cluster, you must get the archive file from the Hadoop administrator.
1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:
Property Description
Method to import Choose Import from file to import properties from an archive file.
the cluster
configuration
Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster
environment variables, cluster path variables, and advanced properties. Based on the
cluster environment and the functionality that you use, you can add to the default values
or change the default values of these properties. For a list of Hadoop connection
properties to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.
4. Click Browse to select a file. Select the file and click Open.
5. Click Next and verify the cluster configuration information on the summary page.
If you upgraded from 10.2 and you changed the distribution version, you need to verify the distribution
version in the General properties of the cluster configuration.
Effective in version 10.2.1, Informatica assigns a default version to each Hadoop distribution type. If you
configure the cluster configuration to use the default version, the upgrade process upgrades to the
assigned default version if the version changes. If you have not upgraded your Hadoop distribution to
Informatica's default version, you need to update the distribution version property.
For example, suppose the assigned default Hadoop distribution version for 10.2.1 is n, and for 10.2.2 is n
+1. If the cluster configuration uses the default supported Hadoop version of n, the upgraded cluster
configuration uses the default version of n+1. If you have not upgraded the distribution in the Hadoop
environment you need to change the cluster configuration to use version n.
If you configure the cluster configuration to use a distribution version that is not the default version, you
need to update the distribution version property in the following circumstances:
• Design-time. To import metadata, you can use the DataDirect drivers packaged with the Informatica
installer if they are available. If they are not available, use any Type 4 JDBC driver that the database
vendor recommends.
• Run-time. To run mappings, use any Type 4 JDBC driver that the database vendor recommends. Some
distributions support other drivers to use Sqoop connectors. You cannot use the DataDirect drivers for
run-time processing.
Copy the JDBC driver .jar files to the following location on the Developer tool machine:
1. Download Type 4 JDBC drivers associated with the JCBC-compliant databases that you want to access.
2. To optimize the Sqoop mapping performance on the Spark engine while writing data to an HDFS
complex file target of the Parquet format, download the following .jar files:
• parquet-hadoop-bundle-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-avro/1.6.0/
• parquet-avro-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-hadoop-bundle/1.6.0/
• parquet-column-1.5.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-column/1.5.0/
3. Copy all of the .jar files to the following directory on the machine where the Data Integration Service
runs:
<Informatica installation directory>\externaljdbcjars
Changes take effect after you recycle the Data Integration Service. At run time, the Data Integration
Service copies the .jar files to the Hadoop distribution cache so that the .jar files are accessible to all
nodes in the cluster.
You can perform one of the following steps to configure the files:
To integrate with EMR 5.16, get emrfs-hadoop-assembly-2.25.0.jar from the Hadoop administrator.
Copy the file to the following locations on each Data Integration Service machine:
/<Informatica installation directory>/services/shared/hadoop/EMR_<version number>/lib
/<Informatica installation directory>/services/shared/hadoop/EMR_<version number>/
extras/hive-auxjars
Note: If you upgraded from EMR 5.10 to EMR 5.14, the part of the file path that includes EMR_<version
number> remains EMR_5.10.
Create a file
Create a ~/.aws/config on the Data Integration Service machine. The file must contain AWS location.
S3 access policies allow control of user access to S3 resources and the actions that users can perform. The
AWS administrator uses policies to control access and actions for specific users and resources, depending
on the use case that mappings and workflows require.
AWS uses a JSON statement for S3 access policies. To set the S3 access policy, determine the principal,
actions, and resources to define, then create or edit an existing S3 access policy JSON statement.
For more information about Amazon S3 access policies, see AWS documentation.
The following table describes the tags to set in the access policy:
Tag Description
Principal The user, service, or account that receives permissions that are defined in a policy.
Assign the owner of the S3 bucket resources as the principal.
Note: The S3 bucket owner and the owner of resources within the bucket can be different.
After you copy the JSON statement, you can edit it in a text editor or in the bucket policy editor.
5. Type the bucket access policy, or edit the existing policy, and click Save.
AWS applies the access policy to the bucket.
Configure developerCore.ini
Edit developerCore.ini to successfully import local complex files available on the Developer tool machine.
When you import a complex file, such as Avro or Parquet, the imported object includes metadata associated
with the distribution in the Hadoop environment. If the file resides on the Developer tool machine, the import
process picks up the distribution information from the developerCore.ini file. You must edit the
developerCore.ini file to point to the distribution directory on the Developer tool machine.
Based on the version that you upgraded from, you might need to update the following types of objects:
Connections
Based on the version you are upgrading from, you might need to update Hadoop connections or replace
connections to the Hadoop environment.
The Hadoop connection contains additional properties. You need to manually update it to include
customized configuration in the hadoopEnv.properties file from previous versions.
Streaming mappings
The mapping contains deferred data objects or transformations. Support will be reinstated in a future
release.
After you upgrade, the streaming mappings become invalid. You must re-create the physical data objects
to run the mappings in Spark engine that uses Structured Streaming.
After you re-create the physical data objects to run the mappings in Spark engine that uses Structured
Streaming some properties are not available for Azure Event Hubs data objects.
Update Connections
You might need to update connections based on the version you are upgrading from.
Consider the following types of updates that you might need to make:
Configure the Hadoop connection.
Configure the Hadoop connection to incorporate properties from the hadoopEnv.properties file.
Replace connections.
If you chose the option to create connections when you ran the Cluster Configuration wizard, you need
to replace connections in mappings with the new connections.
If you did not create connections when you created the cluster configuration, you need to update the
connections.
When you run the Informatica upgrade, the installer backs up the existing hadoopEnv.properties file. You can
find the backup hadoopEnv.properties file in the following location:
The method that you use to replace connections in mappings depends on the type of connection.
Hadoop connection
For information about the infacmd commands, see the Informatica Command Reference.
Review connections that you created in a previous release to update the values for connection
properties. For example, if you added nodes to the cluster or if you updated the distribution version, you
might need to verify host names, URIs, or port numbers for some of the properties.
Associate the cluster configuration
The Hadoop, Hive, HDFS, and HBase connections must be associated with a cluster configuration.
Complete the following tasks:
1. Run infacmd isp listConnections to identify the connections that you need to upgrade. Use -ct
to list connections of a particular type.
After you upgrade, the existing streaming mappings become invalid because of the unavailable header ports,
the unsupported transformations or data objects, and the behavior change of some data objects.
• Re-create the physical data objects. After you re-create the physical data objects, the data objects get the
required header ports, such as timestamp, partitionID, or key based on the data object.
• In a Normalizer transformation, if the Occurs column is set to Auto, re-create the Normalizer
transformation. You must re-create the Normalizer transformation because the type configuration
property of the complex port refers to the physical data object that you plan to replace.
• Update the streaming mapping. If the mapping contains Kafka target, Aggregator transformation, Joiner
transformation, or Normalizer transformation, replace the data object or transformation, and then update
the mapping because of the changed behavior of these transformations and data objects.
• Verify the deferred data object types. If the streaming mapping contains unsupported transformations or
data objects, contact Informatica Global Customer Support.
1. Go to the existing mapping, select the data object from the mapping.
2. Click the Properties tab. On the Column Projection tab, click Edit Schema.
3. Note the schema information from the Edit Schema dialog box.
4. Note the parameters information from the Parameters tab.
5. Create new physical data objects.
After you re-create the data objects, the physical data objects get the required header ports. The Microsoft
Azure does not support the following properties and are not available for Azure Event Hubs data objects:
• Consumer Properties
• Partition Count
Transformation Updates
If a transformation uses a complex port, configure the type configuration property of the port because
the property refers to the physical data object that you replaced.
If a mapping contains an Aggregator transformation upstream from a Joiner transformation, move the
Aggregator transformation downstream from a Joiner transformation. Add a Window transformation
directly upstream from both Aggregator and Joiner transformations.
The following table lists the data object types to which the support is deferred to a future release:
Source JMS
MapR Streams
If you want to continue using the mappings that contain deferred data objects or transformations, you must
contact Informatica Global Customer Support.
• Integrate the Informatica domain with Azure HDInsight for the first time.
• Upgrade from version 10.2.1.
• Upgrade from version 10.2.
• Upgrade from a version earlier than 10.2.
53
Task Flow to Integrate with Azure HDInsight
The following diagram shows the task flow to integrate the Informatica domain with Azure HDInsight:
Note: If you are upgrading from a previous version, verify the properties and suggested values, as Big Data
Management might require additional properties or different values for existing properties.
Complete the following tasks to prepare the cluster before the Informatica administrator creates the cluster
configuration:
1. Verify that the VPN is enabled between the Informatica domain and the Azure HDInsight cloud network.
2. Verify property values in *-site.xml files that Big Data Management needs to run mappings in the Hadoop
environment.
3. Provide information to the Informatica administrator that is required to import cluster information into
the domain. Depending on the method of import, perform one of the following tasks:
• To import directly from the cluster, give the Informatica administrator cluster authentication
information to connect to the cluster.
• To import from an archive file, export cluster information and provide an archive file to the
Informatica administrator.
core-site.xml
Configure the following properties in the core-site.xml file:
fs.azure.account.key.<youraccount>.blob.core.windows.net
Required for Azure HDInsight cluster that uses WASB storage. The storage account access key required
to access the storage.
You can contact the HDInsight cluster administrator to get the storage account key associated with the
HDInsight cluster. If you are unable to contact the administrator, perform the following steps to decrypt
the encrypted storage account key:
<name>fs.azure.shellkeyprovider.script</name>
<value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value>
</property>
- Copy the decrypted value and update the value of
fs.azure.account.key.youraccount.blob.core.windows.net property in the cluster configuration
core-site.xml.
dfs.adls.oauth2.client.id
Required for Azure HDInsight cluster that uses ADLS storage without Enterprise Security Package. The
application ID associated with the Service Principal required to authorize the service principal and
access the storage.
To find the application ID for a service principal, in the Azure Portal, click Azure Active Directory > App
registrations > Service Principal Display Name.
dfs.adls.oauth2.refresh.url
Required for Azure HDInsight cluster that uses ADLS storage without Enterprise Security Package. The
OAuth 2.0 token endpoint required to authorize the service principal and access the storage.
To find the refresh URL OAuth 2.0 endpoint, in the Azure portal, click Azure Active Directory > App
registrations > Endpoints.
dfs.adls.oauth2.credential
Required for Azure HDInsight cluster that uses ADLS storage without Enterprise Security Package. The
password required to authorize the service principal and access the storage.
To find the password for a service principal, in the Azure portal, click Azure Active Directory > App
registrations > Service Principal Display Name > Settings > Keys.
hadoop.proxyuser.<proxy user>.groups
Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.
Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.<proxy user>.users
Required for Azure HDInsight cluster that uses Enterprise Security Package and ADLS storage. Defines
the user account that the proxy user account can impersonate. On a secure cluster, the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.
Set to a single user account or set to a comma-separated list. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.<proxy user>.hosts
Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.
hadoop.proxyuser.yarn.groups
Comma-separated list of groups that you want to allow the YARN user to impersonate on a non-secure
cluster.
Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.yarn.hosts
Comma-separated list of hosts that you want to allow the YARN user to impersonate on a non-secure
cluster.
Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.
io.compression.codecs
hadoop.security.auth_to_local
Translates the principal names from the Active Directory and MIT realm into local names within the
Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.
hbase-site.xml
Configure the following properties in the hbase-site.xml file:
hbase.use.dynamic.jars
Enables metadata import and test connection from the Developer tool. Required for an HDInsight cluster
that uses ADLS storage or an Amazon EMR cluster that uses HBase resources in S3 storage.
zookeeper.znode.parent
hive-site.xml
Configure the following properties in the hive-site.xml file:
hive.cluster.delegation.token.store.class
The token store implementation. Required for HiveServer2 high availability and load balancing.
hive.compactor.initiator.on
Runs the initiator and cleaner threads on metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.
hive.compactor.worker.threads
The number of worker threads to run in a metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.
hive.enforce.bucketing
Enables dynamic bucketing while loading to Hive. Required for an Update Strategy transformation in a
mapping that writes to a Hive target.
hive.exec.dynamic.partition
Enables dynamic partitioned tables for Hive tables. Applicable for Hive versions 0.9 and earlier.
hive.exec.dynamic.partition.mode
Allows all partitions to be dynamic. Required for the Update Strategy transformation in a mapping that
writes to a Hive target. Also required if you use Sqoop and define a DDL query to create or replace a
partitioned Hive target at run time.
hive.support.concurrency
Enables table locking in Hive. Required for an Update Strategy transformation in a mapping that writes to
a Hive target.
hive.server2.support.dynamic.service.discovery
Enables HiveServer2 dynamic service discovery. Required for HiveServer2 high availability.
hive.server2.zookeeper.namespace
The value of the ZooKeeper namespace in the JDBC connection string. Required for HiveServer2 high
availability.
hive.txn.manager
Turns on transaction support. Required for an Update Strategy transformation in a mapping that writes
to a Hive target.
hive.zookeeper.quorum
Comma-separated list of ZooKeeper server host:ports in a cluster. The value of the ZooKeeper ensemble
in the JDBC connection string. Required for HiveServer2 high availability.
mapred-site.xml
Configure the following properties in the mapred-site.xml file:
mapreduce.framework.name
The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for
Sqoop.
yarn-site.xml
Configure the following properties in the yarn-site.xml file:
yarn.application.classpath
Add spark_shuffle.jar to the class path. The .jar file must contain the class
"org.apache.spark.network.yarn.YarnShuffleService."
yarn.nodemanager.resource.memory-mb
The maximum RAM available for each container. Set the maximum memory on the cluster to increase
resource memory available to the Blaze engine.
yarn.nodemanager.resource.cpu-vcores
The number of virtual cores for each container. Required for Blaze engine resource allocation.
yarn.scheduler.minimum-allocation-mb
The minimum RAM available for each container. Required for Blaze engine resource allocation.
yarn.nodemanager.vmem-check-enabled
Disables virtual memory limits for containers. Required for the Blaze and Spark engines.
yarn.nodemanager.aux-services
yarn.nodemanager.aux-services.spark_shuffle.class
yarn.resourcemanager.scheduler.class
Defines the YARN scheduler that the Data Integration Service uses to assign resources.
yarn.node-labels.enabled
yarn.node-labels.fs-store.root-dir
tez-site.xml
Configure the following properties in the tez-site.xml file:
The sort buffer memory. Required when the output needs to be sorted for Blaze and Spark engines.
The following table describes the information that you need to provide to the Informatica administrator to
create the cluster configuration directly from the cluster:
Property Description
Cluster name Name of the cluster. Use the display name if the cluster manager manages multiple clusters. If you do
not provide a cluster name, the wizard imports information based on the default cluster.
Create a .zip or .tar file that contains the following *-site.xml files:
• core-site.xml
• hbase-site.xml. Required only to access HBase sources and targets.
• hdfs-site.xml
• hive-site.xml
• mapred-site.xml or tez-site.xml. Include the mapred-site.xml file or the tez-site.xml file based on the Hive
execution type used on the Hadoop cluster.
• yarn-site.xml
For example, the edited tez.task.launch.cluster-default.cmd-opts property value looks similar to the following:
<property>
<name>tez.task.launch.cluster-default.cmd-opts</name>
<value>-server -Djava.net.preferIPv4Stack=true -Dhdp.version=2.6.0.2-76</value>
</property>
A cluster configuration is an object in the domain that contains configuration information about the Hadoop
cluster. The cluster configuration enables the Data Integration Service to push mapping logic to the Hadoop
environment. Import configuration properties from the Hadoop cluster to create a cluster configuration.
The import process imports values from *-site.xml files into configuration sets based on the individual *-
site.xml files. When you perform the import, the cluster configuration wizard can create Hadoop, HBase,
HDFS, and Hive connection to access the Hadoop environment. If you choose to create the connections, the
wizard also associates the cluster configuration with the connections.
Note: If you are integrating for the first time and you imported the cluster configuration when you ran the
installer, you must re-create or refresh the cluster configuration.
If you import directly from the cluster, contact the Hadoop administrator to get cluster connection
information. If you import from a file, get an archive file of exported cluster information.
1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following General properties:
Property Description
Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster environment
variables, cluster path variables, and advanced properties. Based on the cluster
environment and the functionality that you use, you can add to the default values or
change the default values of these properties. For a list of Hadoop connection properties
to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.
Property Description
Cluster name Name of the cluster. Use the display name if the cluster manager manages multiple clusters. If
you do not provide a cluster name, the wizard imports information based on the default cluster.
5. Click Next and verify the cluster configuration information on the summary page.
Before you import from the cluster, you must get the archive file from the Hadoop administrator.
1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:
Property Description
Method to import Choose Import from file to import properties from an archive file.
the cluster
configuration
Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster
environment variables, cluster path variables, and advanced properties. Based on the
cluster environment and the functionality that you use, you can add to the default values
or change the default values of these properties. For a list of Hadoop connection
properties to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.
4. Click Browse to select a file. Select the file and click Open.
5. Click Next and verify the cluster configuration information on the summary page.
If you upgraded from 10.2 and you changed the distribution version, you need to verify the distribution
version in the General properties of the cluster configuration.
Effective in version 10.2.1, Informatica assigns a default version to each Hadoop distribution type. If you
configure the cluster configuration to use the default version, the upgrade process upgrades to the
assigned default version if the version changes. If you have not upgraded your Hadoop distribution to
Informatica's default version, you need to update the distribution version property.
For example, suppose the assigned default Hadoop distribution version for 10.2.1 is n, and for 10.2.2 is n
+1. If the cluster configuration uses the default supported Hadoop version of n, the upgraded cluster
configuration uses the default version of n+1. If you have not upgraded the distribution in the Hadoop
environment you need to change the cluster configuration to use version n.
If you configure the cluster configuration to use a distribution version that is not the default version, you
need to update the distribution version property in the following circumstances:
• Design-time. To import metadata, you can use the DataDirect drivers packaged with the Informatica
installer if they are available. If they are not available, use any Type 4 JDBC driver that the database
vendor recommends.
• Run-time. To run mappings, use any Type 4 JDBC driver that the database vendor recommends. Some
distributions support other drivers to use Sqoop connectors. You cannot use the DataDirect drivers for
run-time processing.
Copy the JDBC driver .jar files to the following location on the Developer tool machine:
1. Download Type 4 JDBC drivers associated with the JCBC-compliant databases that you want to access.
2. To optimize the Sqoop mapping performance on the Spark engine while writing data to an HDFS
complex file target of the Parquet format, download the following .jar files:
• parquet-hadoop-bundle-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-avro/1.6.0/
• parquet-avro-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-hadoop-bundle/1.6.0/
• parquet-column-1.5.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-column/1.5.0/
3. Copy all of the .jar files to the following directory on the machine where the Data Integration Service
runs:
<Informatica installation directory>\externaljdbcjars
Changes take effect after you recycle the Data Integration Service. At run time, the Data Integration
Service copies the .jar files to the Hadoop distribution cache so that the .jar files are accessible to all
nodes in the cluster.
Configure developerCore.ini
Edit developerCore.ini to successfully import local complex files available on the Developer tool machine.
When you import a complex file, such as Avro or Parquet, the imported object includes metadata associated
with the distribution in the Hadoop environment. If the file resides on the Developer tool machine, the import
process picks up the distribution information from the developerCore.ini file. You must edit the
developerCore.ini file to point to the distribution directory on the Developer tool machine.
Based on the version that you upgraded from, you might need to update the following types of objects:
Connections
Based on the version you are upgrading from, you might need to update Hadoop connections or replace
connections to the Hadoop environment.
The Hadoop connection contains additional properties. You need to manually update it to include
customized configuration in the hadoopEnv.properties file from previous versions.
Streaming mappings
The mapping contains deferred data objects or transformations. Support will be reinstated in a future
release.
After you upgrade, the streaming mappings become invalid. You must re-create the physical data objects
to run the mappings in Spark engine that uses Structured Streaming.
After you re-create the physical data objects to run the mappings in Spark engine that uses Structured
Streaming some properties are not available for Azure Event Hubs data objects.
Consider the following types of updates that you might need to make:
Configure the Hadoop connection.
Configure the Hadoop connection to incorporate properties from the hadoopEnv.properties file.
Replace connections.
If you chose the option to create connections when you ran the Cluster Configuration wizard, you need
to replace connections in mappings with the new connections.
If you did not create connections when you created the cluster configuration, you need to update the
connections.
When you run the Informatica upgrade, the installer backs up the existing hadoopEnv.properties file. You can
find the backup hadoopEnv.properties file in the following location:
Edit the Hadoop connection in the Administrator tool or the Developer tool to include any properties that you
manually configured in the hadoopEnv.properties file. The Hadoop connection contains default values for
properties such as cluster environment and path variables and advanced properties. You can update the
default values to match the properties in the hadoopEnv.properties file.
The method that you use to replace connections in mappings depends on the type of connection.
Hadoop connection
Review connections that you created in a previous release to update the values for connection
properties. For example, if you added nodes to the cluster or if you updated the distribution version, you
might need to verify host names, URIs, or port numbers for some of the properties.
The Hadoop, Hive, HDFS, and HBase connections must be associated with a cluster configuration.
Complete the following tasks:
1. Run infacmd isp listConnections to identify the connections that you need to upgrade. Use -ct
to list connections of a particular type.
2. Run infacmd isp UpdateConnection to associate the cluster configuration with the connection.
Use -cn to name the connection and -o clusterConfigID to associate the cluster configuration
with the connection.
For more information about infacmd, see the Informatica Command Reference.
After you upgrade, the existing streaming mappings become invalid because of the unavailable header ports,
the unsupported transformations or data objects, and the behavior change of some data objects.
• Re-create the physical data objects. After you re-create the physical data objects, the data objects get the
required header ports, such as timestamp, partitionID, or key based on the data object.
1. Go to the existing mapping, select the data object from the mapping.
2. Click the Properties tab. On the Column Projection tab, click Edit Schema.
3. Note the schema information from the Edit Schema dialog box.
4. Note the parameters information from the Parameters tab.
5. Create new physical data objects.
After you re-create the data objects, the physical data objects get the required header ports. The Microsoft
Azure does not support the following properties and are not available for Azure Event Hubs data objects:
• Consumer Properties
• Partition Count
Transformation Updates
If a transformation uses a complex port, configure the type configuration property of the port because
the property refers to the physical data object that you replaced.
If a mapping contains an Aggregator transformation upstream from a Joiner transformation, move the
Aggregator transformation downstream from a Joiner transformation. Add a Window transformation
directly upstream from both Aggregator and Joiner transformations.
The following table lists the data object types to which the support is deferred to a future release:
Source JMS
MapR Streams
If you want to continue using the mappings that contain deferred data objects or transformations, you must
contact Informatica Global Customer Support.
• Integrate the Informatica domain with Cloudera CDH for the first time.
• Upgrade from version 10.2.1.
• Upgrade from version 10.2.
• Upgrade from a version earlier than 10.2.
74
Task Flow to Integrate with Cloudera CDH
The following diagram shows the task flow to integrate the Informatica domain with Cloudera CDH:
Note: If you are upgrading from a previous version, verify the properties and suggested values, as Big Data
Management might require additional properties or different values for existing properties.
Complete the following tasks to prepare the cluster before the Informatica administrator creates the cluster
configuration:
1. Verify property values in *-site.xml files that Big Data Management needs to run mappings in the Hadoop
environment.
2. Provide information to the Informatica administrator that is required to import cluster information into
the domain. Depending on the method of import, perform one of the following tasks:
• To import directly from the cluster, give the Informatica administrator cluster authentication
information to connect to the cluster.
• To import from an archive file, export cluster information and provide an archive file to the Big Data
Management administrator.
core-site.xml
Configure the following properties in the core-site.xml file:
fs.s3.enableServerSideEncryption
Enables server side encryption for S3 buckets. Required for SSE and SSE-KMS encryption.
fs.s3a.access.key
The ID for the Blaze and Spark engines to connect to the Amazon S3 file system.
fs.s3a.secret.key
The password for the Blaze and Spark engines to connect to the Amazon S3 file system
The server-side encryption algorithm for S3. Required for SSE and SSE-KMS encryption. Set to the
encryption algorithm used.
hadoop.proxyuser.<proxy user>.groups
Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.
Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.<proxy user>.hosts
Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.
Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.
io.compression.codecs
hadoop.security.auth_to_local
Translates the principal names from the Active Directory and MIT realm into local names within the
Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.
hbase-site.xml
Configure the following properties in the hbase-site.xml file:
zookeeper.znode.parent
hdfs-site.xml
Configure the following properties in the hdfs-site.xml file:
dfs.encryption.key.provider.uri
The KeyProvider used to interact with encryption keys when reading and writing to an encryption zone.
Required if sources or targets reside in the HDFS encrypted zone on Java KeyStore KMS-enabled
Cloudera CDH cluster or a Ranger KMS-enabled Hortonworks HDP cluster.
hive-site.xml
Configure the following properties in the hive-site.xml file:
hive.cluster.delegation.token.store.class
The token store implementation. Required for HiveServer2 high availability and load balancing.
Enables dynamic partitioned tables for Hive tables. Applicable for Hive versions 0.9 and earlier.
hive.exec.dynamic.partition.mode
Allows all partitions to be dynamic. Required if you use Sqoop and define a DDL query to create or
replace a partitioned Hive target at run time.
hiveserver2_load_balancer
mapred-site.xml
Configure the following properties in the mapred-site.xml file:
mapreduce.application.classpath
A comma-separated list of CLASSPATH entries for MapReduce applications. Required for Sqoop.
mapreduce.framework.name
The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for
Sqoop.
mapreduce.jobhistory.address
Location of the MapReduce JobHistory Server. The default port is 10020. Required for Sqoop.
mapreduce.jobhistory.intermediate-done-dir
Directory where MapReduce jobs write history files. Required for Sqoop.
mapreduce.jobhistory.done-dir
Directory where the MapReduce JobHistory Server manages history files. Required for Sqoop.
mapreduce.jobhistory.principal
The Service Principal Name for the MapReduce JobHistory Server. Required for Sqoop.
mapreduce.jobhistory.webapp.address
Web address of the MapReduce JobHistory Server. The default value is 19888. Required for Sqoop.
yarn.app.mapreduce.am.staging-dir
Add spark_shuffle.jar to the class path. The .jar file must contain the class
"org.apache.spark.network.yarn.YarnShuffleService."
yarn.nodemanager.resource.memory-mb
The maximum RAM available for each container. Set the maximum memory on the cluster to increase
resource memory available to the Blaze engine.
yarn.nodemanager.resource.cpu-vcores
The number of virtual cores for each container. Required for Blaze engine resource allocation.
yarn.scheduler.minimum-allocation-mb
The minimum RAM available for each container. Required for Blaze engine resource allocation.
yarn.nodemanager.vmem-check-enabled
Disables virtual memory limits for containers. Required for the Blaze and Spark engines.
yarn.nodemanager.aux-services
yarn.nodemanager.aux-services.spark_shuffle.class
yarn.resourcemanager.scheduler.class
Defines the YARN scheduler that the Data Integration Service uses to assign resources.
yarn.node-labels.enabled
yarn.node-labels.fs-store.root-dir
The following table describes the information that you need to provide to the Informatica administrator to
create the cluster configuration directly from the cluster:
Property Description
Cluster name Name of the cluster. Use the display name if the cluster manager manages multiple clusters. If you do
not provide a cluster name, the wizard imports information based on the default cluster.
To find the correct Cloudera cluster name when you have multiple clusters, perform the following
steps:
1. Log in to Cloudera Manager adding the following string to the URL: /api/v8/clusters
2. Provide the Informatica Administrator the cluster property name that appears in the browser tab.
Create a .zip or .tar file that contains the following *-site.xml files:
• core-site.xml
• hbase-site.xml. Required only for access to HBase sources and targets.
• hdfs-site.xml
• hive-site.xml
• mapred-site.xml
• yarn-site.xml
Give the Informatica administrator access to the archive file to import the cluster information into the
domain.
A cluster configuration is an object in the domain that contains configuration information about the Hadoop
cluster. The cluster configuration enables the Data Integration Service to push mapping logic to the Hadoop
environment. Import configuration properties from the Hadoop cluster to create a cluster configuration.
The import process imports values from *-site.xml files into configuration sets based on the individual *-
site.xml files. When you perform the import, the cluster configuration wizard can create Hadoop, HBase,
HDFS, and Hive connection to access the Hadoop environment. If you choose to create the connections, the
wizard also associates the cluster configuration with the connections.
Note: If you are integrating for the first time and you imported the cluster configuration when you ran the
installer, you must re-create or refresh the cluster configuration.
If you import directly from the cluster, contact the Hadoop administrator to get cluster connection
information. If you import from a file, get an archive file of exported cluster information.
1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following General properties:
Property Description
Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster environment
variables, cluster path variables, and advanced properties. Based on the cluster
environment and the functionality that you use, you can add to the default values or
change the default values of these properties. For a list of Hadoop connection properties
to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.
Property Description
Cluster name Name of the cluster. Use the display name if the cluster manager manages multiple clusters. If
you do not provide a cluster name, the wizard imports information based on the default cluster.
5. Click Next and verify the cluster configuration information on the summary page.
Before you import from the cluster, you must get the archive file from the Hadoop administrator.
1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:
Property Description
Method to import Choose Import from file to import properties from an archive file.
the cluster
configuration
Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster
environment variables, cluster path variables, and advanced properties. Based on the
cluster environment and the functionality that you use, you can add to the default values
or change the default values of these properties. For a list of Hadoop connection
properties to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.
4. Click Browse to select a file. Select the file and click Open.
5. Click Next and verify the cluster configuration information on the summary page.
If you upgraded from 10.2 and you changed the distribution version, you need to verify the distribution
version in the General properties of the cluster configuration.
Effective in version 10.2.1, Informatica assigns a default version to each Hadoop distribution type. If you
configure the cluster configuration to use the default version, the upgrade process upgrades to the
assigned default version if the version changes. If you have not upgraded your Hadoop distribution to
Informatica's default version, you need to update the distribution version property.
For example, suppose the assigned default Hadoop distribution version for 10.2.1 is n, and for 10.2.2 is n
+1. If the cluster configuration uses the default supported Hadoop version of n, the upgraded cluster
configuration uses the default version of n+1. If you have not upgraded the distribution in the Hadoop
environment you need to change the cluster configuration to use version n.
If you configure the cluster configuration to use a distribution version that is not the default version, you
need to update the distribution version property in the following circumstances:
• Design-time. To import metadata, you can use the DataDirect drivers packaged with the Informatica
installer if they are available. If they are not available, use any Type 4 JDBC driver that the database
vendor recommends.
• Run-time. To run mappings, use any Type 4 JDBC driver that the database vendor recommends. Some
distributions support other drivers to use Sqoop connectors. You cannot use the DataDirect drivers for
run-time processing.
Copy the JDBC driver .jar files to the following location on the Developer tool machine:
1. Download Type 4 JDBC drivers associated with the JCBC-compliant databases that you want to access.
2. To use Sqoop TDCH Cloudera Connector Powered by Teradata, perform the following tasks:
• Download all .jar files in the Cloudera Connector Powered by Teradata package from the following
location: http://www.cloudera.com/downloads.html. The package has the following naming
convention: sqoop-connector-teradata-<version>.tar
• Download terajdbc4.jar and tdgssconfig.jar from the following location:
http://downloads.teradata.com/download/connectivity/jdbc-driver
3. To optimize the Sqoop mapping performance on the Spark engine while writing data to an HDFS
complex file target of the Parquet format, download the following .jar files:
• parquet-hadoop-bundle-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-avro/1.6.0/
• parquet-avro-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-hadoop-bundle/1.6.0/
• parquet-column-1.5.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-column/1.5.0/
4. Copy all of the .jar files to the following directory on the machine where the Data Integration Service
runs:
<Informatica installation directory>\externaljdbcjars
Changes take effect after you recycle the Data Integration Service. At run time, the Data Integration
Service copies the .jar files to the Hadoop distribution cache so that the .jar files are accessible to all
nodes in the cluster.
To connect to the Hadoop cluster to develop a mapping, the Developer tool requires security certificate
aliases on the machine that hosts the Developer tool. To run a mapping, the machine that hosts the Data
Integration Service requires these same certificate alias files.
Perform the following steps from the Developer tool host machine, and then repeat them from the Data
Integration Service host machine:
1. Run the following command to export the certificates from the cluster:
keytool -export -alias <alias name> -keystore <custom.truststore file location> -
file <exported certificate file location> -storepass <password>
For example,
keytool -export -alias <alias name> -keystore ~/custom.truststore -file ~/
exported.cer
The command produces a certificate file.
2. Choose to import security certificates to an SSL-enabled domain or a domain that is not SSL-enabled
using the following command:
keytool -import -trustcacerts -alias <alias name> -file <exported certificate file
location> -keystore <java cacerts location> -storepass <password>
For example,
keytool -import -alias <alias name> -file ~/exported.cer -keystore <Informatica
installation directory>/java/jre/lib/security/cacerts
Configure developerCore.ini
Edit developerCore.ini to successfully import local complex files available on the Developer tool machine.
When you import a complex file, such as Avro or Parquet, the imported object includes metadata associated
with the distribution in the Hadoop environment. If the file resides on the Developer tool machine, the import
Based on the version that you upgraded from, you might need to update the following types of objects:
Connections
Based on the version you are upgrading from, you might need to update Hadoop connections or replace
connections to the Hadoop environment.
The Hadoop connection contains additional properties. You need to manually update it to include
customized configuration in the hadoopEnv.properties file from previous versions.
Streaming mappings
The mapping contains deferred data objects or transformations. Support will be reinstated in a future
release.
After you upgrade, the streaming mappings become invalid. You must re-create the physical data objects
to run the mappings in Spark engine that uses Structured Streaming.
After you re-create the physical data objects to run the mappings in Spark engine that uses Structured
Streaming some properties are not available for Azure Event Hubs data objects.
Update Connections
You might need to update connections based on the version you are upgrading from.
Consider the following types of updates that you might need to make:
Configure the Hadoop connection.
Configure the Hadoop connection to incorporate properties from the hadoopEnv.properties file.
Replace connections.
If you chose the option to create connections when you ran the Cluster Configuration wizard, you need
to replace connections in mappings with the new connections.
If you did not create connections when you created the cluster configuration, you need to update the
connections.
When you run the Informatica upgrade, the installer backs up the existing hadoopEnv.properties file. You can
find the backup hadoopEnv.properties file in the following location:
Edit the Hadoop connection in the Administrator tool or the Developer tool to include any properties that you
manually configured in the hadoopEnv.properties file. The Hadoop connection contains default values for
properties such as cluster environment and path variables and advanced properties. You can update the
default values to match the properties in the hadoopEnv.properties file.
The method that you use to replace connections in mappings depends on the type of connection.
Hadoop connection
For information about the infacmd commands, see the Informatica Command Reference.
Review connections that you created in a previous release to update the values for connection
properties. For example, if you added nodes to the cluster or if you updated the distribution version, you
might need to verify host names, URIs, or port numbers for some of the properties.
The Hadoop, Hive, HDFS, and HBase connections must be associated with a cluster configuration.
Complete the following tasks:
1. Run infacmd isp listConnections to identify the connections that you need to upgrade. Use -ct
to list connections of a particular type.
2. Run infacmd isp UpdateConnection to associate the cluster configuration with the connection.
Use -cn to name the connection and -o clusterConfigID to associate the cluster configuration
with the connection.
For more information about infacmd, see the Informatica Command Reference.
After you upgrade, the existing streaming mappings become invalid because of the unavailable header ports,
the unsupported transformations or data objects, and the behavior change of some data objects.
• Re-create the physical data objects. After you re-create the physical data objects, the data objects get the
required header ports, such as timestamp, partitionID, or key based on the data object.
• In a Normalizer transformation, if the Occurs column is set to Auto, re-create the Normalizer
transformation. You must re-create the Normalizer transformation because the type configuration
property of the complex port refers to the physical data object that you plan to replace.
• Update the streaming mapping. If the mapping contains Kafka target, Aggregator transformation, Joiner
transformation, or Normalizer transformation, replace the data object or transformation, and then update
the mapping because of the changed behavior of these transformations and data objects.
• Verify the deferred data object types. If the streaming mapping contains unsupported transformations or
data objects, contact Informatica Global Customer Support.
1. Go to the existing mapping, select the data object from the mapping.
• Consumer Properties
• Partition Count
Transformation Updates
If a transformation uses a complex port, configure the type configuration property of the port because
the property refers to the physical data object that you replaced.
If a mapping contains an Aggregator transformation upstream from a Joiner transformation, move the
Aggregator transformation downstream from a Joiner transformation. Add a Window transformation
directly upstream from both Aggregator and Joiner transformations.
The following table lists the data object types to which the support is deferred to a future release:
Source JMS
MapR Streams
If you want to continue using the mappings that contain deferred data objects or transformations, you must
contact Informatica Global Customer Support.
• Integrate the Informatica domain with Hortonworks HDP for the first time.
• Upgrade from version 10.2.1.
• Upgrade from version 10.2.
• Upgrade from a version earlier than 10.2.
94
Task Flow to Integrate with Hortonworks HDP
The following diagram shows the task flow to integrate the Informatica domain with Hortonworks HDP:
Note: If you are upgrading from a previous version, verify the properties and suggested values, as Big Data
Management might require additional properties or different values for existing properties.
Complete the following tasks to prepare the cluster before the Informatica administrator creates the cluster
configuration:
1. Verify property values in *-site.xml files that Big Data Management needs to run mappings in the Hadoop
environment.
2. Provide information to the Informatica administrator that is required to import cluster information into
the domain. Depending on the method of import, perform one of the following tasks:
• To import directly from the cluster, give the Informatica administrator cluster authentication
information to connect to the cluster.
• To import from an archive file, export cluster information and provide an archive file to the Big Data
Management administrator.
core-site.xml
Configure the following properties in the core-site.xml file:
fs.s3.enableServerSideEncryption
Enables server side encryption for S3 buckets. Required for SSE and SSE-KMS encryption.
fs.s3a.access.key
The ID for the Blaze and Spark engines to connect to the Amazon S3 file system.
fs.s3a.secret.key
The password for the Blaze and Spark engines to connect to the Amazon S3 file system
The server-side encryption algorithm for S3. Required for SSE and SSE-KMS encryption. Set to the
encryption algorithm used.
hadoop.proxyuser.<proxy user>.groups
Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.
Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.<proxy user>.hosts
Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.
Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.
hadoop.proxyuser.yarn.groups
Comma-separated list of groups that you want to allow the YARN user to impersonate on a non-secure
cluster.
Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.yarn.hosts
Comma-separated list of hosts that you want to allow the YARN user to impersonate on a non-secure
cluster.
Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.
hadoop.security.auth_to_local
Translates the principal names from the Active Directory and MIT realm into local names within the
Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.
hbase-site.xml
Configure the following properties in the hbase-site.xml file:
zookeeper.znode.parent
hdfs-site.xml
Configure the following properties in the hdfs-site.xml file:
dfs.encryption.key.provider.uri
The KeyProvider used to interact with encryption keys when reading and writing to an encryption zone.
Required if sources or targets reside in the HDFS encrypted zone on Java KeyStore KMS-enabled
Cloudera CDH cluster or a Ranger KMS-enabled Hortonworks HDP cluster.
The token store implementation. Required for HiveServer2 high availability and load balancing.
hive.compactor.initiator.on
Runs the initiator and cleaner threads on metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.
hive.compactor.worker.threads
The number of worker threads to run in a metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.
Set to: 1
hive.enforce.bucketing
Enables dynamic bucketing while loading to Hive. Required for an Update Strategy transformation in a
mapping that writes to a Hive target.
io.compression.codecs
hive.exec.dynamic.partition.mode
Allows all partitions to be dynamic. Required for the Update Strategy transformation in a mapping that
writes to a Hive target. Also required if you use Sqoop and define a DDL query to create or replace a
partitioned Hive target at run time.
hive.support.concurrency
Enables table locking in Hive. Required for an Update Strategy transformation in a mapping that writes to
a Hive target.
hive.server2.support.dynamic.service.discovery
Enables HiveServer2 dynamic service discovery. Required for HiveServer2 high availability.
hive.server2.zookeeper.namespace
The value of the ZooKeeper namespace in the JDBC connection string. Required for HiveServer2 high
availability.
hive.txn.manager
Turns on transaction support. Required for an Update Strategy transformation in a mapping that writes
to a Hive target.
hive.zookeeper.quorum
Comma-separated list of ZooKeeper server host:ports in a cluster. The value of the ZooKeeper ensemble
in the JDBC connection string. Required for HiveServer2 high availability.
mapred-site.xml
Configure the following properties in the mapred-site.xml file:
mapreduce.framework.name
The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for
Sqoop.
yarn-site.xml
Configure the following properties in the yarn-site.xml file:
yarn.application.classpath
Add spark_shuffle.jar to the class path. The .jar file must contain the class
"org.apache.spark.network.yarn.YarnShuffleService."
yarn.nodemanager.resource.memory-mb
The maximum RAM available for each container. Set the maximum memory on the cluster to increase
resource memory available to the Blaze engine.
yarn.nodemanager.resource.cpu-vcores
The number of virtual cores for each container. Required for Blaze engine resource allocation.
yarn.scheduler.minimum-allocation-mb
The minimum RAM available for each container. Required for Blaze engine resource allocation.
yarn.nodemanager.vmem-check-enabled
Disables virtual memory limits for containers. Required for the Blaze and Spark engines.
yarn.nodemanager.aux-services
yarn.nodemanager.aux-services.spark_shuffle.class
Defines the YARN scheduler that the Data Integration Service uses to assign resources.
yarn.node-labels.enabled
yarn.node-labels.fs-store.root-dir
tez-site.xml
Configure the following properties in the tez-site.xml file:
tez.runtime.io.sort.mb
The sort buffer memory. Required when the output needs to be sorted for Blaze and Spark engines.
The following table describes the information that you need to provide to the Informatica administrator to
create the cluster configuration directly from the cluster:
Property Description
Cluster name Name of the cluster. Use the display name if the cluster manager manages multiple clusters. If you do
not provide a cluster name, the wizard imports information based on the default cluster.
The Hortonworks cluster configuration archive file must have the following contents:
• core-site.xml
• hbase-site.xml. hbase-site.xml is required only if you access HBase sources and targets.
• hdfs-site.xml
• hive-site.xml
For example, the edited tez.lib.uris property looks similar to the following:
<property>
<name>tez.lib.uris</name>
<value>/hdp/apps/2.5.0.0-1245/tez/tez.tar.gz</value>
</property>
A cluster configuration is an object in the domain that contains configuration information about the Hadoop
cluster. The cluster configuration enables the Data Integration Service to push mapping logic to the Hadoop
environment. Import configuration properties from the Hadoop cluster to create a cluster configuration.
The import process imports values from *-site.xml files into configuration sets based on the individual *-
site.xml files. When you perform the import, the cluster configuration wizard can create Hadoop, HBase,
HDFS, and Hive connection to access the Hadoop environment. If you choose to create the connections, the
wizard also associates the cluster configuration with the connections.
Note: If you are integrating for the first time and you imported the cluster configuration when you ran the
installer, you must re-create or refresh the cluster configuration.
If you import directly from the cluster, contact the Hadoop administrator to get cluster connection
information. If you import from a file, get an archive file of exported cluster information.
1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following General properties:
Property Description
Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster environment
variables, cluster path variables, and advanced properties. Based on the cluster
environment and the functionality that you use, you can add to the default values or
change the default values of these properties. For a list of Hadoop connection properties
to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.
Property Description
Cluster name Name of the cluster. Use the display name if the cluster manager manages multiple clusters. If
you do not provide a cluster name, the wizard imports information based on the default cluster.
5. Click Next and verify the cluster configuration information on the summary page.
Before you import from the cluster, you must get the archive file from the Hadoop administrator.
1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:
Property Description
Method to import Choose Import from file to import properties from an archive file.
the cluster
configuration
Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster
environment variables, cluster path variables, and advanced properties. Based on the
cluster environment and the functionality that you use, you can add to the default values
or change the default values of these properties. For a list of Hadoop connection
properties to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.
4. Click Browse to select a file. Select the file and click Open.
5. Click Next and verify the cluster configuration information on the summary page.
If you upgraded from 10.2 and you changed the distribution version, you need to verify the distribution
version in the General properties of the cluster configuration.
Effective in version 10.2.1, Informatica assigns a default version to each Hadoop distribution type. If you
configure the cluster configuration to use the default version, the upgrade process upgrades to the
assigned default version if the version changes. If you have not upgraded your Hadoop distribution to
Informatica's default version, you need to update the distribution version property.
For example, suppose the assigned default Hadoop distribution version for 10.2.1 is n, and for 10.2.2 is n
+1. If the cluster configuration uses the default supported Hadoop version of n, the upgraded cluster
If you configure the cluster configuration to use a distribution version that is not the default version, you
need to update the distribution version property in the following circumstances:
• Design-time. To import metadata, you can use the DataDirect drivers packaged with the Informatica
installer if they are available. If they are not available, use any Type 4 JDBC driver that the database
vendor recommends.
• Run-time. To run mappings, use any Type 4 JDBC driver that the database vendor recommends. Some
distributions support other drivers to use Sqoop connectors. You cannot use the DataDirect drivers for
run-time processing.
Copy the JDBC driver .jar files to the following location on the Developer tool machine:
1. Download Type 4 JDBC drivers associated with the JCBC-compliant databases that you want to access.
2. To use Sqoop TDCH Hortonworks Connector for Teradata, perform the following task:
To connect to the Hadoop cluster to develop a mapping, the Developer tool requires security certificate
aliases on the machine that hosts the Developer tool. To run a mapping, the machine that hosts the Data
Integration Service requires these same certificate alias files.
Perform the following steps from the Developer tool host machine, and then repeat them from the Data
Integration Service host machine:
1. Run the following command to export the certificates from the cluster:
keytool -export -alias <alias name> -keystore <custom.truststore file location> -
file <exported certificate file location> -storepass <password>
For example,
keytool -export -alias <alias name> -keystore ~/custom.truststore -file ~/
exported.cer
The command produces a certificate file.
Configure developerCore.ini
Edit developerCore.ini to successfully import local complex files available on the Developer tool machine.
When you import a complex file, such as Avro or Parquet, the imported object includes metadata associated
with the distribution in the Hadoop environment. If the file resides on the Developer tool machine, the import
process picks up the distribution information from the developerCore.ini file. You must edit the
developerCore.ini file to point to the distribution directory on the Developer tool machine.
Based on the version that you upgraded from, you might need to update the following types of objects:
Connections
Based on the version you are upgrading from, you might need to update Hadoop connections or replace
connections to the Hadoop environment.
The Hadoop connection contains additional properties. You need to manually update it to include
customized configuration in the hadoopEnv.properties file from previous versions.
The mapping contains deferred data objects or transformations. Support will be reinstated in a future
release.
After you upgrade, the streaming mappings become invalid. You must re-create the physical data objects
to run the mappings in Spark engine that uses Structured Streaming.
After you re-create the physical data objects to run the mappings in Spark engine that uses Structured
Streaming some properties are not available for Azure Event Hubs data objects.
Update Connections
You might need to update connections based on the version you are upgrading from.
Consider the following types of updates that you might need to make:
Configure the Hadoop connection.
Configure the Hadoop connection to incorporate properties from the hadoopEnv.properties file.
Replace connections.
If you chose the option to create connections when you ran the Cluster Configuration wizard, you need
to replace connections in mappings with the new connections.
If you did not create connections when you created the cluster configuration, you need to update the
connections.
When you run the Informatica upgrade, the installer backs up the existing hadoopEnv.properties file. You can
find the backup hadoopEnv.properties file in the following location:
Edit the Hadoop connection in the Administrator tool or the Developer tool to include any properties that you
manually configured in the hadoopEnv.properties file. The Hadoop connection contains default values for
properties such as cluster environment and path variables and advanced properties. You can update the
default values to match the properties in the hadoopEnv.properties file.
The method that you use to replace connections in mappings depends on the type of connection.
Hadoop connection
Review connections that you created in a previous release to update the values for connection
properties. For example, if you added nodes to the cluster or if you updated the distribution version, you
might need to verify host names, URIs, or port numbers for some of the properties.
The Hadoop, Hive, HDFS, and HBase connections must be associated with a cluster configuration.
Complete the following tasks:
1. Run infacmd isp listConnections to identify the connections that you need to upgrade. Use -ct
to list connections of a particular type.
2. Run infacmd isp UpdateConnection to associate the cluster configuration with the connection.
Use -cn to name the connection and -o clusterConfigID to associate the cluster configuration
with the connection.
For more information about infacmd, see the Informatica Command Reference.
After you upgrade, the existing streaming mappings become invalid because of the unavailable header ports,
the unsupported transformations or data objects, and the behavior change of some data objects.
• Re-create the physical data objects. After you re-create the physical data objects, the data objects get the
required header ports, such as timestamp, partitionID, or key based on the data object.
• In a Normalizer transformation, if the Occurs column is set to Auto, re-create the Normalizer
transformation. You must re-create the Normalizer transformation because the type configuration
property of the complex port refers to the physical data object that you plan to replace.
• Update the streaming mapping. If the mapping contains Kafka target, Aggregator transformation, Joiner
transformation, or Normalizer transformation, replace the data object or transformation, and then update
the mapping because of the changed behavior of these transformations and data objects.
• Verify the deferred data object types. If the streaming mapping contains unsupported transformations or
data objects, contact Informatica Global Customer Support.
1. Go to the existing mapping, select the data object from the mapping.
2. Click the Properties tab. On the Column Projection tab, click Edit Schema.
3. Note the schema information from the Edit Schema dialog box.
4. Note the parameters information from the Parameters tab.
5. Create new physical data objects.
After you re-create the data objects, the physical data objects get the required header ports. The Microsoft
Azure does not support the following properties and are not available for Azure Event Hubs data objects:
• Consumer Properties
• Partition Count
Transformation Updates
If a transformation uses a complex port, configure the type configuration property of the port because
the property refers to the physical data object that you replaced.
If a mapping contains an Aggregator transformation upstream from a Joiner transformation, move the
Aggregator transformation downstream from a Joiner transformation. Add a Window transformation
directly upstream from both Aggregator and Joiner transformations.
The following table lists the data object types to which the support is deferred to a future release:
Source JMS
MapR Streams
If you want to continue using the mappings that contain deferred data objects or transformations, you must
contact Informatica Global Customer Support.
• Integrate the Informatica domain with MapR for the first time.
• Upgrade from version 10.2.1.
• Upgrade from version 10.2.
• Upgrade from a version earlier than 10.2.
115
Task Flow to Integrate with MapR
The following diagram shows the task flow to integrate the Informatica domain with MapR:
You install the MapR client on the Data Integration Service, Metadata Access Service, and Analyst Service
machines in the following directory:
/opt/mapr
For instructions about installing and configuring the MapR client, refer to the MapR documentation at
https://mapr.com/docs/60/AdvancedInstallation/SettingUptheClient-install-mapr-client.html.
Note: If you are upgrading from a previous version, verify the properties and suggested values, as Big Data
Management might require additional properties or different values for existing properties.
Complete the following tasks to prepare the cluster before the Informatica administrator creates the cluster
configuration:
1. Verify property values in *-site.xml files that Big Data Management needs to run mappings in the Hadoop
environment.
2. Prepare the archive file to import into the domain.
Note: You cannot import cluster information directly from the MapR cluster into the Informatica domain.
core.site.xml
Configure the following properties in the core-site.xml file:
fs.s3.enableServerSideEncryption
Enables server side encryption for S3 buckets. Required for SSE and SSE-KMS encryption.
fs.s3a.access.key
The ID for the Blaze and Spark engines to connect to the Amazon S3 file system.
fs.s3a.secret.key
The password for the Blaze and Spark engines to connect to the Amazon S3 file system
fs.s3a.server-side-encryption-algorithm
The server-side encryption algorithm for S3. Required for SSE and SSE-KMS encryption. Set to the
encryption algorithm used.
hadoop.proxyuser.<proxy user>.groups
Defines the groups that the proxy user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.
Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.<proxy user>.hosts
Defines the host machines that a user account can impersonate. On a secure cluster the <proxy user> is
the Service Principal Name that corresponds to the cluster keytab file. On a non-secure cluster, the
<proxy user> is the system user that runs the Informatica daemon.
Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.
hadoop.proxyuser.yarn.groups
Comma-separated list of groups that you want to allow the YARN user to impersonate on a non-secure
cluster.
Set to group names of impersonation users separated by commas. If less security is preferred, use the
wildcard " * " to allow impersonation from any group.
hadoop.proxyuser.yarn.hosts
Comma-separated list of hosts that you want to allow the YARN user to impersonate on a non-secure
cluster.
Set to a single host name or IP address, or set to a comma-separated list. If less security is preferred,
use the wildcard " * " to allow impersonation from any host.
io.compression.codecs
hadoop.security.auth_to_local
Translates the principal names from the Active Directory and MIT realm into local names within the
Hadoop cluster. Based on the Hadoop cluster used, you can set multiple rules.
hbase-site.xml
Configure the following properties in the hbase-site.xml file:
zookeeper.znode.parent
hive-site.xml
Configure the following properties in the hive-site.xml file:
hive.cluster.delegation.token.store.class
The token store implementation. Required for HiveServer2 high availability and load balancing.
hive.compactor.initiator.on
Runs the initiator and cleaner threads on metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.
hive.compactor.worker.threads
The number of worker threads to run in a metastore instance. Required for an Update Strategy
transformation in a mapping that writes to a Hive target.
Set to: 1
hive.enforce.bucketing
Enables dynamic bucketing while loading to Hive. Required for an Update Strategy transformation in a
mapping that writes to a Hive target.
hive.exec.dynamic.partition
Enables dynamic partitioned tables for Hive tables. Applicable for Hive versions 0.9 and earlier.
hive.exec.dynamic.partition.mode
Allows all partitions to be dynamic. Required for the Update Strategy transformation in a mapping that
writes to a Hive target. Also required if you use Sqoop and define a DDL query to create or replace a
partitioned Hive target at run time.
hive.support.concurrency
Enables table locking in Hive. Required for an Update Strategy transformation in a mapping that writes to
a Hive target.
Enables HiveServer2 dynamic service discovery. Required for HiveServer2 high availability.
hive.server2.zookeeper.namespace
The value of the ZooKeeper namespace in the JDBC connection string. Required for HiveServer2 high
availability.
hive.txn.manager
Turns on transaction support. Required for an Update Strategy transformation in a mapping that writes
to a Hive target.
Set to: org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
hive.zookeeper.quorum
Comma-separated list of ZooKeeper server host:ports in a cluster. The value of the ZooKeeper ensemble
in the JDBC connection string. Required for HiveServer2 high availability.
mapred-site.xml
Configure the following properties in the mapred-site.xml file:
mapreduce.framework.name
The run-time framework to run MapReduce jobs. Values can be local, classic, or yarn. Required for
Sqoop.
mapreduce.jobhistory.address
Location of the MapReduce JobHistory Server. The default port is 10020. Required for Sqoop.
yarn.app.mapreduce.am.staging-dir
yarn-site.xml
Configure the following properties in the yarn-site.xml file:
yarn.application.classpath
Add spark_shuffle.jar to the class path. The .jar file must contain the class
"org.apache.spark.network.yarn.YarnShuffleService."
yarn.nodemanager.resource.memory-mb
The maximum RAM available for each container. Set the maximum memory on the cluster to increase
resource memory available to the Blaze engine.
yarn.nodemanager.resource.cpu-vcores
The number of virtual cores for each container. Required for Blaze engine resource allocation.
yarn.scheduler.minimum-allocation-mb
The minimum RAM available for each container. Required for Blaze engine resource allocation.
yarn.nodemanager.vmem-check-enabled
Disables virtual memory limits for containers. Required for the Blaze and Spark engines.
yarn.nodemanager.aux-services
yarn.nodemanager.aux-services.spark_shuffle.class
yarn.resourcemanager.scheduler.class
Defines the YARN scheduler that the Data Integration Service uses to assign resources.
yarn.node-labels.enabled
yarn.node-labels.fs-store.root-dir
Create an archive file that contains the following files from the cluster:
• core-site.xml
• hbase-site.xml. Required only if you access HBase sources and targets.
• hive-site.xml
• mapred-site.xml
• yarn-site.xml
Note: To import from MapR, the Informatica administrator must use an archive file.
A cluster configuration is an object in the domain that contains configuration information about the Hadoop
cluster. The cluster configuration enables the Data Integration Service to push mapping logic to the Hadoop
environment. Import configuration properties from the Hadoop cluster to create a cluster configuration.
The import process imports values from *-site.xml files into configuration sets based on the individual *-
site.xml files. When you perform the import, the cluster configuration wizard can create Hadoop, HBase,
HDFS, and Hive connection to access the Hadoop environment. If you choose to create the connections, the
wizard also associates the cluster configuration with the connections.
Note: If you are integrating for the first time and you imported the cluster configuration when you ran the
installer, you must re-create or refresh the cluster configuration.
Before you import from the cluster, you must get the archive file from the Hadoop administrator.
1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:
Property Description
Method to import Choose Import from file to import properties from an archive file.
the cluster
configuration
Create connections Choose to create Hadoop, HDFS, Hive, and HBase connections.
If you choose to create connections, the Cluster Configuration wizard associates the
cluster configuration with each connection that it creates.
The Hadoop connection contains default values for properties such as cluster
environment variables, cluster path variables, and advanced properties. Based on the
cluster environment and the functionality that you use, you can add to the default values
or change the default values of these properties. For a list of Hadoop connection
properties to configure, see “Configuring Hadoop Connection Properties” on page 188 .
If you do not choose to create connections, you must manually create them and associate
the cluster configuration with them.
Important: When the wizard creates the Hive connection, it populates the Metadata
Connection String and the Data Access Connection String properties with the value from
the hive.metastore.uris property. If the Hive metastore and HiveServer2 are running on
different nodes, you must update the Metadata Connection String to point to the
HiveServer2 host.
4. Click Browse to select a file. Select the file and click Open.
5. Click Next and verify the cluster configuration information on the summary page.
If you upgraded from 10.2 and you changed the distribution version, you need to verify the distribution
version in the General properties of the cluster configuration.
Effective in version 10.2.1, Informatica assigns a default version to each Hadoop distribution type. If you
configure the cluster configuration to use the default version, the upgrade process upgrades to the
assigned default version if the version changes. If you have not upgraded your Hadoop distribution to
Informatica's default version, you need to update the distribution version property.
For example, suppose the assigned default Hadoop distribution version for 10.2.1 is n, and for 10.2.2 is n
+1. If the cluster configuration uses the default supported Hadoop version of n, the upgraded cluster
configuration uses the default version of n+1. If you have not upgraded the distribution in the Hadoop
environment you need to change the cluster configuration to use version n.
If you configure the cluster configuration to use a distribution version that is not the default version, you
need to update the distribution version property in the following circumstances:
• Design-time. To import metadata, you can use the DataDirect drivers packaged with the Informatica
installer if they are available. If they are not available, use any Type 4 JDBC driver that the database
vendor recommends.
• Run-time. To run mappings, use any Type 4 JDBC driver that the database vendor recommends. Some
distributions support other drivers to use Sqoop connectors. You cannot use the DataDirect drivers for
run-time processing.
Copy the JDBC driver .jar files to the following location on the Developer tool machine:
1. Download Type 4 JDBC drivers associated with the JCBC-compliant databases that you want to access.
2. To use Sqoop TDCH MapR Connector for Teradata, download the following files:
• sqoop-connector-tdch-1.1-mapr-1707.jar from
http://repository.mapr.com/nexus/content/groups/mapr-public/org/apache/sqoop/connector/
sqoop-connector-tdch/1.1-mapr-1707/
• terajdbc4.jar and tdgssconfig.jar from
http://downloads.teradata.com/download/connectivity/jdbc-driver
• The MapR Connector for Teradata .jar file from the Teradata website.
3. To optimize the Sqoop mapping performance on the Spark engine while writing data to an HDFS
complex file target of the Parquet format, download the following .jar files:
• parquet-hadoop-bundle-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-avro/1.6.0/
• parquet-avro-1.6.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-hadoop-bundle/1.6.0/
• parquet-column-1.5.0.jar from
http://central.maven.org/maven2/com/twitter/parquet-column/1.5.0/
4. Copy all of the .jar files to the following directory on the machine where the Data Integration Service
runs:
<Informatica installation directory>\externaljdbcjars
Changes take effect after you recycle the Data Integration Service. At run time, the Data Integration
Service copies the .jar files to the Hadoop distribution cache so that the .jar files are accessible to all
nodes in the cluster.
The Data Integration Service user requires an account on the MapR cluster and a MapR ticket on the
application service machines that require access to MapR. When the MapR cluster uses both Kerberos and
Ticket authentication, you generate a single ticket for the Data Integration Service user for both
authentication systems.
After you generate and save MapR tickets, you perform additional steps to configure the Data Integration
Service, the Metadata Access Service, and the Analyst Service to communicate with the MapR cluster.
Save the ticket on the machines that host the Data Integration Service, the Metadata Access Service, and the
Analyst Service. The Data Integration Service and the Analyst Service access the ticket at run time. The
Metadata Access Service access the ticket for the Developer tool at design time.
By default, the services access the ticket in the /tmp directory. If you save the ticket to any other location,
you must configure the MAPR_TICKETFILE_LOCATION environment variable in the service properties.
In the Administrator tool Domain Navigator, select the Data Integration Service to configure, and then select
the Processes tab.
Property Value
JAVA_OPTS -Dhadoop.login=<MAPR_ECOSYSTEM_LOGIN_OPTS> -
Dhttps.protocols=TLSv1.2
where <MAPR_ECOSYSTEM_LOGIN_OPTS> is the value of the
MAPR_ECOSYSTEM_LOGIN_OPTS property in the file /opt/mapr/conf/env.sh.
MAPR_HOME MapR client directory on the machine that runs the Data Integration Service.
For example, opt/mapr
Required if you want to fetch a MapR Streams data object.
MAPR_TICKETFILE_LOCATION Required when the MapR cluster uses Kerberos or MapR Ticket authentication.
Location of the MapR ticket file if you saved it to a directory other than /tmp.
For example:
/export/home/username1/Keytabs_and_krb5conf/Tickets/project1/
maprticket_30103
Changes take effect when you restart the Data Integration Service.
In the Administrator tool Domain Navigator, select the Metadata Access Service to configure, and then select
the Processes tab.
In the Environment Variables area, configure the following property to define the Kerberos authentication
protocol:
Property Value
JAVA_OPTS -Dhadoop.login=<MAPR_ECOSYSTEM_LOGIN_OPTS> -
Dhttps.protocols=TLSv1.2
where <MAPR_ECOSYSTEM_LOGIN_OPTS> is the value of the
MAPR_ECOSYSTEM_LOGIN_OPTS property in the file /opt/mapr/conf/env.sh.
MAPR_TICKETFILE_LOCATION Required when the MapR cluster uses Kerberos or MapR Ticket authentication.
Location of the MapR ticket file if you saved it to a directory other than /tmp.
For example,
/export/home/username1/Keytabs_and_krb5conf/Tickets/project1/
maprticket_30103
Changes take effect when you restart the Metadata Access Service.
In the Administrator tool Domain Navigator, select the Analyst Service to configure, then select the
Processes tab.
In the Environment Variables area, configure the following property to define the Kerberos authentication
protocol:
Property Value
MAPR_TICKETFILE_LOCATION Required when the MapR cluster uses Kerberos or MapR Ticket authentication.
Location of the MapR ticket file if you saved it to a directory other than /tmp.
For example,
/export/home/username1/Keytabs_and_krb5conf/Tickets/project1/
maprticket_30103
When you import a complex file, such as Avro or Parquet, the imported object includes metadata associated
with the distribution in the Hadoop environment. If the file resides on the Developer tool machine, the import
process picks up the distribution information from the developerCore.ini file. You must edit the
developerCore.ini file to point to the distribution directory on the Developer tool machine.
You can find developerCore.ini in the following directory: <Informatica installation directory>
\clients\DeveloperClient
Based on the version that you upgraded from, you might need to update the following types of objects:
Connections
Based on the version you are upgrading from, you might need to update Hadoop connections or replace
connections to the Hadoop environment.
The Hadoop connection contains additional properties. You need to manually update it to include
customized configuration in the hadoopEnv.properties file from previous versions.
Streaming mappings
The mapping contains deferred data objects or transformations. Support will be reinstated in a future
release.
After you upgrade, the streaming mappings become invalid. You must re-create the physical data objects
to run the mappings in Spark engine that uses Structured Streaming.
After you re-create the physical data objects to run the mappings in Spark engine that uses Structured
Streaming some properties are not available for Azure Event Hubs data objects.
Update Connections
You might need to update connections based on the version you are upgrading from.
Consider the following types of updates that you might need to make:
Configure the Hadoop connection.
Configure the Hadoop connection to incorporate properties from the hadoopEnv.properties file.
Replace connections.
If you chose the option to create connections when you ran the Cluster Configuration wizard, you need
to replace connections in mappings with the new connections.
If you did not create connections when you created the cluster configuration, you need to update the
connections.
When you run the Informatica upgrade, the installer backs up the existing hadoopEnv.properties file. You can
find the backup hadoopEnv.properties file in the following location:
Edit the Hadoop connection in the Administrator tool or the Developer tool to include any properties that you
manually configured in the hadoopEnv.properties file. The Hadoop connection contains default values for
properties such as cluster environment and path variables and advanced properties. You can update the
default values to match the properties in the hadoopEnv.properties file.
The method that you use to replace connections in mappings depends on the type of connection.
Hadoop connection
For information about the infacmd commands, see the Informatica Command Reference.
Review connections that you created in a previous release to update the values for connection
properties. For example, if you added nodes to the cluster or if you updated the distribution version, you
might need to verify host names, URIs, or port numbers for some of the properties.
The Hadoop, Hive, HDFS, and HBase connections must be associated with a cluster configuration.
Complete the following tasks:
1. Run infacmd isp listConnections to identify the connections that you need to upgrade. Use -ct
to list connections of a particular type.
2. Run infacmd isp UpdateConnection to associate the cluster configuration with the connection.
Use -cn to name the connection and -o clusterConfigID to associate the cluster configuration
with the connection.
For more information about infacmd, see the Informatica Command Reference.
135
Chapter 8
Introduction to Databricks
Integration
This chapter includes the following topics:
The Data Integration Service automatically installs the binaries required to integrate the Informatica domain
with the Databricks environment. The integration requires Informatica connection objects and cluster
configurations. A cluster configuration is a domain object that contains configuration parameters that you
import from the Databricks cluster. You then associate the cluster configuration with connections to access
the Databricks environment.
Perform the following tasks to integrate the Informatica domain with the Databricks environment:
The following image shows the components of the Informatica and the Databricks environments:
136
1. The Logical Data Transformation Manager translates the mapping into a Scala program, packages it as
an application, and sends it to the Databricks Engine Executor on the Data Integration Service machine.
2. The Databricks Engine Executor submits the application through REST API to the Databricks cluster,
requests to run the application, and stages files for access during run time.
3. The Databricks cluster passes the request to the Databricks Spark driver on the driver node.
4. The Databricks Spark driver distributes the job to one or more Databricks Spark executors that reside on
worker nodes.
5. The executors run the job and stage run-time data to the Databricks File System (DBFS) of the
workspace.
Native Environment
The integration with Databricks requires tools, services, and a repository database in the Informatica domain.
Use the Administrator tool to mange the Informatica domain and application services. You can also
create objects such as connections, cluster configurations, and cloud provisioning configurations to
enable big data operations.
Use the Developer tool to import sources and targets and create mappings to run in the Databricks
environment.
Application Services
The domain integration with Databricks uses the following services:
Data Integration Service
The Data Integration Service can process mappings in the native environment, or it can push the
processing to the Databricks environment. The Data Integration Service retrieves metadata from the
Model repository when you run a mapping.
The Model Repository Service manages the Model repository. All requests to save or access Model
repository metadata go through the Model repository.
Model Repository
The Model repository stores mappings that you create and manage in the Developer tool.
Databricks Environment
Integration with the Databricks environment includes the following components:
The Databricks run-time engine based on the open-source Apache Spark engine.
A distributed file system installed on Databricks Runtime clusters. Run-time data is staged in the DBFS
and is persisted to a mounted Blob storage container.
Install and configure the Informatica services and the Developer tool. Verify that the domain contains a
Model Repository Service and a Data Integration Service.
Verify access domain access to Databricks through one of the following methods:
• VPN is enabled between the Informatica domain and the Azure cloud network.
• The Informatica domain is installed within the Azure ecosystem.
140
Databricks distribution
Verify that the Databricks distribution is a version 5.1 standard concurrency distribution.
Verify that the DBFS has WASB storage or a mounted Blob storage container.
For more information about product requirements and supported platforms, see the Product Availability
Matrix on Informatica Network:
https://network.informatica.com/community/informatica-network/product-availability-matrices
When you submit a job to Databricks, it allocates resources to run the job. If it does not have enough
resources, it puts the job in a queue. Pending jobs fail if resources do not become available before the
timeout of 30 minutes.
You can configure preemption on the cluster to control the amount of resources that Databricks allocates to
each job, thereby allowing more jobs to run concurrently. You can also configure the timeout for the queue
and the interval at which the Databricks Spark engine checks for available resources.
Configure the following environment variables for the Databricks Spark engine:
spark.databricks.preemption.enabled
spark.databricks.preemption.threshold
A percentage of resources that are allocated to each submitted job. The job runs with the allocated
resources until completion. Default is 0.5, or 50 percent.
spark.databricks.preemption.timeout
The number of seconds that a job remains in the queue before failing. Default is 30.
Note: If you set a value higher than 1,800, Databricks ignores the value and uses the maximum timeout
of 1,800.
spark.databricks.preemption.interval
The number of seconds to check for available resources to assign to a job in the queue. Default is 5.
Important: Informatica integrates with Databricks, supporting standard concurrency clusters. Standard
concurrency clusters have a maximum queue time of 30 minutes, and jobs fail when the timeout is reached.
The maximum queue time cannot be extended. Setting the preemption threshold allows more jobs to run
concurrently, but with a lower percentage of allocated resources, the jobs can take longer to run. Also,
configuring the environment for preemption does not ensure that all jobs will run. In addition to configuring
preemption, you might choose to run cluster workflows to create ephemeral clusters that create the cluster,
If you use an account access key, add "spark.hadoop" as a prefix to the Hadoop configuration key as
shown in the following text:
spark.hadoop.fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net
<your-storage-account-access-key>
SAS token
If you use an SAS token, add "spark.hadoop" as a prefix to the Hadoop configuration key as shown in the
following text:
spark.hadoop.fs.azure.sas.<your-container-name>.<your-storage-account-
name>.blob.core.windows.net <complete-query-string-of-your-sas-for-the-container>
By default, the Data Integration Service writes the files to the DBFS directory /tmp.
If you create a staging directory, you configure this path in the Cluster Staging Directory property of the Data
Integration Service.
Optionally, you can create a directory on DBFS to stage temporary files during run time. By default, the Data
Integration Service uses the DBFS directory /<Cluster Staging Directory>/DATABRICKS.
Create a Databricks user to generate the authentication token. Complete the following tasks to prepare for
authentication.
Use the Administrator tool to import configuration properties from the Databricks cluster to create a cluster
configuration. You can import configuration properties from the cluster or from a file that contains cluster
properties. You can choose to create a Databricks connection when you perform the import.
Before you import the cluster configuration, get cluster information from the Databricks administrator.
1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:
Property Description
144
Property Description
Databricks access token The token ID created within Databricks required for authentication.
Note: If the token has an expiration date, verify that you get a new token from the
Databricks administrator before it expires.
To create the .xml file for import, you must get required information from the Databricks administrator. You
can provide any name for the file and store it locally.
The following table describes the properties required to import the cluster information:
Optionally, you can include other properties specific to the Databricks environment.
When you complete the .xml file, compress it into a .zip or .tar file for import.
1. From the Connections tab, click the ClusterConfigurations node in the Domain Navigator.
2. From the Actions menu, select New > Cluster Configuration.
The Cluster Configuration wizard opens.
3. Configure the following properties:
Property Description
Upload configuration The full path and file name of the file. Click the Browse button to navigate to the
archive file file.
For information about the Databricks connection properties, see “Databricks Connection Properties” on page
159.
Connections
This appendix includes the following topics:
• Connections, 149
• Cloud Provisioning Configuration, 149
• Amazon Redshift Connection Properties, 155
• Amazon S3 Connection Properties, 156
• Cassandra Connection Properties, 158
• Databricks Connection Properties, 159
• Google Analytics Connection Properties, 161
• Google BigQuery Connection Properties, 161
• Google Cloud Spanner Connection Properties, 162
• Google Cloud Storage Connection Properties, 163
• Hadoop Connection Properties, 164
• HDFS Connection Properties, 169
• HBase Connection Properties, 171
• HBase Connection Properties for MapR-DB, 171
• Hive Connection Properties, 172
• JDBC Connection Properties, 175
• Kafka Connection Properties, 180
• Microsoft Azure Blob Storage Connection Properties, 181
• Microsoft Azure Cosmos DB SQL API Connection Properties, 182
• Microsoft Azure Data Lake Store Connection Properties, 183
• Microsoft Azure SQL Data Warehouse Connection Properties, 184
• Snowflake Connection Properties, 185
• Creating a Connection to Access Sources or Targets, 186
• Creating a Hadoop Connection, 186
• Configuring Hadoop Connection Properties, 188
148
Connections
Create a connection to access non-native environments, Hadoop and Databricks. If you access HBase, HDFS,
or Hive sources or targets in the Hadoop environment, you must also create those connections. You can
create the connections using the Developer tool, Administrator tool, and infacmd.
HBase connection
Create an HBase connection to access HBase. The HBase connection is a NoSQL connection.
HDFS connection
Create an HDFS connection to read data from or write data to the HDFS file system on a Hadoop cluster.
Hive connection
Create a Hive connection to access Hive as a source or target. You can access Hive as a source if the
mapping is enabled for the native or Hadoop environment. You can access Hive as a target if the
mapping runs on the Blaze engine.
JDBC connection
Create a JDBC connection and configure Sqoop properties in the connection to import and export
relational data through Sqoop.
Databricks connection
Note: For information about creating connections to other sources or targets such as social media web sites
or Teradata, see the respective PowerExchange adapter user guide for information.
The properties to populate depend on the Hadoop distribution you choose to build a cluster on. Choose one
of the following connection types:
• AWS Cloud Provisioning. Connects to an Amazon EMR cluster on Amazon Web Services.
• Azure Cloud Provisioning. Connects to an HDInsight cluster on the Azure platform.
• Databricks Cloud Provisioning. Connects to a Databricks cluster on the Azure Databricks platform.
Connections 149
AWS Cloud Provisioning Configuration Properties
The properties in the AWS cloud provisioning configuration enable the Data Integration Service to contact
and create resources on the AWS cloud platform.
General Properties
The following table describes cloud provisioning configuration general properties:
Property Description
AWS Access Key ID Optional. ID of the AWS access key, which AWS uses to control REST or HTTP query protocol
requests to AWS service APIs.
If you do not specify a value, Informatica attempts to follow the Default Credential Provider
Chain.
Region Region in which to create the cluster. This must be the region in which the VPC is running.
Use AWS region values. For a list of acceptable values, see AWS documentation.
Note: The region where you want to create the cluster can be different from the region in which
the Informatica domain is installed.
Permissions
The following table describes cloud provisioning configuration permissions properties:
Property Description
EMR Role Name of the service role for the EMR cluster that you create. The role must have sufficient
permissions to create a cluster, access S3 resources, and run jobs on the cluster.
When the AWS administrator creates this role, they select the “EMR” role. This contains the default
AmazonElasticMapReduceRole policy. You can edit the services in this policy.
EC2 Instance Name of the EC2 instance profile role that controls permissions on processes that run on the cluster.
Profile When the AWS administrator creates this role, they select the “EMR Role for EC2” role. This includes
S3 access by default.
Auto Scaling Required if you configure auto-scaling for the EMR cluster.
Role This role is created when the AWS administrator configures auto-scaling on any cluster in the VPC.
Default: When you leave this field blank, it is equivalent to setting the Auto Scaling role to “Proceed
without role” when the AWS administrator creates a cluster in the AWS console.
Property Description
EC2 Key Pair EC2 key pair to enable communication with the EMR cluster master node.
Optional. This credential enables you to log into the cluster. Configure this property if you intend
the cluster to be non-ephemeral.
EC2 Subnet ID of the subnet on the VPC in which to create the cluster.
Use the subnet ID of the EC2 instance where the cluster runs.
Master Security Optional. ID of the security group for the cluster master node. Acts as a virtual firewall to control
Group inbound and outbound traffic to cluster nodes.
Security groups are created when the AWS administrator creates and configures a cluster in a
VPC. In the AWS console, the property is equivalent to ElasticMapReduce-master.
You can use existing security groups, or the AWS administrator might create dedicated security
groups for the ephemeral cluster.
If you do not specify a value, the cluster applies the default security group for the VPC.
Additional Master Optional. IDs of additional security groups to attach to the cluster master node. Use a comma-
Security Groups separated list of security group IDs.
Core and Task Optional. ID of the security group for the cluster core and task nodes. When the AWS
Security Group administrator creates and configures a cluster In the AWS console, the property is equivalent to
the ElasticMapReduce-slave security group
If you do not specify a value, the cluster applies the default security group for the VPC.
Additional Core Optional. IDs of additional security groups to attach to cluster core and task nodes. Use a
and Task Security comma-separated list of security group IDs.
Groups
Service Access EMR managed security group for service access. Required when you provision an EMR cluster in
Security Group a private subnet.
Authentication Details
The following table describes authentication properties to configure:
Property Description
ID ID of the cloud provisioning configuration. Default: Same as the cloud provisioning configuration
name.
Client ID A GUID string that is the same as the Application ID associated with the Service Principal. The
Service Principal must be assigned to a role that has permission to create resources in the
subscription that you identified in the Subscription ID property.
Client Secret An octet string that provides a key associated with the client ID.
The following table describes the information you need to configure Azure Data Lake Storage (ADLS) with the
HDInsight cluster:
Property Description
Azure Data Lake Store Name of the ADLS storage to access. The ADLS storage and the cluster to create must
Name reside in the same region.
Data Lake Service A credential that enables programmatic access to ADLS storage. Enables the Informatica
Principal Client ID domain to communicate with ADLS and run commands and mappings on the HDInsight
cluster.
The service principal is an Azure user that meets the following requirements:
- Permissions to access required directories in ADLS storage.
- Certificate-based authentication for ADLS storage.
- Key-based authentication for ADLS storage.
Data Lake Service The Base64 encoded text of the public certificate used with the service principal.
Principal Certificate Leave this property blank when you create the cloud provisioning configuration. After you
Contents save the cloud provisioning configuration, log in to the VM where the Informatica domain
is installed and run infacmd ccps updateADLSCertificate to populate this property.
Data Lake Service Private key for the service principal. This private key must be associated with the service
Principal Certificate principal certificate.
Password
Data Lake Service An octet string that provides a key associated with the service principal.
Principal Client Secret
Property Description
Azure Storage Name of the storage account to access. Get the value from the Storage Accounts node in the
Account Name Azure web console. The storage and the cluster to create must reside in the same region.
Azure Storage A key to authenticate access to the storage account. To get the value from the Azure web
Account Key console, select the storage account, then Access Keys. The console displays the account keys.
Property Description
Resource Resource group in which to create the cluster. A resource group is a logical set of Azure resources.
Group
Virtual Name of the virtual network or vnet where you want to create the cluster. Specify a vnet that resides
Network in the resource group that you specified in the Virtual Network Resource Group property.
The vnet must be in the same region as the region in which to create the cluster.
Subnet Name Subnet in which to create the cluster. The subnet must be a part of the vnet that you designated in
the previous property.
Each vnet can have one or more subnets. The Azure administrator can choose an existing subnet or
create one for the cluster.
You can use an external relational database like MySQL or Amazon RDS as the Hive metastore database. The
external database must be on the same cloud platform as the cluster to create.
If you do not specify an existing external database in this dialog box, the cluster creates its own database on
the cluster. This database is terminated when the cluster is terminated.
Property Description
Database User User name of the account for the domain to use to access the database.
Name
The following table describes the Databricks cloud provisioning configuration properties:
Property Description
Databricks token ID The token ID created within Databricks required for authentication.
Note: If the token has an expiration date, verify that you get a new token from the Databricks
administrator before it expires.
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within the domain. You
can change this property after you create the connection. The name cannot exceed 128 characters,
contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It
must be 255 characters or less and must be unique in the domain. You cannot change this property
after you create the connection. Default value is the connection name.
Description The description of the connection. The description cannot exceed 4,000 characters.
The Details tab contains the connection attributes of the Amazon Redshift connection. The following table
describes the connection attributes:
Property Description
Schema Optional. Amazon Redshift schema name. Do not specify the schema name if you want to use
multiple schema. The Data Object wizard displays all the user-defined schemas available for the
Amazon Redshift objects.
Default is public.
Master Symmetric Optional. Provide a 256-bit AES encryption key in the Base64 format when you enable client-side
Key encryption. You can generate a key using a third-party tool.
If you specify a value, ensure that you specify the encryption type as client side encryption in the
advanced target properties.
Customer Master Optional. Specify the customer master key ID or alias name generated by AWS Key Management
Key ID Service (AWS KMS). You must generate the customer master key corresponding to the region
where Amazon S3 bucket resides. You can specify any of the following values:
Customer generated customer master key
Enables client-side or server-side encryption. Only the administrator user of the account can
use the default customer master key ID to enable client-side encryption.
Note: You can use customer master key ID when you run a mapping in the native environment or
on the Spark engine.
Note: If you upgrade the mappings created in versions 10.1.1 Update 2 or earlier, you must select the relevant
schema in the connection property. Else, the mappings fail when you run them on current version.
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within the domain.
You can change this property after you create the connection. The name cannot exceed 128
characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + = { [ } ] |
\:;"'<,>.?/
ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive.
It must be 255 characters or less and must be unique in the domain. You cannot change this
property after you create the connection. Default value is the connection name.
Description Optional. The description of the connection. The description cannot exceed 4,000 characters.
Access Key The access key ID for access to Amazon account resources.
Note: Required if you do not use AWS Identity and Access Management (IAM) authentication.
Secret Key The secret access key for access to Amazon account resources. The secret key is associated with
the access key and uniquely identifies the account.
Note: Required if you do not use AWS Identity and Access Management (IAM) authentication.
Folder Path The complete path to Amazon S3 objects. The path must include the bucket name and any folder
name.
Do not use a slash at the end of the folder path. For example, <bucket name>/<my folder
name>.
Master Optional. Provide a 256-bit AES encryption key in the Base64 format when you enable client-side
Symmetric Key encryption. You can generate a master symmetric key using a third-party tool.
Customer Optional. Specify the customer master key ID or alias name generated by AWS Key Management
Master Key ID Service (AWS KMS). You must generate the customer master key for the same region where Amazon
S3 bucket reside.
You can specify any of the following values:
Customer generated customer master key
Enables client-side or server-side encryption. Only the administrator user of the account can use
the default customer master key ID to enable client-side encryption.
Note: Applicable when you run a mapping in the native environment or on the Spark engine.
Region Name Select the AWS region in which the bucket you want to access resides.
Select one of the following regions:
- Asia Pacific (Mumbai)
- Asia Pacific (Seoul)
- Asia Pacific (Singapore)
- Asia Pacific (Sydney)
- Asia Pacific (Tokyo)
- AWS GovCloud (US)
- Canada (Central)
- China (Beijing)
- China (Ningxia)
- EU (Ireland)
- EU (Frankfurt)
- EU (London)
- EU (Paris)
- South America (Sao Paulo)
- US East (Ohio)
- US East (N. Virginia)
- US West (N. California)
- US West (Oregon)
Default is US East (N. Virginia).
Note: The order of the connection properties might vary depending on the tool where you view them.
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within the domain.
You can change this property after you create the connection. The name cannot exceed 128
characters, contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection.
The ID is not case sensitive. The ID must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection.
Default value is the connection name.
Description Optional. The description of the connection. The description cannot exceed 4,000 characters.
Password Password corresponding to the user name to access the Cassandra server.
SQL Identifier Type of character that the database uses to enclose delimited identifiers in SQL or CQL queries. The
Character available characters depend on the database type.
Select None if the database uses regular identifiers. When the Data Integration Service generates
SQL or CQL queries, the service does not place delimited characters around any identifiers.
Select a character if the database uses delimited identifiers. When the Data Integration Service
generates SQL or CQL queries, the service encloses delimited identifiers within this character.
Additional Enter one or more JDBC connection parameters in the following format:
Connection <param1>=<value>;<param2>=<value>;<param3>=<value>
Properties PowerExchange for Cassandra JDBC supports the following JDBC connection parameters:
- BinaryColumnLength
- DecimalColumnScale
- EnableCaseSensitive
- EnableNullInsert
- EnablePaging
- RowsPerPage
- StringColumnLength
- VTTableNameSeparator
A Databricks connection is a cluster type connection. You can create and manage a Databricks connection in
the Administrator tool or the Developer tool. You can use infacmd to create a Databricks connection.
Configure properties in the Databricks connection to enable communication between the Data Integration
Service and the Databricks cluster.
The following table describes the general connection properties for the Databricks connection:
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within the
domain. You can change this property after you create the connection. The name cannot exceed
128 characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + =
{[}]|\:;"'<,>.?/
ID String that the Data Integration Service uses to identify the connection. The ID is not case
sensitive. It must be 255 characters or less and must be unique in the domain. You cannot change
this property after you create the connection. Default value is the connection name.
Description Optional. The description of the connection. The description cannot exceed 4,000 characters.
Cluster Name of the cluster configuration associated with the Databricks environment.
Configuration Required if you do not configure the cloud provisioning configuration.
Cloud Name of the cloud provisioning configuration associated with a Databricks cloud platform.
Provisioning Required if you do not configure the cluster configuration.
Configuration
Staging Directory The directory where the Databricks Spark engine stages run-time files.
If you specify a directory that does not exist, the Data Integration Service creates it at run time.
If you do not provide a directory path, the run-time staging files are written to /<cluster staging
directory>/DATABRICKS.
Advanced List of advanced properties that are unique to the Databricks environment.
Properties You can configure run-time properties for the Databricks environment in the Data Integration
Service and in the Databricks connection. You can override a property configured at a high level
by setting the value at a lower level. For example, if you configure a property in the Data
Integration Service custom properties, you can override it in the Databricks connection. The Data
Integration Service processes property overrides based on the following priorities:
1. Databricks connection advanced properties
2. Data Integration Service custom properties
Note: Informatica does not recommend changing these property values before you consult with
third-party documentation, Informatica documentation, or Informatica Global Customer Support. If
you change a value without knowledge of the property, you might experience performance
degradation or other unexpected results.
Advanced Properties
Configure the following properties in the Advanced Properties of the Databricks configuration section:
infaspark.json.parser.mode
Specifies the parser how to handle corrupt JSON records. You can set the value to one of the following
modes:
infaspark.json.parser.multiLine
Specifies whether the parser can read a multiline record in a JSON file. You can set the value to true or
false. Default is false. Applies only to non-native distributions that use Spark version 2.2.x and above.
infaspark.flatfile.writer.nullValue
When the Databricks Spark engine writes to a target, it converts null values to empty strings (" "). For
example, 12, AB,"",23p09udj.
The Databricks Spark engine can write the empty strings to string columns, but when it tries to write an
empty string to a non-string column, the mapping fails with a type mismatch.
To allow the Databricks Spark engine to convert the empty strings back to null values and write to the
target, configure the following advanced property in the Databricks Spark connection:
infaspark.flatfile.writer.nullValue=true
Note: The order of the connection properties might vary depending on the tool where you view them.
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within the
domain. You can change this property after you create the connection. The name cannot exceed
128 characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + =
{[}]|\:;"'<,>.?/
ID String that the Data Integration Service uses to identify the connection.
The ID is not case sensitive. The ID must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection.
Default value is the connection name.
Description Optional. The description of the connection. The description cannot exceed 4,000 characters.
Service Account Specifies the client_email value present in the JSON file that you download after you create a
ID service account.
Service Account Specifies the private_key value present in the JSON file that you download after you create a
Key service account.
APIVersion API that PowerExchange for Google Analytics uses to read from Google Analytics reports.
Select Core Reporting API v3.
Note: PowerExchange for Google Analytics does not support Analytics Reporting API v4.
Note: The order of the connection properties might vary depending on the tool where you view them.
Property Description
Service Specifies the client_email value present in the JSON file that you download after you create a
Account ID service account in Google BigQuery.
Service Specifies the private_key value present in the JSON file that you download after you create a service
Account Key account in Google BigQuery.
Connection The mode that you want to use to read data from or write data to Google BigQuery.
mode Select one of the following connection modes:
- Simple. Flattens each field within the Record data type field as a separate field in the mapping.
- Hybrid. Displays all the top-level fields in the Google BigQuery table including Record data type
fields. PowerExchange for Google BigQuery displays the top-level Record data type field as a
single field of the String data type in the mapping.
- Complex. Displays all the columns in the Google BigQuery table as a single field of the String
data type in the mapping.
Default is Simple.
Schema Specifies a directory on the client machine where the PowerCenter Integration ServiceData
Definition File Integration Service must create a JSON file with the sample schema of the Google BigQuery table.
Path The JSON file name is the same as the Google BigQuery table name.
Alternatively, you can specify a storage path in Google Cloud Storage where the PowerCenter
Integration ServiceData Integration Service must create a JSON file with the sample schema of the
Google BigQuery table. You can download the JSON file from the specified storage path in Google
Cloud Storage to a local machine.
Project ID Specifies the project_id value present in the JSON file that you download after you create a service
account in Google BigQuery.
If you have created multiple projects with the same service account, enter the ID of the project that
contains the dataset that you want to connect to.
Storage Path This property applies when you read or write large volumes of data.
Path in Google Cloud Storage where the PowerCenter Integration ServiceData Integration Service
creates a local stage file to store the data temporarily.
You can either enter the bucket name or the bucket name and folder name.
For example, enter gs://<bucket_name> or gs://<bucket_name>/<folder_name>
Note: The order of the connection properties might vary depending on the tool where you view them.
The following table describes the Google Cloud Spanner connection properties:
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within the
domain. You can change this property after you create the connection. The name cannot exceed
128 characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + =
{[}]|\:;"'<,>.?/
ID String that the Data Integration Service uses to identify the connection.
The ID is not case sensitive. The ID must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection.
Default value is the connection name.
Description Optional. The description of the connection. The description cannot exceed 4,000 characters.
Project ID Specifies the project_id value present in the JSON file that you download after you create a service
account.
If you have created multiple projects with the same service account, enter the ID of the project
that contains the bucket that you want to connect to.
Service Account Specifies the client_email value present in the JSON file that you download after you create a
ID service account.
Service Account Specifies the private_key value present in the JSON file that you download after you create a
Key service account.
Instance ID Name of the instance that you created in Google Cloud Spanner.
Note: The order of the connection properties might vary depending on the tool where you view them.
The following table describes the Google Cloud Storage connection properties:
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within the
domain. You can change this property after you create the connection. The name cannot exceed
128 characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + =
{[}]|\:;"'<,>.?/
ID String that the Data Integration Service uses to identify the connection.
The ID is not case sensitive. The ID must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection.
Default value is the connection name.
Description Optional. The description of the connection. The description cannot exceed 4,000 characters.
Project ID Specifies the project_id value present in the JSON file that you download after you create a service
account.
If you have created multiple projects with the same service account, enter the ID of the project that
contains the bucket that you want to connect to.
Service Account Specifies the client_email value present in the JSON file that you download after you create a
ID service account.
Service Account Specifies the private_key value present in the JSON file that you download after you create a
Key service account.
The following table describes the general connection properties for the Hadoop connection:
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within the
domain. You can change this property after you create the connection. The name cannot
exceed 128 characters, contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection. The ID is not case
sensitive. It must be 255 characters or less and must be unique in the domain. You cannot
change this property after you create the connection. Default value is the connection name.
Description The description of the connection. Enter a string that you can use to identify the connection.
The description cannot exceed 4,000 characters.
Cluster The name of the cluster configuration associated with the Hadoop environment.
Configuration Required if you do not configure the Cloud Provisioning Configuration.
Cloud Provisioning Name of the cloud provisioning configuration associated with a cloud platform such as
Configuration Amazon AWS or Microsoft Azure.
Required if you do not configure the Cluster Configuration.
Cluster Library Path* The path for shared libraries on the cluster.
The $DEFAULT_CLUSTER_LIBRARY_PATH variable contains a list of default directories.
Cluster Classpath* The classpath to access the Hadoop jar files and the required libraries.
The $DEFAULT_CLUSTER_CLASSPATH variable contains a list of paths to the default jar files
and libraries.
You can configure run-time properties for the Hadoop environment in the Data Integration
Service, the Hadoop connection, and in the mapping. You can override a property configured at
a high level by setting the value at a lower level. For example, if you configure a property in the
Data Integration Service custom properties, you can override it in the Hadoop connection or in
the mapping. The Data Integration Service processes property overrides based on the
following priorities:
1. Mapping custom properties set using infacmd ms runMapping with the -cp option
2. Mapping run-time properties for the Hadoop environment
3. Hadoop connection advanced properties for run-time engines
4. Hadoop connection advanced general properties, environment variables, and classpaths
5. Data Integration Service custom properties
* Informatica does not recommend changing these property values before you consult with third-party documentation,
Informatica documentation, or Informatica Global Customer Support. If you change a value without knowledge of the
property, you might experience performance degradation or other unexpected results.
Property Description
Impersonation Required if the Hadoop cluster uses Kerberos authentication. Hadoop impersonation user. The
User Name user name that the Data Integration Service impersonates to run mappings in the Hadoop
environment.
The Data Integration Service runs mappings based on the user that is configured. Refer the
following order to determine which user the Data Integration Services uses to run mappings:
1. Operating system profile user. The mapping runs with the operating system profile user if the
profile user is configured. If there is no operating system profile user, the mapping runs with
the Hadoop impersonation user.
2. Hadoop impersonation user. The mapping runs with the Hadoop impersonation user if the
operating system profile user is not configured. If the Hadoop impersonation user is not
configured, the Data Integration Service runs mappings with the Data Integration Service user.
3. Informatica services user. The mapping runs with the operating user that starts the
Informatica daemon if the operating system profile user and the Hadoop impersonation user
are not configured.
Temporary Table Hadoop compression library for a compression codec class name.
Compression Note: The Spark engine does not support compression settings for temporary tables. When you
Codec run mappings on the Spark engine, the Spark engine stores temporary tables in an uncompressed
file format.
Codec Class Codec class name that enables data compression and improves performance on temporary
Name staging tables.
Hive Staging Namespace for Hive staging tables. Use the name default for tables that do not have a
Database Name specified database name.
If you do not configure a namespace, the Data Integration Service uses the Hive database name
in the Hive target connection to create staging tables.
When you run a mapping in the native environment to write data to Hive, you must configure the
Hive staging database name in the Hive connection. The Data Integration Service ignores the
value you configure in the Hadoop connection.
Advanced List of advanced properties that are unique to the Hadoop environment. The properties are
Properties common to the Blaze and Spark engines. The advanced properties include a list of default
properties.
You can configure run-time properties for the Hadoop environment in the Data Integration
Service, the Hadoop connection, and in the mapping. You can override a property configured at a
high level by setting the value at a lower level. For example, if you configure a property in the
Data Integration Service custom properties, you can override it in the Hadoop connection or in
the mapping. The Data Integration Service processes property overrides based on the following
priorities:
1. Mapping custom properties set using infacmd ms runMapping with the -cp option
2. Mapping run-time properties for the Hadoop environment
3. Hadoop connection advanced properties for run-time engines
4. Hadoop connection advanced general properties, environment variables, and classpaths
5. Data Integration Service custom properties
Note: Informatica does not recommend changing these property values before you consult with
third-party documentation, Informatica documentation, or Informatica Global Customer Support.
If you change a value without knowledge of the property, you might experience performance
degradation or other unexpected results.
Property Description
Write Reject Files If you use the Blaze engine to run mappings, select the check box to specify a location to move
to Hadoop reject files. If checked, the Data Integration Service moves the reject files to the HDFS location
listed in the property, Reject File Directory.
By default, the Data Integration Service stores the reject files based on the RejectDir system
parameter.
Reject File The directory for Hadoop mapping files on HDFS when you run mappings.
Directory
Blaze Configuration
The following table describes the connection properties that you configure for the Blaze engine:
Property Description
Blaze Staging The HDFS file path of the directory that the Blaze engine uses to store temporary files. Verify that
Directory the directory exists. The YARN user, Blaze engine user, and mapping impersonation user must have
write permission on this directory.
Default is /blaze/workdir. If you clear this property, the staging files are written to the Hadoop
staging directory /tmp/blaze_<user name>.
Blaze User The owner of the Blaze service and Blaze service logs.
Name When the Hadoop cluster uses Kerberos authentication, the default user is the Data Integration
Service SPN user. When the Hadoop cluster does not use Kerberos authentication and the Blaze
user is not configured, the default user is the Data Integration Service user.
Minimum Port The minimum value for the port number range for the Blaze engine. Default is 12300.
Maximum Port The maximum value for the port number range for the Blaze engine. Default is 12600.
YARN Queue The YARN scheduler queue name used by the Blaze engine that specifies available resources on a
Name cluster.
Blaze Job The host name and port number for the Blaze Job Monitor.
Monitor Use the following format:
Address <hostname>:<port>
Where
- <hostname> is the host name or IP address of the Blaze Job Monitor server.
- <port> is the port on which the Blaze Job Monitor listens for remote procedure calls (RPC).
For example, enter: myhostname:9080
Blaze YARN Node label that determines the node on the Hadoop cluster where the Blaze engine runs. If you do
Node Label not specify a node label, the Blaze engine runs on the nodes in the default partition.
If the Hadoop cluster supports logical operators for node labels, you can specify a list of node
labels. To list the node labels, use the operators && (AND), || (OR), and ! (NOT).
Advanced List of advanced properties that are unique to the Blaze engine. The advanced properties include a
Properties list of default properties.
You can configure run-time properties for the Hadoop environment in the Data Integration Service,
the Hadoop connection, and in the mapping. You can override a property configured at a high level
by setting the value at a lower level. For example, if you configure a property in the Data Integration
Service custom properties, you can override it in the Hadoop connection or in the mapping. The Data
Integration Service processes property overrides based on the following priorities:
1. Mapping custom properties set using infacmd ms runMapping with the -cp option
2. Mapping run-time properties for the Hadoop environment
3. Hadoop connection advanced properties for run-time engines
4. Hadoop connection advanced general properties, environment variables, and classpaths
5. Data Integration Service custom properties
Note: Informatica does not recommend changing these property values before you consult with
third-party documentation, Informatica documentation, or Informatica Global Customer Support. If
you change a value without knowledge of the property, you might experience performance
degradation or other unexpected results.
Spark Configuration
The following table describes the connection properties that you configure for the Spark engine:
Property Description
Spark Staging The HDFS file path of the directory that the Spark engine uses to store temporary files for running
Directory jobs. The YARN user, Data Integration Service user, and mapping impersonation user must have
write permission on this directory.
If you do not specify a file path, by default, the temporary files are written to the Hadoop staging
directory /tmp/SPARK_<user name>.
When you run Sqoop jobs on the Spark engine, the Data Integration Service creates a Sqoop staging
directory within the Spark staging directory to store temporary files: <Spark staging
directory>/sqoop_staging
Spark Event Optional. The HDFS file path of the directory that the Spark engine uses to log events.
Log Directory
YARN Queue The YARN scheduler queue name used by the Spark engine that specifies available resources on a
Name cluster. The name is case sensitive.
Advanced List of advanced properties that are unique to the Spark engine. The advanced properties include a
Properties list of default properties.
You can configure run-time properties for the Hadoop environment in the Data Integration Service,
the Hadoop connection, and in the mapping. You can override a property configured at a high level
by setting the value at a lower level. For example, if you configure a property in the Data Integration
Service custom properties, you can override it in the Hadoop connection or in the mapping. The Data
Integration Service processes property overrides based on the following priorities:
1. Mapping custom properties set using infacmd ms runMapping with the -cp option
2. Mapping run-time properties for the Hadoop environment
3. Hadoop connection advanced properties for run-time engines
4. Hadoop connection advanced general properties, environment variables, and classpaths
5. Data Integration Service custom properties
Note: Informatica does not recommend changing these property values before you consult with
third-party documentation, Informatica documentation, or Informatica Global Customer Support. If
you change a value without knowledge of the property, you might experience performance
degradation or other unexpected results.
Note: The order of the connection properties might vary depending on the tool where you view them.
Property Description
Name Name of the connection. The name is not case sensitive and must be unique within the domain. The
name cannot exceed 128 characters, contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive.
It must be 255 characters or less and must be unique in the domain. You cannot change this
property after you create the connection. Default value is the connection name.
Description The description of the connection. The description cannot exceed 765 characters.
Location The domain where you want to create the connection. Not valid for the Analyst tool.
HDFS hdfs://<namenode>:<port>
where:
- <namenode> is the host name or IP address of the NameNode.
- <port> is the port that the NameNode listens for remote procedure calls (RPC).
hdfs://<nameservice> in case of NameNode high availability.
MapR-FS maprfs:///
WASB in wasb://<container_name>@<account_name>.blob.core.windows.net/<path>
HDInsight
where:
- <container_name> identifies a specific Azure Storage Blob container.
Note: <container_name> is optional.
- <account_name> identifies the Azure Storage Blob object.
Example:
wasb://infabdmoffering1storage.blob.core.windows.net/
infabdmoffering1cluster/mr-history
ADLS in adl://home
HDInsight
When you create a cluster configuration from an Azure HDInsight cluster, the cluster configuration uses
either ADLS or WASB as the primary storage. You cannot create a cluster configuration with ADLS or WASB
as the secondary storage. You can edit the NameNode URI property in the HDFS connection to connect to a
local HDFS location.
Property Description
Name The name of the connection. The name is not case sensitive and must be unique
within the domain. You can change this property after you create the connection.
The name cannot exceed 128 characters, contain spaces, or contain the following
special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection. The ID is
not case sensitive. It must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection. Default
value is the connection name.
Description The description of the connection. The description cannot exceed 4,000
characters.
The following table describes the HBase connection properties for MapR-DB:
Property Description
Name Name of the connection. The name is not case sensitive and must be unique
within the domain. You can change this property after you create the connection.
The name cannot exceed 128 characters, contain spaces, or contain the following
special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection. The ID is
not case sensitive. It must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection. Default
value is the connection name.
Description Description of the connection. The description cannot exceed 4,000 characters.
Cluster Configuration The name of the cluster configuration associated with the Hadoop environment.
MapR-DB Database Path Database path that contains the MapR-DB table that you want to connect to. Enter
a valid MapR cluster path.
When you create an HBase data object for MapR-DB, you can browse only tables
that exist in the MapR-DB path that you specify in the Database Path field. You
cannot access tables that are available in sub-directories in the specified path.
For example, if you specify the path as /user/customers/, you can access the
tables in the customers directory. However, if the customers directory contains
a sub-directory named regions, you cannot access the tables in the following
directory:
/user/customers/regions
Note: The order of the connection properties might vary depending on the tool where you view them.
Property Description
Name The name of the connection. The name is not case sensitive and must be unique
within the domain. You can change this property after you create the connection.
The name cannot exceed 128 characters, contain spaces, or contain the following
special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection. The ID is
not case sensitive. It must be 255 characters or less and must be unique in the
domain. You cannot change this property after you create the connection. Default
value is the connection name.
Description The description of the connection. The description cannot exceed 4000
characters.
Location The domain where you want to create the connection. Not valid for the Analyst
tool.
LDAP username LDAP user name of the user that the Data Integration Service impersonates to run
mappings on a Hadoop cluster. The user name depends on the JDBC connection
string that you specify in the Metadata Connection String or Data Access
Connection String for the native environment.
If the Hadoop cluster uses Kerberos authentication, the principal name for the
JDBC connection string and the user name must be the same. Otherwise, the user
name depends on the behavior of the JDBC driver. With Hive JDBC driver, you can
specify a user name in many ways and the user name can become a part of the
JDBC URL.
If the Hadoop cluster does not use Kerberos authentication, the user name
depends on the behavior of the JDBC driver.
If you do not specify a user name, the Hadoop cluster authenticates jobs based on
the following criteria:
- The Hadoop cluster does not use Kerberos authentication. It authenticates jobs
based on the operating system profile user name of the machine that runs the
Data Integration Service.
- The Hadoop cluster uses Kerberos authentication. It authenticates jobs based
on the SPN of the Data Integration Service. LDAP username will be ignored.
Environment SQL SQL commands to set the Hadoop environment. In native environment type, the
Data Integration Service executes the environment SQL each time it creates a
connection to a Hive metastore. If you use the Hive connection to run profiles on
a Hadoop cluster, the Data Integration Service executes the environment SQL at
the beginning of each Hive session.
The following rules and guidelines apply to the usage of environment SQL in both
connection modes:
- Use the environment SQL to specify Hive queries.
- Use the environment SQL to set the classpath for Hive user-defined functions
and then use environment SQL or PreSQL to specify the Hive user-defined
functions. You cannot use PreSQL in the data object properties to specify the
classpath. If you use Hive user-defined functions, you must copy the .jar files to
the following directory:
<Informatica installation directory>/services/shared/hadoop/
<Hadoop distribution name>/extras/hive-auxjars
- You can use environment SQL to define Hadoop or Hive parameters that you
want to use in the PreSQL commands or in custom queries.
- If you use multiple values for the Environment SQL property, ensure that there is
no space between the values.
SQL Identifier Character The type of character used to identify special characters and reserved SQL
keywords, such as WHERE. The Data Integration Service places the selected
character around special characters and reserved SQL keywords. The Data
Integration Service also uses this character for the Support mixed-case
identifiers property.
Property Description
JDBC Driver Name of the Hive JDBC driver class. If you leave this option blank, the Developer tool uses the
Class Name default Apache Hive JDBC driver shipped with the distribution. If the default Apache Hive JDBC
driver does not fit your requirements, you can override the Apache Hive JDBC driver with a third-
party Hive JDBC driver by specifying the driver class name.
Metadata The JDBC connection URI used to access the metadata from the Hadoop server.
Connection You can use PowerExchange for Hive to communicate with a HiveServer service or HiveServer2
String service. To connect to HiveServer, specify the connection string in the following format:
jdbc:hive2://<hostname>:<port>/<db>
Where
- <hostname> is name or IP address of the machine on which HiveServer2 runs.
- <port> is the port number on which HiveServer2 listens.
- <db> is the database name to which you want to connect. If you do not provide the database
name, the Data Integration Service uses the default database details.
To connect to HiveServer2, use the connection string format that Apache Hive implements for
that specific Hadoop Distribution. For more information about Apache Hive connection string
formats, see the Apache Hive documentation.
For user impersonation, you must add hive.server2.proxy.user=<xyz> to the JDBC
connection URI. If you do not configure user impersonation, the current user's credentials are
used connect to the HiveServer2.
If the Hadoop cluster uses SSL or TLS authentication, you must add ssl=true to the JDBC
connection URI. For example: jdbc:hive2://<hostname>:<port>/<db>;ssl=true
If you use self-signed certificate for SSL or TLS authentication, ensure that the certificate file is
available on the client machine and the Data Integration Service machine. For more information,
see the Informatica Big Data Management Integration Guide.
Bypass Hive JDBC driver mode. Select the check box to use the embedded JDBC driver mode.
JDBC Server To use the JDBC embedded mode, perform the following tasks:
- Verify that Hive client and Informatica services are installed on the same machine.
- Configure the Hive connection properties to run mappings on a Hadoop cluster.
If you choose the non-embedded mode, you must configure the Data Access Connection String.
Informatica recommends that you use the JDBC embedded mode.
Fine Grained When you select the option to observe fine grained authorization in a Hive source, the mapping
Authorization observes the following:
- Row and column level restrictions. Applies to Hadoop clusters where Sentry or Ranger security
modes are enabled.
- Data masking rules. Applies to masking rules set on columns containing sensitive data by
Dynamic Data Masking.
If you do not select the option, the Blaze and Spark engines ignore the restrictions and masking
rules, and results include restricted or sensitive data.
Data Access The connection string to access data from the Hadoop data store. To connect to HiveServer,
Connection specify the non-embedded JDBC mode connection string in the following format:
String
jdbc:hive2://<hostname>:<port>/<db>
Where
- <hostname> is name or IP address of the machine on which HiveServer2 runs.
- <port> is the port number on which HiveServer2 listens.
- <db> is the database to which you want to connect. If you do not provide the database name,
the Data Integration Service uses the default database details.
To connect to HiveServer2, use the connection string format that Apache Hive implements for the
specific Hadoop Distribution. For more information about Apache Hive connection string formats,
see the Apache Hive documentation.
For user impersonation, you must add hive.server2.proxy.user=<xyz> to the JDBC
connection URI. If you do not configure user impersonation, the current user's credentials are
used connect to the HiveServer2.
If the Hadoop cluster uses SSL or TLS authentication, you must add ssl=true to the JDBC
connection URI. For example: jdbc:hive2://<hostname>:<port>/<db>;ssl=true
If you use self-signed certificate for SSL or TLS authentication, ensure that the certificate file is
available on the client machine and the Data Integration Service machine. For more information,
see the Informatica Big Data Management Integration Guide.
Hive Staging HDFS directory for Hive staging tables. You must grant execute permission to the Hadoop
Directory on impersonation user and the mapping impersonation users.
HDFS This option is applicable and required when you write data to a Hive target in the native
environment.
Hive Staging Namespace for Hive staging tables. Use the name default for tables that do not have a
Database Name specified database name.
This option is applicable when you run a mapping in the native environment to write data to a
Hive target.
If you run the mapping on the Blaze or Spark engine, you do not need to configure the Hive
staging database name in the Hive connection. The Data Integration Service uses the value that
you configure in the Hadoop connection.
Note: The order of the connection properties might vary depending on the tool where you view them.
Property Description
Name Name of the connection. The name is not case sensitive and must be unique within the domain. The
name cannot exceed 128 characters, contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive.
It must be 255 characters or less and must be unique in the domain. You cannot change this property
after you create the connection. Default value is the connection name.
Description The description of the connection. The description cannot exceed 765 characters.
Connection Connection string to connect to the database. Use the following connection string:
String
jdbc:<subprotocol>:<subname>
The following list provides sample connection strings that you can enter for the applicable database
type:
- Connection string for DataDirect Oracle JDBC driver:
jdbc:informatica:oracle://<host>:<port>;SID=<value>
- Connection string for Oracle JDBC driver:
jdbc:oracle:thin:@//<host>:<port>:<SID>
- Connection string for DataDirect IBM DB2 JDBC driver:
jdbc:informatica:db2://<host>:<port>;DatabaseName=<value>
- Connection string for IBM DB2 JDBC driver:
jdbc:db2://<host>:<port>/<database_name>
- Connection string for DataDirect Microsoft SQL Server JDBC driver:
jdbc:informatica:sqlserver://<host>;DatabaseName=<value>
- Connection string for Microsoft SQL Server JDBC driver:
jdbc:sqlserver://<host>;DatabaseName=<value>
- Connection string for Netezza JDBC driver:
jdbc:netezza://<host>:<port>/<database_name>
- Connection string for Pivotal Greenplum driver:
jdbc:pivotal:greenplum://<host>:<port>;/database_name=<value>
- Connection string for Postgres Greenplum driver:
jdbc:postgressql://<host>:<port>/<database_name>
- Connection string for Teradata JDBC driver:
jdbc:teradata://<host>/database_name=<value>,tmode=<value>,charset=<value>
For more information about the connection string to use with specific drivers, see the vendor
documentation.
Environment Optional. Enter SQL commands to set the database environment when you connect to the database.
SQL The Data Integration Service executes the connection environment SQL each time it connects to the
database.
Note: If you enable Sqoop, Sqoop ignores this property.
Transaction Optional. Enter SQL commands to set the database environment when you connect to the database.
SQL The Data Integration Service executes the transaction environment SQL at the beginning of each
transaction.
Note: If you enable Sqoop, Sqoop ignores this property.
SQL Identifier Type of character that the database uses to enclose delimited identifiers in SQL queries. The
Character available characters depend on the database type.
Select (None) if the database uses regular identifiers. When the Data Integration Service generates
SQL queries, the service does not place delimited characters around any identifiers.
Select a character if the database uses delimited identifiers. When the Data Integration Service
generates SQL queries, the service encloses delimited identifiers within this character.
Note: If you enable Sqoop, Sqoop ignores this property.
Support Enable if the database uses case-sensitive identifiers. When enabled, the Data Integration Service
Mixed-case encloses all identifiers within the character selected for the SQL Identifier Character property.
Identifiers When the SQL Identifier Character property is set to none, the Support Mixed-case Identifiers
property is disabled.
Note: If you enable Sqoop, Sqoop honors this property when you generate and execute a DDL script
to create or replace a target at run time. In all other scenarios, Sqoop ignores this property.
If you want to use the same driver to import metadata and run the mapping, and do not want to specify any
additional Sqoop arguments, select Sqoop v1.x from the Use Sqoop Version list and leave the Sqoop
Arguments field empty in the JDBC connection. The Data Integration Service constructs the Sqoop command
based on the JDBC connection properties that you specify.
However, if you want to use a different driver for run-time tasks or specify additional run-time Sqoop
arguments, select Sqoop v1.x from the Use Sqoop Version list and specify the arguments in the Sqoop
Arguments field.
You can configure the following Sqoop arguments in the JDBC connection:
driver
Defines the JDBC driver class that Sqoop must use to connect to the database.
For example, use the following syntax depending on the database type that you want to connect to:
connect
Defines the JDBC connection string that Sqoop must use to connect to the database. The JDBC
connection string must be based on the driver that you define in the driver argument.
For example, use the following syntax depending on the database type that you want to connect to:
connection-manager
Defines the connection manager class name that Sqoop must use to connect to the database.
For example, use the following syntax to use the generic JDBC manager class name:
--connection-manager org.apache.sqoop.manager.GenericJdbcManager
direct
When you read data from or write data to Oracle, you can configure the direct argument to enable Sqoop
to use OraOop. OraOop is a specialized Sqoop plug-in for Oracle that uses native protocols to connect to
the Oracle database. When you configure OraOop, the performance improves.
You can configure OraOop when you run Sqoop mappings on the Spark engine.
--direct
When you use OraOop, you must use the following syntax to specify multiple arguments:
-D<argument=value> -D<argument=value>
Note: If you specify multiple arguments and include a space character between -D and the argument
name-value pair, Sqoop considers only the first argument and ignores the remaining arguments.
If you do not direct the job to a specific queue, the Spark engine uses the default queue.
-Dsqoop.connection.factories
To run the mapping on the Blaze engine with the Teradata Connector for Hadoop (TDCH) specialized
connectors for Sqoop, you must configure the -Dsqoop.connection.factories argument. Use the
argument to define the TDCH connection factory class that Sqoop must use. The connection factory
class varies based on the TDCH Sqoop Connector that you want to use.
Note: To run the mapping on the Spark engine, you do not need to configure the -
Dsqoop.connection.factories argument. The Data Integration Service invokes Cloudera Connector
Powered by Teradata and Hortonworks Connector for Teradata (powered by the Teradata Connector for
Hadoop) by default.
--infaoptimize
Use this argument to disable the performance optimization of Sqoop pass-through mappings on the
Spark engine.
• You read data from a Sqoop source and write data to a Hive target that uses the Text format.
• You read data from a Sqoop source and write data to an HDFS target that uses the Flat, Avro, or
Parquet format.
If you want to disable the performance optimization, set the --infaoptimize argument to false. For
example, if you see data type issues after you run an optimized Sqoop mapping, you can disable the
performance optimization.
--infaoptimize false
For a complete list of the Sqoop arguments that you can configure, see the Sqoop documentation.
The following table describes the general connection properties for the Kafka connection:
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within the domain. You
can change this property after you create the connection. The name cannot exceed 128 characters,
contain spaces, or contain the following special characters:
~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID The string that the Data Integration Service uses to identify the connection. The ID is not case sensitive.
It must be 255 characters or less and must be unique in the domain. You cannot change this property
after you create the connection. Default value is the connection name.
Description The description of the connection. Enter a string that you can use to identify the connection. The
description cannot exceed 4,000 characters.
Property Description
Kafka Broker List Comma-separated list of Kafka brokers which maintains the
configuration of the Kafka messaging broker.
To specify a Kafka broker, use the following format:
<IP Address>:<port>
ZooKeeper Host Port List Optional. Comma-separated list of Apache ZooKeeper which maintains
the configuration of the Kafka messaging broker.
To specify the ZooKeeper, use the following format:
<IP Address>:<port>
Retry Timeout Number of seconds the Integration Service attempts to reconnect to the
Kafka broker to write data. If the source or target is not available for the
time you specify, the mapping execution stops to avoid any data loss.
Kafka Broker Version Configure the Kafka messaging broker version to 0.10.1.x-2.0.0.
Note: The order of the connection properties might vary depending on the tool where you view them.
You can create and manage a Microsoft Azure Blob Storage connection in the Administrator tool or the
Developer tool. The following table describes the Microsoft Azure Blob Storage connection properties:
Property Description
ID String that the Data Integration Service uses to identify the connection. The ID is not
case sensitive. It must be 255 characters or less and must be unique in the domain.
You cannot change this property after you create the connection. Default value is the
connection name.
Property Description
Container Name The root container or sub-folders with the absolute path.
Endpoint Suffix Type of Microsoft Azure end-points. You can select any of the following end-
points:
- core.windows.net: Default
- core.usgovcloudapi.net: To select the US government Microsoft Azure end-
points
- core.chinacloudapi.cn: Not applicable
The following table describes the Microsoft Azure Cosmos DB connection properties:
Property Description
ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive.
It must be 255 characters or less and must be unique in the domain. You cannot change this
property after you create the connection. Default value is the connection name.
Description Description of the connection. The description cannot exceed 765 characters.
Location The project or folder in the Model repository where you want to store the Cosmos DB connection.
Key The primary and secondary key to which provides you complete administrative access to the
resources within Microsoft Azure Cosmos DB account.
Database Name of the database that contains the collections from which you want to read or write JSON
documents.
Note: You can find the Cosmos DB URI and Key values in the Keys settings on Azure portal. Contact your
Azure administrator for more details.
Note: The order of the connection properties might vary depending on the tool where you view them.
You can create and manage a Microsoft Azure SQL Data Warehouse connection in the Administrator tool or
the Developer tool. The following table describes the Microsoft Azure Data Lake Store connection properties:
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within the domain. You
can change this property after you create the connection. The name cannot exceed 128 characters,
contain spaces, or contain the following special characters: ~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It
must be 255 characters or less and must be unique in the domain. You cannot change this property
after you create the connection.
Default value is the connection name.
Description The description of the connection. The description cannot exceed 4,000 characters.
Type The connection type. Select Microsoft Azure Data Lake Store.
Property Description
ADLS Account Name The name of the Microsoft Azure Data Lake Store.
ClientID The ID of your application to complete the OAuth Authentication in the Active Directory.
Client Secret The client secret key to complete the OAuth Authentication in the Active Directory.
Directory The Microsoft Azure Data Lake Store directory that you use to read data or write data. The
default is root directory.
AuthEndpoint The OAuth 2.0 token endpoint from where access code is generated based on based on the
Client ID and Client secret is completed.
For more information about creating a client ID, client secret, and auth end point, contact the Azure
administrator or see Microsoft Azure Data Lake Store documentation.
Note: The order of the connection properties might vary depending on the tool where you view them.
You can create and manage a Microsoft Azure SQL Data Warehouse connection in the Administrator tool or
the Developer tool. The following table describes the Microsoft Azure SQL Data Warehouse connection
properties:
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within the domain. You
can change this property after you create the connection. The name cannot exceed 128 characters,
contain spaces, or contain the following special characters: ~ ` ! $ % ^ & * ( ) - + = { [ } ] | \ : ; " ' < , > . ? /
ID String that the Data Integration Service uses to identify the connection. The ID is not case sensitive. It
must be 255 characters or less and must be unique in the domain. You cannot change this property
after you create the connection. Default value is the connection name.
Description The description of the connection. The description cannot exceed 4,000 characters.
Type The connection type. Select Microsoft Azure SQL Data Warehouse.
Property Description
Azure DW JDBC URL Microsoft Azure Data Warehouse JDBC connection string. For example, you can enter the
following connection string: jdbc:sqlserver:// <Server>.database.windows.net:
1433;database=<Database>. The Administrator can download the URL from Microsoft Azure
portal.
Azure DW JDBC User name to connect to the Microsoft Azure SQL Data Warehouse account. You must have
Username permission to read, write, and truncate data in Microsoft Azure SQL Data Warehouse.
Azure DW JDBC Password to connect to the Microsoft Azure SQL Data Warehouse account.
Password
Azure DW Schema Name of the schema in Microsoft Azure SQL Data Warehouse.
Name
Azure Blob Account Name of the Microsoft Azure Storage account to stage the files.
Name
Azure Blob Account The key that authenticates the access to the Blob storage account.
Key
Blob End-point Type of Microsoft Azure end-points. You can select any of the following end-points:
- core.windows.net: Default
- core.usgovcloudapi.net: To select the US government Microsoft Azure end-points
- core.chinacloudapi.cn: Not applicable
You can configure the US government Microsoft Azure end-points when a mapping runs in the
native environment and on the Spark engine.
Note: The order of the connection properties might vary depending on the tool where you view them.
Property Description
Name The name of the connection. The name is not case sensitive and must be unique within the domain.
You can change this property after you create the connection. The name cannot exceed 128
characters, contain spaces, or contain the following special characters:~ ` ! $ % ^ & * ( ) - + = { [ } ] |
\:;"'<,>.?/
ID String that the Data Integration Service uses to identify the connection.
The ID is not case sensitive. The ID must be 255 characters or less and must be unique in the domain.
You cannot change this property after you create the connection.
Default value is the connection name.
Description Optional. The description of the connection. The description cannot exceed 4,000 characters.
Additional Enter one or more JDBC connection parameters in the following format:
JDBC URL
<param1>=<value>&<param2>=<value>&<param3>=<value>....
Parameters
For example:
user=jon&warehouse=mywh&db=mydb&schema=public
To access Snowflake through Okta SSO authentication, enter the web-based IdP implementing SAML
2.0 protocol in the following format:
authenticator=https://<Your_Okta_Account_Name>.okta.com
Note: Microsoft ADFS is not supported.
For more information about configuring Okta authentication, see the following website:
https://docs.snowflake.net/manuals/user-guide/admin-security-fed-auth-configure-
snowflake.html#configuring-snowflake-to-use-federated-authentication
6. Click Next.
7. Enter the Hadoop cluster properties, common properties, and the reject directory properties.
8. Click Next.
9. Click Next.
Effective in version 10.2.2, Informatica dropped support for the Hive engine. Do not enter Hive
configuration properties.
10. Enter configuration properties for the Blaze engine and click Next.
11. Enter configuration properties for the Spark engine and click Finish.
You can configure the following Hadoop connection properties based on the cluster environment and
functionality that you use:
Note: Informatica does not recommend changing these property values before you consult with third-party
documentation, Informatica documentation, or Informatica Global Customer Support. If you change a value
without knowledge of the property, you might experience performance degradation or other unexpected
results.
To reset to default values, delete the property values. For example, if you delete the values of an edited
Cluster Library Path property, the value resets to the default $DEFAULT_CLUSTER_LIBRARY_PATH.
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following environment variables in the Cluster Environment Variables property:
HADOOP_NODE_JDK_HOME
Represents the directory from which you run the cluster services and the JDK version that the cluster
nodes use. Required to run the Java transformation in the Hadoop environment and Sqoop mappings on
the Blaze engine. Default is /usr/java/default. The JDK version that the Data Integration Service uses
must be compatible with the JDK version on the cluster.
To edit the property in the text box, use the following format with : to separate each path variable:
<variable1>[:<variable2>…:<variableN]
Configure the library path variables in the Cluster Library Path property.
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following property in the Advanced Properties of the common properties section:
infapdo.java.opts
List of Java options to customize the Java run-time environment. The property contains default values.
• -Xmx512M. Specifies the maximum size for the Java virtual memory. Default is 512 MB. Increase the
value to at least 700 MB.
For example, infapdo.java.opts=-Xmx700M
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
Configure the following properties in the Advanced Properties of the Blaze configuration section:
infagrid.cadi.namespace
Namespace for the Data Integration Service to use. Required to set up multiple Blaze instances.
infagrid.blaze.console.jsfport
JSF port for the Blaze engine console. Use a port number that no other cluster processes use. Required
to set up multiple Blaze instances.
infagrid.blaze.console.httpport
HTTP port for the Blaze engine console. Use a port number that no other cluster processes use.
Required to set up multiple Blaze instances.
infagrid.node.local.root.log.dir
Path for the Blaze service logs. Default is /tmp/infa/logs/blaze. Required to set up multiple Blaze
instances.
infacal.hadoop.logs.directory
Path in HDFS for the persistent Blaze logs. Default is /var/log/hadoop-yarn/apps/informatica. Required
to set up multiple Blaze instances.
infagrid.node.hadoop.local.root.log.dir
Configure the following properties in the Advanced Properties of the Spark configuration section:
To edit the property in the text box, use the following format with &: to separate each name-value pair:
<name1>=<value1>[&:<name2>=<value2>…&:<nameN>=<valueN>]
spark.authenticate
Enables authentication for the Spark service on Hadoop. Required for Spark encryption.
Set to TRUE.
spark.authenticate.enableSaslEncryption
Enables encrypted communication when SASL authentication is enabled. Required if Spark encryption
uses SASL authentication.
Set to TRUE.
spark.executor.cores
Indicates the number of cores that each executor process uses to run tasklets on the Spark engine.
spark.executor.instances
Indicates the number of instances that each executor process uses to run tasklets on the Spark engine.
spark.executor.memory
Indicates the amount of memory that each executor process uses to run tasklets on the Spark engine.
List of extra Java options for the Spark driver that runs inside the cluster. Required for streaming
mappings to read from or write to a Kafka cluster that uses Kerberos authentication.
List of extra Java options for the Spark executor. Required for streaming mappings to read from or write
to a Kafka cluster that uses Kerberos authentication.
When the Databricks Spark engine writes to a target, it converts null values to empty strings (" "). For
example, 12, AB,"",23p09udj.
The Databricks Spark engine can write the empty strings to string columns, but when it tries to write an
empty string to a non-string column, the mapping fails with a type mismatch.
To allow the Databricks Spark engine to convert the empty strings back to null values and write to the
target, configure the following advanced property in the Databricks Spark connection:
infaspark.flatfile.writer.nullValue=true
spark.hadoop.validateOutputSpecs
Validates if the HBase table exists. Required for streaming mappings to write to a HBase target in an
Amazon EMR cluster. Set the value to false.
infaspark.json.parser.mode
Specifies the parser how to handle corrupt JSON records. You can set the value to one of the following
modes:
Specifies whether the parser can read a multiline record in a JSON file. You can set the value to true or
false. Default is false. Applies only to non-native distributions that use Spark version 2.2.x and above.
infaspark.pythontx.exec
Required to run a Python transformation on the Spark engine for Big Data Management. The location of
the Python executable binary on the worker nodes in the Hadoop cluster.
Required to run a Python transformation on the Spark engine for Big Data Management and Big Data
Streaming. The location of the Python installation directory on the worker nodes in the Hadoop cluster.
If the Python installation directory on the worker nodes is in a directory such as usr/lib/python, set the
property to the following value:
infaspark.pythontx.executorEnv.PYTHONHOME=usr/lib/python
If you use the installation of Python on the Data Integration Service machine, use the location of the
Python installation directory on the Data Integration Service machine.
Required to run a Python transformation on the Spark engine for Big Data Streaming. The location of the
Python shared library in the Python installation folder on the Data Integration Service machine.
Required to run a Python transformation on the Spark engine for Big Data Streaming. The location of the
Jep package in the Python installation folder on the Data Integration Service machine.
Enables encrypted communication when authentication is enabled. Required for Spark encryption.
Set to TRUE.
The number of milliseconds to wait for resources to register before scheduling a task. Default is 30000.
Decrease the value to reduce delays before starting the Spark job execution. Required to improve
performance for mappings on the Spark engine.
Set to 15000.
spark.scheduler.minRegisteredResourcesRatio
The minimum ratio of registered resources to acquire before task scheduling begins. Default is 0.8.
Decrease the value to reduce any delay before starting the Spark job execution. Required to improve
performance for mappings on the Spark engine.
B
big data
application services 13
D
repositories 14 Data Integration Service
Big Data Management prerequisites 28
integration with Informatica products 15 configuration for MapR 129
Blaze engine Databricks
create a user account 24 authentication 143
port requirements 17 cloud provisioning configuration 154
connection properties 164 components 138
directories to create 24 import file 145
import from file 146
run-time staging directory 143
194
E install
Jep 29
ephemeral clusters Python 29
cloud provisioning connection 149 Python transformation 29
installation
MapR client 120
G
Google Analytics connections
properties 161
J
Google BigQuery connection JDBC
properties 161 Sqoop connectivity 108
Google Cloud Spanner connection JDBC connections
properties 162 properties 175
Google Cloud Storage connections
properties 163
K
H Kerberos authentication
security certificate import 89, 109
Hadoop 149
Hadoop administrator
prerequisite tasks for Amazon EMR 37
prerequisite tasks for Azure HDInsight 58
M
prerequisite tasks for Cloudera CDH 79 MapR
prerequisite tasks for Hortonworks HDP 99 Hadoop administrator tasks 120, 121
prerequisite tasks for MapR 120 Analyst Service configuration 131
Hadoop administrator tasks Data Integration Service configuration 129
Amazon EMR 37 Metadata Access Service configuration 130
Azure HDInsight 58 tickets 128
Cloudera CDH 79 MapR client
configure *-site.files 37, 58, 79, 99, 121 installing 120
Hortonworks HDP 99 Metadata Access Service
MapR 121 configuration for MapR 130
Hadoop connections Microsoft Azure 151
creating 186 Microsoft Azure Data Lake Store connection
Hadoop operating system properties 183
on Data Integration Service 27 Microsoft Azure SQL Data Warehouse connection
HBase connections properties 184
MapR-DB properties 171
properties 171
HDFS connections
creating 186
O
properties 169 overview 12
HDFS staging directory 23
high availability
configuration on Developer tool 132
Hive access
P
for Amazon EMR 45 permissions
Hive connections Blaze engine user 24
creating 186 ports
properties 172 Amazon EMR requirements 17
Hive pushdown Azure HDInsight requirements 17
connection properties 164 Blaze engine requirements 17
Hortonworks HDP Prerequisite
Hadoop administrator tasks 99 download Hadoop operating system 27
hosts file prerequisites
Azure HDInsight 30 create directories for the Blaze engine 24
disk space 17
Hadoop administrator tasks. 37, 58, 79, 99, 120
Index 195
R staging directory (continued)
HDFS 23
reject file directory system requirements
HDFS 25 Databricks integration 140
prerequisites 16
S T
S3 access policies 46
Snowflake connection TDCH connection factory
properties 185 -Dsqoop.connection.factories 178
Spark deploy mode
Hadoop connection properties 164
Spark engine
connection properties 164
U
Spark Event Log directory uninstall
Hadoop connection properties 164 prerequisite 18
Spark execution parameters user accounts
Hadoop connection properties 164 MapR 128
Spark HDFS staging directory
Hadoop connection properties 164
Sqoop
JDBC drivers 108
W
Sqoop connection arguments WASB
-Dsqoop.connection.factories 178 Databricks access 142
connect 178
direct 178
driver 178
staging directory
Databricks 142
196 Index