Oracle Big Data Appliance - SW - Guide
Oracle Big Data Appliance - SW - Guide
Oracle Big Data Appliance - SW - Guide
July 2013
Provides an introduction to the Oracle Big Data Appliance
software and to the administrative tools and procedures.
Oracle Big Data Appliance Software User's Guide, Release 2 (2.1)
E40656-03
Copyright 2011, 2013, Oracle and/or its affiliates. All rights reserved.
This software and related documentation are provided under a license agreement containing restrictions on
use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your
license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license,
transmit, distribute, exhibit, perform, publish, or display any part, in any form, or by any means. Reverse
engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is
prohibited.
The information contained herein is subject to change without notice and is not warranted to be error-free. If
you find any errors, please report them to us in writing.
If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it
on behalf of the U.S. Government, the following notice is applicable:
U.S. GOVERNMENT END USERS: Oracle programs, including any operating system, integrated software,
any programs installed on the hardware, and/or documentation, delivered to U.S. Government end users
are "commercial computer software" pursuant to the applicable Federal Acquisition Regulation and
agency-specific supplemental regulations. As such, use, duplication, disclosure, modification, and
adaptation of the programs, including any operating system, integrated software, any programs installed on
the hardware, and/or documentation, shall be subject to license terms and license restrictions applicable to
the programs. No other rights are granted to the U.S. Government.
This software or hardware is developed for general use in a variety of information management
applications. It is not developed or intended for use in any inherently dangerous applications, including
applications that may create a risk of personal injury. If you use this software or hardware in dangerous
applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and other
measures to ensure its safe use. Oracle Corporation and its affiliates disclaim any liability for any damages
caused by use of this software or hardware in dangerous applications.
Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of
their respective owners.
Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks
are used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD,
Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced
Micro Devices. UNIX is a registered trademark of The Open Group.
This software or hardware and documentation may provide access to or information on content, products,
and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly
disclaim all warranties of any kind with respect to third-party content, products, and services. Oracle
Corporation and its affiliates will not be responsible for any loss, costs, or damages incurred due to your
access to or use of third-party content, products, or services.
Cloudera, Cloudera CDH, and Cloudera Manager are registered and unregistered trademarks of Cloudera,
Inc.
Contents
iii
About the Oracle Big Data Appliance Software ................................................................................ 2-7
Software Components ....................................................................................................................... 2-7
Logical Disk Layout ........................................................................................................................... 2-8
About the Software Services .................................................................................................................. 2-9
Monitoring the CDH Services .......................................................................................................... 2-9
Where Do the Services Run?............................................................................................................. 2-9
Automatic Failover of the NameNode......................................................................................... 2-11
Unconfigured Software .................................................................................................................. 2-12
Configuring HBase ............................................................................................................................... 2-12
Effects of Hardware on Software Availability ................................................................................ 2-13
Critical and Noncritical Nodes...................................................................................................... 2-13
First NameNode .............................................................................................................................. 2-14
Second NameNode ......................................................................................................................... 2-14
JobTracker Node.............................................................................................................................. 2-14
Noncritical Nodes ........................................................................................................................... 2-15
Collecting Diagnostic Information for Oracle Customer Support .............................................. 2-15
Security on Oracle Big Data Appliance ............................................................................................ 2-16
About Predefined Users and Groups ........................................................................................... 2-17
Port Numbers Used on Oracle Big Data Appliance................................................................... 2-18
About CDH Security Using Kerberos .......................................................................................... 2-18
About Puppet Security ................................................................................................................... 2-19
4 Configuring Oracle Exadata Database Machine for Use with Oracle Big Data
Appliance
About Optimizing Communications.................................................................................................... 4-1
Prerequisites.............................................................................................................................................. 4-1
Enabling SDP on Exadata Database Nodes ........................................................................................ 4-1
Configuring a JDBC Client for SDP..................................................................................................... 4-2
Creating an SDP Listener on the InfiniBand Network..................................................................... 4-3
Configuring Oracle Exadata Database Machine to Use InfiniBand .............................................. 4-5
iv
Glossary
Index
v
vi
Preface
The Oracle Big Data Appliance Software User's Guide describes how to manage and use
the installed software.
Audience
This guide is intended for users of Oracle Big Data Appliance including:
Application developers
Data analysts
Data scientists
Database administrators
System administrators
The Oracle Big Data Appliance Software User's Guide introduces the terminology and
concepts necessary to discuss Oracle Big Data Appliance. However, you must acquire
the necessary information about administering Hadoop clusters and writing
MapReduce programs from other sources.
Documentation Accessibility
For information about Oracle's commitment to accessibility, visit the Oracle
Accessibility Program website at
http://www.oracle.com/pls/topic/lookup?ctx=acc&id=docacc.
Related Documents
For more information, see the following documents:
Oracle Big Data Appliance Owner's Guide
Oracle Big Data Connectors User's Guide
Oracle Enterprise Manager System Monitoring Plug-in Installation Guide for Oracle Big
Data Appliance
vii
Conventions
The following text conventions are used in this document:
Convention Meaning
boldface Boldface type indicates graphical user interface elements associated
with an action, or terms defined in text or the glossary.
italic Italic type indicates book titles, emphasis, or placeholder variables for
which you supply particular values.
monospace Monospace type indicates commands within a paragraph, URLs, code
in examples, text that appears on the screen, or text that you enter.
viii
1
Introducing Oracle Big Data Appliance
1
This chapter presents an overview of Oracle Big Data Appliance and describes the
software installed on the system. This chapter contains the following sections:
What Is Big Data?
The Oracle Big Data Solution
Software for Big Data
Acquiring Data for Analysis
Organizing Big Data
Analyzing and Visualizing Big Data
High Variety
Big data is derived from a variety of sources, such as:
Equipment sensors: Medical, manufacturing, transportation, and other machine
sensor transmissions
Machines: Call detail records, web logs, smart meter readings, Global Positioning
System (GPS) transmissions, and trading systems records
Social media: Data streams from social media sites such as Facebook and blogging
sites such as Twitter
Analysts can mine this data repeatedly as they devise new ways of extracting
meaningful insights. What seems irrelevant today might prove to be highly pertinent
to your business tomorrow.
Challenge: Delivering flexible systems to handle this high variety
High Complexity
As the variety of data types increases, the complexity of the system increases. The
complexity of data types also increases in big data because of its low structure.
Challenge: Finding solutions that apply across a broad range of data types.
High Volume
Social media can generate terabytes of daily data. Equipment sensors and other
machines can generate that much data in less than an hour.
Even traditional data sources for data warehouses, such as customer profiles from
customer relationship management (CRM) systems, transactional enterprise resource
planning (ERP) data, store transactions, and general ledger data, have increased
tenfold in volume over the past decade.
Challenge: Providing scalability and ease in growing the system
High Velocity
Huge numbers of sensors, web logs, and other machine sources generate data
continuously and at a much higher speed than traditional sources, such as individuals
entering orders into a transactional database.
Challenge: Handling the data at high speed without stressing the structured systems
warehouse. Oracle Big Data Appliance is the platform for acquiring and organizing
big data so that the relevant portions with true business value can be analyzed in
Oracle Database.
For maximum speed and efficiency, Oracle Big Data Appliance can be connected to
Oracle Exadata Database Machine running Oracle Database. Oracle Exadata Database
Machine provides outstanding performance in hosting data warehouses and
transaction processing databases. Moreover, Oracle Exadata Database Machine can be
connected to Oracle Exalytics In-Memory Machine for the best performance of
business intelligence and planning applications. The InfiniBand connections between
these engineered systems provide high parallelism, which enables high-speed data
transfer for batch or query workloads.
Figure 11 shows the relationships among these engineered systems.
replicating data across multiple servers without RAID technology. It runs on top
of the Linux file system on Oracle Big Data Appliance.
MapReduce engine: The MapReduce engine provides a platform for the massively
parallel execution of algorithms written in Java.
Administrative framework: Cloudera Manager is a comprehensive administrative
tool for CDH.
CDH is written in Java, and Java is the language for applications development.
However, several CDH utilities and other software available on Oracle Big Data
Appliance provide graphical, web-based, and other language interfaces for ease of use.
See Also:
For conceptual information about Hadoop technologies, refer to
this third-party publication:
Hadoop: The Definitive Guide, Third Edition by Tom White (O'Reilly
Media Inc., 2012., ISBN: 978-1449311520).
For documentation about Cloudera's Distribution including
Apache Hadoop, see the Cloudera library at
http://oracle.cloudera.com/
An intelligent driver links the NoSQL database with client applications and provides
access to the requested key-value on the storage node with the lowest latency.
Oracle NoSQL Database includes hashing and balancing algorithms to ensure proper
data distribution and optimal load balancing, replication management components to
handle storage node failure and recovery, and an easy-to-use administrative interface
to monitor the state of the database.
Oracle NoSQL Database is typically used to store customer profiles and similar data
for identifying and analyzing big data. For example, you might log in to a website and
see advertisements based on your stored customer profile (a record in Oracle NoSQL
Database) and your recent activity on the site (web logs currently streaming into
HDFS).
Oracle NoSQL Database is an optional component of Oracle Big Data Appliance. It is
always installed, but might not be configured and enabled during installation of the
software.
See Also:
Oracle NoSQL Database Getting Started Guide at
http://docs.oracle.com/cd/NOSQL/html/index.html
Oracle Big Data Appliance Licensing Information
Hive
Hive is an open-source data warehouse that supports data summarization, ad hoc
querying, and data analysis of data stored in HDFS. It uses a SQL-like language called
HiveQL. An interpreter generates MapReduce code from the HiveQL queries. By
storing data in Hive, you can avoid writing MapReduce programs in Java.
Hive is a component of CDH and is always installed on Oracle Big Data Appliance.
Oracle Big Data Connectors can access Hive tables.
MapReduce
The MapReduce engine provides a platform for the massively parallel execution of
algorithms written in Java. MapReduce uses a parallel programming model for
processing data on a distributed system. It can process vast amounts of data quickly
and can scale linearly. It is particularly effective as a mechanism for batch processing
of unstructured and semistructured data. MapReduce abstracts lower-level operations
into computations over a set of keys and values.
Although big data is often described as unstructured, incoming data always has some
structure. However, it does not have a fixed, predefined structure when written to
HDFS. Instead, MapReduce creates the desired structure as it reads the data for a
particular job. The same data can have many different structures imposed by different
MapReduce jobs.
Note: Oracle Big Data Appliance supports the Yet Another Resource
Negotiator (YARN) implementation of MapReduce. However, the
Mammoth utility installs and configures only classic MapReduce.
See Also:
For information about R, go to
http://www.r-project.org/
For information about Oracle R Enterprise, go to
http://docs.oracle.com/cd/E27988_01/welcome.html
See Also:
Oracle Business Intelligence website at
http://www.oracle.com/us/solutions/ent-performance-bi/bus
iness-intelligence/index.html
Data Warehousing and Business Intelligence in the Oracle
Database Documentation Library at
http://www.oracle.com/pls/db112/portal.portal_
db?selected=6&frame=
This chapter provides information about the software and services installed on Oracle
Big Data Appliance. It contains these sections:
Monitoring a Cluster Using Oracle Enterprise Manager
Managing CDH Operations Using Cloudera Manager
Using Hadoop Monitoring Utilities
Using Hue to Interact With Hadoop
About the Oracle Big Data Appliance Software
About the Software Services
Configuring HBase
Effects of Hardware on Software Availability
Collecting Diagnostic Information for Oracle Customer Support
Security on Oracle Big Data Appliance
In this example, bda1 is the name of the appliance, node03 is the name of the
server, example.com is the domain, and 7180 is the default port number for
Cloudera Manager.
2. Log in with a user name and password for Cloudera Manager. Only a user with
administrative privileges can change the settings. Other Cloudera Manager users
can view the status of Oracle Big Data Appliance.
In this example, bda1 is the name of the appliance, node03 is the name of the
server, and 50030 is the default port number for Hadoop Map/Reduce
Administration.
Figure 23 shows part of a Hadoop Map/Reduce Administration display.
http://bda1node13.example.com:50060
In this example, bda1 is the name of the rack, node13 is the name of the server, and
50060 is the default port number for the Task Tracker Status interface.
Figure 24 shows the Task Tracker Status interface.
and automatically becomes an administrator. This user can create other user and
administrator accounts.
3. Use the icons across the top to open a utility.
Figure 25 shows the Beeswax Query Editor for entering Hive queries.
See Also: Hue Installation Guide for information about using Hue,
which is already installed and configured on Oracle Big Data
Appliance, at
http://cloudera.github.com/hue/docs-2.1.0/manual.html
Software Components
These software components are installed on all 18 servers in Oracle Big Data
Appliance Rack. Oracle Linux, required drivers, firmware, and hardware verification
utilities are factory installed. All other software is installed on site using the Mammoth
Utility. The optional software components may not be configured in your installation.
See Also: Oracle Big Data Appliance Owner's Guide for information
about the Mammoth Utility
Note: Oracle Big Data Appliance 2.0 and later releases do not
support the use of an external NFS filer for backups and do not use
NameNode federation.
Figure 27 shows the relationships among the processes that support automatic
failover on Oracle Big Data Appliance.
Unconfigured Software
The RPM installation files for the following tools are available on Oracle Big Data
Appliance. Do not download them from the Cloudera website. However, you must
install and configure them.
Flume
HBase
Mahout
Sqoop
Whirr
You can find the RPM files on the first server of each cluster in
/opt/oracle/BDAMammoth/bdarepo/RPMS/noarch.
Configuring HBase
HBase is an open-source, column-oriented database provided with CDH. HBase is not
configured automatically on Oracle Big Data Appliance. You must set up and
configure HBase before you can access it from an HBase client on another system.
To create an HBase service:
1. Open Cloudera Manager in a browser, using a URL like the following:
http://bda1node03.example.com:7180
In this example, bda1 is the name of the appliance, node03 is the name of the
server, example.com is the domain, and 7180 is the default port number for
Cloudera Manager.
2. On the All Services page, click Add a Service.
3. Select HBase from the list of services, and then click Continue.
4. Select zookeeper, and then click Continue.
In a multirack cluster, some of the critical services run on the first server of the second
rack. See "Where Do the Services Run?" on page 2-9.
Moving a critical node requires that all clients be reconfigured with the address of the
new node. The other alternative is to wait for the repair of the failed server. You must
weigh the loss of services against the inconvenience of reconfiguring the clients.
First NameNode
One instance of the NameNode initially runs on node01. If this node fails or goes
offline (such as a reboot), then the second NameNode (node02) automatically takes
over to maintain the normal activities of the cluster.
Alternatively, if the second NameNode is already active, it continues without a
backup. With only one NameNode, the cluster is vulnerable to failure. The cluster has
lost the redundancy needed for automatic failover of the active NameNode.
These functions are also disrupted:
Balancer: The balancer runs periodically to ensure that data is distributed evenly
across the cluster. Balancing is not performed when the first NameNode is down.
Puppet master: The Mammoth utilities use Puppet, and so you cannot install or
reinstall the software if, for example, you must replace a disk drive elsewhere in
the rack.
Second NameNode
One instance of the NameNode initially runs on node02. If this node fails, then the
function of the NameNode either fails over to the first NameNode (node01) or
continues there without a backup. However, the cluster has lost the redundancy
needed for automatic failover if the first NameNode also fails.
These services are also disrupted:
MySQL Master Database: Cloudera Manager, Oracle Data Integrator, Hive, and
Oozie use MySQL Database. The data is replicated automatically, but you cannot
access it when the master database server is down.
Oracle NoSQL Database KV Administration: Oracle NoSQL Database database
is an optional component of Oracle Big Data Appliance, so the extent of a
disruption due to a node failure depends on whether you are using it and how
critical it is to your applications.
JobTracker Node
The JobTracker assigns MapReduce tasks to specific nodes in the CDH cluster.
Without the JobTracker node (node03), this critical function is not performed.
Noncritical Nodes
The noncritical nodes (node04 to node18) are optional in that Oracle Big Data
Appliance continues to operate with no loss of service if a failure occurs. The
NameNode automatically replicates the lost data to maintain three copies at all times.
MapReduce jobs execute on copies of the data stored elsewhere in the cluster. The only
loss is in computational power, because there are fewer servers on which to distribute
the work.
The command output identifies the name and the location of the diagnostic file.
3. Go to My Oracle Support at http://support.oracle.com.
4. Open a Service Request (SR) if you have not already done so.
5. Upload the bz2 file into the SR. If the file is too large, then upload it to
ftp.oracle.com, as described in the next procedure.
To upload the diagnostics to ftp.oracle.com:
1. Open an FTP client and connect to ftp.oracle.com.
See Example 21 if you are using a command-line FTP client from Oracle Big Data
Appliance.
2. Log in as user anonymous and leave the password field blank.
3. In the bda/incoming directory, create a directory using the SR number for the
name, in the format SRnumber. The resulting directory structure looks like this:
bda
incoming
SRnumber
Table 25 identifies the operating system users and groups that are created
automatically during installation of Oracle Big Data Appliance software for use by
CDH components and other software packages.
The CDH master nodes, NameNode, and JobTracker resolve the group name so
that users cannot manipulate their group memberships.
Map tasks run under the identity of the user who submitted the job.
Authorization mechanisms in HDFS and MapReduce help control user access to
data.
Appliance
This chapter describes how you can support users who are running MapReduce jobs
on Oracle Big Data Appliance or using Oracle Big Data Connectors. It contains these
sections:
Providing Remote Client Access to CDH
Managing User Accounts
Recovering Deleted Files
Prerequisites
Ensure that you have met the following prerequisites:
You must have these access privileges:
Root access to the client system
Login access to Cloudera Manager
If you do not have these privileges, then contact your system administrator for
help.
The client system must run an operating system that Cloudera supports for CDH4.
For the list of supported operating systems, see "Before You Install CDH4 on a
Cluster" in the Cloudera CDH4 Installation Guide at
http://ccp.cloudera.com/display/CDH4DOC/Before+You+Install+CDH4+on+a+Cl
uster
The client system must run the same version of Oracle JDK as Oracle Big Data
Appliance. CDH4 requires Oracle JDK 1.6.
3. If the rpm command returns a value, then remove the existing Hadoop software:
rpm -e hadoop_rpm
4. Copy the following Linux RPMs to the database server from the first server of
Oracle Big Data Appliance. The RPMs are located in the
/opt/oracle/BDAMammoth/bdarepo/RPMS/x86_64 directory.
ed-version_number.x86_64.rpm
m4-version_number.x86_64.rpm
nc-version_number.x86_64.rpm
redhat-lsb-version_number.x86_64.rpm
5. Install the Oracle Linux RPMs from Step 4 on all database nodes. For example:
sudo yum --nogpgcheck localinstall ed-0.2-39.el5_2.x86_64.rpm
sudo yum --nogpgcheck localinstall m4-1.4.5-3.el5.1.x86_64.rpm
sudo yum --nogpgcheck localinstall nc-1.84-10.fc6.x86_64.rpm
sudo yum --nogpgcheck localinstall redhat-lsb-4.0-2.1.4.0.2.el5.x86_64.rpm
Be sure to install the Oracle Linux RPMs before installing the CDH RPMs.
6. Copy the following CDH RPMs from the
/opt/oracle/BDAMammoth/bdarepo/RPMS/noarch directory.
bigtop-utils-version_number.noarch.rpm
zookeeper-version_number.noarch.rpm
7. Copy the following CDH RPMs from the
/opt/oracle/BDAMammoth/bdarepo/RPMS/x86_64 directory.
hadoop-version_number.x86_64.rpm
bigtop-jsvc-version_number.x86_64.rpm
hadoop-hdfs-version_number.x86_64.rpm
hadoop-0.20-mapreduce-version_number.x86_64.rpm
hadoop-yarn-version_number.x86_64.rpm
hadoop-mapreduce-version_number.x86_64.rpm
hadoop-client-version_number.x86_64.rpm
8. Install the CDH RPMs in the exact order shown in Steps 6 and 7 on all database
servers. For example:
rpm -ihv /bigtop-utils-0.4+502-1.cdh4.2.0.p0.12.el5.noarch.rpm
rpm -ihv zookeeper-3.4.5+14-1.cdh4.2.0.p0.12.el5.noarch.rpm
rpm -ihv hadoop-2.0.0+922-1.cdh4.2.0.p0.12.el5.x86_64.rpm
rpm -ihv bigtop-jsvc-1.0.10-1.cdh4.2.0.p0.13.el5.x86_64.rpm
rpm -ihv hadoop-hdfs-2.0.0+922-1.cdh4.2.0.p0.12.el5.x86_64.rpm
rpm -ihv hadoop-0.20-mapreduce-0.20.2+1341-1.cdh4.2.0.p0.21.el5.x86_64.rpm
rpm -ihv hadoop-yarn-2.0.0+922-1.cdh4.2.0.p0.12.el5.x86_64.rpm
rpm -ihv hadoop-mapreduce-2.0.0+922-1.cdh4.2.0.p0.12.el5.x86_64.rpm
rpm -ihv hadoop-client-2.0.0+922-1.cdh4.2.0.p0.12.el5.x86_64.rpm
2. To provide support for other components, such as Hive, Pig, or Oozie, see the
component installation instructions.
3. Configure the CDH client. See "Configuring CDH" on page 3-3.
Configuring CDH
After installing CDH, you must configure it for use with Oracle Big Data Appliance.
To configure the Hadoop client:
1. Open a browser on your client system and connect to Cloudera Manager. It runs
on the JobTracker node (node03) and listens on port 7180, as shown in this
example:
http://bda1node03.example.com:7180
2. Log in as admin.
3. On the Services tab, open the Actions menu for the cluster, and then select Client
Configuration URLs.
8. Delete the number sign (#) to uncomment the line, and then save the file.
9. Make a backup copy of the Hadoop configuration files:
# cp /full_path/hadoop-conf /full_path/hadoop-conf-bak
10. Overwrite the existing configuration files with the downloaded configuration files
in Step 6.
# cd /full_path/hadoop-conf
# cp * /usr/lib/hadoop/conf
11. Verify that you can access HDFS on Oracle Big Data Appliance from the client, by
entering a simple Hadoop file system command like the following:
$ hadoop fs -ls /user
Found 4 items
drwx------ - hdfs supergroup 0 2013-01-16 13:50 /user/hdfs
drwxr-xr-x - hive supergroup 0 2013-01-16 12:58 /user/hive
drwxr-xr-x - oozie hadoop 0 2013-01-16 13:01 /user/oozie
drwxr-xr-x - oracle hadoop 0 2013-01-29 12:50 /user/oracle
Check the output for HDFS users defined on Oracle Big Data Appliance, and not
on the client system. You should see the same results as you would after entering
the command directly on Oracle Big Data Appliance.
12. Validate the installation by submitting a MapReduce job. You must be logged in to
the host computer under the same user name as your HDFS user name on Oracle
Big Data Appliance.
The following example calculates the value of pi:
$ hadoop jar
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.0.0-cdh4.2.0.jar pi 10
1000000
Number of Maps = 10
Samples per Map = 1000000
Wrote input for Map #0
Wrote input for Map #1
.
.
.
13/04/30 08:15:50 INFO mapred.JobClient: BYTES_READ=240
Job Finished in 12.403 seconds
Estimated value of Pi is 3.14158440000000000000
13. Use Cloudera Manager to verify that the job ran on Oracle Big Data Appliance
instead of the local system. Select mapreduce from the Activities menu for a list of
jobs.
Figure 31 shows the job created by the previous example.
To create Hue users, open Hue in a browser and click the User Admin icon. See
"Using Hue to Interact With Hadoop" on page 2-6.
To create a Hadoop cluster user:
1. Open an ssh connection as the root user to a noncritical node (node04 to node18).
2. Create the user's home directory:
# sudo -u hdfs hadoop fs -mkdir /user/user_name
You use sudo because the HDFS super user is hdfs (not root).
3. Change the ownership of the directory:
# sudo -u hdfs hadoop fs -chown user_name:hadoop /user/user_name
5. Create the operating system user across all nodes in the cluster:
# dcli useradd -G hadoop,hive[,group_name...] -m user_name
In this syntax, replace group_name with an existing group and user_name with the
new name.
6. Verify that the operating system user belongs to the correct groups:
# dcli id user_name
7. Verify that the users home directory was created on all nodes:
# dcli ls /home | grep user_name
Example 31 creates a user named jdoe with a primary group of hadoop and an
addition group of hive.
If the output shows either "Empty password" or "Password locked," then you must
set a password.
3. Set the password:
hash=$(echo 'password' | openssl passwd -1 -stdin); dcli "usermod
--pass='$hash' user_name"
See Also:
Oracle Big Data Appliance Owner's Guide for information about
dcli.
The Linux man page for the full syntax of the useradd command.
/user/oracle/.Trash/Current/user/oracle/ontime_s.dat
2. Move or copy the file to its previous location. The following example moves
ontime_s.dat from the trash to the HDFS /user/oracle directory.
$ hadoop fs -mv .Trash/Current/user/oracle/ontime_s.dat /user/oracle/ontime_
s.dat
Note: If you do not want any clients to use the trash, then you can
completely disable the trash facility. See "Completely Disabling the
Trash Facility" on page 3-9.
SQL Connector for HDFS and Oracle R Connector for Hadoop are examples of remote
clients.
To disable the trash facility for a remote HDFS client:
1. Open a connection to the system where the CDH client is installed.
2. Open /etc/hadoop/conf/hdfs-site.xml in a text editor.
3. Change the trash interval to zero:
<property>
<name>fs.trash.interval</name>
<value>0</value>
</property>
Prerequisites
Oracle Big Data Appliance and Oracle Exadata Database Machine racks must be
cabled together using InfiniBand cables. The IP addresses must be unique across all
racks and use the same subnet for the InfiniBand network.
See Also:
Oracle Big Data Appliance Owner's Guide about multirack cabling
Oracle Big Data Appliance Configuration Worksheets about IP
addresses and subnets
Configuring Oracle Exadata Database Machine for Use with Oracle Big Data Appliance 4-1
Configuring a JDBC Client for SDP
Note: This example lists two nodes for an Oracle Exadata Database
Machine quarter rack. If you have an Oracle Exadata Database
Machine half or full rack, you must repeat node-specific lines for each
node in the cluster.
1. Edit /etc/hosts on each node in the Exadata rack to add the virtual IP addresses
for the InfiniBand network. Make sure that these IP addresses are not in use. For
example:
# Added for Listener over IB
192.168.10.21 dm01db01-ibvip.example.com dm01db01-ibvip
192.168.10.22 dm01db02-ibvip.example.com dm01db02-ibvip
2. As the root user, create a network resource on one database node for the
InfiniBand network. For example:
# /u01/app/grid/product/11.2.0.2/bin/srvctl add network -k 2 -S
192.168.10.0/255.255.255.0/bondib0
3. Verify that the network was added correctly with one of the following commands:
# /u01/app/grid/product/11.2.0.2/bin/crsctl stat res -t | grep net
ora.net1.network
ora.net2.network -- Output indicating new Network resource
or
# /u01/app/grid/product/11.2.0.2/bin/srvctl config network -k 2
Network exists: 2/192.168.10.0/255.255.255.0/bondib0, type static -- Output
indicating Network resource on the 192.168.10.0 subnet
4. Add the virtual IP addresses on the network created in Step 2, for each node in the
cluster:
# srvctl add vip -n dm01db01 -A dm01db01-ibvip/255.255.255.0/bondib0 -k 2
# srvctl add vip -n dm01db02 -A dm01db02-ibvip/255.255.255.0/bondib0 -k 2
5. As the oracle user, who owns Grid Infrastructure Home, add a listener for the
virtual IP addresses created in Step 4.
# srvctl add listener -l LISTENER_IB -k 2 -p TCP:1522,/SDP:1522
6. For each database that will accept connections from the middle tier, modify the
listener_networks init parameter to allow load balancing and failover across
multiple networks (Ethernet and InfiniBand). You can either enter the full
TNSNAMES syntax in the initialization parameter or create entries in tnsnames.ora
in the $ORACLE_HOME/network/admin directory. The TNSNAMES.ORA entries must
exist in GRID_HOME. The following example first updates tnsnames.ora.
Complete this step on each node in the cluster with the correct IP addresses for
that node. LISTENER_IBREMOTE should list all other nodes that are in the cluster.
DBM_IB should list all nodes in the cluster.
Configuring Oracle Exadata Database Machine for Use with Oracle Big Data Appliance 4-3
Creating an SDP Listener on the InfiniBand Network
DBM =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = dm01-scan)(PORT = 1521))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = dbm)
))
DBM_IB =
(DESCRIPTION =
(LOAD_BALANCE=on)
(ADDRESS = (PROTOCOL = TCP)(HOST = dm01db01-ibvip)(PORT = 1522))
(ADDRESS = (PROTOCOL = TCP)(HOST = dm01db02-ibvip)(PORT = 1522))
(CONNECT_DATA =
(SERVER = DEDICATED)
(SERVICE_NAME = dbm)
))
LISTENER_IBREMOTE =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = dm01db02-ibvip.mycompany.com)(PORT = 1522))
))
LISTENER_IBLOCAL =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = dm01db01-ibvip.mycompany.com)(PORT = 1522))
(ADDRESS = (PROTOCOL = SDP)(HOST = dm01db01-ibvip.mycompany.com)(PORT = 1522))
))
LISTENER_IPLOCAL =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = dm0101-vip.mycompany.com)(PORT = 1521))
))
LISTENER_IPREMOTE =
(DESCRIPTION =
(ADDRESS_LIST =
(ADDRESS = (PROTOCOL = TCP)(HOST = dm01-scan.mycompany.com)(PORT = 1521))
))
Ensure that the line does not reverse the order (dns files); if it does, your
additions to /etc/hosts will not be used. Edit the file if necessary.
Configuring Oracle Exadata Database Machine for Use with Oracle Big Data Appliance 4-5
Configuring Oracle Exadata Database Machine to Use InfiniBand
ASR
Oracle Auto Service Request, a software tool that monitors the health of the hardware
and automatically generates a service request if it detects a problem.
See also OASM.
Balancer
A service that ensures that all nodes in the cluster store about the same amount of
data, within a set range. Data is balanced over the nodes in the cluster, not over the
disks in a node.
CDH
Cloudera's Distribution including Apache Hadoop, the version of Apache Hadoop
and related components installed on Oracle Big Data Appliance.
cluster
A group of servers on a network that are configured to work together. A server is
either a master node or a worker node.
All servers in an Oracle Big Data Appliance rack form a cluster. Servers 1, 2, and 3 are
master nodes. Servers 4 to 18 are worker nodes.
See Hadoop.
DataNode
A server in a CDH cluster that stores data in HDFS. A DataNode performs file system
operations assigned by the NameNode.
See also HDFS; NameNode.
Flume
A distributed service in CDH for collecting and aggregating data from almost any
source into a data store such as HDFS or HBase.
See also HBase; HDFS.
Hadoop
A batch processing infrastructure that stores files and distributes work across a group
of servers. Oracle Big Data Appliance uses Cloudera's Distribution including Apache
Hadoop (CDH).
Glossary-1
Hadoop Distributed File System (HDFS)
HBase
An open-source, column-oriented database that provides random, read/write access
to large amounts of sparse data stored in a CDH cluster. It provides fast lookup of
values by key and can perform thousands of insert, update, and delete operations per
second.
HDFS
Hadoop Distributed File System, an open-source file system designed to store
extremely large data files (megabytes to petabytes) with streaming data access
patterns. HDFS splits these files into data blocks and distributes the blocks across a
CDH cluster.
When a data set is larger than the storage capacity of a single computer, then it must
be partitioned across several computers. A distributed file system can manage the
storage of a data set across a network of computers.
See also cluster.
Hive
An open-source data warehouse in CDH that supports data summarization, ad hoc
querying, and data analysis of data stored in HDFS. It uses a SQL-like language called
HiveQL. An interpreter generates MapReduce code from the HiveQL queries.
By using Hive, you can avoid writing MapReduce programs in Java.
See also Hive Thrift; HiveQL; MapReduce.
Hive Thrift
A remote procedure call (RPC) interface for remote access to CDH for Hive queries.
See also CDH; Hive.
HiveQL
A SQL-like query language used by Hive.
See also Hive.
HotSpot
A Java Virtual Machine (JVM) that is maintained and distributed by Oracle. It
automatically optimizes code that executes frequently, leading to high performance.
HotSpot is the standard JVM for the other components of the Oracle Big Data
Appliance stack.
Hue
Hadoop User Experience, a web user interface in CDH that includes several
applications, including a file browser for HDFS, a job browser, an account
management tool, a MapReduce job designer, and Hive wizards. Cloudera Manager
runs on Hue.
See also HDFS; Hive.
Glossary-2
Oracle Linux
JobTracker
A service that assigns MapReduce tasks to specific nodes in the CDH cluster,
preferably those nodes storing the data.
See also Hadoop; MapReduce.
MapReduce
A parallel programming model for processing data on a distributed system.
A MapReduce program contains these functions:
Mappers: Process the records of the data set.
Reducers: Merge the output from several mappers.
Combiners: Optimizes the result sets from the mappers before sending them to the
reducers (optional).
MySQL Server
A SQL-based relational database management system. Cloudera Manager, Oracle Data
Integrator, Hive, and Oozie use MySQL Server as a metadata repository on Oracle Big
Data Appliance.
NameNode
A service that maintains a directory of all files in HDFS and tracks where data is stored
in the CDH cluster.
See also HDFS.
node
A server in a CDH cluster.
See also cluster.
NoSQL Database
See Oracle NoSQL Database.
OASM
Oracle Automated Service Manager, a service for monitoring the health of Oracle Sun
hardware systems. Formerly named Sun Automatic Service Manager (SASM).
Oozie
An open-source workflow and coordination service for managing data processing jobs
in CDH.
Oracle Linux
An open-source operating system. Oracle Linux 5.6 is the same version used by
Exalogic 1.1. It features the Oracle Unbreakable Enterprise Kernel.
Glossary-3
Oracle NoSQL Database
Oracle R Distribution
An Oracle-supported distribution of the R open-source language and environment for
statistical analysis and graphing.
Oracle R Enterprise
A component of the Oracle Advanced Analytics Option. It enables R users to run R
commands and scripts for statistical and graphical analyses on data stored in an
Oracle database.
Pig
An open-source platform for analyzing large data sets that consists of the following:
Pig Latin scripting language
Pig interpreter that converts Pig Latin scripts into MapReduce jobs
Pig runs as a client application.
See also MapReduce.
Puppet
A configuration management tool for deploying and configuring software components
across a cluster. The Oracle Big Data Appliance initial software installation uses
Puppet.
The Puppet tool consists of these components: puppet agents, typically just called
puppets; the puppet master server; a console; and a cloud provisioner.
See also puppet agent; puppet master.
puppet agent
A service that primarily pulls configurations from the puppet master and applies
them. Puppet agents run on every server in Oracle Big Data Appliance.
See also Puppet; puppet master
puppet master
A service that primarily serves configurations to the puppet agents.
See also Puppet; puppet agent.
Sqoop
A command-line tool that imports and exports data between HDFS or Hive and
structured databases. The name Sqoop comes from "SQL to Hadoop." Oracle R
Connector for Hadoop uses the Sqoop executable to move data between HDFS and
Oracle Database.
table
In Hive, all files in a directory stored in HDFS.
See also HDFS.
Glossary-4
ZooKeeper
TaskTracker
A service that runs on each node and executes the tasks assigned to it by the
JobTracker service.
See also JobTracker.
ZooKeeper
A centralized coordination service for CDH distributed processes that maintains
configuration information and naming, and provides distributed synchronization and
group services.
Glossary-5
ZooKeeper
Glossary-6
Index
Index-1
user identity, 2-17 user identity, 2-17
Hue service, 2-15 Oozie service, 2-15
operating system users, 2-17
Oracle Automated Service Manager
I See OASM
installing CDH client, 3-1 Oracle Data Integrator, 1-9
about, 1-8
J node location, 2-15
software dependencies, 2-14
Java HotSpot Virtual Machine, 2-8 version, 2-8
JobTracker Oracle Data Integrator agent, 2-18
about, 2-14 Oracle Data Pump files, 1-8
monitoring, 2-5 Oracle Database Instant Client, 2-8
opening, 2-5 Oracle Direct Connector for Hadoop Distributed File
security, 2-19 System, 1-8
user identity, 2-17 Oracle Exadata Database Machine, 1-3
JobTracker node, 2-14 using as a CDH client, 3-2
Oracle Exalytics In-Memory Machine, 1-3
K Oracle Linux
about, 1-3
Kerberos network authentication, 2-18 relationship to HDFS, 1-4
key-value database, 1-5, Glossary-4 version, 2-8
knowledge modules, 1-9 Oracle Loader for Hadoop, 1-8, 2-8
Oracle NoSQL Database
L about, 1-5
KV Administration, 2-14
Linux
port numbers, 2-18
disk location, 2-9
version, 2-8
installation, 2-7
Oracle R Connector for Hadoop, 1-8, 2-8
loading data, 1-8
Oracle R Enterprise, 1-7
Oracle Support, creating a service request, 2-15
M oracle user, 2-17, 3-5
Mahout, 2-12
mapred user, 2-17 P
MapReduce, 1-4, 1-6, 2-19, 3-5
partitioning, 2-8
monitoring
planning applications, 1-3
JobTracker, 2-5
port map, 2-18
TaskTracker, 2-5
port numbers, 2-18
MySQL Database
puppet
about, 2-14
port numbers, 2-18
backup location, 2-15
security, 2-19
port number, 2-18
user identity, 2-17
user identity, 2-17
puppet master
version, 2-8
node location, 2-14
N R
NameNode, 2-11, 2-19
R Connector
NoSQL databases
See Oracle R Connector for Hadoop
See also Oracle NoSQL Database
R distribution, 2-8
R language support, 1-7
O recovering HDFS files, 3-7
OASM, port number, 2-18 remote client access, 3-1
ODI replicating data, 1-5
See Oracle Data Integrator rpc.statd service, 2-18
oinstall group, 2-17, 3-5
Oozie S
software dependencies, 2-14
second NameNode, NameNode
software services, 2-17
second, 2-14
Index-2
security, 2-16
service requests, creating for CDH, 2-15
service tags, 2-18
services
See software services
software components, 2-7
software framework, 1-3
software services
monitoring, 2-9
node locations, 2-9
port numbers, 2-18
Sqoop, 2-12, 2-17
ssh service, 2-18
svctag user, 2-17
T
tables, 1-8, 3-5
Task Tracker Status interface, 2-5
TaskTracker
monitoring, 2-5
user identity, 2-17
trash facility, 3-7
troubleshooting CDH, 2-15
U
user groups, 3-5
users
Cloudera Manager, 2-4
operating system, 2-17
W
Whirr, 2-12
X
xinetd service, 2-18
Y
YARN support, 1-7
Z
ZooKeeper, 2-17
Index-3
Index-4