06 Mapreduce Lab
06 Mapreduce Lab
06 Mapreduce Lab
Information Management
Map/Reduce
Hands-On Lab
IBM Software
Information Management
Table of Contents
1 Introduction ....................................................................................................................................................................................... 3
2 About this Lab ................................................................................................................................................................................... 3
3 Background ....................................................................................................................................................................................... 3
4 Preparation ............................................................................................................................................ Error! Bookmark not defined.
5 Create a BigInsights Project ............................................................................................................................................................ 4
6 Create a new Java Map/Reduce Program ....................................................................................................................................... 6
7 Implementing the Java Map/Reduce Program.............................................................................................................................. 12
8 Running the Java Map/Reduce Program....................................................................................................................................... 14
9 Debugging the Java Map/Reduce Program Locally ..................................................................................................................... 19
10 Running the Java Map/Reduce Program on the Cluster.............................................................................................................. 23
11 Debugging the Failed Map Task........................................................................................................... Error! Bookmark not defined.
12 Summary.......................................................................................................................................................................................... 28
1 Introduction
The BigInsights tools for Eclipse is the development tooling that comes with InfoSphere BigInsights and includes wizards,
code generators, context-sensitive help, and a test environment to simplify your development efforts for programs written
in Jaql, Pig, Hive or Java map/reduce. In addition, you can package a program into a BigInsights application and publish it
into the BigInsights console to make the program available for other users to run.
This lab focuses on the development of Java map/reduce programs and covers all steps from creating a new map/reduce
program, running it locally with a small subset of data, running it against the cluster and debugging task attempts to find
problems. For creating a BigInsights application, publishing and running it on the cluster, see the BigInsights application
development lab.
The lab can be run in an Eclipse on Linux or Windows. For Windows, follow the pre-requisites in the install section for the
BigInsights tools for Eclipse in the information center.
Run the Java map/reduce program on the cluster, monitor the jobs and see the log output
Debug failed task attempts on the cluster from your Eclipse environment
3 Background
The Java map/reduce program that you will be writing will analyze sales data from a CSV file (Comma-Separated Values)
and count the orders per customer id. The program is kept simple because the lab focuses on how the available tools can
help you during the development and the process of debugging.
The data that the program will be analyzing is a comma-separated file with sales data that contains order and customer
information.
The goal of the program is to find loyal customers who have ordered more than 3 times. A map-reduce program is ideal
for that scenario, especially for a large amount of sales data. The mapper will process the input file line by line, and since
the file is a CSV file, it can be easily run on multiple nodes. For each line, the mapper will find the customer id, which is
the third field in the file (CUST_CODE) and create an entry with the key of the customer id and the value of 1 to indicate
that the user has placed an order.
The reducer will take the output of all the mappers and add up the orders by counting how many entries exist for a
particular customer id.
The file we will analyze contains standard comma separated data of customer sales.
CUST_ORDER_NUMBER,CUST_ORDER_DATE,CUST_CODE,CUST_CC_ID,PAY_METHOD_CODE,
166517,"2006-09-07-08.43.15.017000",125412,35582,29,5,3,1,1
167083,"2006-10-28-07.50.14.017000",125412,35803,22,5,3,2,2,
167336,"2006-10-15-19.20.32.017000",125412,35903,28,5,3,2,2,
As you can see the first column contains the customer order number followed by the order date and the customer code.
We will filter the document in the Map task, splitting up rows into their fields and extracting the customer id for each row.
We will then send the customer ids to the reducer. The reducer will finally compute a total count of all customer ids. He
will then remove all customer ids that have less than 4 orders. This will return repeat customers.
In total our MapReduce program will work very similar to the following SQL call:
As we explained in the presentation map tasks normally execute per-row computations like WHERE conditions and scalar
functions, while the shuffle task handles joins and groupings. Finally the reducer executes tasks like aggregations and
having clauses.
1. From your desktop, launch the BigInsights Studio icon. Note: The Eclipse IDE has been pre-loaded with
BigInsights tooling for this lab.
2. Open the BigInsights perspective. Go to Window -> Open Perspective -> Other and select BigInsights.
3. To create a new project, go to File -> New -> BigInsights Project. If the BigInsights Project menu is not visible,
you can also go to File -> New -> Project and select BigInsights Project from the BigInsights folder.
Enter Sales as the project name and select Finish.
4. The project Sales will appear in the Project Explorer view. Click the arrow next to Sales to expand the project.
You will see a src folder where all your Java source files will be stored. Youll also see two libraries entries JRE
System Library points to Java to compile your project files, and BigInsights v2.0.0.0 includes all the BigInsights
libraries that you will need for development against the BigInsights cluster. This includes Hadoop libraries that are
needed to compile Java map/reduce programs. If you use a different version of BigInsights, the BigInsights library
entry will show a different version other than 2.0.0.0 and the libraries underneath will have different versions
based on what is bundled with the specific version of BigInsights platform.
1. In the BigInsights perspective, select File -> New -> Java MapReduce program.
If the Java MapReduce menu is not visible, you can also select File -> New -> Other and select Java
MapReduce program from the BigInsights folder.
2. A new wizard opens which has three pages with the details for a mapper and reducer implementation as well as a
driver class. For both mapper and reducer we have to specify the input and output key and value pairs.
3. For this program, the input will be a CSV file and the mapper will be called with one row of data. To be able to
check whether we have data, we specify the input key type as LongWritable, and the value is one row of text,
so the input value type is Text.
The customer id will be the key for output of the mapper, so output key type is Text. Also, output value is the
number of orders for each user, and thus it should be an Integer.
Therefore, fill in the following types for input and output keys/values.
Type of input keys : org.apache.hadoop.io.LongWritable
Type of input values : org.apache.hadoop.io.Text
Type of output keys : org.apache.hadoop.io.Text
Type of output values : org.apache.hadoop.io.IntWritable
4. Click Next.
5. On the next page, we have to specify the values for the reducer implementation. The package is pre-filled in with
the same package as the mapper, but you can change that value if youd like.
6. The types for the input keys and values are copied from the output keys and values from the mapper class
definition because the output of the mapper is processed as input by the reducer.
For this program, the reducer is aggregating the number of orders for each customer id. Hence we set the type of
the output keys to Text and the type of the output values to an Integer.
The wizard page for the reducer should look like this.
7. Click Next.
8. On the last page of the wizard we only have to select a package and specify the name of the driver class.
Set the Package and name with following values.
Package : myPackage
Name : SalesDriver
9. Click Finish.
10. After clicking Finish, the wizard will create templates for the three classes and open them in a Java editor.
The mapper and reducer classes implement the Hadoop Mapper and Reducer interfaces and its up to us to
implement the methods.
The driver classes creates a new Hadoop job configuration and sets mapper and reducer classes accordingly.
Mapper:
Reducer:
1. SalesDriver.java:
The only change we have to make to the driver class is to get the arguments for the input and output path from
the program arguments to be able to specify the values during runtime. Replace the lines 29 and 31 where TODO
is specified with following code.
job.setOutputValueClass(IntWritable.class);
// TODO: Update the input path for the location of the inputs of the job
FileInputFormat.addInputPath(job, new Path(programArgs[0]));
// TODO: Update the output path for the output directory of the job
FileOutputFormat.setOutputPath(job, new Path(programArgs[1]));
2. SalesMapper.java:
Enter the following code into the map method.
if (key.get() > 0) {
String line = value.toString();
String[] elements = line.split(",");
This code will take each input line of text and split it into its elements in our case comma separated values. Since the
customer id is the third element we then retrieve it with elements[2] ( arrays in Java start with 0 ) . Finally we output
the customer id and the number 1 as input for the reducer. The shuffle task will automatically take all output values
with the same customer id send it to the same reducer and will add all 1 values into an array.
You may be wondering what happens with the first row of the data file that contains the column names. As
described in the presentation the default TextInputFormat class will split up input files into rows and
add a key that contains the offset of the start byte of each row to the start of the file. For the first row the key
is therefore 0. And you can see that we first filter the rows by key.get() >0. This means that the first row will be
effectively ignored.
3. SalesReducer.java:
Enter the following code into the reduce method.
int count = 0;
for (IntWritable value : values) {
count++;
}
if (count > 3) {
context.write(key, new IntWritable(count));
}
In the reducer we have inputs of each customer ids ( as key ) and an array of input values; Essentially an array of 1,
one for each row with the customer id. We first count up the number of ones and then filter out all customer ids with
less than 4 values. The remaining values are send to the output file by writing them into the context together with the
number of times they occur.
2. A new Run configurations dialog opens up. Double-click on the entry Java MapReduce in the tree on the left
hand side to create a new empty run configuration.
3. In the Main tab, select the project Sales, and select the main class myPackage.SalesDriver. You can also give
the run configuration a more meaningful name in the Name field, but put SalesDriver for now.
4. In the BigInsights section, we can select whether we want to run against the cluster (default), or whether we
want to run locally. For now we want to run in our local Eclipse machine, so we select Local.
5. Leave the Job name in blank for now since the job name is an optional parameter. You can set it to override the
job name that Hadoop will automatically assign when running a job.
6. Our program requires two input arguments a value for the input of the data (directory or a specific file) and a
folder where the output will be stored. Since we want to run locally, you should specify directories in your local file
system. As a input file, write a location of the file data_small.csv in your local system.
/home/biadmin/labs/mapreduce/data_small.csv /home/biadmin/labs/mapreduce/output
7. The delete directory is set to the same value as the output directory of the program. If the output directory exists
when running the program, Hadoop will throw an error that the directory already exists. Instead of deleting it
manually every time before running the program, we can simply set the delete directory which comes in handy
when developing a new program or debugging.
/home/biadmin/labs/mapreduce/output
.
8. With variables filled, it should look like figure below.
9. In addition to settings in the Main tab, well also have to make a few changes in the JAR Settings tab. Click on
the JAR settings tab to open it.
10. In order to run the map/reduce program, it will be handed to Hadoop as a JAR file. In the JAR settings tab you
have the option to specify an existing JAR file or let the JAR rebuilt every time you run the program using the
source in your BigInsights project.
Click on Browse to open a file select dialog, keep the default location (which is the location of your BigInsights
project), and type in the name salesprogram. Click OK to close the Browse dialog.
11. Check the checkbox Rebuild the JAR file on each run. This will ensure that we rebuild the jar with the latest
source files every time you run the program.
12. Expand the Sales tree below and select the src folder to include all Java files.
The page should look like this and the error indicator next to the JAR settings name should disappear.
13. In the Additional JAR files list, you can add additional JAR files that your program may need and they will be
bundled along with the jar and added to the classpath when Hadoop runs your program. Our sample program
doesnt require any additional Jar files, so we keep the list empty.
14. Click Run to run the program. Since we are running locally, you will see a console window with the output from
Hadoop.
15. When the job is completed, youll see a line that indicates that both map and reduce are 100% done.
16. To see the output of the program, navigate to the output directory that you specified in the run configuration using
the file browser. Youll see two files: a success file from Hadoop and a part file that contains the output of your
program. If you open the part file, youll see that there is one customer who has ordered more than 3 times.
First we want to set a couple of breakpoints to suspend the program execution at certain points.
2. In the editor, open the file SalesReducer.java and set a breakpoint in line
int count = 0;
3. (Optional) To only debug the reducer for a particular customer id, we can set a breakpoint property to only
suspend the program at the breakpoint if a certain condition is true. For example, we can check the input key of
the reducer (which is the customer id), and check for its value. To set breakpoint properties, right click on the
breakpoint and select Breakpoint Properties from the context menu.
4. In the dialog, check Conditional and type in a condition. You can use the content assist to ensure that you type in
valid Java syntax. In our example, lets only stop in the reducer if our customer id is 125412:
key.toString().equals("125412");
6. To launch the program in debug mode, we can re-use the same run configuration as before but launch it in debug
mode. Click Run -> Debug Configurations and select the previously created run configuration under the Java
MapReduce entry in the tree.
9. The program will be suspended in the SalesMapper class. You will see the stack trace in the Debug view, the
variables and their values in the Variables view, and the current line in the editor window.
10. Step through the code to see how the variables change. You can use either the navigation icons in the Debug
view, or the shortcut keys F5 to F8. You will notice that the program is suspended for every row in the mapper
class.
11. Disable or remove the breakpoint in SalesMapper.java to be able to continue the program without interruption and
proceed to the reducer class.
12. Since we set the breakpoint properties in the reducer class, the program is only suspended if the key value is
125412.
Using conditional breakpoint properties is very useful when dealing with a large amount of input rows.
1. In the BigInsights Perspective, click on the Add a new server icon in the BigInsights servers view (bottom-left
corner).
2. When prompted, specify the URL for your BigInsights web console as well as user id and password. The server
name will self-populate once you enter a value into the URL field, but you can change the value:
URL : http://bigdata:8080/data/html/index.html
User ID : biadmin
Password : passw0rd
3. Click Finish to register the BigInsights server in the BigInsights Servers view.
To run on the cluster, we still need a run configuration and we can use the one we created for the local run as the
base to create a new configuration.
4. Since we will run on the cluster now we first need to upload the data files to the HDFS file system. Switch to a
terminal window
5. Make sure you are in the mapreduce lab directory or enter it with
10. Click on Run -> Run Configurations and select the previously created run configuration. Because we have to
change the locations for input and output directories to reflect the locations in the Hadoop file system, we will
create a copy of the run configuration first so that we dont have to fill out everything again and can easily switch
between local and cluster mode without changing our program parameters.
11. In the run configuration dialog, click the Duplicate button to create a copy of the existing run configuration and
give the run configuration a new name
12. In duplicated configuration, for execution mode, select Cluster and select the just registered BigInsights server.
13. In the job arguments section, we need to update the input files to locations in HDFS. This time we want to use the
input file data_error.csv. Your user id needs write access to the output directory in HDFS which biadmin has.
Use the same value for the delete directory as for the output directory.
Be careful with the name. The user directory on HDFS is /user/biadmin this is different to /home/biadmin on
Linux FS. Also make sure to remove the labs directory.
15. When the program is run on the cluster, the JAR of the program will be copied to the cluster and run on the
cluster using the Hadoop command. A dialog with the progress information will pop up in Eclipse and when the
program was successfully submitted on the server, you will see the job id information in the Console window of
your Eclipse workspace:
To see the details of your job, you need to go to the BigInsights console.
16. Open a web browser and go to the URL of the BigInsights console. Log on with your user ID and password.
17. Click on the Application Status page and click on the sub link Jobs.
The jobs page shows the job status information for any Hadoop job and is the starting point to drill down into more
job details.
In our case, the job failed and we want to investigate what caused the job to fail.
18. To drill down into jobs details, click on the row of the job. In the bottom of the page, the breakdown of setup, map,
reduce and cleanup tasks is shown in a table and you can see that the map task failed was killed and was not
successful.
19. Click on the map row to drill down into more details. Keep clicking on the task attempts until you reach the page
with the log files. To have more space to see the log file content, click the Window icon on the right:
In the log file page, youll see the stdout, stderr and syslog output from the job.
In our case, you can see that there was an ArrayIndexOutOfBoundsException in the SalesMapper.java class
in line 20.
We could now setup a remote debugging task that would utilize the IsolationRunner helper task. But this would go
over the scope of this lab. If we look at the line in the code, we can guess what the problem is. If any of the lines
in the data file do not have 3 values this line will fail. We should therefore make our program more robust by
checking for these eventualities. Either by ignoring incomplete rows ( for example by catching the
ArrayIndexOutOfBoundsException or by gracefully shutting down.
10 Summary
To simplify the development of Big Data programs, BigInsights Enterprise Edition provides Eclipse tooling that feature
wizards, code generators, context-sensitive help, and a test environment. This lab provided a quick tour of
development of Java map-reduce programs and showed you how you can create, run and debug these programs in
your local Eclipse environment or on the cluster.
IBM Canada
8200 Warden Avenue
Markham, ON
L6G 1C7
Canada
IBM, the IBM logo, ibm.com and Tivoli are trademarks or registered
trademarks of International Business Machines Corporation in the
United States, other countries, or both. If these and other
IBM trademarked terms are marked on their first occurrence in this
information with a trademark symbol ( or ), these symbols indicate
U.S. registered or common law trademarks owned by IBM at the time
this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list
of IBM trademarks is available on the Web at Copyright and
trademark information at ibm.com/legal/copytrade.shtml
Product data has been reviewed for accuracy as of the date of initial
publication. Product data is subject to change without notice. Any
statements regarding IBMs future direction and intent are subject to
change or withdrawal without notice, and represent goals and
objectives only.