Design and Implementation of High-Level Compute On Android Systems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Design and Implementation of High-Level Compute

on Android Systems
Hung-Shuen Chen† , Jr-Yuan Chiou† , Cheng-Yan Yang‡ ,
Yi-jui Wu‡ , Wei-chung Hwang‡ , Hao-Chien Hung† , Shih-wei Liao†⇤
† NationalTaiwan University, Taipei, Taiwan
‡ IndustrialTechnology Research Institute, Hsinchu, Taiwan
⇤ Corresponding author: [email protected]

Abstract—As Android devices come with various CPU and Play. To address the problem above, we develop the first Java-
GPU cores, the demand for effective parallel computing across- based compute system on Android called Android-Aparapi.
the-board increases. In response, Google has released Render-
script to leverage parallel computing while maintaining porta- A. Android-Aparapi: Higher-level than Renderscript
bility. However, the adoption has been slow – We hardly see
any Renderscript apps on Google Play. In the meantime, the The proliferation of heterogeneous cores inside a sin-
proliferation of heterogeneous cores inside a single device calls gle device calls for a more developer-friendly parallel lan-
for a higher-level, more developer-friendly parallel language. guage. Google Renderscript is Android’s official heteroge-
Since most Android developers already use Java, we develop neous computing framework. Renderscript claims to be the
the first Java-based compute system on Android called Android-
Aparapi. Android-Aparapi facilitates programmers’ adoption of high-performance API (Application Programming Interface)
compute by obviating the need of learning a new language like for compute on Android. Although Renderscript aims for
Renderscript, thus the software can start catching up with the broad support from GPU or DSP vendors, few GPUs support
hardware trend of doubling the number of cores periodically. Renderscript thus far. By contrast, OpenCL’s support among
Furthermore, Android-Aparapi is a better defined API than GPU vendors is becoming universal. OpenCL has broad
the original Aparapi. We support comprehensive set of data types
and their array forms in terms of Java objects. In addition, support in both desktop and mobile GPUs. By going for
we propose innovative optimizations that effectively reduce the higher-level Android-Aparapi, we can avoid the embarrassing
Java-Native Interface overheads. Finally, we develop an Android- situation where developers’ Renderscript programs cannot be
Aparapi benchmark suite by extending Rodinia benchmark to deployed on most mobile GPUs today. Instead, our system
Android. The Java’s Thread Pool version of the suite runs 3 ensures broad deployability because we will translate Aparapi
times slower than the Android-Aparapi version. This demonstrate
the effectiveness of our high-level compute system. Furthermore, to the lower-level APIs available on GPUs of the day, be it
we compare our Android-Aparapi version with the existing OpenCL or Renderscript APIs.
lower-level OpenCL version and show that the performance Furthermore, in Android systems, Java is already the pri-
is comparable (Our performance is at 88% of OpenCL’s). In mary development language. App developers would not be
short, we achieve higher-level abstraction without sizable losing required to learn a new language such as Renderscript in order
performance.
to implement GPGPU on Android, thanking to our high-level
I. I NTRODUCTION Java-based compute system.
Today’s Android devices are heterogeneous computing sys-
tems. According to Wikipedia, such systems are “electronic B. Android-Aparapi’s Backend: Targeting OpenCL today
systems that use a variety of different types of computational Today we translate Android-Aparapi into OpenCL, since
units.” Compute in this context refers to computation-intensive more GPUs today support it. While Android developers are
processing. Examples are photo editing and computational mostly Java programmers, OpenCL is based on the C99
photography. GPGPU (General-Purpose Graphics Processing programming language. We do not recommend developers
Unit) is a popular form of compute: It uses heterogeneous to use OpenCL directly. Because if they do, in Android
systems such as mobile GPUs to perform compute instead systems today they may be required to use the Android Native
of graphics rendering. Note that compute is more general Development Kit (NDK). The NDK[3] allows developers from
than GPGPU compute: Android may use DSP (Digital Signal the world of native codes such as non-portable C and C++
Processor) or CPU (Central Processing Unit) for compte too. languages to write applications. Adding OpenCL complicates
Sadly an undeniable fact is that compute has not taken off the programming story further.
in the mobile world. We hardly see any compute apps on Android-Aparapi is based on Aparapi[4], which stands for
Google Play. We believe existing compute languages such as “A PARallel API.” Aparapi allows programmers to write code
OpenCL[1] and Renderscript[2] are too low-level for mobile using only Java and execute it on GPUs via the generated
app developers. There are few HPC (High-Performance Com- OpenCL code from the Java source. We successfully adapt
puting) programmers in the world of App Store or Google Aparapi to Android’s Dalvik that uses Dex instead of Java

978-1-4799-1284-1/13/ $31.00 c 2013 IEEE 96


Bytecode. In addition we develop the Android-Aparapi version
of the Rodinia[5] benchmark and conduct experiments.
The contributions of this paper are as follows:
• To satisfy the booming demand for heterogeneous com-
puting programs, we develop Android-Aparapi on An-
droid devices;
• We implement the Rodinia benchmark in Android-
Aparapi;
• We use enhanced Rodinia benchmark to evaluate
Android-Aparapi’s performance on devices in both GPU
and JTP modes.
The rest of the paper is organized as follows. Section
II provides an overview of existing parallel APIs, OpenCL
and Aparapi, that we build upon. Section III describes the
architecture of our Android-Aparapi, and Section IV presents Fig. 2. The flow of Aparapi: Convert Java bytecode to OpenCL host and
the better-defined APIs of Android-Aparapi, as compared to kernel codes or to codes that uses Java thread pool
Aparapi. We present our important performance optimizations
in Android-Aparapi in Section V. Section VI shows the
performance of Android-Aparapi on Android devices. Finally, work-items are categorized into work-groups. It is important
Section VII details related work, and Section VIII concludes to know that work-group’s size is hardware-dependent and
the paper and presents some future work. determined per device. Each device specifies the maximal
value CL_DEVICE_MAX_WORK_GROUP_SIZE that can be
II. PARALLEL API S : A PARAPI AND O PEN CL queried by systems like Android-Aparapi. Thus, Android-
Our system, Android-Aparapi, build upon the Aparapi Aparapi can achieve performance-portability across devices.
and OpenCL of today. Before presenting the architecture of The NDRange size is the sum of each work-group size.
Android-Aparapi, we shall give an overview of OpenCL and We decrease the value of work-group’s size, starting with
Aparapi in this section. CL_DEVICE_MAX_WORK_GROUP_SIZE until it meets the
Section II-A provides an overview of OpenCL before we divisor of NDRange size. The structure of the OpenCL work-
describe Aparapi in Section II-B. The former paves the way item and work-group is shown in Figure 1. An essential con-
for the Aparapi discussion later. sideration is that hardware memory accessing speeds between
global memory and local memory are substantially distinct.
A. OpenCL OpenCL local memory is much faster than global memory.
OpenCL is an open, royalty-free standard for the cross- Local memory size (also referred to as work-group size) is
platform parallel programming of modern processors such as not large, and therefore users have to manage the memory
GPUs and multicore CPUs. Apple Inc. initially developed the allocation carefully when executing each of the work-group
standard, and it is currently managed by the Khronos Group. in parallel. Local memory thread in a work-group can share
Users can substantially improve the speed for a wide range data between work-items, using barriers for synchronization
of applications through OpenCL, without any proprietary primitives. In contrast, communication across work groups is
hardware function call. based on global memory.
OpenCL consists of two distinct parts. First, the C99-
based kernel (parallel portion of the programming language) B. Aparapi
can execute and synchronize the processing cores. Kernel Aparapi (A PARallel API) is an open-source API for ex-
is a special function that executes parallel portions in the pressing data parallel workloads in Java. Aparapi translator
same manner that a Renderscript root function does. Second, will convert the Java bytecode of a given workload into
OpenCL consists of a host program and runtime library. A OpenCL hosts and kernel codes for running on a GPU. Only
host code submits work to devices; it contains instructions for kernel class bytecodes are required to convert. Aparapi API
setting up the environment and the argument, reading back is at Java level. Programmers do not need to have exhaustive
results, and managing the kernel command queue. Moreover, knowledge of heterogeneous computing. In addition, Aparapi
the primary advantage in using OpenCL is that developers are helps developers save time in setting up the program. In
able to dynamically allocate memory by extracting low-level contrast, OpenCL programmer need to write the host program.
hardware information, such as work-group size. This benefit Aparapi translates Java bytecode to the OpenCL code at run-
allows a program to achieve consistent performance across time and creates a Java thread pool for the Aparapi kernel
various devices. class. Because Java’s virtual machine is safe, it does not allow
In OpenCL, the basic compute element of the execution programmers to name hardware-level mechanism, but Aparapi
model is called a “work-item,” which has private memory, bridges Java to the OpenCL program through a Java Native
the properties of which are similar to register, and those Interface(JNI). JNI defines a way for managed code to interact

97
Fig. 1. The structure of OpenCL work-items and work-groups

with native code. The primary characteristic of Aparapi is that does not support exception, throw, or catch. Finally,
it automatically detects the capability of an OpenCL platform, because of the language limitations of Java and C99-based
determines at run-time whether to execute the arranged code OpenCL, only simple loops and conditions are supported.
with the traditional Java thread pool (JTP mode), or whether Conditions such as break, switch, and continue are
it is capable of running on the GPU via the OpenCL interface not supported. Additionally, new is not supported for either
(GPU mode). Figure 2 shows the flow of Java bytecode objects or arrays.
converting to OpenCL host and kernel codes, using Aparapi.
When Aparapi receives parallel Java bytecodes, it first converts III. A RCHITECTURE OF A NDROID -A PARAPI
the OpenCL program (unless users select the JTP mode), in Enabling GPGPU on Android presents certain challenges.
which case it sets up the OpenCL host program and binds it Our challenging goal is to apply the benefits of a GPU to the
with kernel codes before running it on devices. Aparapi falls Android device, using the Aparapi GPU mode.
back to JTP mode at execution time in two situations; first, the
A. Android-Aparapi Frontend on a PC or Workstation
Aparapi cannot convert to OpenCL kernel codes successfully
because of language limitations, which are detailed in the The virtual machine (VM) of the Android operating system
next paragraph; second, when executing OpenCL results in is Dalvik. Unlike Java VMs, which are stack machines, the
exception errors at compile- or run-time. For instance, when Dalvik VM uses a register-based architecture. A tool called
clGetDeviceinfo() returns errors, we fall back to JTP dx is used to convert Java class format files into the Dalvik-
mdoe. compatible dex (Dalvik Executable) format. After such con-
version, dex format files can run on Dalvik. Upstream Aparapi
To use Aparapi efficiently, it is necessary to be aware of does not work on mobile: It only read Java class format files
its restrictions, according to the distinct language features in runtime to generate OpenCL codes. Specifically, the parser
between Java and the C99 standard. Here we list the four im- of Aparapi is called ClassModel, not DexModel. However, in
portant limitations, which require attention to prevent Aparapi Android, the class format files are converted to dex format
ceasing conversion of Java bytecode to OpenCL kernel codes files. Two solutions can resolve this problem. The first is that
and reverting to JTP mode. First, certain data types in Aparapi we move the class format file into an Android raw folder, and
are not supported by Java, which only supports boolean, byte, instruct ClassModel to read and translate the Java-format files.
short,int,long, andfloat. The data-type char is not supported. Another solution is implementing a DexModel, which parses
Second, Aparapi does not support multidimensional arrays. dex-format files and generates corresponding OpenCL codes.
Additionally, Aparapi does not implement Java 5’s extended The first solution requires the repacking of all the class format
for syntax, such as for (int i: arrayOfInt), be- files that must be translated to corresponding OpenCL codes.
cause it will cause a shallow copy of the original array. The The second solution does not require repacking the of dex
third restriction is that methods, such as static, overloaded, format files into a raw folder; it can directly read dex format
and methods with varargs argument lists, are not supported, files and generate the corresponding OpenCL. We use the first
because OpenCL does not allow them. Recursive calls are not solution in Android-Aparapi since Java, given its long history,
supported either. Certain other restrictions are that Aparapi is relatively more stable today.

98
Fig. 3. Architecture of Android-Aparapi on Android systems. On the PC or workstation, we package all the class format files which contains kernel class
of Android-Aparapi. These files are passed to Android-Aparapi on Mobile devices. Android-Aparapi parse these files to generate OpenCL kernels, which are
executed on the mobile GPU at run time.

B. Android-Aparapi Backend on an Android Device By design a Java application is prevented from accessing
The next challenge is on how to run OpenCL on Android de- underlying memory layout. However, Dalvik VM needs to
vices such as Nexus 10, which is not supported by the original determine the layout in order to make JNI calls or exchange
Aparapi. We start with the given library “libGLES mali.so” data with native code such as OpenCL driver. Thus, when
that contains entry points for the OpenCL functions in Nexus using Java objects, Dalvik VM uses the sun.misc.Unsafe
10. Note that the word “mali” stands for GPU of Nexus 10, methods to obtain the object field addresses whenever needed.
“Mali-T604.” Because Mali-T604 has been certified Khronos We refer to those methods such as sun.misc.Unsafe.getFloat
conformant for OpenCL 1.1 Full Profile on Linux and Android as layout-getters.
systems, we use Nexus 10 as the development vehicle of The layout-getters result in certain restrictions when adapt-
Android-Aparapi. ing Aparapi to Android, because not all layout-getters are
The key is to build our device library (we name it available on Dalvik. For instance, layout-getters for short, byte,
“libAparapi nexus10”) leveraging both Aparapi and “lib- float, double, and boolean data types and their corresponding
GLES mali.so.” To do so, we pull the given library from arrays are not implemented in Dalvik VM. In our Android-
Nexus 10 to our desktop, and used NDK to link this library to Aparapi system, we implement all layout-getters and layout-
all the Aparapi native side codes. The end result is our own setters.
“libAparapi nexus10,” which is the native side of Android- Android-Aparapi uses out-of-heap memory, called Byte-
Aparapi. Buffer, to communicate data with native, OpenCL driver.
In addition to creating Android-Aparapi’s device library, ByteBuffer cannot be accessed by garbage collector, owing to
we need to extend the architecture-checking mechanism in its out-of-heap memory. Thus it will not increase the overhead
Aparapi. In its Java side, Aparapi detects which architecture of garbage collection. Before saving values to ByteBuffer,
of the JRE is used and decides which type of Aparapi-native Android-Aparapi establishes the parameters of how many
shared library to load in run time. The original Aparapi bytes a data type has, using the JValue. For example, an integer
only loads the “libAparapi x86 64” shared library if the of four bytes is jint, similarly, a double is eight bytes and is
JRE was x86 64 (amd64), and loads “libAparapi x86” if the jdouble. Thanking to the layout-getters and layout-setters that
JRE was x86 (i386). In Nexus 10, the OS architecture of we implement, our ByteBuffer supports any Java objects and
Dalvik was ARMv7l. The suffix “l” stands for little-endian. arrays of Java objects in the corresponding Java source. As a
We extend the architecture-checking mechanism in Android- result, Android-Aparapi is a better defined API than Aparapi.
Aparapi, and load “libAparapi nexus10” correspondingly.
V. O PTIMIZATIONS IN A NDROID -A PARAPI S YSTEM
Note that Android-Aparapi will first load libGLES mali.so
because “libAparapi nexus10” contains OpenCL calls. The primary aim of Android-Aparapi in using native codes
After these two substantial changes, Android-Aparapi could is interfacing the low-level hardware from within Dalvik VM.
use the GPU mode to run on Nexus 10. The workflow of It is ironic that some developers who resort to such native code
Android-Aparapi, run on Nexus 10, is shown in Figure 3. for higher performance get bitten by the JNI overhead and
hence the performance actually become lower. Specifically,
IV. B OOSTING THE API OF A NDROID -A PARAPI those developers find that the speedup from using GPGPU
Android-Aparapi is a better defined API than Aparapi. via Android-Aparapi can disappear if we do not reduce the
For instance, Dalvik only supports two data types, int and overhead of each call from Java to native through JNI.
long in terms of Java objects when using sun.misc.Unsafe Our insight is that because of the nature of compute,
methods. Android-Aparapi supports short, byte, float, double, Android-Aparapi typically operates on many data elements,
and boolean data types and their corresponding arrays. one by one. That is, the Java object is typically used as an

99
TABLE I
array. If each Java object has n variables, the number of GPU VERSUS O PT GPU ON NN AND SC
operations is n times the size of the array. This results in a large
time penalty from numerous JNI calls to interface native-side. NN Nodes GPU(ms) OptGPU(ms) Speedup
The repeated operations present opportunity for amortizing 10690 320 130 2.46⇥
the JNI overhead. We can group many operations into the 42760 1036 358 2.89⇥
same JNI call. Thus, users can obtain high performances by 171040 3895 1318 2.96⇥
decreasing the number of JNI calls SC Points⇥Dims GPU(ms) OptGPU(ms) Speedup
In Android-Aparapi we implement a set of new APIs 64⇥256 3059 2236 1.35⇥
such as getFloatArray and putFloatArray in the native side, 1024⇥256 9589 5810 1.65⇥
which can operate on the memory of an entire array of a 16384⇥256 152434 58319 2.65⇥
Java object, rather than a single variable. Now regardless
of the size of the array, the penalty for going to or from
native would be taken only once. To achieve the above, multidimensional array to a one-dimensional array. It is nec-
we extend Android Dalvik VM’s native code (specifically, essary to write benchmarks by recalculating the index of
vm/native/sun.misc.Unsafe.cpp) and enhance Aparapi’s Dalvik the array. Next, in OpenCL, the programmer must write
side, specifically, UnsafeWrapper.java. In Section VI we will numerous host programs, such as querying for the platform
measure the performance benefit due to our optimization in and devices, creating context and managing the command
this section. queue, and creating buffers. However, using Android-Aparapi,
a programmer can focus on the areas where codes are supposed
VI. E XPERIMENTAL R ESULTS
to be run in parallel. Finally, certain codes used for calculating
The experiments focus on three topics:demonstrating the the length of time for debugging are replaced. Java contains
performance benefit from the optimizations in Section V, com- numerous convenient libraries for programmers, and we used
paring the performance of the CPU and GPU, and comparing these libraries directly.
the performance of the Android-Aparapi and OpenCL on GPU. Our benchmarks written in Android-Aparapi contain signif-
The experiments are run on Nexus 10, the specifications icantly fewer source lines of code (SLOC) than those written
of which are as follows: Dual-core A15 CPU, Quad-core in OpenCL in Rodinia. The SLOC of each benchmark is in
Mali T604 GPU, and 2 GB RAM. Because we added certain Table II. If the Rodinia benchmark contains input files, we
codes to Dalvik, we re-build from the Android Open Source use these files as the input to our benchmark. If the input
Project (AOSP). The operating system of the Android version data is random, we use Java Math random API to generate
is 4.2.2.2.2.2.2.2.2.2, as Google decided to have nine “2”s data. Finally, we validate our output results against those of
in the version number. The model number is Full AOSP on the Rodinia OpenCL benchmark’s.
Manta, and the Kernel version is 3.4.5-gaf9c307.
Our experiments begin with Rodinia, which is a bench-
A. Android-Aparapi performance vs. original Aparapi perfor-
mark suite for heterogeneous computing. Rodinia contains
mance
OpenMP[6], OpenCL, and CUDA[7] implementations. Ro-
dinia is currently in version 2.3[8] and contains 18 benchmarks In this experiment we measure the benefit of the optimiza-
written in OpenCL. Our goal is to use Android-Aparapi to tion from Section V. Two benchmarks, NN and SC, in Rodinia
rewrite Rodinia OpenCL benchmark. Rodinia contains more contain Java object in their kernel code. As a result, our
than two thirds of benchmarks that are written with the work- optimization in Section V can make a difference. Because
group feature. OpenCL that runs on a GPU will be sped-up Aparapi running in JTP mode is not related to native sides, we
if the OpenCL programmer appropriately uses OpenCL local only compare two Android native versions: Android-Aparapi
memory. Android-Aparapi benchmarks that use local memory version and the original Aparapi version. The summary of
will also be sped-up. performance averages of two applications are shown in Table
In the interest of space in the paper we shall show the I. Each runs three data sizes, which all follow Rodinia specifi-
results on the representative seven benchmarks. The seven cations. Each time we run on the GPU on Android devices, we
are: Gaussian Elimination (GE), Breadth-First Search (BFS), generate the OpenCL host and kernel program when executing
Kmeans (KM), k-Nearest Neighbors (NN), PathFinder (PF), the Android-Aparapi kernel for the first time. The conversion
Needleman-Wunsch (NW), and Streamcluster (SC). In these time would not affect this experiment, because our focus
benchmarks, NW, SC, and PF are written containing local is running the OpenCL on device time, and therefore we
memory, using Android-Aparapi. More details for each bench- subtract conversion time from the total application-executing
mark are available in the Rodinia benchmark suite. time. Our optimization results in a speedup of at least 1.35 in
The details of porting Rodinia OpenCL benchmarks to all configurations in Table I. In addition, when the data sizes
Android-Aparapi are as follows. Because Android-Aparapi increase, the improvement increased substantially, particularly
has certain limitations in writing kernels, we have to rewrite in the SC program. For the NN program, the speedup is less
the benchmarks whenever necessary. For example, the most substantial because its data size increase four-fold. (vs. the 16
common problem is that we are required to change every fold in the case of SC.)

100
Fig. 4. BFS benchmark run in JTP and GPU modes Fig. 5. SC benchmark running in JTP and GPU modes

Figure 5 shows the SC samples run in JTP and GPU modes


B. JTP vs. GPU mode in Android-Aparapi
for a variety of problem sizes. Compared with BFS, the speed
We test our benchmarks, written in Android-Aparapi, using of the GPU mode is more accelerated in SC. Initializing
the JTP and GPU modes. The experiments require two consid- OpenCL affects the performance of the SC running on the
erations. First, each benchmark contains a variety of data sizes. GPU. Additional problems arise when codes that contain local
The performance of each benchmark varies with the data size. memory (a number of work-groups larger than one) run on
Second, because Android-Aparapi translates Java bytecode to the JTP and GPU. Using local memory efficiently speeds up
OpenCL in run-time, it may lose performance. the performance of benchmarks using OpenCL. However, Java
Because of the number of benchmarks, for simplicity, we does not support local memory, therefore, any codes containing
use BFS and SC benchmarks to show the performance of local memory only speed up in Android-Aparapi GPU modes.
JTP and GPU. The BFS represents a benchmark in the Another consideration is that, when using local memory,
execution model that only contains one work-group, where the programmer must use the local barrier appropriately. In
the benchmark does not utilize local memory. However, SC JTP mode, Aparapi uses Java’s barrier to simulate the local
represents a benchmark that contains a number of work-groups barrier in OpenCL. These barriers incur an additional cost,
greater than one, where the benchmark used local memory. therefore Android-Aparapi officially recommends using local
Figure 4 shows the BFS benchmark run on JTP and GPU, memory when the programmer is certain that the benchmark
in a variety of problem sizes. In the small problem size, can run on a GPU at more than 90 percent of time. Another
the BFS benchmark running in JTP mode is faster than that consideration regarding Aparapi performance on Android sys-
running in GPU mode. The reason is that Android-Aparapi tems is the garbage collection of Dalvik. When starting to
generates OpenCL in run time. Android-Aparapi takes time to run the benchmarks, the heap adjusts its size automatically,
initialize operations in OpenCL, such as parsing Java bytecode, based upon the needs of the application. However, when the
and generating the OpenCL host program and kernel program problem size is too large, the benchmarks running in JTP mode
in run-time. The time required to initialize OpenCL is not may trigger the garbage collection. This situation typically
affected by the problem size. Android-Aparapi contains codes occurs in benchmarks containing local memory, and affects
to calculate times for initializing OpenCL. For example, the the performance.
graph contains 4,096 nodes; the BFS benchmark run in GPU Figure 6 shows the performance of each benchmark when
mode requires 308 milliseconds to initialize OpenCL, but the the problem size is small. In Figure 6, GPU execution time
kernel codes run in GPU mode only require 29 milliseconds. contains the time to initialize OpenCL. OpenCL initialization
The total time required by Android-Aparapi kernel codes run affects the execution time of the GPU and renders the total
in GPU mode is 337 milliseconds. However, the total time execution time slower than the JTP.
required by Aparapi kernel codes run in JTP mode is only Figure 7 shows the performance of each benchmark when
50 millisecond. In smaller problem sizes, the JTP is faster the problem size is large. The OpenCL initialization cost is
than the GPU. However, for a larger problem size of BFS, negligible. Nearly every benchmark in the GPU mode is faster
such as that shown in the graph, containing 1 million nodes, than in the JTP mode, excluding NN. Because the kernel
the total time required by kernel codes run in GPU mode is codes of NN only calculate an array of the square root,
only 1,670 milliseconds, but it required 3,645 milliseconds the computation complexity remains low. The power of the
when run in JTP mode. In larger problem sizes, the GPU GPU cannot be leveraged well. Additionally, the PF and NW
mode is faster than the JTP one. In smaller cases, initializing are substantially sped-up in the GPU mode because a large
OpenCL substantially affects the performance of the GPU. portion of local memory is used, which implies significant
When the problem size becomes large, the time of initialization performance boost on GPU and performance handicap on CPU
for OpenCL becomes negligible. due to the barrier problem.

101
Fig. 6. Small problem size of each benchmark Fig. 8. Benchmark performance between our Android-Aparapi version with
the existing lower-level OpenCL version running on GPU

VII. R ELATED W ORK

Our work focuses on computing frameworks in Java on


Android devices and is based on AMD’s Aparapi project.
Although several computing framework projects leverage the
Java language, none have been adapted to Android Dalvik
Virtual Machine, or have conducted detailed performance
evaluation on a mobile GPU. Below we discuss several related
works in the field of Java-based GPGPU APIs.
There are several Java language bindings for com-
Fig. 7. Large problem size of each benchmark
puting framework (JCUDA[9], JCudaMP[10], jocl[11] and
JavaCL[12]), but these projects still require developers to
write primitive arrays manually and kernel codes in another
The benchmarks running in the JTP mode are faster than language. Peter Calvert’s Java-GPU[13] is a project of a
those running in the GPU mode, when problem size is small, compute framework in Java that offloads parallel “for” loops
because benchmarks running in the GPU mode require time to NVIDIA CUDA automatically. Unlike Java-GPU targeting
to initialize OpenCL and communicate between Java and only loops, Rootbeer[14] is GPU compiler that let program-
OpenCL. When the problem size is large, the GPU is more mers use NVIDIA CUDA from within Java. Rootbeer binds
powerful and runs faster than the JTP. complex graphs of objects into arrays of primitive types
automatically and aims at better performance via a Java
Table II shows the summary information of our current
optimization framework called soot[15]. Hence, both of them
benchmark for large problem sizes. The number of Kernels
eliminate many manual steps. The primary difference between
ranges from 1 to 2 and the number of OpenCL barriers ranges
Aparapi and them is that Java-GPU and Rootbeer do not
from 0 to 12. Our Android-Aparapi SLOC only contains the
provide fall back mechanism at run time, but Aparapi does.
parts of compute; the parts that use Android framework are
not included. “OpenCL in Action”, by Matt Scarpino[16], runs OpenCL
image filtering. It is a pure JDK project. It does not use
Java-based GPGPU APIs and runs only on Nexus 10. Rahul
C. Android-Aparapi performance vs. original OpenCL perfor- Garg[17] demonstrates how to load .so files for the base code,
mance on GPU using the dlsym approach on various OpenCL driver libraries.
We compare the performance between our Android-Aparapi Aopencl project[18] designed by Mahadevan GSS is similar
version with the existing lower-level OpenCL version running to our work. For simplicity, aopencl ports the older version
on GPU. The performance of each sample is shown in Figure of Aparapi to Nexus 4 and focuses on handling Qualcomm
8. Android-Aparapi needs time to initialize OpenCL, the SDK[19] challenges. The aopencl project also encounters
performance is slower than OpenCL version. In all of these sun.misc.Unsafe method problems. However, aopencl does not
benchmarks, almost all benchmarks execution time between enhance the APIs like we do. Furthermore, aopencl cannot
Android-Aparapi and OpenCL are quite close. Only PF using sufficiently allocate local memory, because of the old version
Android-Aparapi is much slower than OpenCL. Our perfor- they use. Finally, we implement key optimizations. And we
mance is 58% of OpenCL’s. Overall the geometric mean of compare the performance between running on Java threads
our performance vs. OpenCL’s is 88%. and running on GPUs.

102
TABLE II
S UMMARY INFORMATION OF OUR CURRENT BENCHMARK FOR LARGE PROBLEM SIZES

GE BFS NN KM PF NW SC

Kernels 2 2 1 2 1 2 1

Barriers 0 0 0 0 3 12 1

1024⇥1024 1000000 171040 494020 points 100⇥500 1024⇥1024 16384 points


Problem Size
data points nodes nodes 35 features data points data points 256 dimensions

JTP Execution
122.26 s 3.64 s 0.22 s 120.54 s 87.57 s 76.12 s 283.75 s
Time

GPU Execution
69.96 s 1.67 s 12.3 s 34.18 s 1.21 s 2.528 s 58.88 s
Time

Original SLOC
525 349 358 686 671 592 2999
(OpenCL)

SLOC
(Android- 182 212 130 228 328 411 850
Aparapi)

initialize
263 ms 308 ms 160 ms 443 ms 501 ms 373 ms 559 ms
OpenCL

GPU Speedup 1.75x 2.18x 0.02x 2.79x 72.37x 30.11x 4.82x

VIII. C ONCLUSION AND F UTURE W ORK R EFERENCES

Compute is in higher demand in more domains these [1] K. Group, “OpenCL,” http://www.khronos.org/opencl/.
[2] Google, “Renderscript,” http://developer.android.com/guide/topics/
days. The use of GPU devices has become a priority. As renderscript/index.html.
demonstrated in this paper, developers can use Android- [3] “Android NDK Document,” 2012. [Online]. Available: http://developer.
Aparapi to write heterogeneous computing programs in Java android.com/sdk/ndk/index.html
[4] “Aparapi,” https://code.google.com/p/aparapi/.
and launch OpenCL kernels on Android devices with ease. [5] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and
App developers no longer need to struggle with writing native K. Skadron, “Rodinia: A benchmark suite for heterogeneous computing,”
codes to bind with the OpenCL library. Furthermore, Android- in Workload Characterization, 2009. IISWC 2009. IEEE International
Symposium on. IEEE, 2009, pp. 44–54.
Aparapi is a better defined API than the original Aparapi. [6] S. Lee, S.-J. Min, and R. Eigenmann, “OpenMP to GPGPU: a
In addition, our system reduces the large overheads of JNI compiler framework for automatic translation and optimization,” in
calls: We modify both the Aparapi and Dalvik VM to group Proceedings of the 14th ACM SIGPLAN symposium on Principles
and practice of parallel programming, ser. PPoPP ’09. New
many operations into a single JNI call. Finally, we adapt York, NY, USA: ACM, 2009, pp. 101–110. [Online]. Available:
the popular Rodinia benchmark to Android-Aparapi, and the http://doi.acm.org/10.1145/1504176.1504194
evaluation results demonstrate the effect of our optimizations [7] NVIDIA, “CUDA Toolkit,” http://developer.nvidia.com/cuda-toolkit.
[8] K. Skadron, “Rodinia:Accelerating Compute-Intensive Applications
for Android-Aparapi. We also present detailed comparisons with Accelerators,” http://lava.cs.virginia.edu/Rodinia/.
between Android-Aparapi’s JTP and GPU modes. [9] Y. Yan, M. Grossman, and V. Sarkar, “Jcuda: A programmer-friendly
In the future, we will rewrite the Rodinia benchmark further interface for accelerating java programs with cuda,” in Euro-Par 2009
Parallel Processing. Springer, 2009, pp. 887–899.
to use Java without resorting to JTP mode. Today we experi- [10] G. Dotzler, R. Veldema, and M. Klemm, “Jcudamp: Openmp/java on
ment with Mali T604 on Nexus 10, but more OpenCL libraries cuda,” in Proceedings of the 3rd International Workshop on Multicore
can be done next. Finally, we will compare the performance Software Engineering. ACM, 2010, pp. 10–17.
[11] “Jocl-java bindings for opencl,” http://www.jocl.org/.
of the Android’s official computing framework, Renderscript, [12] “JavaCL,” https://code.google.com/p/javacl/.
and NDK. The translated results from O2render[20] will be [13] P. Calvert, “Parallelisation of java for graphics processors,” Final-year
thrown into the comparison too. dissertation at University of Cambridge Computer Laboratory. Available
from http://www. cl. cam. ac. uk/ prc33, 2010.
[14] P. C. Pratt-Szeliga, J. W. Fawcett, and R. D. Welch, “Rootbeer:
ACKNOWLEDGMENT Seamlessly using gpus from java,” in High Performance Computing
and Communication & 2012 IEEE 9th International Conference on
This work was supported by the Industrial Technology Embedded Software and Systems (HPCC-ICESS), 2012 IEEE 14th
International Conference on. IEEE, 2012, pp. 375–380.
Research Institute, Hsinchu, Taiwan. We sincerely appreciate [15] R. Vallee-Rai, P. Co, E. Gagnon, L. Hendren, P. Lam, and V. Sundaresan,
Logan Chien and Deryu Tsai for helping us implement low “Soot: A java bytecode optimization framework,” in CASCON First
level memory access on Android systems. We acknowledge Decade High Impact Papers. IBM Corp., 2010, pp. 214–224.
[16] “MattScarpino,” https://http://www.openclblog.com/2013/03/
Jian-Min Liou and Kuan-Yu Lin who contributed certain opencl-image-filtering-on-nexus-10.html/.
benchmarks on Android-Aparapi. [17] “Rahul Garg,” https://bitbucket.org/codedivine/testcln10/src.

103
[18] M. GSS, “aopencl,” https://code.google.com/p/aopencl/.
[19] “QualComm SDK,” https://developer.qualcomm.com/discover/
mobile-platforms/android/.
[20] C.-y. Yang, Y.-j. Wu, and S. Liao, “O2render: An opencl-to-renderscript
translator for porting across various gpus or cpus,” in Embedded Systems
for Real-time Multimedia (ESTIMedia), 2012 IEEE 10th Symposium on.
IEEE, 2012, pp. 67–74.

104

You might also like