Design and Implementation of High-Level Compute On Android Systems
Hung-Shuen Chen† , Jr-Yuan Chiou† , Cheng-Yan Yang‡ ,
Yi-jui Wu‡ , Wei-chung Hwang‡ , Hao-Chien Hung† , Shih-wei Liao†⇤
† NationalTaiwan University, Taipei, Taiwan
‡ IndustrialTechnology Research Institute, Hsinchu, Taiwan
⇤ Corresponding author: [email protected]
Abstract—As Android devices come with various CPU and Play. To address the problem above, we develop the first Java-
GPU cores, the demand for effective parallel computing across- based compute system on Android called Android-Aparapi.
the-board increases. In response, Google has released Render-
script to leverage parallel computing while maintaining porta- A. Android-Aparapi: Higher-level than Renderscript
bility. However, the adoption has been slow – We hardly see
any Renderscript apps on Google Play. In the meantime, the The proliferation of heterogeneous cores inside a sin-
proliferation of heterogeneous cores inside a single device calls gle device calls for a more developer-friendly parallel lan-
for a higher-level, more developer-friendly parallel language. guage. Google Renderscript is Android’s official heteroge-
Since most Android developers already use Java, we develop neous computing framework. Renderscript claims to be the
the first Java-based compute system on Android called Android-
Aparapi. Android-Aparapi facilitates programmers’ adoption of high-performance API (Application Programming Interface)
compute by obviating the need of learning a new language like for compute on Android. Although Renderscript aims for
Renderscript, thus the software can start catching up with the broad support from GPU or DSP vendors, few GPUs support
hardware trend of doubling the number of cores periodically. Renderscript thus far. By contrast, OpenCL’s support among
Furthermore, Android-Aparapi is a better defined API than GPU vendors is becoming universal. OpenCL has broad
the original Aparapi. We support comprehensive set of data types
and their array forms in terms of Java objects. In addition, support in both desktop and mobile GPUs. By going for
we propose innovative optimizations that effectively reduce the higher-level Android-Aparapi, we can avoid the embarrassing
Java-Native Interface overheads. Finally, we develop an Android- situation where developers’ Renderscript programs cannot be
Aparapi benchmark suite by extending Rodinia benchmark to deployed on most mobile GPUs today. Instead, our system
Android. The Java’s Thread Pool version of the suite runs 3 ensures broad deployability because we will translate Aparapi
times slower than the Android-Aparapi version. This demonstrate
the effectiveness of our high-level compute system. Furthermore, to the lower-level APIs available on GPUs of the day, be it
we compare our Android-Aparapi version with the existing OpenCL or Renderscript APIs.
lower-level OpenCL version and show that the performance Furthermore, in Android systems, Java is already the pri-
is comparable (Our performance is at 88% of OpenCL’s). In mary development language. App developers would not be
short, we achieve higher-level abstraction without sizable losing required to learn a new language such as Renderscript in order
to implement GPGPU on Android, thanking to our high-level
I. I NTRODUCTION Java-based compute system.
Today’s Android devices are heterogeneous computing sys-
tems. According to Wikipedia, such systems are “electronic B. Android-Aparapi’s Backend: Targeting OpenCL today
systems that use a variety of different types of computational Today we translate Android-Aparapi into OpenCL, since
units.” Compute in this context refers to computation-intensive more GPUs today support it. While Android developers are
processing. Examples are photo editing and computational mostly Java programmers, OpenCL is based on the C99
photography. GPGPU (General-Purpose Graphics Processing programming language. We do not recommend developers
Unit) is a popular form of compute: It uses heterogeneous to use OpenCL directly. Because if they do, in Android
systems such as mobile GPUs to perform compute instead systems today they may be required to use the Android Native
of graphics rendering. Note that compute is more general Development Kit (NDK). The NDK[3] allows developers from
than GPGPU compute: Android may use DSP (Digital Signal the world of native codes such as non-portable C and C++
Processor) or CPU (Central Processing Unit) for compte too. languages to write applications. Adding OpenCL complicates
Sadly an undeniable fact is that compute has not taken off the programming story further.
in the mobile world. We hardly see any compute apps on Android-Aparapi is based on Aparapi[4], which stands for
Google Play. We believe existing compute languages such as “A PARallel API.” Aparapi allows programmers to write code
OpenCL[1] and Renderscript[2] are too low-level for mobile using only Java and execute it on GPUs via the generated
app developers. There are few HPC (High-Performance Com- OpenCL code from the Java source. We successfully adapt
puting) programmers in the world of App Store or Google Aparapi to Android’s Dalvik that uses Dex instead of Java
Fig. 1. The structure of OpenCL work-items and work-groups
with native code. The primary characteristic of Aparapi is that does not support exception, throw, or catch. Finally,
it automatically detects the capability of an OpenCL platform, because of the language limitations of Java and C99-based
determines at run-time whether to execute the arranged code OpenCL, only simple loops and conditions are supported.
with the traditional Java thread pool (JTP mode), or whether Conditions such as break, switch, and continue are
it is capable of running on the GPU via the OpenCL interface not supported. Additionally, new is not supported for either
(GPU mode). Figure 2 shows the flow of Java bytecode objects or arrays.
converting to OpenCL host and kernel codes, using Aparapi.
When Aparapi receives parallel Java bytecodes, it first converts III. A RCHITECTURE OF A NDROID -A PARAPI
the OpenCL program (unless users select the JTP mode), in Enabling GPGPU on Android presents certain challenges.
which case it sets up the OpenCL host program and binds it Our challenging goal is to apply the benefits of a GPU to the
with kernel codes before running it on devices. Aparapi falls Android device, using the Aparapi GPU mode.
back to JTP mode at execution time in two situations; first, the
A. Android-Aparapi Frontend on a PC or Workstation
Aparapi cannot convert to OpenCL kernel codes successfully
because of language limitations, which are detailed in the The virtual machine (VM) of the Android operating system
next paragraph; second, when executing OpenCL results in is Dalvik. Unlike Java VMs, which are stack machines, the
exception errors at compile- or run-time. For instance, when Dalvik VM uses a register-based architecture. A tool called
clGetDeviceinfo() returns errors, we fall back to JTP dx is used to convert Java class format files into the Dalvik-
mdoe. compatible dex (Dalvik Executable) format. After such con-
version, dex format files can run on Dalvik. Upstream Aparapi
To use Aparapi efficiently, it is necessary to be aware of does not work on mobile: It only read Java class format files
its restrictions, according to the distinct language features in runtime to generate OpenCL codes. Specifically, the parser
between Java and the C99 standard. Here we list the four im- of Aparapi is called ClassModel, not DexModel. However, in
portant limitations, which require attention to prevent Aparapi Android, the class format files are converted to dex format
ceasing conversion of Java bytecode to OpenCL kernel codes files. Two solutions can resolve this problem. The first is that
and reverting to JTP mode. First, certain data types in Aparapi we move the class format file into an Android raw folder, and
are not supported by Java, which only supports boolean, byte, instruct ClassModel to read and translate the Java-format files.
short,int,long, andfloat. The data-type char is not supported. Another solution is implementing a DexModel, which parses
Second, Aparapi does not support multidimensional arrays. dex-format files and generates corresponding OpenCL codes.
Additionally, Aparapi does not implement Java 5’s extended The first solution requires the repacking of all the class format
for syntax, such as for (int i: arrayOfInt), be- files that must be translated to corresponding OpenCL codes.
cause it will cause a shallow copy of the original array. The The second solution does not require repacking the of dex
third restriction is that methods, such as static, overloaded, format files into a raw folder; it can directly read dex format
and methods with varargs argument lists, are not supported, files and generate the corresponding OpenCL. We use the first
because OpenCL does not allow them. Recursive calls are not solution in Android-Aparapi since Java, given its long history,
supported either. Certain other restrictions are that Aparapi is relatively more stable today.
Fig. 3. Architecture of Android-Aparapi on Android systems. On the PC or workstation, we package all the class format files which contains kernel class
of Android-Aparapi. These files are passed to Android-Aparapi on Mobile devices. Android-Aparapi parse these files to generate OpenCL kernels, which are
executed on the mobile GPU at run time.
B. Android-Aparapi Backend on an Android Device By design a Java application is prevented from accessing
The next challenge is on how to run OpenCL on Android de- underlying memory layout. However, Dalvik VM needs to
vices such as Nexus 10, which is not supported by the original determine the layout in order to make JNI calls or exchange
Aparapi. We start with the given library “libGLES” data with native code such as OpenCL driver. Thus, when
that contains entry points for the OpenCL functions in Nexus using Java objects, Dalvik VM uses the sun.misc.Unsafe
10. Note that the word “mali” stands for GPU of Nexus 10, methods to obtain the object field addresses whenever needed.
“Mali-T604.” Because Mali-T604 has been certified Khronos We refer to those methods such as sun.misc.Unsafe.getFloat
conformant for OpenCL 1.1 Full Profile on Linux and Android as layout-getters.
systems, we use Nexus 10 as the development vehicle of The layout-getters result in certain restrictions when adapt-
Android-Aparapi. ing Aparapi to Android, because not all layout-getters are
The key is to build our device library (we name it available on Dalvik. For instance, layout-getters for short, byte,
“libAparapi nexus10”) leveraging both Aparapi and “lib- float, double, and boolean data types and their corresponding
GLES” To do so, we pull the given library from arrays are not implemented in Dalvik VM. In our Android-
Nexus 10 to our desktop, and used NDK to link this library to Aparapi system, we implement all layout-getters and layout-
all the Aparapi native side codes. The end result is our own setters.
“libAparapi nexus10,” which is the native side of Android- Android-Aparapi uses out-of-heap memory, called Byte-
Aparapi. Buffer, to communicate data with native, OpenCL driver.
In addition to creating Android-Aparapi’s device library, ByteBuffer cannot be accessed by garbage collector, owing to
we need to extend the architecture-checking mechanism in its out-of-heap memory. Thus it will not increase the overhead
Aparapi. In its Java side, Aparapi detects which architecture of garbage collection. Before saving values to ByteBuffer,
of the JRE is used and decides which type of Aparapi-native Android-Aparapi establishes the parameters of how many
shared library to load in run time. The original Aparapi bytes a data type has, using the JValue. For example, an integer
only loads the “libAparapi x86 64” shared library if the of four bytes is jint, similarly, a double is eight bytes and is
JRE was x86 64 (amd64), and loads “libAparapi x86” if the jdouble. Thanking to the layout-getters and layout-setters that
JRE was x86 (i386). In Nexus 10, the OS architecture of we implement, our ByteBuffer supports any Java objects and
Dalvik was ARMv7l. The suffix “l” stands for little-endian. arrays of Java objects in the corresponding Java source. As a
We extend the architecture-checking mechanism in Android- result, Android-Aparapi is a better defined API than Aparapi.
Aparapi, and load “libAparapi nexus10” correspondingly.
Note that Android-Aparapi will first load libGLES
because “libAparapi nexus10” contains OpenCL calls. The primary aim of Android-Aparapi in using native codes
After these two substantial changes, Android-Aparapi could is interfacing the low-level hardware from within Dalvik VM.
use the GPU mode to run on Nexus 10. The workflow of It is ironic that some developers who resort to such native code
Android-Aparapi, run on Nexus 10, is shown in Figure 3. for higher performance get bitten by the JNI overhead and
hence the performance actually become lower. Specifically,
IV. B OOSTING THE API OF A NDROID -A PARAPI those developers find that the speedup from using GPGPU
Android-Aparapi is a better defined API than Aparapi. via Android-Aparapi can disappear if we do not reduce the
For instance, Dalvik only supports two data types, int and overhead of each call from Java to native through JNI.
long in terms of Java objects when using sun.misc.Unsafe Our insight is that because of the nature of compute,
methods. Android-Aparapi supports short, byte, float, double, Android-Aparapi typically operates on many data elements,
and boolean data types and their corresponding arrays. one by one. That is, the Java object is typically used as an
array. If each Java object has n variables, the number of GPU VERSUS O PT GPU ON NN AND SC
operations is n times the size of the array. This results in a large
time penalty from numerous JNI calls to interface native-side. NN Nodes GPU(ms) OptGPU(ms) Speedup
The repeated operations present opportunity for amortizing 10690 320 130 2.46⇥
the JNI overhead. We can group many operations into the 42760 1036 358 2.89⇥
same JNI call. Thus, users can obtain high performances by 171040 3895 1318 2.96⇥
decreasing the number of JNI calls SC Points⇥Dims GPU(ms) OptGPU(ms) Speedup
In Android-Aparapi we implement a set of new APIs 64⇥256 3059 2236 1.35⇥
such as getFloatArray and putFloatArray in the native side, 1024⇥256 9589 5810 1.65⇥
which can operate on the memory of an entire array of a 16384⇥256 152434 58319 2.65⇥
Java object, rather than a single variable. Now regardless
of the size of the array, the penalty for going to or from
native would be taken only once. To achieve the above, multidimensional array to a one-dimensional array. It is nec-
we extend Android Dalvik VM’s native code (specifically, essary to write benchmarks by recalculating the index of
vm/native/sun.misc.Unsafe.cpp) and enhance Aparapi’s Dalvik the array. Next, in OpenCL, the programmer must write
side, specifically, In Section VI we will numerous host programs, such as querying for the platform
measure the performance benefit due to our optimization in and devices, creating context and managing the command
this section. queue, and creating buffers. However, using Android-Aparapi,
a programmer can focus on the areas where codes are supposed
to be run in parallel. Finally, certain codes used for calculating
The experiments focus on three topics:demonstrating the the length of time for debugging are replaced. Java contains
performance benefit from the optimizations in Section V, com- numerous convenient libraries for programmers, and we used
paring the performance of the CPU and GPU, and comparing these libraries directly.
the performance of the Android-Aparapi and OpenCL on GPU. Our benchmarks written in Android-Aparapi contain signif-
The experiments are run on Nexus 10, the specifications icantly fewer source lines of code (SLOC) than those written
of which are as follows: Dual-core A15 CPU, Quad-core in OpenCL in Rodinia. The SLOC of each benchmark is in
Mali T604 GPU, and 2 GB RAM. Because we added certain Table II. If the Rodinia benchmark contains input files, we
codes to Dalvik, we re-build from the Android Open Source use these files as the input to our benchmark. If the input
Project (AOSP). The operating system of the Android version data is random, we use Java Math random API to generate
is, as Google decided to have nine “2”s data. Finally, we validate our output results against those of
in the version number. The model number is Full AOSP on the Rodinia OpenCL benchmark’s.
Manta, and the Kernel version is 3.4.5-gaf9c307.
Our experiments begin with Rodinia, which is a bench-
A. Android-Aparapi performance vs. original Aparapi perfor-
mark suite for heterogeneous computing. Rodinia contains
OpenMP[6], OpenCL, and CUDA[7] implementations. Ro-
dinia is currently in version 2.3[8] and contains 18 benchmarks In this experiment we measure the benefit of the optimiza-
written in OpenCL. Our goal is to use Android-Aparapi to tion from Section V. Two benchmarks, NN and SC, in Rodinia
rewrite Rodinia OpenCL benchmark. Rodinia contains more contain Java object in their kernel code. As a result, our
than two thirds of benchmarks that are written with the work- optimization in Section V can make a difference. Because
group feature. OpenCL that runs on a GPU will be sped-up Aparapi running in JTP mode is not related to native sides, we
if the OpenCL programmer appropriately uses OpenCL local only compare two Android native versions: Android-Aparapi
memory. Android-Aparapi benchmarks that use local memory version and the original Aparapi version. The summary of
will also be sped-up. performance averages of two applications are shown in Table
In the interest of space in the paper we shall show the I. Each runs three data sizes, which all follow Rodinia specifi-
results on the representative seven benchmarks. The seven cations. Each time we run on the GPU on Android devices, we
are: Gaussian Elimination (GE), Breadth-First Search (BFS), generate the OpenCL host and kernel program when executing
Kmeans (KM), k-Nearest Neighbors (NN), PathFinder (PF), the Android-Aparapi kernel for the first time. The conversion
Needleman-Wunsch (NW), and Streamcluster (SC). In these time would not affect this experiment, because our focus
benchmarks, NW, SC, and PF are written containing local is running the OpenCL on device time, and therefore we
memory, using Android-Aparapi. More details for each bench- subtract conversion time from the total application-executing
mark are available in the Rodinia benchmark suite. time. Our optimization results in a speedup of at least 1.35 in
The details of porting Rodinia OpenCL benchmarks to all configurations in Table I. In addition, when the data sizes
Android-Aparapi are as follows. Because Android-Aparapi increase, the improvement increased substantially, particularly
has certain limitations in writing kernels, we have to rewrite in the SC program. For the NN program, the speedup is less
the benchmarks whenever necessary. For example, the most substantial because its data size increase four-fold. (vs. the 16
common problem is that we are required to change every fold in the case of SC.)
Fig. 4. BFS benchmark run in JTP and GPU modes Fig. 5. SC benchmark running in JTP and GPU modes
Fig. 6. Small problem size of each benchmark Fig. 8. Benchmark performance between our Android-Aparapi version with
the existing lower-level OpenCL version running on GPU
Kernels 2 2 1 2 1 2 1
Barriers 0 0 0 0 3 12 1
JTP Execution
122.26 s 3.64 s 0.22 s 120.54 s 87.57 s 76.12 s 283.75 s
GPU Execution
69.96 s 1.67 s 12.3 s 34.18 s 1.21 s 2.528 s 58.88 s
Original SLOC
525 349 358 686 671 592 2999
(Android- 182 212 130 228 328 411 850
263 ms 308 ms 160 ms 443 ms 501 ms 373 ms 559 ms
