Decompiling Java Bytecode: Problems, Traps and Pitfalls
Decompiling Java Bytecode: Problems, Traps and Pitfalls
Decompiling Java Bytecode: Problems, Traps and Pitfalls
1 Introduction
all variables, but instead inserts spurious type casts to “fix up” code that has
unknown type.
We solve a more difficult problem, that of decompiling arbitrary, verifiable
bytecode. In addition to handling arbitrary bytecode, we also try to ensure that
the decompiled code can be compiled by a Java compiler and that the code
does not contain extraneous type casts or spurious control structures. Such a
decompiler can be used to decompile bytecode that comes from many sources
including: (1) bytecode from javac; (2) bytecode that has been produced by
compilers for other languages, including Ada, ML, Eiffel and Scheme; or (3)
bytecode that has been produced by bytecode optimizers. Code from these last
two categories many cause decompilers to fail because they were designed to
work specifically with bytecode produced by javac and cannot handle bytecode
that does not fit specific patterns.
To achieve our goal, we are developing a decompiler called Dava, based on
the Soot bytecode optimization framework. In this paper we outline the major
problems that we faced while developing the decompiler. We present many of the
major difficulties, discuss what makes the problems difficult, and demonstrate
that other commonly used decompilers fail to handle these problems properly.
Section 2 of this paper describes the problems in decompiling variables, types,
literals, expressions and simple statements. Section 3 introduces the problem of
converting arbitrary control flow found in bytecode to the control flow constructs
available in Java. Section 4 discusses the basic control flow constructions, while
the specific problems due to exceptions and synchronized blocks are examined
in more detail in Section 5. Related work and conclusions are given in Section 6.
In the Java source each variable has a name and a static type which is valid for
all uses and definitions of that variable. In the bytecode there are only untyped
locations — in method f there are 4 stack locations and 5 local locations. The
stack locations are used for the expression stack, while the local locations are used
to store parameters and local variables. In this particular example, the javac
compiler has mapped the parameter i to local 0, and the four local variables c,
r, d and is fat are mapped to locals 1, 2, 3 and 4 respectively. The mapping
of offsets to variable names and the types of variables must be inferred by the
decompiler.
Another complicating factor in decompiling bytecode is that while Java sup-
ports several integral data types, including boolean, char, short and int, at the
bytecode level the distinction between these types is only made in the signatures
for methods and fields. Otherwise, bytecode instructions consider these types as
integers. For example, at Label2 in Figure 1(b) the instruction iload 4 loads
an integer value for is fat from line 16 in Figure 1(a), which is a boolean value
in the Java program. This mismatch between many integral types in Java and
the single integer type in bytecode provides several challenges for decompiling.
These difficulties are illustrated by the result of applying several commonly
used decompilers. Figure 2 shows the output from three popular decompil-
ers, plus the output from our decompiler, Dava. Jasmine (also known as the
SourceTec Java Decompiler) is an improved version of Mocha, probably the
first publicly available decompiler[10,7]. Jad is a decompiler that is free for
non-commercial use whose decompilation module has been integrated into sev-
eral graphical user interfaces including FrontEnd Plus, Decafe Pro, DJ Java
Decompiler and Cavaj[6]. Wingdis is a commercial product sold by Wing-
Soft [16]. In our later examples we also include results from SourceAgain, a
commercial product that has a web-based demo version[14].1 Our tests used the
most current releases of the software available at the time of writing this pa-
per, namely Jasmine version 1.10, Jad version 1.5.8, Wingdis version 2.16, and
SourceAgain version 1.1.
Each of the results illustrate different approaches to typing local variables. In
all cases the variables with types boolean, Circle and Rectangle are correct.
The major difficulty is in inferring the type for variable d in the original program,
which should have type Drawable. The basic problem is that on one control path
d is assigned an object of type Circle, whereas on the other, d is assigned an
object of type Rectangle. The decompiler must find a type that is consistent
with both assignments, and with the use of d in the statement d.draw();. The
simplest approach is to always chose the type Object in the case of different
constraints. Figure 2(a) shows that Jasmine uses this approach. This produces
incorrect Java in the final line where the variable object needs to be cast to
a Drawable. Jad correctly inserted this cast in Figure 2(c). Wingdis exhibits a
bug on this example, producing no a variable for the original d, and incorrectly
emitting a static call Drawable.draw();.
1
The demo version does not support typing across several class files, so it is not
included in our first figure.
114 Jerome Miecznikowski and Laurie Hendren
As shown in Figure 2(d), our decompiler correctly types all the variables and
does not require a spurious cast to Drawable. The complete typing algorithm
is presented in our paper entitled “Efficient Inference of Static Types for Java
Bytecode”[5]. The basic idea is to construct a graph encoding type constraints.
The graph contains hard nodes representing the types of classes, interfaces, and
the base types; and soft nodes representing the variables. Edges in the graph are
inserted for all constraints that must be satisfied by a legal typing. For example,
the statement d.draw(); would insert an edge from the soft node for d to the
hard node for Drawable. Once the graph has been created, typing is performed
by collapsing nodes in the graph until all soft nodes have been associated with
hard nodes. In this case the soft node for d would be collapsed into the hard
node for Drawable. There do exist bytecode programs that cannot be statically
typed, and for those programs we resort to assigning types that are too general
and inserting down casts where necessary. However, we have found very few cases
Decompiling Java Bytecode: Problems, Traps and Pitfalls 115
public static void f(short word0) public static void f(short s0)
{ Object obj; { boolean z0;
boolean flag; Rectangle r0;
Drawable r1;
if (word0 > 10) Circle r2;
{ Rectangle rectangle =
new Rectangle(word0, word0); if (s0 <= 10)
flag = rectangle.isFat(); { r2 = new Circle(s0);
obj = rectangle; z0 = r2.isFat();
} r1 = r2;
else }
{ Circle circle = else
new Circle(word0); { r0 = new Rectangle(s0, s0);
flag = circle.isFat(); z0 = r0.isFat();
obj = circle; r1 = r0;
} }
if(!flag) if (z0 == false)
((Drawable) (obj)).draw(); r1.draw();
} return;
}
(c) Jad (d) Dava
where such casts need to be inserted, and in general our approach leads to many
fewer casts than simpler typing algorithms.
The decompiled code produced by Wingdis, Figure 2(b), demonstrates the
difficulties produced by different integral types. This decompiler inserts spurious
typecasts for all uses of the variable short. Furthermore, constants as well as
variables must be assigned the correct integral type. For example, a call to
method f with a constant value must be made as f((short) 10); in order to
avoid a type conflict between the type of the argument (int) and the type of the
parameter (short).
on the stack, the isFat method is invoked, which pops the object reference and
pushes isFat’s return value, and finally the return value is popped from the
stack and stored in local 4. The expression stack had height 0 at the beginning
of the statement and height 0 at the end of the statement.
This straight forward code generation strategy makes it fairly simple for a
decompiler to rebuild the statement. However, many other bytecode sequences
could express the same computations. Consider the example in Figure 3. Fig-
ure 3(a) gives the original bytecode as produced by javac, whereas Figure 3(b)
gives an optimized version of the bytecode. The optimized version uses 5 fewer
instructions and 3 fewer locals.2 An example of a simple optimization is found
at line 7. At this point the second iload 0 instruction has been replaced with a
dup instruction. A more complex optimization makes use of the expression stack
to save the values. For example, rather than storing the result of line 7 and then
reloading it at line 8, the value is just left on the stack. Furthermore, since this
same value is needed later, its value is duplicated (third dup at line 7). Line 8
demonstrates that the return value from the call to isFat can just be left on the
stack. The swap instruction at line 8 exchanges the boolean value on top of the
stack with the object reference just below it. Line 9 stores the object reference
from the top of the stack and Line 12 uses the boolean value that is now on top
of stack for the infne test.
When the optimized code from Figure 3(b) is given to the other decompilers,
they all fail because the bytecode does not correspond to patterns they expect
(see Figure 4, page 118). Jasmine and Jad emit error messages saying that the
control flow analysis fails and emit code that is clearly not Java. Wingdis emits
code that resembles Java but is clearly not correct as the calls to the method
isFat have been completely missed, and the type for the left operand of == is an
object rather than a boolean. SourceAgain also produces something that looks
like Java, but it is also incorrect since it allocates too many objects and has lost
the boolean variable.
Our Dava decompiler produces exactly the same Java code as for the unopti-
mized class file, except for the names of the local variables. Figure 2(d) contains
no variables starting with $, whereas in Figure 4(e) three variables do start
with $. In our generated code we prefix variables with $ to indicate variables
corresponding to stack locations in the bytecode.
Dava is insensitive to the input bytecode because it is built on top of the Soot
framework which transforms the bytecode into an intermediate representation
called Grimp[13,15]. Soot begins by reading bytecode and converting it to simple
three address statements (this intermediate form is called Jimple). When gen-
erating Jimple the stack locations become specially named variables. Soot then
uses U-D webs to separate different variables that may share the same local offset
in bytecode, and finally performs simple code cleanup and the typing algorithm.
2
It should be noted that this is not a contrived example; it merely illustrates the
problems we encountered when applying other decompilers to bytecode produced by
Java bytecode optimizers (even very simple peephole optimizers) and to bytecode
produced by compilers for other languages.
Decompiling Java Bytecode: Problems, Traps and Pitfalls 117
Given the typed Jimple, an aggregation step rebuilds expressions and produces
Grimp. Grimp is the starting point for our restructuring algorithms described in
the next section.
The last major phase of our decompiler recovers a structured representation for
a method’s control flow. There may be more than one structured representation
for any given control flow graph (CFG), so in Dava, we focused on producing
a correct restructuring that would be easy to understand. Other goals, such as
fast restructuring or representing control flow with a restricted set of control
flow statements, are possible but not explored in Dava.
For correctness, we use a graph theoretic approach and focused on the capa-
bilities of the Java grammar. For us, the key question was: “For any given set
of control flow features in the CFG, can we represent it with pure Java?” When
answering this question we must consider the following:
118 Jerome Miecznikowski and Laurie Hendren
1. Every control flow statement in Java has exactly one entry point, and one
or more exit points.
2. Java provides labeled blocks, labeled control flow statements, and labeled
breaks and continues. With these, it is possible to represent any CFG that
forms a directed acyclic graph (DAG) in pure Java. Consider the following.
Decompiling Java Bytecode: Problems, Traps and Pitfalls 119
The only novel feature of this CFG is that is distinguish edges representing
normal control flow from those representing the throwing of an exception.
The SET is built in 6 phases. A more complete description can be found in
our paper entitled “Decompiling Java Using Staged Encapsulation”[9]; here we
provide a brief overview. Each phase searches for a specific type of feature in the
CFG and produces structured Java language statements that can represent that
feature. The Java statement is then bundled with the set of nodes (wrapped
Grimp statements) from the CFG that would correspond to its body. Since
every structured Java statement has only one entry point, we can usually use
dominance to determine the body. For example, a while statement would consist
of the appropriate condition expression plus those statements from the CFG that
the condition dominates, minus those statements reachable by the control flow
from the condition that escapes the loop. The structured bundle is then nested
in the SET such that the set of statements in the bundle is a subset of those
in its parent node and a superset of those in its children nodes. In this way
the SET can be built up in any arbitrary order of node insertion. Note also
that the properties searched for in the CFG (ie. dominance and reachability) are
transitive, which guarantees us that the superset/subset relations between SET
bundles and their children will always hold.
A decompiler must be able to find if, switch, while, and do-while statements,
labeled blocks, and labeled breaks and continues.
Many decompilers use reduction based restructuring. These work by search-
ing the CFG for local patterns that directly correspond to those produced by
Java grammar productions. When a pattern is found it is reduced to a single
node in the CFG and the search is repeated. This process is iterated until no
more reductions can be found. In general this approach is difficult because the
library of patterns that are matched against does not cover all possible patterns
in the CFG. At some point, one may not find any more reductions, but still have
not reduced the program to a single structured statement.
In contrast, Dava searches for features in the control flow graph in order
of how flexibly they can treated. For example, strongly connected components
must be represented by loops, which is an inflexible requirement. Accordingly,
the conditions of loops are to be found before the conditions of if statements.
4.1 Loops
The most general way to characterize cyclic behavior in the CFG is to begin by
searching for the strongly connected components (SCC). For each SCC, we build
a Java loop. By examining the properties of the entry and exit points in the SCC
we can determine which type of Java loop (while, do-while or while(true))
is suitable for the structured representation. Once we know the type of loop, we
Decompiling Java Bytecode: Problems, Traps and Pitfalls 121
know which statement in the CFG yields us the conditional expression (if any)
for the structured loop, and we can find the loop body.
We know that for every iteration of a Java loop, if the loop is conditional,
the condition expression must be evaluated, or if the loop is unconditional, the
entry point statement must be executed. To find nested loops, we simply remove
the condition statement, or the entry point statement, from the CFG and re-
evaluate to see if any SCCs remain. This process is iterated until no more SCCs
are found.
This process seems to be more robust than reduction based techniques. Con-
sider the small, if somewhat contrived, example in figure 5, page 122. Method
foo() has no real purpose other than to illustrate the performance of a restruc-
turer on difficult, loop based control flow. The original Java source was compiled
with javac and the resulting bytecode class was not modified in any way. This
example has two interesting components, (1) the outer loop only executes if an
exception is thrown, and (2) if the inner loop exits normally, the next statement
that affects program state is the return.
We can see that only Dava produces correct, recompilable code, though it
does not greatly resemble the original program. Jad alone produces code that is
reminiscent of the original, but unfortunately it is neither correct nor recompil-
able.
We may encounter multi-entry point SCCs. Here the input does not directly
correspond to a Java structured program, so all decompilers will output ugly
Java code. There are several solutions, but all involve transforming the CFG.
Our solution converts the multi-entry point SCC to a single entry point SCC
by breaking the control flow to the original entry points and rerouting it to
a dispatch statement. This dispatch then acts as the single entry point and
redirects control to the appropriate destination.
the thrown exception matches the table entry’s exception class, then control is
transferred to that entry’s handler instruction.
In bytecode, regular control flow imposes few restrictions on exception han-
dling. Control flow may enter or exit at any instruction within a table entry’s
area of protection, and does not have to remain constantly within that area once
it enters. Multiple control flow paths may enter a single area of protection at
different points, and different areas of protection may overlap arbitrarily. The
handler instruction may be anywhere within the class file, limited by the con-
straints of bytecode verification, including within the table entry’s own area of
protection. Finally, more than one exception table entry may share the same
exception handler. In short, exception handling in Java bytecode is mostly un-
structured.
By contrast, exception handling in the Java language uses the try, catch and
finally grammar productions and is highly structured. There is only one entry
point to a try statement, control flow within it is contiguous, and each of these
Java statements nests properly. There is no way to make try statements partially
overlap each other. Also, each try must be immediately followed by a catch
and/or a finally statement. There may be any number of catch statements
but no more than one finally.
If an exception is thrown and is not caught in a catch statement, then the
method in which this occurs must declare that it throws that exception. Method
declarations must agree between subclasses and superclasses. Therefore, if some
method m1 declares a throws and overrides or is overridden by another method
m2 , then m2 must also declare the throws.
There is a complication to the throws declaration rule. Object locking is pro-
vided in Java with the synchronized() statement. If a thrown exception causes
control to leave a synchronized() statement, the Java language specification
requires that the object lock be released. This is accomplished in the bytecode
by catching the exception, releasing the lock in the exception handler and finally
rethrowing the exception. This exception handling should not be translated into
try catch statements, but remains masked by the synchronized() statement.
Consequently, throws that are to be implied by a synchronized() statement’s
exception handling are not explicitly put in the Java language representation,
and therefore are also ignored in the method declaration.
There are numerous consequences from this “semantic gap” in exception
handling. An area of protection must be represented by a try statement, and
handlers by a catch or finally. However, a try statement has only one entry
point. So, an area of protection with more than one entry point must be split into
as many parts as there are entry points. Each of these new areas of protection
share the same handler, but a catch statement can only be immediately preceded
by a single try. To reconcile this, the handler statement (at least) must be
duplicated for each area of protection. If two areas of protection overlap but
neither fully encapsulates the other, we must break up at least one of the areas
to allow the resulting try statements to either be disjoint or nest each other
properly.
124 Jerome Miecznikowski and Laurie Hendren
synchronized() blocks is both complex and specific, it turns out that the occur-
rence of the proper feature set is almost always the result of a synchronized()
block in the bytecode’s source. As such, it is already in a form that is easily re-
structured and an aggressive approach provides little improvement over simple
pattern matching.
To our knowledge there are few papers on the complete problem of decompiling
arbitrary bytecode to Java. There are many tools including the decompilers we
tested in this paper, however there is very little written about the design and
implementation of those tools.
The implementation of the Krakatoa decompiler has been described in the
research literature[11], however, we were unable to test this decompiler because
it is not publically available. Krakatoa uses an extended version of Ramshaw’s
goto-elimination technique [12], which produces legal, though somewhat convo-
luted, Java structures by introducing loops and multi-level breaks. Krakatoa then
applies a series of rewrite rules to this structured representation where each rule
attempts to replace a program substructure with a more “natural” one. Such a
relatively strong restructurer may be able to handle complicated loops. While
it is not clear from the paper how the typing and expression building works,
Krakatoa appears to use the same approach as the decompilers we tested. All
program examples come from bytecode generated from javac. This approach
does not address the problems with exceptions and synchronization.
There has been related work on restructuring Java and other high-level lan-
guages. Research on restructuring can usually be divided into restructuring
with gotos, versus eliminating gotos. The independent works of Baker[2] and
Cifuentes[3] are prominent examples of the first category while Erosa[4] and Z.
Ammarguellat[1] are good examples of the second. These are general approaches
and would require modifications to deal with the special requirements of Java,
such as dealing with synchronization and exceptions.
Knoblock and Rehof[8]. have worked on finding static types for Java pro-
grams. Their approach differs from ours in that it works on an SSA intermediate
representation and may change the type hierarchy when types conflict due to
interfaces.
This paper has presented some of the problems, traps and pitfalls encoun-
tered when decompiling arbitrary, verifiable Java bytecode. We demonstrated
the problems in dealing with variables, literals and types, and showed how ex-
isting decompilers deal with the typing problem by inserting spurious type casts
(or by producing incorrect code). We showed that bytecode that has been opti-
mized is not correctly decompiled by any of the four decompilers we tested. This
demonstrates that such decompilers target bytecode that has been produced by
a known compilation strategy, such as that used by javac. We discussed the
overall problem of control flow structuring and showed that even control flow
produced by javac can be difficult to handle. Finally, we demonstrated byte-
Decompiling Java Bytecode: Problems, Traps and Pitfalls 127
code allows for more general use of exceptions and synchronizations than what
is produced from Java. In all cases our Dava compiler was able to produce a
correct Java program.
Now that we have a robust decompiler, we will begin to concentrate on a
postprocessor that converts control flow constructs into idioms likely to be used
by a programmer, and on mechanisms for choosing readable variable names for
parameters and local variables. We will also continue to stress test the decom-
piler by decompiling class files from a variety of sources. The decompiler will be
released as part of the Soot framework, and will be publically available. Cur-
rently, interested parties can contact the first author for a “preview version” of
the software.
References
1. Z. Ammarguellat. A control-flow normalization algorithm and its complexity. IEEE
Transactions on Software Engineering, 18(3):237–250, March 1992. 126
2. B. S. Baker. An algorithm for structuring flowgraphs. Journal of the Association
for Computing Machinery, pages 98–120, January 1977. 126
3. C. Cifuentes. Reverse Compilation Techniques. PhD thesis, Queensland University
of Technology, July 1994. 126
4. A. M. Erosa and L. J. Hendren. Taming control flow: A structured approach to
eliminating goto statements. In Proceedings of the 1994 International Conference
on Computer Languages, pages 229–240, May 1994. 126
5. E. M. Gagnon, L. J. Hendren, and G. Marceau. Efficient inference of static types
for Java bytecode. In Static Analysis Symposium 2000, Lecture Notes in Computer
Science, pages 199–219, Santa Barbara, June 2000. 114
6. Jad - the fast JAva Decompiler. http://www.geocities.com/SiliconValley/-
Bridge/8617/jad.html. 113
7. SourceTec Java Decompiler. http://www.srctec.com/decompiler/. 113
8. T. Knoblock and J. Rehof. Type elaboration and subtype completion for java
bytecode. In Proceedings 27th ACM SIGPLAN-SIGACT Symposium on Principles
of Programming Languages., 2000. 126
9. J. Miecznikowski and L. Hendren. Decompiling Java using staged encapsulation.
In Proceedings of the Working Conference on Reverse Engineering, pages 368–374,
October 2001. 119, 120
10. Mocha, the Java Decompiler. http://www.brouhaha.com/~eric/computers/-
mocha.html. 113
11. T. A. Proebsting and S. A. Watterson. Krakatoa: Decompilation in Java (Does
bytecode reveal source?). In 3rd USENIX Conference on Object-Oriented Tech-
nologies and Systems (COOTS’97), pages 185–197, June 1997. 126
12. L. Ramshaw. Eliminating go to’s while preserving program structure. Journal of
the Association for Computing Machinery, 35(4):893–920, October 1988. 126
13. Soot - a Java Optimization Framework. http://www.sable.mcgill.ca/soot/. 116
14. Source Again - A Java Decompiler. http://www.ahpah.com/. 113
15. R. Vallée-Rai, E. Gagnon, L. Hendren, P. Lam, P. Pominville, and V. Sundaresan.
Optimizing Java bytecode using the Soot framework: Is it feasible? In D. A.
Watt, editor, Compiler Construction, 9th International Conference, volume 1781
of Lecture Notes in Computer Science, pages 18–34, Berlin, Germany, March 2000.
Springer. 116
16. WingDis - A Java Decompiler. http:/www.wingsoft.com/wingdis.html. 113