Interrupted

Unordered thoughts about programming, engineering and dealing with the people in the process.

Decompiling Clojure I

| Comments

This is the first in a series of articles about decompiling Clojure, that is, going from JVM bytecode created by the Clojure compiler, to some kind of higher level language, not necessarily Clojure.

This article was written in the scope of a larger project, building a better Clojure debugger, which I’ll probably blog about in the future.

These articles are going to build form the ground up, so you may skip forward if you find some of the stuff obvious.

Clojure targets the JVM

To be more precise, there is a Clojure compiler targeting the JVM, there’s also one targeting Javascript, one for the CLR and there are some less known projects targeting lua or even C.

But the official clojure core efforts are mainly on the JVM, which stands for Java Virtual Machine.

That means when you write some clojure code:

1
2
3
(ns hello-world)

(println "Hello World")

You won’t get a native binary, for instance a x86 PE or ELF file, although it’s entirely possible to write a compiler to do it.

When you target a particular runtime though, you usually get a different set of functions to interact with the host, there’s a lot of language primitives just to deal with Java inter operation which do not migrate easily to other runtimes or virtual machines.

The JVM is about Java, or is it?

This doesn’t mean that the JVM can only run programs written in Java.

In fact, Clojure doesn’t use Java as an intermediate language before compiling, the Clojure compiler for the JVM generates JVM bytecode directly using the ASM library.

So, what does it mean the JVM is about Java if you can compile directly to bytecode without a mandatory visit to the kingdom of nouns?

Besides its name, the JVM was designed by James Gosling in 1992 to support the Oak Programming Language, before evolving into its current form.

Its main responsibility is to achieve independence from hardware and operating system, and like a real machine, it has an instruction set and manipulates memory at runtime, and the truth is the JVM knows nothing about the Java Programming Language, it only knows of a particularly binary format, the class file format, which contains bytecode and other information.

So, any programming language with features that can be expressed in terms of a valid class file, can by hosted on the JVM.

But the truth is that the class file format, maintains a lot of resemblance with concepts appearing in Java, or any other OO programming language as a matter of fact, to name a few:

  • A class file corresponds to a Class
  • A class file has members
  • A class file has methods
  • Methods of the class file can be static or instance methods
  • There are primitive types and reference types, that can be stored in variables
  • Exceptions are an instance or subclass of Throwable
  • etc

So, we can say the JVM is not agnostic regarding the concepts supported by the language, as the LISP machines were not agnostic either.

Clojure compiles to bytecode

So we have a language like Clojure, with many concepts not easily mapped to the JVM spec, but that it was mapped none the less, how?

Namespaces do not exist

Maybe you think Clojure namespaces correspond to a class, and each method in the namespace is mapped to a method in the class.

Well, that is not the case.

Namespaces were criticized before for being tough, and the truth is they’re used for proper modularity, but do not map to an entity in the JVM. They’re equivalent to java packages or modules in other languages.

Each function is a class

Each function in your namespace will get compiled to a complete different class. That’s something you can easily confirm listing the files under target/classes in a leiningen project directory.

1
2
3
4
5
6
7
8
9
10
11
12
git:(master)  ls target/classes/

config$fn__292.class
routes__init.class
config$loading__4910__auto__.class
config$read_environment$fn__300.class
config$read_environment.class
config$read_properties$iter__304__308$fn__309$fn__310.class
server
server$_main.class
server$_main$fn__4006.class
server$fn__3939.class

You will find a .class file for each function you have defined, namespace$function.class being the standard syntax.

Each anonymous function is also a class

As you saw in the previous listing, there are many functions with numbers like config$fn__292.class.

Those correspond to anonymous functions that get their own class when compiled, so if you have this code:

1
(map #(+ 34 %) (range 10))

You should expect a .class file for the anonymous function #(+ 34 %).

class files don’t need to be on disk

Many times you’ll find the class files on disk, but it doesn’t have to be that way.

In many circumstances we’re going to be modifying the class structure on runtime, or creating new class structures to be run, entirely on memory. Even the compiler can eval some code compiling to memory without creating a class file on disk.

What does bytecode look like?

For the first example, I selected a real simple clojure function

1
2
3
4
5
6
7
(defn test-multi-let
  []
  (let [a 1
        b 2
        c 3
        b 4]
    9))

To explore the bytecode we will use javap, simple, but does the job:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
javap -c target/classes/debugee/test\$test_multi_let.class
...
public static {};
    Code:
       0: lconst_1
       1: invokestatic  #19                 // Method java/lang/Long.valueOf:(J)Ljava/lang/Long;
       4: putstatic     #21                 // Field const__0:Ljava/lang/Object;
       7: ldc2_w        #22                 // long 2l
      10: invokestatic  #19                 // Method java/lang/Long.valueOf:(J)Ljava/lang/Long;
      13: putstatic     #25                 // Field const__1:Ljava/lang/Object;
      16: ldc2_w        #26                 // long 3l
      19: invokestatic  #19                 // Method java/lang/Long.valueOf:(J)Ljava/lang/Long;
      22: putstatic     #29                 // Field const__2:Ljava/lang/Object;
      25: ldc2_w        #30                 // long 4l
      28: invokestatic  #19                 // Method java/lang/Long.valueOf:(J)Ljava/lang/Long;
      31: putstatic     #33                 // Field const__3:Ljava/lang/Object;
      34: ldc2_w        #34                 // long 9l
      37: invokestatic  #19                 // Method java/lang/Long.valueOf:(J)Ljava/lang/Long;
      40: putstatic     #37                 // Field const__4:Ljava/lang/Object;
      43: return

I’ve removed some extra information such as variable tables, we’re going to be visiting those later.

What you see here are JVM assembly instructions, just a subset of the JVM instruction set, generated by the Clojure compiler when feed with the sample function above.

Before we get into more details, let me show you how that code looks after a basic decompiler pass:

1
2
3
4
5
0:a = 1
2:b = 2
6:c = 3
11:b = 4
16:RETURN 9

Prettier uh?

That is until you decompile this:

1
2
3
4
5
(defn test-if
  []
  (if (= (inc 1) 2)
    8
    9))

And get this:

1
2
3
0:IF (2 != clojure.lang.Numbers.inc(1)) GOTO 11 ELSE GOTO 18
11:RETURN 9
18:RETURN 8

Who was the moron that put a BASIC in my Clojure!

Ain’t it?

Keep reading… there’s more to be seen ahead.

The operand stack

I won’t dwell into many details about each JVM instruction and how that translates to something resembling Clojure, or Basic for that matter, but there’s one thing worth of mention, and that is the operand stack.

Frames

A new Frame is created each time a method is invoked and destroyed when the method completes, whether that information is normal or abrupt (throws an uncaught exception), frames are allocated in the JVM stack and have its own array of local variables and its own operand stack.

If you set a breakpoint on your code, each different entry in your thread callstack, is a frame.

Stack

The operand stack is a last-in-first-out (LIFO) stack and its empty when the Frame that contains it is created, and the JVM provides instructions for loading constants or variables into the operand stack, and to put values from the operand stack in variables.

The operand stack is usually used to prepare parameters to be passed to methods and to receive method results, as opposed to using registers to do it.

So you should expect something along these lines:

1
2
3
4
5
6
7
8
f(a, b, c)

=> compiling to

push a
push b
push c
call f

So looking again at the bytecode of the previous function:

1
2
3
4
5
(defn test-if
  []
  (if (= (inc 1) 2)
    8
    9))

Here:

1
2
3
4
5
6
7
8
9
10
11
12
public java.lang.Object invoke();
    Code:
       0: lconst_1
       1: invokestatic  #63                 // Method clojure/lang/Numbers.inc:(J)J
       4: ldc2_w        #42                 // long 2l
       7: lcmp
       8: ifne          18
      11: getstatic     #49                 // Field const__4:Ljava/lang/Object;
      14: goto          21
      17: pop
      18: getstatic     #53                 // Field const__5:Ljava/lang/Object;
      21: areturn

We can make ourselves an interpretation about what’s going on…

lconst_1 is pushing the constant value 1 into the stack, then calling a static method with invokestatic, as you’ve already guessed that’s the clojure.lang.Numbers.inc(1) we saw on the basic decompiler earlier.

Then ld2_w loads the value 2 into the stack and lcmp will compare it against the function result, ifne tests for non equality and jumps to line 18 if values differ.

One thing to consider here is that each entry on the operand stack can hold a value of any JVM type, and those must be operated in ways appropriate to their types, so many operations have a different operation code according to the type they’re handling.

So looking at this example from the JVM specification, we see the operations are prefixed with a d since they operate on double values.

1
2
3
4
5
Method double doubleLocals(double,double)
0 dload_1 // First argument in local variables 1 and 2
1 dload_3 // Second argument in local variables 3 and 4
2 dadd
3 dreturn

Which as you may have guessed, is adding double values 1 and 3.

JVM auxiliary information

The JVM class format has support for some extra information that can be used for debugging purposes, some of which you can get rid from your files if you want.

Among those we find the LineNumberTable attribute and the the LocalVariableTable attribute, which may be used by debuggers to determine the value of a given local variable during the execution of a method.

According to the jvm spec, the table has the following structure inside the class file format

1
2
3
4
5
6
7
8
9
10
11
LocalVariableTable_attribute {
       u2 attribute_name_index;
       u4 attribute_length;
       u2 local_variable_table_length;
       {   u2 start_pc;
           u2 length;
           u2 name_index;
           u2 descriptor_index;
           u2 index;
       } local_variable_table[local_variable_table_length];
   }

Basically it says which variable starts at which instruction: start_pc and lasts for how long: length.

If we look at that table for our let example:

1
2
3
4
5
6
7
(defn test-multi-let
  []
  (let [a 1
        b 2
        c 3
        b 4]
    9))

We see how each variable is referenced against program counter(pc) line numbers (do not get confused with source file line numbers).

1
2
3
4
5
6
7
LocalVariableTable:
      Start  Length  Slot  Name   Signature
             2      17     1     a   J
             6      13     3     b   J
            11       8     5     c   J
            16       3     7     b   J
             0      19     0  this   Ljava/lang/Object;

One interesting thing though, is the LineNumberTable

1
2
3
  public debugee.test$test_multi_let();
    LineNumberTable:
      line 204: 0

Which has only one line number reference, even if our function was 7 lines long, obviously that cannot be good for a debugger expecting to step over each line!

Next post I’ll blog about the Clojure compiler and how it ends up creating that bytecode, before visiting again the decompiling process.

I’m guilespi on Twitter, get in touch!

Comments