Latest Posts »
Latest Comments »
Popular Posts »

Typesafe DSLs in Java: Part 1 — Typesafe Bytecode

Written by Jevgeni Kabanov on March 24, 2008 – 12:25 am

Domain Specific Languages (DSLs) have been brought to Java under the name of Fluent Interface. However most of them utilize a lot of strings and untyped behavior to make the interface fluent enough. It turns out that using Java 5 and a bag of tricks we can have the compiler to check a lot more. In this post we'll check out how to write Java bytecode using ASM in a typesafe way.

Domain Specific Languages

Domain specific language usually refers to a small sublangauge that has very little overhead when expressing domain specific data and behaviour. DSL is a broad term and can refer both to a fully implemented language and a specialized API that looks like a sublanguage, but still written using some general-purpose language. Such DSLs in the latter meaning have been introduced (perhaps independently?) by both the functional and dynamic languages community. Both these communities (and especially functional) took advantage of function composition and operator overloading to build combinator-based languages that look nothing like the original. The functional communities also strongly support the notion of type-safety, therefore the DSLs they create are usually statically typed.

In Java the DSLs are also becoming more popular. The first examples like jMock and Hibernate Criteria were coined as Fluent Interface. However most of these DSLs are not typesafe and will allow wrong statements to be compiled. In this post we introduce some patterns that help making Java DSLs safer.

Java bytecode engineering

Java bytecode is a relatively simple stack-based language. All the code is contained in methods, class structure is pretty much preserved from source (fields and methods, both can be static, constructors and static initializers are turned into methods named "<init>" and "<clinit>" correspondingly). Inside the method we have the variables, referred by an index with 0 being "this", 1 being the first parameter and so on, local variables starting after parameters. We can load and store variables. We also have the stack, where we can push, pop and duplicate values. We have a number of basic operations on the stack (a la add and multiply) as well as method invocation. When invoking the methods parameters are gathered from the stack with last parameter being on top of the stack. Finally we have some flow control, namely conditional (and unconditional) jumps. That pretty much sums it up.

One of the best libraries for working with Java bytecode is ASM. It provides both a lightweight fast visitor-based interface and a more comfortable tree-based object-oriented interface. Unfortunately both of them (and especially visitor-based one) are completely untyped and debugging the wrong bytecode created using them is a huge pain in the neck. 'Nuff to say that Java bytecode verifier will tell you there's a problem, but will not tell you where exactly it is.

Further we will examine in details the bytecode of the following example:

JAVA:
  1. public class HelloWorld {
  2.   public static void main(String[] args) {
  3.     System.out.println("Hello, World!");
  4.   }
  5. }

If it looks simple to you -- it should. However the Java bytecode is a bit more complex, we have a default constructor that compiler will generate for you and we can do only one operation per line (more correctly per instruction). Next you see the HelloWorld example in Java bytecode (you can get a similar using "javah -c ClassName", although Bytecode outline Eclipse plugin is more comfortable).

JAVA:
  1. public class HelloWorld {
  2.   // Default constructor
  3.   public <init>()V
  4.     ALOAD 0
  5.     INVOKESPECIAL java/lang/Object.<init>()V
  6.     RETURN
  7.  
  8.   // System.out.println("Hello, World!");
  9.   public static main([Ljava/lang/String;)V
  10.     GETSTATIC java/lang/System.out : Ljava/io/PrintStream;
  11.     LDC "Hello, World!"
  12.     INVOKEVIRTUAL java/io/PrintStream.println(Ljava/lang/String;)V
  13.     RETURN
  14. }

Here's a summary of the instructions in the example:

  • ALOAD loads the local variables to the top of the stack, index 0 is "this"
  • INVOKESPECIAL in this case invokes super method, it also consumes and Object from the stack.
  • GETSTATIC retrieves the value of the static fields and puts it on the stack (in this case a PrintStream)
  • LDC pushes a constant value to the stack
  • INVOKEVIRTUAL invokes a usual (virtual) method, consuming the parameters and returning the result
  • RETURN returns from the method

If we want to write the same code from Java using ASM it would be almost the same:

JAVA:
  1. ClassWriter cw = new ClassWriter(0);
  2. MethodVisitor mv;
  3.  
  4. cw.visit(V1_6, ACC_PUBLIC + ACC_SUPER, "HelloWorld", null, "java/lang/Object", null);
  5.  
  6. {
  7. mv = cw.visitMethod(ACC_PUBLIC, "<init>", "()V", null, null);
  8. mv.visitCode();
  9. mv.visitVarInsn(ALOAD, 0);
  10. mv.visitMethodInsn(INVOKESPECIAL, "java/lang/Object", "<init>", "()V");
  11. mv.visitInsn(RETURN);
  12. mv.visitMaxs(1, 1);
  13. mv.visitEnd();
  14. }
  15. {
  16. mv = cw.visitMethod(ACC_PUBLIC + ACC_STATIC, "main", "([Ljava/lang/String;)V", null, null);
  17. mv.visitCode();
  18. mv.visitFieldInsn(GETSTATIC, "java/lang/System", "out", "Ljava/io/PrintStream;");
  19. mv.visitLdcInsn("Hello, World!");
  20. mv.visitMethodInsn(INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");
  21. mv.visitInsn(RETURN);
  22. mv.visitMaxs(2, 1);
  23. mv.visitEnd();
  24. }
  25. cw.visitEnd();

Typesafe ASM frontend

Take a look at the mv.visitMethodInsn(INVOKEVIRTUAL, "java/io/PrintStream", "println", "(Ljava/lang/String;)V");. It requires two elements on the stack -- an instance of PrintStream on which the method is called and a String. What we want to do is for compiler to issue an error when the stack in fact does not contain such elements. We propose the following DSL as the basis:

JAVA:
  1. new ClassBuilder(cw, V1_4, ACC_PUBLIC, "HelloWorld", "java/lang/Object", null)   
  2.     .beginStaticMethod(ACC_PUBLIC | ACC_STATIC, "main", void.class, String[].class)
  3.     .getStatic(System.class, "out", PrintStream.class)
  4.     .push("Hello, World!")
  5.      //Here String.class refers to the type of the first parameter
  6.     .invokeVirtualVoid(INVOKEVIRTUAL, PrintStream.class, "println", String.class)
  7.     .returnVoid()
  8.     .endMethod();

Note that now all the types are written as class literals instead of strings. In Java 5, class literals are generified to Class<C>, where C refers to the actual underlying type. This already allows for less mistakes, since the classes are no longer written as strings. However we want more!

The first thing we want to do is to track the stack size. We wouldn't want to allow to pop() off an empty stack. And we would want to ensure that when you invoke a method all the parameters are there. To do that we introduce a simple idiom:

The type you return from your DSL method should allow exactly those operations that are possible with the current state.

What does it mean in our context? It means that when we build a method we don't have just a MethodBuilder class, but MethodBuilderS0, MethodBuilderS1, MethodBuilderS2 and so on, where the S* refers to the stack size. And each of them has only the methods possible with the current stack size. Now the method for pop() will not even show up in autocompletion if the stack is not large enough.

We can apply the same trick to tracking variables, allowing to add only one variable at a time and providing methods loadVar*/stroreVar*. This way our class becomes MethodBuilderS*V* where S stands for stack size and V stands for variable count.

Of course we want more. We want to actually track the stack and variable types. To do that we introduce the following idiom:

The DSL type should be parametrized by all of the accumulated types as should be it's methods.

What does that mean? Let's return to our example and see what happens if I replace the string "Hello, World!" with an integer 11?

asm-error-1.PNG

As we can see from the screenshot the compiler inferred that the argument invokeVirtualVoid() will receive is an Integer (generics do not work with primitive types), whereas we have stated that it should receive a String as a parameter. Another thing we can see is that invokeVirtualVoid() method belongs to the MethodBuilderS2V1<Self, PrintStream, Integer, String[]> class.

As we know the name MethodBuilderS2V1 refers to a class that tracks two stack slots and one variable. The types are encoded accordingly in <Self, PrintStream, Integer, String[]> -- the two stack types are PrintStream, Integer and the one variable type us String[]. Self is not important for now.

To understand how the tracking is implemented let's see the push() and pop() implementation (the rest are omitted).

JAVA:
  1. public class MethodBuilderS2V1 <O, S0, S1, V0> {
  2.   public <S> MethodBuilderS3V1<O, S0, S1, S, V0> push(S value) {
  3.     mv.visitLdcInsn(value);
  4.     return new MethodBuilderS3V1<O, S0, S1, S, V0>(cb, mv);
  5.   }
  6.  
  7.   public MethodBuilderS1V1<O, S0, V0> pop() {
  8.     mv.visitInsn(POP);
  9.     return new MethodBuilderS1V1<O, S0, V0>(cb, mv);
  10.   }

So as you can see the class itself is parametrized by all of the accumulated type parameters (which were one by one added by previously invoked DSL methods). We can see that push() returns an instance of MethodBuilderS3V1. That class tracks exactly one more stack slot, which is passed to it as inferred from the argument (S). pop() on the other hand just discards one stack slot and returns an instance of MethodBuilderS1V1.

So what does the invokeVirtualVoid() look like? First of all, it needs to consume two stack variables, therefore we need at least MethodBuilderS2V* class to call it. Therefore all classes with less stack variables will not have this method. Secondly we need to check that the types in the stack are fitting, an must also allow for some leniency due to subtyping:

JAVA:
  1. public class MethodBuilderS2V1<O, S0, S1, V0> {
  2.     public MethodBuilderS0V1<O, V0> invokeVirtualVoid(int kind, Class<? super S0> owner, String name, Class<? super S1> parameter1) {
  3.     mv.visitMethodInsn(
  4.         kind,
  5.         owner.getName().replace('.', '/'),
  6.         name,
  7.         Type.getMethodDescriptor(Type.VOID_TYPE, new Type[] {Type.getType(parameter1)}));
  8.    
  9.     return new MethodBuilderS0V1<O, V0>(cb, mv);
  10.   }
  11. }

As you can see it indeed consumes two stack values returning MethodBuilderS0V1. If our method would also return a result we would need to take the result type as well, and return a class with stack depth only one less. The expression "? super S0" means that we require the actual parameter type to be a superclass of the stack type.

Problem 1: The world isn't perfect

One of the problems with the DSL we proposed here is that it assumes all the types exist. However it is very often the case that some of the types (most prominently the class currently being created) do not. You can somewhat alleviate the problem by introducing a special placeholder Self type and use it instead of the current class name, but it won't solve the problem of other classes still awaiting construction.

A different (but connected) problem is that you are not always constructing the full method, instead you could be just creating a prelude for a particular method or replacing one instruction with a series of your own. To solve we add this idiom:

We should allow escaping from the rigid typesafe world by having unsafe operations, which issue compiler warnings

This means that instead of missing pop() from a type with no stack slots we should just deprecate it or otherwise issue a warning. This also means that we should have invoke* methods that take strings as parameter types, similarly deprecated.

However, since we know it's not a perfect world we'd like to at least protect ourselves a bit better. Therefore we add another idiom:

Allow users to document their assumptions.

This means that when we need to write just a fragment of bytecode we want to document what stack values and variables will it need. For that we introduce methods assumePush()/assumePop() and assumeVar*(), which do not push any values, but just add the corresponding type variables:

JAVA:
  1. public class MethodBuilderS2V1 <O, S0, S1, V0> {
  2.   public <S> MethodBuilderS3V1<O, S0, S1, S, V0> assumePush(Class<S> type) {
  3.     return new MethodBuilderS3V1<O, S0, S1, S, V0>(cb, mv);
  4.   }
  5.  
  6.   public MethodBuilderS1V1<O, S0, V0> assumePop() {
  7.     return new MethodBuilderS1V1<O, S0, V0>(cb, mv);
  8.   }
  9.  
  10.   public <V> MethodBuilderS2V2<O, S0, S1, V0, V> assumeVar1(Class<V> type) {
  11.     return new MethodBuilderS2V2<O, S0, S1, V0, V>(cb, mv);
  12.   }

Using them we can document our expectations and let the compiler validate them. The following is an example of how we can use the assumptions to document that we expect PrintStream to be on stack and String[] to be the variable 0.

JAVA:
  1. private static <O> void genSayHello(MethodBuilderS0V0<O> mb) {
  2.     mb.assumeVar0(String[].class)
  3.     .assumePush(PrintStream.class)
  4.     .loadVar0(String[].class)
  5.     .push(0)
  6.     .arrayLoad(String[].class, Integer.class, String.class)
  7.     .invokeVirtualVoid(INVOKEVIRTUAL, PrintStream.class, "println", String.class);
  8.   }

Problem 2: Control flow and reuse

The next problem is how to mix the DSL with the general-purpose control flow and method calls. The problem here is that (although it is hidden from the user) we return a different type every time. Therefore if we add a conditional operation, we would need to save the type in a variable, which would be clumsy, since it contains a lot of inferred types. It also wouldn't cope with changes, since the inferred types would change every time. To solve this we invoke the next idiom:

Hide control flow in closures.

In fact we introduce two types of closures -- the strongly typed and weakly typed. The weakly typed one loses all accumulated type information and is meant first of all for reusable methods, since they need to take one based type as an argument:

JAVA:
  1. public interface Closure {
  2.   public <O> void apply(MethodBuilderS0V0<O> mb);
  3. }

The strongly typed closures are generated with the class and look like that:

JAVA:
  1. public class MethodBuilderS2V1 <O, S0, S1, V0> {
  2.   public MethodBuilderS2V1<O, S0, S1, V0> closure(ClosureS2V1 closure) {
  3.     closure.apply(this);
  4.     return this;
  5.   }
  6.  
  7.   public MethodBuilderS2V1<O, S0, S1, V0> closure(Closure closure) {
  8.     closure.apply(new MethodBuilderS0V0<O>(cb, mv));
  9.     return this;
  10.   }
  11.  
  12.   interface ClosureS2V1 {
  13.     <O, S0, S1, V0> void apply(MethodBuilderS2V1<O, S0, S1, V0> mb);
  14.   }
  15. }

We illustrating using the weakly typed closures to call the method genSayHello() introduced previously:

JAVA:
  1. .beginStaticMethod(ACC_PUBLIC | ACC_STATIC, "main", void.class, String[].class)
  2.     .getStatic(System.class, "out", PrintStream.class)
  3.     .closure(new Closure() {
  4.       public <O> void apply(MethodBuilderS0V0<O> mb) {
  5.         genSayHello(mb);
  6.       }
  7.     })
  8.  
  9.     .returnVoid()
  10.     .endMethod();

Of course with the introduction of closures in Java 7 this idiom would be much shorter

Problem 3: Productivity and performance

One of the problems with having classes for each stack depth and variable size is that we have a lot of them ((max stack depth + 1) times (max variable count + 1)). And we need to have a reasonably big number as maximum before deprecating the methods and losing track of some stack slots and variables. Writing them by hand would be exteremely unefficient, therefore we propose to use a generator.

The other problem is that having such a great number of classes and instances at runtime could cause performance problems. The way we could solve it is providing also a source-compatible unsafe library that would provide exactly the same interface, but always return "this" and track no types to allow good perfromance. It cannot be made binary compatible, since every method has a different signature (even though inference hides it from the user), but since those signatures are not present in source it would be a matter of recompiling the application for production, after insuring in development that the code is typesafe.

Problem 4: Primitive types and operations

The final problem is one of the most annoying and the one we tackled the least. Since generic types cannot be parametrized by primitive types we have to track boxed types instead of primitives (Integer v/s int). This means that we cannot distinguish among them and they can still cause a runtime error. What's worse we need to somehow indicate to a method that the parameter is a primitive type, but still need to pass to it a boxed one to satisfy the inference. There are several possibilities here, but all of them are relatively ugly.

A similar problem, which is yet to find an elegant solution is that double and long take two stack slots, whereas everything else takes one. And of course some coercions will not work with the current prototype.

If you have any ideas, please let us know.

But is it real?

The code is available in a Google Code repository. The code is just a prototype to show off the typesafe DSLs for our upcoming paper and is yet nowhere ready to even consider its usage.

The generator is written in PHP and driven by a BASH script. It may sound crazy, but it allowed me to prototype the code very quickly.

UPDATE: I also uploaded a downloadable distribution with 10x20 classes pregenerated for your pleasure. Check out src/Test.java for the example and gen/* for the generated classes. The asm-3.0.jar must be in the classpath.


Tags: , , ,
Posted in Featured, creative | 6 Comments »

6 Comments to “Typesafe DSLs in Java: Part 1 — Typesafe Bytecode”

  1. Christian Vest Hansen Says:

    This is certainly interesting. Perhaps you could create a number of methods designed specifically to work on the primitive methods? Like you have invokeVirtualVoid, you’d also have invokeVirtualInt and likewise the other primitive types.

    As for the performance concern of the numerous instances, could you not create just a single instance of each of these MethodBuilders, perhaps even lazily, per instance of ClassBuilder, and then pass a StateObject around that contains everything they need to know at each particular call?

    I also think Closures are the right way to go about control flow.

    Finally, I’d like to say, that some people might want to do some very crazy things with ASM – things your API might not anticipate or allow, so it should provide support for dropping down to “guru level” whenever needed, and give direct access to the Method and ClassWriters.

    Oh, and do you have a mailing list, or is the RSS feed for this blog th best way to “stay posted”?

  2. Jevgeni Kabanov Says:

    >Like you have invokeVirtualVoid, you’d also
    >have invokeVirtualInt and likewise the other
    >primitive types.

    invokeVirtualInt wont work, since ints can appear anywhere in the descriptor, both as arguments and as result. What you can do is declare a Placeholder parametric class, which could take an enum or somesuch and overload the invoke* methods to take either Class or that. You can also just allow giving it as string. Both approaches are viable, but not pretty.

    >As for the performance concern of the numerous
    >instances…

    No I can’t, since the methods must be polymorphic (parametrized) and the only way to do it in Java is to have the particular instance parametrized. If I have the StateObject I’ll lose the type safety (unfortunately). It is possible to make all the methods static (static methods are not polymorphic to the instance), but then I’d need to pass the instance and apply them in reverse order, which again sucks.

    >it should provide support for dropping down
    >to “guru level” whenever needed

    Of course. That’s partially what assume* exist for.

  3. Jevgeni Kabanov Says:

    Oh, and about staying up-to-date — it’s not a true versioned project at the moment, so this blog is most likely the best place to follow for now.

  4. Sp3w » Blog Archive » Linkage 2008.03.26 PM Says:

    [...] Typesafe DSLs in Java: Part 1 – Typesafe Bytecode [...]

  5. Alex Popescu Says:

    I don’t think usage of class literals will really work when defining classes, as you may need to refer classes that haven’t been loaded.

    cheers,

    ./alex

    .w( the_mindstorm )p.
    Alexandru Popescu

  6. Jevgeni Kabanov Says:

    Alex: I discuss that in “The world isn’t perfect”. Most classes you access are likely to be present and some classes you can access using special placeholders (e.g. Self for the class being constructed). In general case you just have to make some calls not type safe. It will still provide a great deal of benefit to many projects.

Leave a Comment

Additional comments powered by BackType