Posts Tagged: java


10
Nov 08

The Performance Cutoff

A few days ago I had a small epiphany on a simple yet important issue. I was trying to squeeze those last few percents of performance out of JavaRebel and it came to the point where I started optimizing individual hotspots method-by-method.

One of the most common ways to improve execution time of a specific method is the cutoff. For many complicated enough methods there are some inputs for which you can return immediately and cut off the main execution path. To explain let me use the following example:
[java]
Output doSomething(Input input) {
// Do something for 200 ms
return output;
}
[/java]
This method could do anything as long as the complexity is constant, we only care how much time it takes on average. I could also have taken some algorithm (e.g. I started with matrix multiplication), but then we’d have to bring in complexity estimates and I’d like to keep it simple for now.

Let’s assume that for some inputs we could calculate the output in 50ms instead of 200ms. We introduce a check and the code is now:
[java]
Output doSomething(Input input) {
if (isSimpleInput(input)) {
// Do something for 50 ms
return output;
}

// Do something for 200 ms
return output;
}
[/java]
This is a common way to optimize the method execution time. However the question is if this actually optimized anything? It seems like a stupid question, but let’s estimate the new average execution time.

To do that we need to know two things. The time t it takes to do the isSimpleInput() check and the proportion p of method calls that are “simple”. Let’s assume that t is 20 ms and 10% of the calls are simple. The average execution time can then be calculated using the expected value formula (EV = p * v1 + (1 – p) * v2):
EV = 0.1 * (20 + 50) + (1 – 0.1) * (20 + 200)
= 20 + 0.1 * 50 + 0.9 * 200
= 20 + 5 + 180 = 205

This calculation shows that with this values we have actually increased the average method execution time by 5 ms. Note that we are still winning every time the cutoff occurs (20 + 50 << 200), but we are losing on the average.

The same equation can also be applied to measuring several checks or measuring execution time that is dependent on the input, however it gets increasingly more complicated. For me the main value of this is realizing that every cutoff has an inherent cost which I will now pay in mind.

Before you think this is all just math, the epiphany that came to me was not to add a check, it was to remove one. I realized that a particular check I’m making is rare and expensive enough to directly influence the method execution time and lo and behold, removing it gave a 30% improvement for the benchmark I was using.


3
Nov 08

When protected isn’t protected

For all those spec believers out there: what would you think if I told you that all of the JVM implementations consistently violate the spec for at least one particular instance?

Our story begins on a cold and lonely evening when I discover a bug in JavaRebel in conjunction with Xerces XMLEntityManager. The problem was with the following method:
[java]
public String[] getRecognizedProperties() {
return (String[])(RPROPERTIES.clone());
}
[/java]
However when I looked at the class bytecode I saw a different picture:
[java]
{
GETSTATIC XMLEntityManager.RPROPERTIES : String[]
INVOKEVIRTUAL Object.clone() : Object[]
CHECKCAST String[]
ARETURN
}
[/java]
For those of you who do not spend hours staring at the Java bytecode, it does the following:

  1. Retrieve the RPROPERTIES static field value (which is of type String[]) and put it on the top of the stack.
  2. Call Object.clone() using the top of the stack as target object (in our case the array). The target is popped off the stack and the result is placed on top of the stack instead of it.
  3. Cast the top of the stack to a String[]
  4. Return the top of the stack as method result.

Whooph. Now that is all fine except for one tiny thing. This is not legal. Object.clone() is a protected method and the JVM spec (see 2.7.4, INVOKEVIRTUAL) allows calls to protected method only if the following is true:

  1. The accessing class is a subclass of the target class or is the same class (check, XMLEntityManager is a subclass of Object, so’s everything else).
  2. The class of the target object (the actual runtime class of the object, not the target class we use in the signature) is either a subclass of the accessing class or is the same class.

The second check may sound crazy, but it restricts the calls to protected methods only to the classes inheriting from the target class, as you would expect. It can also only be enforced during runtime and is the cause of the IllegalAccessError you may see once in a while.

And obviously in our case String[] that is the actual type of the target object is in no way a subclass of the XMLEntityManager. This call should have cause an error in the JVM, but for some inexplicable reason it doesn’t. Further tests show that this is only limited (AFAIK) to the Object.clone() call, which is public as far as JVM cares.

A little googling turned up two JVM bugs that were discussing the same issue (bug 1, bug 2). Turns out the Sun javac compiler has been emitting wrong code at least until 1.4.2 and the JVMs have accommodated accordingly right until today.

Does this have any practical importance? Not for most of you, but if you handle Java bytecode in any way be prepared to handle this exception as well. To me it proves that there is more to Java than just the spec.


28
Oct 08

What do we really know about non-blocking concurrency in Java?

If my yesterday’s post taught me anything it’s this: not that much. This post is a result of a lengthy chat with Heinz Kabutz and Kirk Pepperdine (who really do know something about non-blocking concurrency) as well as comments on the post and the Reddit entry.

When I put together the yesterday’s post (When System.currentTimeMillis() is too slow…) I deliberately left the code unsynchronized in any way. Not because I didn’t know that volatile will fix everything, but because I wasn’t clear on how and why it will fix things and if anything was even broken. I got a lot of replies to the post and although everyone was sure that the answer is to use volatile (or atomics), not many could explain what will really happen if we don’t use them and what impact on performance do they have if any.

First thing most people don’t realize is that reading/writing is atomic on all primitive values, except long and double (this includes references no matter whether they are 32 or 64 bit). This is a requirement in the JVM spec and is independent of volatile flag. Volatile additionally imposes atomicity on long and double value reads/writes. See JVM spec section 17.7 for details.

If the reading/writing of the value is atomic (which it is on ints that I used in the end), then no thread can ever see corrupted data. The only thing we have to deal with is inconsistency. The reason why I even introduced this “error” to begin with is that consistency is not that important for the particular problem as long as it’s not permanently inconsistent. By CPU measurement half a second is a lot of time, so if I can assume that the the data is synced at that horizon, I can safely leave this code in.

The JVM memory model specification stipulates that the JVM/JIT is free to keep a thread-local copy of any values that are not declared as volatile and is only obliged to sync them when entering/exiting a monitor (a monitor is taken when entering a synchronized block and released in the end). This is needed above all to allow for the aggressive memory caching in multicore processors, which is more likely to be present on high-end server hardware (like Sun Sparc) than on desktop Intel/AMD chips.

Hardware keeps the memory values in the L2 cache, which is bound to be flushed relatively regularly, so I’m still not entirely convinced that this would be a real problem even on high-end servers. However Heinz Kabutz has demonstrated that in some cases JVM will cache the values even on a single core processor in the Java Specialists newsletter issue 159. Although the caching seems to be quite limited and we couldn’t reproduce it on the problem at hand it does demonstrate that permanent inconsistency is indeed a possibility one has to consider.

Interestingly enough if we only have to worry about hardware caching, we can use a different trick to ensure the consistency among threads. This was proposed by Kirk Pepperdine and neither he nor I are entirely sure if this will work on all JVM implementations/hardware platforms. However the same trick seems to be used in Cliff Clicks high-scale-lib, and he is one of the few people who really does understand the non-blocking concurrency. Before he confirms it though, I have to consider it just another crazy idea.

The trick is that if reading a volatile will flush the whole cache, who said we have to read the same volatile? We just declare a different variable volatile just for the purpose of flushing the cache and leave the original one still non-volatile:
[java]
public class HeartBeatThread extends Thread {
public static int counter = 0;
public static volatile int cacheFlush = 0;

public HeartBeatThread() {
setDaemon(true);
}

static {
new HeartBeatThread().start();
}

public void run() {
while (true) {
try {
Thread.sleep(500);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}

counter++;
cacheFlush++;
}
}
}
[/java]

To be fair this is not the code that I’m going to put in JavaRebel. Now that I understand the issues behind not using a volatile well enough, I will just slap one on and be content. My tests have shown that making a field volatile does not impose any noticeable overhead at least on the microbenchmarks I was using. I can just hope that this post has perhaps helped you to understand some of the issues better as well.

P.S. To everyone who was suggesting to use AtomicLong/AtomicInteger — if you check the implementation of those classes the field they wrap is declared as volatile, so there is no possible benefit in using atomics instead of volatile values unless you need compareAndSwap() (considering that reads/writes to volatile long are anyway atomic).


27
Oct 08

When System.currentTimeMillis() is too slow…

At the moment I am working on reducing the performance overhead of JavaRebel. Once I got rid of all the obvious bottlenecks and optimizations I fired up the profiler and started searching and destroying ad-hoc hotspots. After some time I got to the point, when the bottleneck was in System.currentTimeMillis().

This deserves some explanation. Since JavaRebel reloads code after the class changes, it needs to poll the actual class file for changes. The way we do it is quite lazy and the class itself causes the resource to be polled. However often there will be a lot of polling involved (more or less any method call on the class will cause a check), so we need to filter those calls. What we did is to check if some amount of time (e.g. 500 ms) has passed since the last check:

if (lastCheck + 500  > System.currentTimeMillis())
  return;
lastCheck = System.currentTimeMillis();
// Resource check code

After some optimization everything else got fast enough that this call started to weigh the performance down. System.currentTimeMillis() causes a system call and often a hardware poll, which is comparatively expensive. I needed some way to eliminate it.

After some thought I decided to cache the call. It may sound crazy to cache the current time, but I only needed the resolution of about half a second, so it was OK for me if the time was lagging behind about that much. What I did is put up a separate thread that would cache the current time and update it every half a sec.

public class TimeCacheThread extends Thread {
  public static volatile long currentTimeMillis = 
    System.currentTimeMillis();
 
  public TimeCacheThread() { 
    setDaemon(true);
  }
 
  static {
    new TimeCacheThread().start();
  }
 
  public void run() {    
    while (true) {
      currentTimeMillis = System.currentTimeMillis();
 
      try {
        Thread.sleep(500);
      } catch (InterruptedException e) {
        throw new RuntimeException(e);
      }
    }
  }
}

Then the check became:

if (lastCheck + 500  > TimeCacheThread.currentTimeMillis)
  return;
lastCheck = TimeCacheThread.currentTimeMillis;
// Resource check code

This check is amazingly fast, because instead of doing a system call we are just making a memory read.

[EDIT] The original code did not have the field declared volatile with the following explanation:

You may also notice the complete lack of any synchronization (or volatile keyword) on the currentTimeMillis static field. Since the type is primitive we can only deal with stale data here and I cared about performance more than I cared about missing a couple of checks.

There is a follow-up post (What do we really know about non-blocking concurrency in Java?) that explains some more of the reasoning behind initially omitting the volatile and then still putting it in. [/EDIT]

However once I got that far I decided I could do even better. The currentTimeMillis is of type long and we are still doing some arithmetic on it. In fact the only thing we want to know is if a quantum of time has passed since the last check. And we can do that by using a very simple heartbeat counter:

public class HeartBeatThread extends Thread {  
  public static volatile int counter = 0;  
 
  public HeartBeatThread() {
    setDaemon(true);
  }
 
  static {
    new HeartBeatThread().start();
  }
 
  public void run() {    
    while (true) {      
      try {
        Thread.sleep(500);
      } catch (InterruptedException e) {
        throw new RuntimeException(e);
      }
 
      counter++;
    }
  }
}

Now the checking code becomes just:

if (counter == HeartBeatThread.counter)
  return;
counter = HeartBeatThread.counter;

The counter field is now of int type and we only do an equals check on a DWORD, which should translate directly to 3 very fast hardware instructions (two memory reads and one conditional jump). Hopefully if you ever find yourself polling something a lot this post will come of help. As for me, it improved JavaRebel performance by 70% on some benchmarks.


8
Oct 08

SpringSource is not the devil?

SpringSource is not the devil? — apparently the community outcry was strong enough to revoke the most outrageous part of the new policy: now the latest stable branch will have maintenance releases even after the three month stoppoint. The funny part is that now the “three months” don’t have any value whatsoever, as the space between Spring branch releases is at least half a year. I get that SpringSource needs to shake the money out of the enterprise, but if JBoss somehow managed to do that without resorting to the outright extortion, so should they.