Posts Tagged ‘performance’
The Performance Cutoff
Written by Jevgeni Kabanov on November 10, 2008 – 12:36 pmA few days ago I had a small epiphany on a simple yet important issue. I was trying to squeeze those last few percents of performance out of JavaRebel and it came to the point where I started optimizing individual hotspots method-by-method.
One of the most common ways to improve execution time of a specific method is the cutoff. For many complicated enough methods there are some inputs for which you can return immediately and cut off the main execution path. To explain let me use the following example:
-
Output doSomething(Input input) {
-
// Do something for 200 ms
-
return output;
-
}
This method could do anything as long as the complexity is constant, we only care how much time it takes on average. I could also have taken some algorithm (e.g. I started with matrix multiplication), but then we’d have to bring in complexity estimates and I’d like to keep it simple for now.
Let’s assume that for some inputs we could calculate the output in 50ms instead of 200ms. We introduce a check and the code is now:
-
Output doSomething(Input input) {
-
if (isSimpleInput(input)) {
-
// Do something for 50 ms
-
return output;
-
}
-
-
// Do something for 200 ms
-
return output;
-
}
This is a common way to optimize the method execution time. However the question is if this actually optimized anything? It seems like a stupid question, but let’s estimate the new average execution time.
To do that we need to know two things. The time t it takes to do the isSimpleInput() check and the proportion p of method calls that are “simple”. Let’s assume that t is 20 ms and 10% of the calls are simple. The average execution time can then be calculated using the expected value formula (EV = p * v1 + (1 – p) * v2):
EV = 0.1 * (20 + 50) + (1 – 0.1) * (20 + 200)
= 20 + 0.1 * 50 + 0.9 * 200
= 20 + 5 + 180 = 205
This calculation shows that with this values we have actually increased the average method execution time by 5 ms. Note that we are still winning every time the cutoff occurs (20 + 50 << 200), but we are losing on the average.
The same equation can also be applied to measuring several checks or measuring execution time that is dependent on the input, however it gets increasingly more complicated. For me the main value of this is realizing that every cutoff has an inherent cost which I will now pay in mind.
Before you think this is all just math, the epiphany that came to me was not to add a check, it was to remove one. I realized that a particular check I’m making is rare and expensive enough to directly influence the method execution time and lo and behold, removing it gave a 30% improvement for the benchmark I was using.
Tags: java, performance
Posted in Featured, creative | No Comments »
What do we really know about non-blocking concurrency in Java?
Written by Jevgeni Kabanov on October 28, 2008 – 3:22 pmIf my yesterday’s post taught me anything it’s this: not that much. This post is a result of a lengthy chat with Heinz Kabutz and Kirk Pepperdine (who really do know something about non-blocking concurrency) as well as comments on the post and the Reddit entry.
When I put together the yesterday’s post (When System.currentTimeMillis() is too slow…) I deliberately left the code unsynchronized in any way. Not because I didn’t know that volatile will fix everything, but because I wasn’t clear on how and why it will fix things and if anything was even broken. I got a lot of replies to the post and although everyone was sure that the answer is to use volatile (or atomics), not many could explain what will really happen if we don’t use them and what impact on performance do they have if any.
First thing most people don’t realize is that reading/writing is atomic on all primitive values, except long and double (this includes references no matter whether they are 32 or 64 bit). This is a requirement in the JVM spec and is independent of volatile flag. Volatile additionally imposes atomicity on long and double value reads/writes. See JVM spec section 17.7 for details.
If the reading/writing of the value is atomic (which it is on ints that I used in the end), then no thread can ever see corrupted data. The only thing we have to deal with is inconsistency. The reason why I even introduced this “error” to begin with is that consistency is not that important for the particular problem as long as it’s not permanently inconsistent. By CPU measurement half a second is a lot of time, so if I can assume that the the data is synced at that horizon, I can safely leave this code in.
The JVM memory model specification stipulates that the JVM/JIT is free to keep a thread-local copy of any values that are not declared as volatile and is only obliged to sync them when entering/exiting a monitor (a monitor is taken when entering a synchronized block and released in the end). This is needed above all to allow for the aggressive memory caching in multicore processors, which is more likely to be present on high-end server hardware (like Sun Sparc) than on desktop Intel/AMD chips.
Hardware keeps the memory values in the L2 cache, which is bound to be flushed relatively regularly, so I’m still not entirely convinced that this would be a real problem even on high-end servers. However Heinz Kabutz has demonstrated that in some cases JVM will cache the values even on a single core processor in the Java Specialists newsletter issue 159. Although the caching seems to be quite limited and we couldn’t reproduce it on the problem at hand it does demonstrate that permanent inconsistency is indeed a possibility one has to consider.
Interestingly enough if we only have to worry about hardware caching, we can use a different trick to ensure the consistency among threads. This was proposed by Kirk Pepperdine and neither he nor I are entirely sure if this will work on all JVM implementations/hardware platforms. However the same trick seems to be used in Cliff Clicks high-scale-lib, and he is one of the few people who really does understand the non-blocking concurrency. Before he confirms it though, I have to consider it just another crazy idea.
The trick is that if reading a volatile will flush the whole cache, who said we have to read the same volatile? We just declare a different variable volatile just for the purpose of flushing the cache and leave the original one still non-volatile:
-
public static int counter = 0;
-
public static volatile int cacheFlush = 0;
-
-
public HeartBeatThread() {
-
setDaemon(true);
-
}
-
-
static {
-
new HeartBeatThread().start();
-
}
-
-
public void run() {
-
while (true) {
-
try {
-
}
-
-
counter++;
-
cacheFlush++;
-
}
-
}
-
}
To be fair this is not the code that I’m going to put in JavaRebel. Now that I understand the issues behind not using a volatile well enough, I will just slap one on and be content. My tests have shown that making a field volatile does not impose any noticeable overhead at least on the microbenchmarks I was using. I can just hope that this post has perhaps helped you to understand some of the issues better as well.
P.S. To everyone who was suggesting to use AtomicLong/AtomicInteger — if you check the implementation of those classes the field they wrap is declared as volatile, so there is no possible benefit in using atomics instead of volatile values unless you need compareAndSwap() (considering that reads/writes to volatile long are anyway atomic).
Tags: concurrency, java, performance
Posted in creative | 9 Comments »