When System.currentTimeMillis() is too slow…
Written by Jevgeni Kabanov on October 27, 2008 – 2:09 pmAt the moment I am working on reducing the performance overhead of JavaRebel. Once I got rid of all the obvious bottlenecks and optimizations I fired up the profiler and started searching and destroying ad-hoc hotspots. After some time I got to the point, when the bottleneck was in System.currentTimeMillis().
This deserves some explanation. Since JavaRebel reloads code after the class changes, it needs to poll the actual class file for changes. The way we do it is quite lazy and the class itself causes the resource to be polled. However often there will be a lot of polling involved (more or less any method call on the class will cause a check), so we need to filter those calls. What we did is to check if some amount of time (e.g. 500 ms) has passed since the last check:
After some optimization everything else got fast enough that this call started to weigh the performance down. System.currentTimeMillis() causes a system call and often a hardware poll, which is comparatively expensive. I needed some way to eliminate it.
After some thought I decided to cache the call. It may sound crazy to cache the current time, but I only needed the resolution of about half a second, so it was OK for me if the time was lagging behind about that much. What I did is put up a separate thread that would cache the current time and update it every half a sec.
-
public static volatile long currentTimeMillis =
-
-
public TimeCacheThread() {
-
setDaemon(true);
-
}
-
-
static {
-
new TimeCacheThread().start();
-
}
-
-
public void run() {
-
while (true) {
-
-
try {
-
}
-
}
-
}
-
}
Then the check became:
-
if (lastCheck + 500 > TimeCacheThread.currentTimeMillis)
-
return;
-
lastCheck = TimeCacheThread.currentTimeMillis;
-
// Resource check code
This check is amazingly fast, because instead of doing a system call we are just making a memory read.
[edit] The original code did not have the field declared volatile with the following explanation:
You may also notice the complete lack of any synchronization (or volatile keyword) on the
currentTimeMillisstatic field. Since the type is primitive we can only deal with stale data here and I cared about performance more than I cared about missing a couple of checks.
There is a follow-up post (What do we really know about non-blocking concurrency in Java?) that explains some more of the reasoning behind initially omitting the volatile and then still putting it in. [/edit]
However once I got that far I decided I could do even better. The currentTimeMillis is of type long and we are still doing some arithmetic on it. In fact the only thing we want to know is if a quantum of time has passed since the last check. And we can do that by using a very simple heartbeat counter:
-
public static volatile int counter = 0;
-
-
public HeartBeatThread() {
-
setDaemon(true);
-
}
-
-
static {
-
new HeartBeatThread().start();
-
}
-
-
public void run() {
-
while (true) {
-
try {
-
}
-
-
counter++;
-
}
-
}
-
}
Now the checking code becomes just:
-
if (counter == HeartBeatThread.counter)
-
return;
-
counter = HeartBeatThread.counter;
The counter field is now of int type and we only do an equals check on a DWORD, which should translate directly to 3 very fast hardware instructions (two memory reads and one conditional jump). Hopefully if you ever find yourself polling something a lot this post will come of help. As for me, it improved JavaRebel performance by 70% on some benchmarks.
Tags: java, javarebel
Posted in Featured, creative | 54 Comments »
54 Comments to “When System.currentTimeMillis() is too slow…”
Leave a Comment
Additional comments powered by BackType
October 27th, 2008 at 2:47 pm
Nice writeup!
October 27th, 2008 at 3:25 pm
> I cared about performance more than I cared about missing a couple of checks.
You are not only risking missing a couple of checks, you are risking missing *all* of them, forever. The java memory model doesn’t require modifications made in one thread to *ever* be visible to an other thread if you don’t use volatile or synchronized.
If the JIT decides to cache the variable in a register, or if you use a multi-processor machine, it gets into the processor cache, or like that, you most probably will actually encounter this problem.
Use an AtomicInteger. That’s it all. Or use volatile. They don’t have any performance problems at all, except for those that you can’t avoid without losing correctness (which you did).
October 27th, 2008 at 3:44 pm
@Eugene
I was sure that someone will jump out and shout “Crazy!” at this. I know that volatile or Atomics will fix this for sure and I know about the lack of guarantees. However implementation-wise processor cache is flushed quite often and there’s not enough registers :) Although theoretically you are right, practically every implementation of Java will work.
October 27th, 2008 at 4:35 pm
Hi Jevgeni,
I just want to add to your solution: You can’t rely that your caching-thread is called every 500 ms. It is supposed to do but if the OS is on high load the thread may “sleeps” 600 ms or more. So your time-slices are not reliable.
Btw. nice solution ;)
-markus
October 27th, 2008 at 4:38 pm
Hi Jevgeni,
Very clever optimization. I would suggest that Eugene’s concerns are valid. What you are relying on is the chip level memory model to do the *right* thing. Volatile is quite quick. You may even want consider using CAS though I’m not sure that will buy you anything over volatile. Oh, it will obfuscate your code ;-)
Kirk
October 27th, 2008 at 4:45 pm
@Markus
I know that, and this isn’t really that important for this problem.
October 27th, 2008 at 4:47 pm
@Kirk
I’m not saying you convinced me, but you know way more about processor models than me, so I give you a very large benefit of doubt :) I’ll test if the volatile or CAS will impact performance much, and if it’s negligible I will of course put them in just to be on the safe side.
October 27th, 2008 at 4:51 pm
I think you overestimate how much time it would take to do this the right way. Have you actually profiled using an Atomic or a volatile flag? I mean, I think you are right that this should almost always work, but it is going to be fun times figuring out what happened if it ever stops, though. It would be silly to introduce incorrect and unpredictable code without justifying it first.
October 27th, 2008 at 4:56 pm
Your two assumptions about synchronisation are wrong.
1) Just because your accessing a long does not mean the setting of its value is atomic or that reading it mid way through will provide any form of consistent value. You either have as other have mentioned the potential to never get a read or get a totally corrupt read. Volatile/AtomicLong cost almost nothing use them.
2) That you can skip synchronisation on your final solution, you can’t it suffers similar problems to the caching.
Interesting solution, but this is very badly written concurrent code and it definitely won’t work on every operating system and processor.
However I am surprised that a function that can be called 308488 times a millisecond without any warm up is a performance bottleneck! I wrote a quick test program and this was its output:
Calls:308488450 for time 1000 millis
Calls per millisecond: 308488.45 ops/ms
And the code (this is going to format very poorly):
public class MilisTest {
/**
* @param args
*/
public static void main(String[] args) {
long current = System.currentTimeMillis();
long start = current;
long end=current+1000; //plus 1 second
long iterations = 0;
while(end>current) {
current = System.currentTimeMillis();
current = System.currentTimeMillis();
current = System.currentTimeMillis();
current = System.currentTimeMillis();
current = System.currentTimeMillis();
current = System.currentTimeMillis();
current = System.currentTimeMillis();
current = System.currentTimeMillis();
current = System.currentTimeMillis();
current = System.currentTimeMillis();
iterations+=10;
}
System.out.println(“Calls:” + iterations +” for time ” + (current-start) + ” millis”);
System.out.println(“Calls per millisecond: ” + callsPerMillisecond(iterations,current-start) + ” ops/ms”);
}
public static double callsPerMillisecond(long calls, long milliseconds) {
return (double)calls / (double)milliseconds;
}
}
October 27th, 2008 at 5:14 pm
See Java Concurrency in Practice. It’s an awesome book and addresses the long thread safety issue. IIRC (don’t have the book with me) volatile int is threadsafe, however volatile long isn’t, at least on 32 bit CPUs. Volatile just guarantees that every change to that variable will be seen by any other thread accessing that variable. However a long on a 32 bit cpu requires 2 writes, one for the high 32 bits and one for the low 32 bits.
As others have suggested, AtomicLong is the safe way to do this and has a very low impact on performance.
October 27th, 2008 at 5:42 pm
if (lastCheck + 500 > System.currentTimeMillis())
return;
lastCheck = System.currentTimeMillis();
// Resource check code
“System.currentTimeMillis() causes a system call and often a hardware poll, which is comparatively expensive. I needed some way to eliminate it.”
Eh, why call it twice, then?
now = System.currentTimeMillis();
if (lastCheck + 500 > now)
return;
lastCheck = now;
October 27th, 2008 at 5:49 pm
The cost hit in the usage of volatile is typically the write operation which can be twice the cost of a read.
The cost of read on a volatile field is nearly (1-2 nanosecond delta) equivalent to the read on a non-volatile field on most platforms but considering that this operation is executed less frequently in the last example I cannot see the reason for not using it.
An int is suffice for your needs as all you are concerned about is a delta detection.
The cost of System. currentTimeMillis() varies across platforms but typically costs 0.2 microseconds on Unix platforms and on MS Windows 1.0-2.0 ** microseconds ** with a very poor clock resolution of between 10-50 milliseconds in older Java runtimes.
William
October 27th, 2008 at 5:52 pm
Nice one Jevgeni!
I was most impressed by the clean, concise and direct code that you have presented. A real pleasure to read indeed (this after reading countless blogs with thoughtlessly complicated and inefficient spaghetti code reflecting highly illogical thought processes). Looking forward to reading your other posts. Cheers!
z0ltan.
October 27th, 2008 at 5:54 pm
By the way when you get down to this level of micro-tuning you should use a different technique for profiling that is not based on the wall clock time or cpu time.
I was actually writing a blog entry on this detailing how we profile our own code without profiling it the way most engineers do today which is using timers.
William
October 27th, 2008 at 5:57 pm
@Paul
The benchmark was making tens of millions calls a second. Welcome to the unreal world :)
You are right about long, but it’s an int in the final solution, which is atomic on all hardware I’m aware of.
October 27th, 2008 at 5:59 pm
@Chad
The second call isn’t important as it happends only twice a second.
October 27th, 2008 at 6:01 pm
@William
Thanks for the info! It will be very interesting to see the blog entry. Kirk recommended me to use hardware-specific profilers like VTune at that level.
October 27th, 2008 at 6:04 pm
There are two ways to solve every performance bottleneck:
1) Improve the performance of the function your calling with a better algorithm.
2) Call the function less.
Since the accuracy you are looking for is 500 milliseconds why not just make a call in the code somewhere (not necessarily here) to Thread.sleep or user a Timer and run only on a more reasonable schedule. Its not like the users of JavaRebel honestly expect an instant change in their code, the loading the classes probably takes longer than anything else when there is actually a change.
This just feels like the wrong thing to optimise as calling something that will change just 500 times in 500ms 10+ million times feels like a (2) style solution not a (1).
October 27th, 2008 at 6:17 pm
@Paul
First of all I am trying to use the function less, that’s what I’m building here — a filter :)
Second of all the problem is that for different reasons we can use very little actual synchronization in JavaRebel. Therefore a separate notification thread (Timer) will cause more issues than it solves.
October 27th, 2008 at 7:18 pm
Hardware counters sounds like a very hardware specific solution. Counters yes but not hardware counters unless you are targeting a specific platform (very non-Java like) or writing a runtime yourself. Counters without context are meaningless.
We have a number of low level probe meter extensions (weaved at loadtime) that count the number of field read/writes and method calls/executions for critical call paths in our probes runtime. These are then mapped as resource meters are metered by instrumented probes.
William
October 27th, 2008 at 8:35 pm
@William
Hmm, doesn’t sound half as exciting anymore. I count the method calls while profiling anyway.
Also performance doesn’t abstract well, so if you’re working on something that is bounded by CPU, not IO, you have to got to the hardware level.
October 27th, 2008 at 8:40 pm
@William,
The danger here is this is a micro-benchmark. As such I think you need to do a lot of things to validate (such as using the debug JVM to dump assembler generated by the interpreter and JIT).
As good as your stuff is, and I do believe it’s quite good, at this level I think you should be looking at hardware based counters. If this is what your probes weave in… pretty cool.
For anyone that believes you can’t write thread safe code without the use of the synchronized keyword, you should take a good look at Cliff Click’s non blocking concurrent algorithms. The don’t use synchronization and they are perfectly thread safe. The collections post date Brian’s book or I’m sure they’d be explained in there.
Also, as Jevgeni has mentioned, he’s using int’s so he’s not subject to word shearing on platforms that require a double fetch for a 64 bit data structure.
Kirk
October 27th, 2008 at 8:49 pm
Oh but it is. You just cannot see the power and simplicity of a multi-meter model that is activity based with a associated cost model that allows you to create a solution that is applicable at any level of abstraction (machine, cluster, cloud).
Ah well back to the release cycle maybe some day others might get it and see beyond simple system metrics.
October 27th, 2008 at 9:03 pm
@William
Not saying that it won’t be a great tool for other people, but it’s not of great use to me. I need low-level solutions.
October 27th, 2008 at 9:06 pm
Kirk the first book I read on algorithms introduced an abstract machine instruction set with each instruction having a corresponding execution cost. This is a proven technique and has allowed me to be productive during microtuning sessions (which is rare for most developers). For nearly each line I have a unit cost along with a probability weighting that is parameterised based on execution profiles from in the wild.
October 27th, 2008 at 10:05 pm
Jevgeni “I need low-level solutions.:
You need low level solutions that go beyond wait times, wait counts, block counts, block times, cpu time, high res wall clock times, field set counts, field write counts (volatile or not), object alloc counts, object alloc sizes, call counts, execution counts,…….
What are you building and how are you going to make sense of such low level hardware counters that for the large part are processor specific (not necessarily thread)?
Such information is only useful when it can be combined with existing knowledge (application execution patterns). There has to be an explicit cause-effect chain for a developer. How does a developer go from a line of code in Java and see the chain and understand the effects in has on various low level counters and also understand the relationship whether it do good and bad and what side effects any proposed change has on other counters or aspects of the execution. This is like asking an operations person with hardly any Java knowledge to change some deployment options in a J2EE application to reduce monitor contention.
Before we get to hardware counters we need to get the JVM vendors to expose some additional low level counters that have a much more comprehensible traceability to the code being executed.
October 27th, 2008 at 10:16 pm
@William
Yes.
October 27th, 2008 at 10:17 pm
Why?
October 27th, 2008 at 10:28 pm
The problem is with the JIT, the unit of instruction, the byte code, isn’t necessarily cost constant. That said, it’s cost constant often enough that a constant cost model would be helpful. That said, there are far less complex ways of achieving the same thing. And before you slam using simple system metrics you might want to consider that these counts are the only ones that don’t lie! This is become especially more true now that CPUs are internalizing thread scheduling. This not only blows away the profilers that you’ve been rallying against, it blows away a number of other techniques we’d use to measure. Maybe yours would survive, I don’t know. I do know that hardware counts work with little complications.
Regards,
Kirk
October 27th, 2008 at 10:36 pm
@William
Because I have those metrics and they are not telling me everything I need to know. I need to optimize on the level of individual real machine instructions (and not high-level bytecode ones). Hardware becomes important unfortunately. As I said, your tool may be useful to others, my situation is quite unique. There are no universal tools.
October 27th, 2008 at 10:57 pm
“I need to optimize on the level of individual real machine instructions (and not high-level bytecode ones).”
Please provide me with a real use case for this assuming we are still talking about Java here. I have developed the fastest Java profiling solution on the market that out performs the competition by 20x-100x and I never needed to drop to this level so what exactly are you building and how are you going to manage and maintain this across platforms whilst still keeping an eye on the big picture which I think you have lost sight of a long time ago.
“There are no universal tools”.
You are looking from the wrong end. There are universal processes and models that have unlimited extensibility that can be supported by one tool. I did say multi-metering model. Meters can be mapped to any counter even those that are not tied specifically to a thread but what is the point unless one can use this information and take corrective action in a managed way (you know what you are doing and why it works so).
Lets cut to the chase and provide me with a concrete example to work with and then lets see how we fair.
October 27th, 2008 at 11:05 pm
@William
OK, I needed to know recently which of the virtual method calls are inlined/inline cached and which are not.
October 27th, 2008 at 11:17 pm
“And before you slam using simple system metrics you might want to consider that these counts are the only ones that don’t lie!”
Lie. Did I say that?
System metrics are useful when you have not got the time or tool to understand both the software and system execution models. But do they tell you the complete story? No. Even less than a readers digest.
Without the ability to see the activity and relate (multi-)resource usage to such activities you just see green symbols on a black background falling down some computer screen instead of the complete picture made up of actors, actions, places, events….
There must be causality / attribution and its understanding.
If a hardware counter is not naturally related to an activity (thread, transaction, workflow, operations, request, method invocation) then one must use statistical correlation across various system metrics and even then you still need to expose the activity to the correlation engine.
By the way our meters can be mapped to any counter global or thread specific , native or non-native. You can even map a meter to an actually metering of another meter in the context of a particular activity. But there must be a point……
October 27th, 2008 at 11:24 pm
This is one of the VM counters I requested.
I pushed this with the BEA JRockit years ago and Henrik and team refused to make this and other counters available even though I was able to see some of these counters.
At JavaOne I also talked to Cliff @ Azul about exposing more of the metrics. We did recently get 6 new counters from the Azul team which we support natively on their platform. Kirk I believe you were present when I asked for them. Strange that you are not talking about them when at the time it seemed a hard sell for me to those present.
October 27th, 2008 at 11:32 pm
@William
Which comes to the effect that you don’t support it at the moment? Even then I’d prefer the hardware stats, as logical counters can lie.
October 27th, 2008 at 11:49 pm
We do not support counters that we cannot see (legally) and a counter is only useful in the context of an activity. I am assuming we are talking about a production system here that does not have one single activity or thread.
I am all for more counters – the more the merrier. If they add measurable value at an acceptable cost which is what most of the VM engineers will reply to such requests.
There is no point in monitoring something unless you can change it. I cannot see how you can does this effectively from were you are standing and expect it to remain constant and manageable.
I will leave you to chop down the trees trying to see the forest.
October 28th, 2008 at 12:11 am
Once the JVM has warmed the time taken to call currentTimeMillis() takes about 50 nano-seconds.
While this is an interesting idea, there is probably other ways to avoid busy waiting and a scheduling tasks instead.
October 28th, 2008 at 12:26 am
Maybe the problem is that the lazy polling is done every time the class is accessed. Instead of checking to see if a timer has expired you could just set a volatile boolean flag indicating that the class has been accessed. Then your background thread could wake up periodically and check to see which flags were set. If a flag is set, check to see if the corresponding class file has changed and reset the flag. Setting the flag would be efficient and wouldn’t involve any conditionals.
October 28th, 2008 at 12:41 am
@Peter
Depends on the hardware and OS. It’s about 2 microseconds on my machine.
@Derek
Yes, I considered this. However consider also that classes are changed by the programmer in between requests, when the methods aren’t called at all. This approach guarantees that on next request after the pause the updates will be propagated lazily, yet immediately *before* the next method call.
October 28th, 2008 at 1:03 am
I do think it’s quite important for your example code to be correct, as someone is going to see this and think it’s the right way to do things.
If there is no synchronization, no atomic reference, and no volatile declaration of this field that you’re sharing across threads, then there’s no reason for you to assume that values from it will be visible to different threads.
I’ve seen this same mechanism used in an application server as the main way it tracked time throughout itself. I discovered it because the thread would crash under some circumstance. When the thread crashed, time stopped for the app server and all kinds of havoc was wreaked.
Any code you share will be someone else’s reference material. If it has a bug in it, I don’t think it’s too much to ask for you to save a generation of concurrent programs and make a small change.
October 28th, 2008 at 1:05 am
> Depends on the hardware and OS. It’s about 2 microseconds on my machine.
It also depends on the version of JVM you have and how you test the timing. Try running this.
{code}
static int RUNS = 10 * 1000 * 1000;
public static void main(String… args) {
doTest(false);
doTest(true);
}
private static void doTest(boolean print) {
long start = System.nanoTime();
// total is used to ensure the loop is not optimised to nothing.
long total = 0;
for (int i = 0; i < RUNS; i++)
total += System.currentTimeMillis();
long average = (System.nanoTime() – start) / RUNS;
if (print)
System.out.println(“average= ” + average + ” ns, total=” + total);
}
{code}
October 28th, 2008 at 1:50 am
@Dustin
No worries, I will update this post tomorrow, when I post the follow up.
@Peter
After I implemented this optimization I got a 70% improvement on my generic benchmark. I think it doesn’t matter how much time does it take, it was clearly a bottleneck for me. Of course for 99.9% of uses it is fast enough.
October 28th, 2008 at 2:11 am
@Peter
Tried your code. It is about 50 ns on my machine with JVM 1.6u10. However replacing it with the TimeCacheThread brings it down to 3 ns and the HeartBeatThread makes it ~1 ns. That is 50 times faster than the usual version and the truth is that it did influence my code.
October 28th, 2008 at 2:17 am
Why not use a volatile boolean in your timerclass?
private volatile boolean ping = true;
public boolean changed() {
boolean pong = ping;
ping = false;
return pong;
}
public ThreadThingy() {
while (true) {
TimeUnit.MILLISECOND.sleep(500);
ping = true;
}
}
October 28th, 2008 at 2:20 am
That is … if it’s acceptable that one test is skipped every now and never.
Doesn’t pre work?
October 28th, 2008 at 2:24 am
@Pete
1) You for some reason assume there is one client of the ThreadThingy, there can be arbitrary many.
2) There are no advantages to using a boolean over using an int performance-wise. However with the boolean you might skip the update.
October 28th, 2008 at 2:31 am
Ah, I see. Didn’t get point 1 till now.
October 28th, 2008 at 7:14 am
@William
You get green symbols unless you use something like VTune or AMD’s equiv. There you get Java aware profiling supported by counters implemented in the hardware.
Regards,
Kirk
October 28th, 2008 at 10:05 am
I am sure I am missing the point but you can do this for any number of tasks and still be performant.
int PROCESSORS = ManagementFactory.getOperatingSystemMXBean().getAvailableProcessors();
ScheduledExecutorService ses = Executors.newScheduledThreadPool(PROCESSORS);
// or
ScheduledExecutorService ses = Executors.newSingleThreadScheduledExecutor();
// any number of Runnable can be added.
ses.schedule(command, 500, TimeUnit.MILLISECONDS);
ses.scheduleAtFixedRate(command, 500, 500, TimeUnit.MILLISECONDS);
October 28th, 2008 at 12:06 pm
>I pushed this with the BEA JRockit years ago
>and Henrik and team refused to make this and
>other counters available even though I was
>able to see some of these counters.
Will – this is not an entirely correct statement as you well know. We didn’t want to expose the proprietary counters since the current API was experimental and many of the counters were untested (and didn’t work!). We didn’t want 3rd parties relying on them since that would put us in a situation where we couldn’t change our API without breaking your tools. I told you as much at the time.
Jevgeni – You may want to use nanotime when timing instead of currenttimemillis. It is very cheap on Windows at least.
For really low-level profiling, you may want to look at platform dependent tools like Intel VTune or AMD’s CodeAnalyst. Many of them have Java support.
And if you happen to be an Oracle customer, you can look at JRockit Mission Control. Doesn’t have wall-clock method profiling, but a lot of other goodies and it piggybacks on the JVM hotspot detector etc so no overhead from profiling.
Cheers — Henrik, JRockit team
October 28th, 2008 at 12:45 pm
“Will – this is not an entirely correct statement as you well know. We didn’t want to expose the proprietary counters since the current API was experimental and many of the counters were untested (and didn’t work!). We didn’t want 3rd parties relying on them since that would put us in a situation where we couldn’t change our API without breaking your tools. I told you as much at the time.”
What did I say differently. “We didn’t want” == “refused”.
Come on it does not take a genius to figure out a way to expose counters in a generic fashion and still allow you to add and remove across releases even fix ones that did not work. We do this ourselves today with all our counters and meters being dynamic and optional.
I asked for this a few times for this well before you even joined the BEA JRockit team when the team were looking for a tools architect in 2003.
William
October 29th, 2008 at 3:52 am
Before doing any optimization you should do a benchmark first, so at the end you will know the performance is actually improved
October 29th, 2008 at 12:26 pm
>What did I say differently. “We didn’t want”
> == “refused”.
We did not forbid you from using the data, nor is there anything in our license etc that would stop that. We only recommended against it and warned that it was unsupported and experimental (see my email to you dated Dec 4 2007 for reeferance). Your choice of words here seems to imply something else, but if that’s not the case then we are in agreement.
>Come on it does not take a genius to figure
>out a way to expose counters in a generic
>fashion and still allow you to add and remove
>across releases even fix ones that did not
>work. We do this ourselves today with all our
>counters and meters being dynamic and optional.
I’m glad to hear that you have find a simple and cheap way of doing this in your product. But even if the cost is low it still has to be prioritized against other features and it has not yet popped up to the top of that list for us.
Henrik
May 14th, 2009 at 11:45 pm
[...] de ZeroTurnaround – Le blog de Toomas et de Jevgeni – Jevgeni sur Twitter – Toomas sur Twitter – Exemple d’article de Jevgeni sur la lenteur de System.currentTime… RSS Classé dans: Java Tags: [...]