Context Is King

Don’t forget to check the context of a particular performance benchmark, especially a comparison between X and Y tasks. Just because your test reveals that X is faster than Y doesn’t mean that the conclusion “X is faster than Y” is actually relevant.

For example, let’s say a performance test reveals that X runs 10,000,000 operations per second, and Y runs at 8,000,000 operations per second. You could claim that Y is 20% slower than X, and you’d be mathematically correct, but your assertion doesn’t hold as much water as you’d think.

Let’s think about the results more critically: 10,000,000 operations per second is 10,000 operations per millisecond, and 10 operations per microsecond. In other words, a single operation takes 0.1 microseconds, or 100 nanoseconds. It’s hard to fathom just how small 100ns is, but for comparison, it’s often cited that the human eye isn’t generally capable of distinguishing anything less than 100ms, which is one million times slower than the 100ns speed of the X operation.

Even recent scientific studies showing that maybe the brain can process as quick as 13ms (about 8x faster than previously asserted) would mean that X is still running 125,000 times faster than the human brain can perceive a distinct thing happening. X is going really, really fast.

But more importantly, let’s talk about the difference between X and Y, the 2,000,000 operations per second difference. If X takes 100ns, and Y takes 80ns, the difference is 20ns, which in the best case is still one 650-thousandth of the interval the human brain can perceive.

What’s my point? None of this performance difference matters, at all!

But wait, what if this operation is going to happen a whole bunch of times in a row? Then the difference could add up, right?

OK, so what we’re asking then is, how likely is it that operation X is going to be run over and over again, one right after the other, and that this has to happen 650,000 times just to get a sliver of a hope the human brain could perceive it. More likely, it’d have to happen 5,000,000 to 10,000,000 times together in a tight loop to even approach relevance.

While the computer scientist in you might protest that this is possible, the louder voice of realism in you should sanity check just how likely or unlikely that really is. Even if it is relevant in rare occasions, it’s irrelevant in most situations.

The vast majority of your benchmark results on tiny operations — like the ++x vs x++ myth — are just totally bogus for supporting the conclusion that X should be favored over Y on a performance basis.

Engine Optimizations

You simply cannot reliably extrapolate that if X was 10 microseconds faster than Y in your isolated test, that means X is always faster than Y and should always be used. That’s not how performance works. It’s vastly more complicated.

For example, let’s imagine (purely hypothetical) that you test some microperformance behavior such as comparing:

  1. var twelve = "12";
  2. var foo = "foo";
  3. // test 1
  4. var X1 = parseInt( twelve );
  5. var X2 = parseInt( foo );
  6. // test 2
  7. var Y1 = Number( twelve );
  8. var Y2 = Number( foo );

If you understand what parseInt(..) does compared to Number(..), you might intuit that parseInt(..) potentially has “more work” to do, especially in the foo case. Or you might intuit that they should have the same amount of work to do in the foo case, as both should be able to stop at the first character "f".

Which intuition is correct? I honestly don’t know. But I’ll make the case it doesn’t matter what your intuition is. What might the results be when you test it? Again, I’m making up a pure hypothetical here, I haven’t actually tried, nor do I care.

Let’s pretend the test comes back that X and Y are statistically identical. Have you then confirmed your intuition about the "f" character thing? Nope.

It’s possible in our hypothetical that the engine might recognize that the variables twelve and foo are only being used in one place in each test, and so it might decide to inline those values. Then it may realize that Number( "12" ) can just be replaced by 12. And maybe it comes to the same conclusion with parseInt(..), or maybe not.

Or an engine’s dead-code removal heuristic could kick in, and it could realize that variables X and Y aren’t being used, so declaring them is irrelevant, so it doesn’t end up doing anything at all in either test.

And all that’s just made with the mindset of assumptions about a single test run. Modern engines are fantastically more complicated than what we’re intuiting here. They do all sorts of tricks, like tracing and tracking how a piece of code behaves over a short period of time, or with a particularly constrained set of inputs.

What if the engine optimizes a certain way because of the fixed input, but in your real program you give more varied input and the optimization decisions shake out differently (or not at all!)? Or what if the engine kicks in optimizations because it sees the code being run tens of thousands of times by the benchmarking utility, but in your real program it will only run a hundred times in near proximity, and under those conditions the engine determines the optimizations are not worth it?

And all those optimizations we just hypothesized about might happen in our constrained test but maybe the engine wouldn’t do them in a more complex program (for various reasons). Or it could be reversed — the engine might not optimize such trivial code but may be more inclined to optimize it more aggressively when the system is already more taxed by a more sophisticated program.

The point I’m trying to make is that you really don’t know for sure exactly what’s going on under the covers. All the guesses and hypothesis you can muster don’t amount to hardly anything concrete for really making such decisions.

Does that mean you can’t really do any useful testing? Definitely not!

What this boils down to is that testing not real code gives you not real results. In so much as is possible and practical, you should test actual real, non-trivial snippets of your code, and under as best of real conditions as you can actually hope to. Only then will the results you get have a chance to approximate reality.

Microbenchmarks like ++x vs x++ are so incredibly likely to be bogus, we might as well just flatly assume them as such.