Benchmarking

OK, time to start dispelling some misconceptions. I’d wager the vast majority of JS developers, if asked to benchmark the speed (execution time) of a certain operation, would initially go about it something like this:

  1. var start = (new Date()).getTime(); // or `Date.now()`
  2. // do some operation
  3. var end = (new Date()).getTime();
  4. console.log( "Duration:", (end - start) );

Raise your hand if that’s roughly what came to your mind. Yep, I thought so. There’s a lot wrong with this approach, but don’t feel bad; we’ve all been there.

What did that measurement tell you, exactly? Understanding what it does and doesn’t say about the execution time of the operation in question is key to learning how to appropriately benchmark performance in JavaScript.

If the duration reported is 0, you may be tempted to believe that it took less than a millisecond. But that’s not very accurate. Some platforms don’t have single millisecond precision, but instead only update the timer in larger increments. For example, older versions of windows (and thus IE) had only 15ms precision, which means the operation has to take at least that long for anything other than 0 to be reported!

Moreover, whatever duration is reported, the only thing you really know is that the operation took approximately that long on that exact single run. You have near-zero confidence that it will always run at that speed. You have no idea if the engine or system had some sort of interference at that exact moment, and that at other times the operation could run faster.

What if the duration reported is 4? Are you more sure it took about four milliseconds? Nope. It might have taken less time, and there may have been some other delay in getting either start or end timestamps.

More troublingly, you also don’t know that the circumstances of this operation test aren’t overly optimistic. It’s possible that the JS engine figured out a way to optimize your isolated test case, but in a more real program such optimization would be diluted or impossible, such that the operation would run slower than your test.

So… what do we know? Unfortunately, with those realizations stated, we know very little. Something of such low confidence isn’t even remotely good enough to build your determinations on. Your “benchmark” is basically useless. And worse, it’s dangerous in that it implies false confidence, not just to you but also to others who don’t think critically about the conditions that led to those results.

Repetition

“OK,” you now say, “Just put a loop around it so the whole test takes longer.” If you repeat an operation 100 times, and that whole loop reportedly takes a total of 137ms, then you can just divide by 100 and get an average duration of 1.37ms for each operation, right?

Well, not exactly.

A straight mathematical average by itself is definitely not sufficient for making judgments about performance which you plan to extrapolate to the breadth of your entire application. With a hundred iterations, even a couple of outliers (high or low) can skew the average, and then when you apply that conclusion repeatedly, you even further inflate the skew beyond credulity.

Instead of just running for a fixed number of iterations, you can instead choose to run the loop of tests until a certain amount of time has passed. That might be more reliable, but how do you decide how long to run? You might guess that it should be some multiple of how long your operation should take to run once. Wrong.

Actually, the length of time to repeat across should be based on the accuracy of the timer you’re using, specifically to minimize the chances of inaccuracy. The less precise your timer, the longer you need to run to make sure you’ve minimized the error percentage. A 15ms timer is pretty bad for accurate benchmarking; to minimize its uncertainty (aka “error rate”) to less than 1%, you need to run your each cycle of test iterations for 750ms. A 1ms timer only needs a cycle to run for 50ms to get the same confidence.

But then, that’s just a single sample. To be sure you’re factoring out the skew, you’ll want lots of samples to average across. You’ll also want to understand something about just how slow the worst sample is, how fast the best sample is, how far apart those best and worse cases were, and so on. You’ll want to know not just a number that tells you how fast something ran, but also to have some quantifiable measure of how trustable that number is.

Also, you probably want to combine these different techniques (as well as others), so that you get the best balance of all the possible approaches.

That’s all bare minimum just to get started. If you’ve been approaching performance benchmarking with anything less serious than what I just glossed over, well… “you don’t know: proper benchmarking.”

Benchmark.js

Any relevant and reliable benchmark should be based on statistically sound practices. I am not going to write a chapter on statistics here, so I’ll hand wave around some terms: standard deviation, variance, margin of error. If you don’t know what those terms really mean — I took a stats class back in college and I’m still a little fuzzy on them — you are not actually qualified to write your own benchmarking logic.

Luckily, smart folks like John-David Dalton and Mathias Bynens do understand these concepts, and wrote a statistically sound benchmarking tool called Benchmark.js (http://benchmarkjs.com/). So I can end the suspense by simply saying: “just use that tool.”

I won’t repeat their whole documentation for how Benchmark.js works; they have fantastic API Docs (http://benchmarkjs.com/docs) you should read. Also there are some great (http://calendar.perfplanet.com/2010/bulletproof-javascript-benchmarks/) writeups (http://monsur.hossa.in/2012/12/11/benchmarkjs.html) on more of the details and methodology.

But just for quick illustration purposes, here’s how you could use Benchmark.js to run a quick performance test:

  1. function foo() {
  2. // operation(s) to test
  3. }
  4. var bench = new Benchmark(
  5. "foo test", // test name
  6. foo, // function to test (just contents)
  7. {
  8. // .. // optional extra options (see docs)
  9. }
  10. );
  11. bench.hz; // number of operations per second
  12. bench.stats.moe; // margin of error
  13. bench.stats.variance; // variance across samples
  14. // ..

There’s lots more to learn about using Benchmark.js besides this glance I’m including here. But the point is that it’s handling all of the complexities of setting up a fair, reliable, and valid performance benchmark for a given piece of JavaScript code. If you’re going to try to test and benchmark your code, this library is the first place you should turn.

We’re showing here the usage to test a single operation like X, but it’s fairly common that you want to compare X to Y. This is easy to do by simply setting up two different tests in a “Suite” (a Benchmark.js organizational feature). Then, you run them head-to-head, and compare the statistics to conclude whether X or Y was faster.

Benchmark.js can of course be used to test JavaScript in a browser (see the “jsPerf.com” section later in this chapter), but it can also run in non-browser environments (Node.js, etc.).

One largely untapped potential use-case for Benchmark.js is to use it in your Dev or QA environments to run automated performance regression tests against critical path parts of your application’s JavaScript. Similar to how you might run unit test suites before deployment, you can also compare the performance against previous benchmarks to monitor if you are improving or degrading application performance.

Setup/Teardown

In the previous code snippet, we glossed over the “extra options” { .. } object. But there are two options we should discuss: setup and teardown.

These two options let you define functions to be called before and after your test case runs.

It’s incredibly important to understand that your setup and teardown code does not run for each test iteration. The best way to think about it is that there’s an outer loop (repeating cycles), and an inner loop (repeating test iterations). setup and teardown are run at the beginning and end of each outer loop (aka cycle) iteration, but not inside the inner loop.

Why does this matter? Let’s imagine you have a test case that looks like this:

  1. a = a + "w";
  2. b = a.charAt( 1 );

Then, you set up your test setup as follows:

  1. var a = "x";

Your temptation is probably to believe that a is starting out as "x" for each test iteration.

But it’s not! It’s starting a at "x" for each test cycle, and then your repeated + "w" concatenations will be making a larger and larger a value, even though you’re only ever accessing the character "w" at the 1 position.

Where this most commonly bites you is when you make side effect changes to something like the DOM, like appending a child element. You may think your parent element is set as empty each time, but it’s actually getting lots of elements added, and that can significantly sway the results of your tests.