Closed Bug 580532 Opened 15 years ago Closed 3 years ago

Improve visibility of small wins in JS engine (reduce SunSpider noise)

Categories

(Core :: JavaScript Engine, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED INACTIVE

People

(Reporter: paul.biggar, Unassigned)

References

Details

Attachments

(2 files, 11 obsolete files)

2.66 KB, text/plain
Details
1.55 KB, patch
Details | Diff | Splinter Review
Wins of less than 1-2% are hard to witness. On my machine, I see variance of 2-3% between runs of sunspider. This bug throws together a few experiments to decrease variance.
This reads the test file, and _then_ starts a timer, then evals the code. Note that calling eval changes the test results significantly (see bug 580529), so I've called the 'snarl' function which does the same thing, which I'll attach in the next patch. This is a patch against sunspider 63804.
Attached patch (good) The snarl function (obsolete) — Splinter Review
Does the same as eval, except without a massive slowdown from bug 580529.
The sunspider script does not give a useful result for variance in the test cases. I would guess that it gives _overall_ variance in the test cases, meaning that +1ms in one test cancels out -1ms in another. calc_benchmark_variance.js works out the stddev of each test individually. It also averages them to give an overall number, though I think this would earn a look of disapproval from someone who knows anything about stats. These numbers give a better indication of the variance of one or more test runs. This is primarily useful for seeing if some change reduces variability in the test, which is the point of this bug.
(In reply to comment #1) > This reads the test file, and _then_ starts a timer, then evals the code. This reduces variability significantly in my tests (from .8 to .5) using the attachment 458949 [details].
v8 and sunspider both have some randomness in them, so it may help to seed the test to make them reproducible. tests/v8-v4/v8-crypto.js 1411: while(rng_pptr < rng_psize) { // extract some randomness from Math.random() 1412: t = Math.floor(65536 * Math.random()); 1468:// PKCS#1 (type 2, random) pad input string s to n bytes, and return a bigint 1480: while(n > 2) { // random non-zero pad 1542:// Undo PKCS#1 (type 2, random) padding and, if valid, return the plaintext 1585:// Generate a new random private key B bits long, using public expt E tests/v8-v4/v8-earley-boyer.js 1949: (peephole (hole 1 "Math.floor(Math.random()*" 'n ")"))) 1951:function sc_random(n) { 1952: return Math.floor(Math.random()*n); tests/v8-v4/v8-splay.js 60: // The benchmark framework guarantees that Math.random is 62: return Math.random(); tests/sunspider-0.9.1/string-validate-input.js 67: var l = Math.floor(26*Math.random()); 78: var l = Math.floor(9*Math.random()); tests/sunspider-0.9.1/string-base64.js 122: str += String.fromCharCode( (25 * Math.random()) + 97 );
This has no effect (or maybe a slight negative effect).
I checked out what v8 does, see if we can learn from them. - they do a few more warm up runs. - they use this Math.random implementation: // To make the benchmark results predictable, we replace Math.random // with a 100% deterministic alternative. Math.random = (function() { var seed = 49734321; return function() { // Robert Jenkins' 32 bit integer hash function. seed = ((seed + 0x7ed55d16) + (seed << 12)) & 0xffffffff; seed = ((seed ^ 0xc761c23c) ^ (seed >>> 19)) & 0xffffffff; seed = ((seed + 0x165667b1) + (seed << 5)) & 0xffffffff; seed = ((seed + 0xd3a2646c) ^ (seed << 9)) & 0xffffffff; seed = ((seed + 0xfd7046c5) + (seed << 3)) & 0xffffffff; seed = ((seed ^ 0xb55a4f09) ^ (seed >>> 16)) & 0xffffffff; return (seed & 0xfffffff) / 0x10000000; }; })();
This has no effect. In v8-the-engine's test script, they run the test multiple times without restarting the benchmark, so more warm up is probably better. However, we use JSC's test scripts, which means the tests are run in separate shell invocation. I think changing the sunspider tests to duplicate this would move the goalposts, so probably a bad idea.
Attachment #459277 - Attachment description: Bad: Run sunspider tests multiple times → (bad) Run sunspider tests multiple times
Attachment #458939 - Attachment description: Remove IO overhead from sunspider → (good) Remove IO overhead from sunspider
Attachment #458945 - Attachment description: The snarl function → (good) The snarl function
Attachment #458949 - Attachment description: Script to calculate variance of sunspider results → (good) Script to calculate variance of sunspider results
Attachment #459270 - Attachment description: Seed the shell's random number generator. → (bad) Seed the shell's random number generator.
This comes from v8, but it doesn't measurably help variability.
Disabling address-space-randomization seems to help a lot. Combined with the other random things above, I'm getting results which barely change in their overall running time. (That said, I'm still getting the same level of overall variance from the calc_benchmark_variance script, so this might just be luck). On Linux: setarch i686 -Rl bash
niceing the process doesn't help, and actually seems to hinder, though I may be running this wrong. I tried both of: nice -20 sunsp sudo /usr/bin/nice -20 sunsp (sunsp is a bash script I use to run sunspider)
Attachment #459281 - Attachment description: (bad) Replace Math.random with a deterministic function → (neutral) Replace Math.random with a deterministic function
Attachment #459277 - Attachment description: (bad) Run sunspider tests multiple times → (neutral) Run sunspider tests multiple times
Attachment #459270 - Attachment description: (bad) Seed the shell's random number generator. → (neutral) Seed the js shell's random number generator.
Date.now() truncates the number of milliseconds to an integer. This keeps the double. This makes really short tests significantly more stable.
Relevant literature: - Statistically Rigorous Java Performance Evaluation by Andy Georges et al - Producing wrong data without doing anything obviously wrong! by Todd Mytkowicz et al (aka most mind blowing paper of all time).
Had a chat with dmandelin about some rudimentary stats. It seems that sunspider is applying tools used for the normal distribution to a distribution which is probably not normal, which is probably why the "significant/not significant" results are not very good. Anyway, he supports the made-up variability thing I'm using in this script, which just sums the difference between the each sub-test's mean and its actual result: sum = 0 foreach i in 1..n calculate mean across the 60 runs of x_i foreach j in 1..60 sum += abs(mean[i] - x_i[j]) variability = sum / 60
Attachment #458949 - Attachment is obsolete: true
This is great stuff! Bug 534547 is related. Paul, feel free to mark that as a dup of this bug if you think it's appropriate; you've certainly dug deeper into this than I have. Are you planning to file these changes as bugs against SunSpider? Seems like the best thing to do. Measuring V8 with the SS harness is kind of bogus, apparently V8 proper ignores some setup/teardown code. Eg. see bug 562553 where I did an optimization that helps v8-splay.js enormously when measured via the SS harness, but I think it may not have helped when measuring V8 proper. (Follow the link to the V8 bug for full info.)
BTW, I regularly use Cachegrind to compare SS instruction counts for changes to TM. The variance is *much* less than you see when timing SS. So I suspect most of the variance is due to hard-to-handle stuff like cache and branch predictor behaviour changes. On the other hand, V8 has significant variance even under Cachegrind (at least, V8 when you run the tests as is, without using the V8 harness).
(In reply to comment #17) > BTW, I regularly use Cachegrind to compare SS instruction counts for changes to TM. In a previous project (phpcompiler.org), I used cachegrind for this. However, instruction count doesn't really cut it: there is cache misses and branch mispredictions at the least. Of course, that makes it so much harder to get a single "is this better or worse" number. It could work if we say "branch misprediction is X cycles, l2 cache miss is Y cycles", etc. What do you think? > On the other hand, V8 has significant variance even under Cachegrind (at least, > V8 when you run the tests as is, without using the V8 harness). Does comment 8 help here?
(In reply to comment #16) > Are you planning to file these changes as bugs against SunSpider? Seems like > the best thing to do. I don't know. I'll look in the future. Fixing it may be too mig a job, and would invalidate past results, IE those in arewefastyet.com, which may or may not be a big deal.
(In reply to comment #18) > However, > instruction count doesn't really cut it: there is cache misses and branch > mispredictions at the least. Of course, that makes it so much harder to get a > single "is this better or worse" number. It could work if we say "branch > misprediction is X cycles, l2 cache miss is Y cycles", etc. What do you think? That's a swamp, belive me. (I started a PhD on such a topic, and quickly switched to a different topic when I saw how awful it was.) I know that instruction counts is only part of the picture, but it turns out, for the things I'm doing, it's actually really useful. One explanation is that an imperfect metric measured precisely can often be more useful than an imperfect metric measured imprecisely. Another is that most of my changes are attempts to execute less code, and instruction count differences are great for that. Occasionally I make a change that aims to improve cache or branch prediction behaviour (eg. get rid of a big switch) and Cachegrind gives good (simulated) numbers for that too. Finally, everyone else measures time, so I find it useful to be the one guy who measures instruction counts; it just gives me a different view on things. > > On the other hand, V8 has significant variance even under Cachegrind (at least, > > V8 when you run the tests as is, without using the V8 harness). > > Does comment 8 help here? I know that it helps v8-crypto a lot at least; I've previously hacked it to use a deterministic Math.random().
(In reply to comment #19) > (In reply to comment #16) > > Are you planning to file these changes as bugs against SunSpider? Seems like > > the best thing to do. > > I don't know. I'll look in the future. Fixing it may be too mig a job, and > would invalidate past results, IE those in arewefastyet.com, which may or may > not be a big deal. Apple can keep hosting 0.9 or whatever they host forever, but new versions should come up under separate version numbers and URLs. All benchmarks should be annual editions (or every six months, maybe). If you are willing to make the case in a bugs.webkit.org bug, I'll help get you air support. /be
Your comment was: OK, I've cooked up a proper test. I ran sunspider with --runs=1000 for each change, and measured the "variability" using the algorithm from comment 15. Each change builds on the previous one. Baseline: Total time: 584319 Total variance: 13660.900000001668 Variance / #scripts: 13.661 Variance * 1000000000 / #scripts / total time: 23379.182 Better date resolution: Total time: 583266.9450683594 Total variance: 7989.476169433345 Variance / #scripts: 7.989 Variance * 1000000000 / #scripts / total time: 13697.804 Use eval to avoid IO: Total time: 579124.4201660156 Total variance: 6005.46807714848 Variance / #scripts: 6.005 Variance * 1000000000 / #scripts / total time: 10369.910 Seed random number generator: Total time: 578326.2463378906 Total variance: 6097.8916064452715 Variance / #scripts: 6.098 Variance * 1000000000 / #scripts / total time: 10544.034 Use deterministic Math.random: Total time: 584234.15625 Total variance: 5991.305539062612 Variance / #scripts: 5.991 Variance * 1000000000 / #scripts / total time: 10254.973 Warm up "properly": Total time: 580214.2788085938 Total variance: 6612.213340820201 Variance / #scripts: 6.612 Variance * 1000000000 / #scripts / total time: 11396.158 Address-space randomization: Total time: 579880.6350097656 Total variance: 6310.679111327958 Variance / #scripts: 6.311 Variance * 1000000000 / #scripts / total time: 10882.721 It seems that seeding the random number generator and warming up "properly" are not useful, so I'll remove them and run it again. The biggest wins are using better date resolution (by far!), using eval (big), and address-space-randomization (not bad). If you have fault with the numbers, let me know.
I should try it with 30 runs too, since that's what people use. I could try finding a sweet spot in the number of runs, but I think that will vary from machine to machine.
OK, new run, having taken out the patches from above that didn't provide benefits. Baseline: Total time: 584454 Total variance: 13775.409999999318 Variance / #scripts: 13.775 Variance * 1000000000 / #scripts / total time:23569.708 Better date resolution: Total time: 583279.1579589844 Total variance: 8011.607755859456 Variance / #scripts: 8.012 Variance * 1000000000 / #scripts / total time:13735.460 Use eval to avoid IO: Total time: 578957.7702636719 Total variance: 5925.116317871156 Variance / #scripts: 5.925 Variance * 1000000000 / #scripts / total time:10234.108 Use deterministic Math.random: Total time: 585289.6984863281 Total variance: 5637.047147460895 Variance / #scripts: 5.637 Variance * 1000000000 / #scripts / total time:9631.209 Address-space randomization: Total time: 585399.4216308594 Total variance: 5990.405707519354 Variance / #scripts: 5.990 Variance * 1000000000 / #scripts / total time:10233.023 Looks like address space randomization isn't so useful, everything else looks good.
njn: I had a look at bug 534547, and I'll fold that into the work I'm doing here. Bug 534547 notes variance due to the tests using Date functions to add randomness, which obviously makes them completely unreproducible (which is actually worse than a lot of what I was measuring here). The tests which use Date in some fashion are: 3d-raytrace.js crypto-aes.js date-format-tofte.js date-format-xparb.js math-cordic.js string-tagcloud.js v8-v4 also uses Date in: v8-earley-boyer.
And if we go fixing sunspider, then using the v8 setup/teardown code will fix bug 562553.
Attachment #459281 - Attachment description: (neutral) Replace Math.random with a deterministic function → (good) Replace Math.random with a deterministic function
Apply this patch to significantly increase the consistency of your sunspider runs. You don't need to do anything else. Note that it contains all of SunSpider.
Assignee: general → pbiggar
Attachment #458939 - Attachment is obsolete: true
Attachment #458945 - Attachment is obsolete: true
Attachment #459270 - Attachment is obsolete: true
Attachment #459277 - Attachment is obsolete: true
Attachment #459281 - Attachment is obsolete: true
Attachment #459554 - Attachment is obsolete: true
Status: NEW → UNCONFIRMED
Ever confirmed: false
Depends on: 585001
I've been filing related bugs to SunSpider. The meta bug is: https://bugs.webkit.org/show_bug.cgi?id=43253
Attachment #463636 - Attachment is patch: false
Attachment #463636 - Attachment mime type: text/plain → application/bzip2
Fixes a bug caused by rebasing.
Attachment #463636 - Attachment is obsolete: true
Patch to apply to checkout to make SunSpider much less random. This assumes you have a SunSpider directory is js/src. (The patch no longer has a full copy of SunSpider, since that bothered people. Most of the other functionality has moved into SpiderMonkey, so is no longer required here.)
Attachment #463665 - Attachment is obsolete: true
I'm finding some significant outliers, and other anomalies. In all cases I ran the benchmark 1000 times. access-binary-trees has odd pathological cases. 990 times it takes 14-16ms, then it gets weird: 18.072021484375, 40.85498046875, 40.9990234375, 41.1201171875, 41.3681640625, 42.1279296875, 44.419921875, 45.10595703125, 45.52197265625, This costs me 6 ms when I average it all up. Same with bitops-bitwise-and. For 970 cases it takes 2-3ms, then BOOM: "bitops-bitwise-and": 6.81591796875, "bitops-bitwise-and": 7.798095703125, "bitops-bitwise-and": 7.9970703125, "bitops-bitwise-and": 8.06005859375, "bitops-bitwise-and": 8.075927734375, "bitops-bitwise-and": 8.2060546875, "bitops-bitwise-and": 8.875, crypto-aes and string-tagcloud get much faster when address-space-randomization is turned on. Run-time (both mean and median) goes from 26ms to 22ms in the former case, and 53 to 47ms in the latter. There aren't bizarre outliers in either case, like in access-binary-trees: every instance speeds up. Meanwhile, access-fannkuch, math-partial-sums, math-spectral-norm and string-unpack-code go nuts (become very very very noisy) when ASR is turned on. access-binary-tree, bitops-bitwise-and, access-nsieve and bitops-nsive-bits go a bit nuts, but nothing like the others. I'm not sure exactly what would cause this.
eval() can be used again now (bug 580529), which means the patch is usable on shells too (so long as they support read()).
Attachment #465599 - Attachment is obsolete: true
Summary: Improve visibility of small wins in JS engine → Improve visibility of small wins in JS engine (reduce SunSpider noise)
Now that SunSpider uses a run function, we no longer need to use eval(), and no longer need the submillisecond timing, as they are both subsumed by run(). This updates to apply to trunk SunSpider as of today (Feb 11, 2011).
Attachment #469028 - Attachment is obsolete: true
Assignee: paul.biggar → general
Assignee: general → nobody
Status: UNCONFIRMED → RESOLVED
Closed: 3 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: