Benchmarking Guide
Micro-benchmarks are notoriously inaccurate, in any system.
Here are some guidelines you should read carefully before trying
to construct an accurate benchmark in the Strongtalk system. This
is very important because there is one big 'gotcha' associated
with running benchmarks from a "do it" in Strongtalk:
- Put your benchmark in a real method. As
mentioned in the tour, to get compiled performance
results in Strongtalk, the primary computation (the code
where your benchmark is spending most of its time) needs
to be in an actual method, not in a "do it"
from a workspace. This is because the current version of
the VM doesn't use the optimized method until the next
time that it is called after compilation, and a "do
it" method by definition is never called more than
once. (In a real program or normal "do it",
this effect is never an issue- only micro-benchmarks have
loops that iterate zillions of times with the loop itself
in the "do it"). This is not a fundamental
limitation in the technology, but we hadn't implemented
"on-stack-replacement" in the Smalltalk system
at the time of release (we did implement it for Java).
Note
that this does not mean that the code that your
"do it" invokes won't be optimized and used the
first time around- it will. But the big performance gains
for micro-benchmarks come from inlining all the
called methods directly into the performance critical
benchmark loop, and if that loop is literally in the
"do it", that isn't possible.
A good way to run your benchmark is to create a method
in the Test class (which is there for this kind of thing)
that runs for at least 100 milliseconds, and then call
that method a number of times until it becomes optimized.
The Test>benchmark: method will do this for you, and
report the fastest time. To tell if your code is running
enough, a good rule of thumb is that if your method
doesn't get faster and then stabilize at some speed, then
it's not being run
- Know how to choose a benchmark. Micro-benchmarks
are notorious for producing misleading results in all
systems, which is why all real benchmarks are bigger
programs that as much as possible use the same code on
both systems. If you insist on writing a micro-benchmark,
keep these issues in mind:
- Your code should spend its time in
Smalltalk, not down in rarely-used
system primitives or C-callouts. For example,
'factorial' spends almost all of its time in the
LargeInteger multiplication primitive, not
Smalltalk code.
- Use library methods that are commonly
used in real performance-critical code. Take
factorial as an example: when is the last time
your program was performance bound on
LargeInteger multiplication?
- Use code that is like normal Smalltalk
code (use of core data structures, allocation,
message sending in a normal pattern, instance
variable access, blocks). This is the
biggest reason most micro-benchmarks aren't
accurate. Real code is broken up into many
methods, with lots of message sends, instance
variable reads, boolean operations, SmallInteger
operations, temporary allocations, and Array
accesses, all mixed together. These are the
things that Strongtalk is designed to optimize.
- Use the same code and input data on both
systems. Running a highly
implementation- dependent operation like
"compile all methods" is not a good
benchmark because the set of methods is totally
different, and the bytecode compilers are
implemented completely differently. (Also, the
byte-code compiler is not a performance critical
routine in applications, so it has not been tuned
at all in Strongtalk. When was the last time your
users were twiddling their thumbs waiting for the
bytecode compiler?)
How we did Benchmarking
When we benchmarked the system ourselves, we assembled a large
suite of accepted OO benchmarks, such as Richards, DeltaBlue (a
constraint solver), the Stanford benchmarks, Slopstones and
Smopstones. These benchmarks are already in the image, if you
want to run them. Try evaluating "VMSuite
runBenchmarks" and look at the code it runs. If you want a
real performance comparison, run these on other VMs.
As an example, I put a couple of very small microbenchmarks
that are run the right way in the system tour (the code is in the
Test class). You can try running them on other Smalltalks as a
start.
Other benchmarkling problems people have been having