Micro-benchmarks are notoriously inaccurate, in any system. Here are some guidelines you should read carefully before trying to construct an accurate benchmark in the Strongtalk system.
A good way to run your benchmark is to create a method in the Test class (which is there for this kind of thing) that runs for at least 100 milliseconds, and then call that method a number of times until it becomes optimized. The Test>benchmark: method will do this for you, and report the fastest time. To tell if your code is running enough, a good rule of thumb is that if your method doesn't get faster and then stabilize at some speed, then it's not being run.
Know how to choose a benchmark. Micro-benchmarks are notorious for producing misleading results in all systems, which is why all real benchmarks are bigger programs that as much as possible use the same code on both systems. If you insist on writing a micro-benchmark, keep these issues in mind:
Your code should spend its time in Smalltalk, not down in rarely-used system primitives or C-callouts. For example, 'factorial' spends almost all of its time in the LargeInteger multiplication primitive, not Smalltalk code.
Use library methods that are commonly used in real performance-critical code. Take factorial as an example: when is the last time your program was performance bound on LargeInteger multiplication?
Use code that is like normal Smalltalk code (use of core data structures, allocation, message sending in a normal pattern, instance variable access, blocks). This is the biggest reason most micro-benchmarks aren't accurate. Real code is broken up into many methods, with lots of message sends, instance variable reads, boolean operations, SmallInteger operations, temporary allocations, and Array accesses, all mixed together. These are the things that Strongtalk is designed to optimize.
Use the same code and input data on both systems. Running a highly implementation- dependent operation like "compile all methods" is not a good benchmark because the set of methods is totally different, and the bytecode compilers are implemented completely differently. (Also, the byte-code compiler is not a performance critical routine in applications, so it has not been tuned at all in Strongtalk. When was the last time your users were twiddling their thumbs waiting for the bytecode compiler?)
When we benchmarked the system ourselves, we assembled a large suite of accepted OO benchmarks, such as Richards, DeltaBlue (a constraint solver), the Stanford benchmarks, Slopstones and Smopstones. These benchmarks are already in the image, if you want to run them. Try evaluating "VMSuite run" and look at the code it runs. If you want a real performance comparison, run these on other VMs.
As an example, I put a couple of very small microbenchmarks that are run the right way in the system tour (the code is in the Test class). You can try running them on other Smalltalks as a start.
The moral of the story: if you have a crash, read the troubleshooting section.