Benchmark Report: wasmz vs wasmi vs wasm3 vs wamr
Date: 2026-04-16 09:54 OS: Linux 6.17.0-1010-azure x86_64 Runs per benchmark: 20 (warmup: 5)
Versions
| Runtime | Version |
|---|---|
| wasmz | dev (ReleaseFast) |
| wasmi | wasmi 2.0.0-beta.2 |
| wasm3 | Wasm3 v0.5.1 on x86_64 |
| wamr | iwasm 2.4.3 |
Binary Size
| Runtime | Size |
|---|---|
| wasmz | 892.6 KB |
| wasmi | 7.0 MB |
| wasm3 | 466.3 KB |
| wamr | 344.8 KB |
Execution Time (median ms) — lower is better
fib(30) — pure C compiled to WASM
| Runtime | Median (ms) | ± stddev |
|---|---|---|
| wasmz | 37.0 | ± 1.5 |
| wasmi | 38.4 | ± 0.6 |
| wasm3 | 39.7 | ± 1.0 |
| wamr | 49.6 | ± 1.7 |
QuickJS fib(25) — JS engine running inside WASM (1.4 MB module)
| Runtime | Median (ms) | ± stddev |
|---|---|---|
| wasmz | 174.9 | ± 3.4 |
| wasmi | 184.4 | ± 2.9 |
| wasm3 | 217.4 | ± 8.2 |
| wamr | 242.9 | ± 4.5 |
esbuild — JS bundler running inside WASM (19 MB module)
Note: wamr is excluded from esbuild tests because it does not support stdin input and causes stack overflow with large workloads.
| Runtime | Median (ms) | ± stddev |
|---|---|---|
| wasmz | 909.5 | ± 12.5 |
| wasmi | 918.2 | ± 19.9 |
| wasm3 | 2215.0 | ± 19.5 |
Peak RSS (memory) — lower is better
Peak RSS = highest resident set size seen at any point during the run. Avg RSS = time-weighted mean RSS sampled every 100 ms during one run (reflects actual memory consumption over the process lifetime, not just the spike).
fib(30)
| Runtime | Peak RSS | Avg RSS |
|---|---|---|
| wasmz | 17.3 MB | 8.7 MB |
| wasmi | 21.9 MB | 11.0 MB |
| wasm3 | 18.9 MB | 9.5 MB |
| wamr | 11.0 MB | 5.4 MB |
QuickJS fib(25)
| Runtime | Peak RSS | Avg RSS |
|---|---|---|
| wasmz | 1.8 MB | 1.2 MB |
| wasmi | 1.8 MB | 1.2 MB |
| wasm3 | 1.8 MB | 1.2 MB |
| wamr | 1.8 MB | 1.4 MB |
esbuild bundling
Note: wamr is excluded due to stdin/stack limitations.
| Runtime | Peak RSS | Avg RSS |
|---|---|---|
| wasmz | 1.8 MB | 1.6 MB |
| wasmi | 1.8 MB | 1.6 MB |
| wasm3 | 1.8 MB | 1.7 MB |
Performance Optimizations
The benchmark results above are the product of a series of targeted optimizations applied to wasmz’s compiler and runtime. This section documents each technique.
Register-Based IR (Stack-to-Register Lowering)
WebAssembly is a stack machine. wasmz translates every function’s stack bytecode into a flat array of typed register-IR ops during compilation. Each op carries explicit source and destination slot indices (16-bit unsigned), eliminating the push/pop bookkeeping that stack interpreters must perform at runtime.
Direct Threaded Code Dispatch
Rather than a traditional switch/case bytecode loop typical of high-level interpreters, wasmz uses direct threaded code (inspired by Marr et al., 2023). Each encoded instruction contains an 8-byte handler pointer followed by its operands. After executing each handler, the next() macro issues a tail call directly to the next handler via its pointer, eliminating the overhead of:
- An explicit loop-condition check at the top of the bytecode loop
- A single branch-prediction site (which saturates on complex control flow)
This approach spreads branch prediction across multiple dispatch points, improving predictor accuracy on modern CPUs.
r0 and fp0 Accumulator Registers
Inspired by the Wasm3 M3 architecture, wasmz maintains two accumulator registers:
- r0 — holds the most recent i32/i64 result
- fp0 — holds the most recent f32/f64 result (f32 values are bit-cast to f64)
Handlers that produce a numeric result write it to both the accumulator and the destination slot. This allows the CPU to keep the top-of-stack value in a real hardware register across instruction boundaries, avoiding a slot load on every back-to-back arithmetic instruction. The *_imm_r variants and other fusions leverage this by reading from r0 implicitly.
Superinstructions (Instruction Fusion)
The compiler performs a single forward pass over the IR and merges common multi-op patterns into one instruction. This reduces the total number of dispatched operations and removes redundant slot reads/writes.
The fused families currently implemented are:
| Label | Pattern | Fused Op |
|---|---|---|
| C | const + binop | binop_imm — immediate rhs embedded in the instruction |
| D | binop + local_set | binop_to_local — result written directly to a local slot |
| E | const + binop + local_set | binop_imm_to_local |
| F | compare + jump_if_z | compare_jump_if_false — one dispatch for test-and-branch |
| G | const + compare + br_if | compare_imm_jump_if_false |
| H | local_get + binop_imm + local_set (same local) | local_inplace — mutates local in-place, no temp slot |
| I | binop + ret | binop_ret — compute and return in one dispatch |
| J | compare_jump_if_false + jump | compare_jump_if_true |
| K | copy + jump_if_nz | copy_jump_if_nz — essential for br_if with a result value |
Additional local-slot specialized fusions:
binop_tee_local— writes the result to both a stack slot and a local (local.teepattern)cmp_to_local— comparison result written directly to a local slotconst_to_local— constant written directly to a local slotimm_to_local— superinstruction combining a constant-to-temp with a copy-to-local, preserving the source slot for downstream useload_to_local— i32/i64 memory load result written directly to a localglobal_get_to_local— global read result written directly to a localcall_to_local— direct call result written directly to a local slot (saves one dispatch vscall+local_set)
r0 Accumulator Variants
For long chains of const + binop_imm sequences (common in tight loops), the compiler tracks an accumulator register r0. When the previous instruction’s destination matches the next instruction’s source, the lhs slot field is elided from the encoding, producing *_imm_r variants. This shrinks the instruction and saves one memory load per dispatch.
call_leaf Superinstruction
When a direct-call site is proven to target a leaf function (a function that makes no further calls itself) whose result is not used (void call), the compiler emits call_leaf instead of call. The VM handler skips result-slot setup and the return-value copy path entirely, reducing call overhead on hot void-dispatches.
Slot Recycling
During lowering, temporary slots created for intermediate values are recycled once their last use is seen. This reduces the total slots_len per compiled function, which directly shrinks the value-stack frame allocated at call time.
Lazy Compilation
Functions are compiled on their first invocation rather than all at once at Module.compile() time. For large modules (esbuild is 19 MB), this makes startup near-instant and amortises compilation cost over actual execution. The --eager-compile flag or Config.eager_compile = true opts into up-front compilation when predictable latency matters more than startup time.
mmap-Based Memory
Two mmap optimizations reduce peak RSS:
-
File mapping — the
.wasmfile is memory-mapped rather than heap-copied. Pending (uncompiled) function bodies borrow slices directly from the mapped region, so the bytecode is never duplicated in the heap until compilation. -
Virtual reservation for linear memory — when allocating WebAssembly linear memory, wasmz reserves a large virtual address range with
mmap(PROT_NONE)and then commits pages withmprotectas the module callsmemory.grow. This avoids the RSS spike thatreallocproduces when doubling a backing buffer.
Lazy GC Heap Initialization
The GC heap inside Store is not allocated until the first GC-typed value is actually created. Modules that use only numeric types (MVP, no GC proposal) never touch the allocator, keeping RSS minimal.
Lazy WASI Initialization
The WASI Preview 1 host is only instantiated when the module’s import table contains at least one wasi_snapshot_preview1 import. Pure compute modules pay no initialization overhead.
Slot Width Reduction (u32 → u16)
All slot indices were narrowed from 32-bit to 16-bit integers. This halves the per-instruction slot storage for the most common instruction layouts, improving cache utilization in the hot interpreter loop.
Handler Ordering (Future Work)
Recent research has shown that the order of bytecode handler definitions in memory affects CPU branch-prediction performance, with potential speedups of 7–23% on specific benchmarks. Genetic algorithms can search for near-optimal orderings tailored to specific workloads and CPU architectures.
wasmz does not yet implement handler reordering, but the architecture (direct threaded code with multiple dispatch sites per handler) is well-suited for such optimization. The decision to prioritize other techniques first (superinstructions, accumulator registers, lazy compilation, mmap) reflects a pragmatic tradeoff: the gains from reducing dispatch count via fusion exceed what handler ordering alone typically provides, and fusion benefits apply uniformly across all workloads.