Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Benchmark Report: wasmz vs wasmi vs wasm3 vs wamr

Date: 2026-04-16 09:54 OS: Linux 6.17.0-1010-azure x86_64 Runs per benchmark: 20 (warmup: 5)

Versions

RuntimeVersion
wasmzdev (ReleaseFast)
wasmiwasmi 2.0.0-beta.2
wasm3Wasm3 v0.5.1 on x86_64
wamriwasm 2.4.3

Binary Size

RuntimeSize
wasmz892.6 KB
wasmi7.0 MB
wasm3466.3 KB
wamr344.8 KB

Execution Time (median ms) — lower is better

fib(30) — pure C compiled to WASM

RuntimeMedian (ms)± stddev
wasmz37.0± 1.5
wasmi38.4± 0.6
wasm339.7± 1.0
wamr49.6± 1.7

QuickJS fib(25) — JS engine running inside WASM (1.4 MB module)

RuntimeMedian (ms)± stddev
wasmz174.9± 3.4
wasmi184.4± 2.9
wasm3217.4± 8.2
wamr242.9± 4.5

esbuild — JS bundler running inside WASM (19 MB module)

Note: wamr is excluded from esbuild tests because it does not support stdin input and causes stack overflow with large workloads.

RuntimeMedian (ms)± stddev
wasmz909.5± 12.5
wasmi918.2± 19.9
wasm32215.0± 19.5

Peak RSS (memory) — lower is better

Peak RSS = highest resident set size seen at any point during the run. Avg RSS = time-weighted mean RSS sampled every 100 ms during one run (reflects actual memory consumption over the process lifetime, not just the spike).

fib(30)

RuntimePeak RSSAvg RSS
wasmz17.3 MB8.7 MB
wasmi21.9 MB11.0 MB
wasm318.9 MB9.5 MB
wamr11.0 MB5.4 MB

QuickJS fib(25)

RuntimePeak RSSAvg RSS
wasmz1.8 MB1.2 MB
wasmi1.8 MB1.2 MB
wasm31.8 MB1.2 MB
wamr1.8 MB1.4 MB

esbuild bundling

Note: wamr is excluded due to stdin/stack limitations.

RuntimePeak RSSAvg RSS
wasmz1.8 MB1.6 MB
wasmi1.8 MB1.6 MB
wasm31.8 MB1.7 MB

Performance Optimizations

The benchmark results above are the product of a series of targeted optimizations applied to wasmz’s compiler and runtime. This section documents each technique.

Register-Based IR (Stack-to-Register Lowering)

WebAssembly is a stack machine. wasmz translates every function’s stack bytecode into a flat array of typed register-IR ops during compilation. Each op carries explicit source and destination slot indices (16-bit unsigned), eliminating the push/pop bookkeeping that stack interpreters must perform at runtime.

Direct Threaded Code Dispatch

Rather than a traditional switch/case bytecode loop typical of high-level interpreters, wasmz uses direct threaded code (inspired by Marr et al., 2023). Each encoded instruction contains an 8-byte handler pointer followed by its operands. After executing each handler, the next() macro issues a tail call directly to the next handler via its pointer, eliminating the overhead of:

  • An explicit loop-condition check at the top of the bytecode loop
  • A single branch-prediction site (which saturates on complex control flow)

This approach spreads branch prediction across multiple dispatch points, improving predictor accuracy on modern CPUs.

r0 and fp0 Accumulator Registers

Inspired by the Wasm3 M3 architecture, wasmz maintains two accumulator registers:

  • r0 — holds the most recent i32/i64 result
  • fp0 — holds the most recent f32/f64 result (f32 values are bit-cast to f64)

Handlers that produce a numeric result write it to both the accumulator and the destination slot. This allows the CPU to keep the top-of-stack value in a real hardware register across instruction boundaries, avoiding a slot load on every back-to-back arithmetic instruction. The *_imm_r variants and other fusions leverage this by reading from r0 implicitly.

Superinstructions (Instruction Fusion)

The compiler performs a single forward pass over the IR and merges common multi-op patterns into one instruction. This reduces the total number of dispatched operations and removes redundant slot reads/writes.

The fused families currently implemented are:

LabelPatternFused Op
Cconst + binopbinop_imm — immediate rhs embedded in the instruction
Dbinop + local_setbinop_to_local — result written directly to a local slot
Econst + binop + local_setbinop_imm_to_local
Fcompare + jump_if_zcompare_jump_if_false — one dispatch for test-and-branch
Gconst + compare + br_ifcompare_imm_jump_if_false
Hlocal_get + binop_imm + local_set (same local)local_inplace — mutates local in-place, no temp slot
Ibinop + retbinop_ret — compute and return in one dispatch
Jcompare_jump_if_false + jumpcompare_jump_if_true
Kcopy + jump_if_nzcopy_jump_if_nz — essential for br_if with a result value

Additional local-slot specialized fusions:

  • binop_tee_local — writes the result to both a stack slot and a local (local.tee pattern)
  • cmp_to_local — comparison result written directly to a local slot
  • const_to_local — constant written directly to a local slot
  • imm_to_local — superinstruction combining a constant-to-temp with a copy-to-local, preserving the source slot for downstream use
  • load_to_local — i32/i64 memory load result written directly to a local
  • global_get_to_local — global read result written directly to a local
  • call_to_local — direct call result written directly to a local slot (saves one dispatch vs call + local_set)

r0 Accumulator Variants

For long chains of const + binop_imm sequences (common in tight loops), the compiler tracks an accumulator register r0. When the previous instruction’s destination matches the next instruction’s source, the lhs slot field is elided from the encoding, producing *_imm_r variants. This shrinks the instruction and saves one memory load per dispatch.

call_leaf Superinstruction

When a direct-call site is proven to target a leaf function (a function that makes no further calls itself) whose result is not used (void call), the compiler emits call_leaf instead of call. The VM handler skips result-slot setup and the return-value copy path entirely, reducing call overhead on hot void-dispatches.

Slot Recycling

During lowering, temporary slots created for intermediate values are recycled once their last use is seen. This reduces the total slots_len per compiled function, which directly shrinks the value-stack frame allocated at call time.

Lazy Compilation

Functions are compiled on their first invocation rather than all at once at Module.compile() time. For large modules (esbuild is 19 MB), this makes startup near-instant and amortises compilation cost over actual execution. The --eager-compile flag or Config.eager_compile = true opts into up-front compilation when predictable latency matters more than startup time.

mmap-Based Memory

Two mmap optimizations reduce peak RSS:

  1. File mapping — the .wasm file is memory-mapped rather than heap-copied. Pending (uncompiled) function bodies borrow slices directly from the mapped region, so the bytecode is never duplicated in the heap until compilation.

  2. Virtual reservation for linear memory — when allocating WebAssembly linear memory, wasmz reserves a large virtual address range with mmap(PROT_NONE) and then commits pages with mprotect as the module calls memory.grow. This avoids the RSS spike that realloc produces when doubling a backing buffer.

Lazy GC Heap Initialization

The GC heap inside Store is not allocated until the first GC-typed value is actually created. Modules that use only numeric types (MVP, no GC proposal) never touch the allocator, keeping RSS minimal.

Lazy WASI Initialization

The WASI Preview 1 host is only instantiated when the module’s import table contains at least one wasi_snapshot_preview1 import. Pure compute modules pay no initialization overhead.

Slot Width Reduction (u32 → u16)

All slot indices were narrowed from 32-bit to 16-bit integers. This halves the per-instruction slot storage for the most common instruction layouts, improving cache utilization in the hot interpreter loop.

Handler Ordering (Future Work)

Recent research has shown that the order of bytecode handler definitions in memory affects CPU branch-prediction performance, with potential speedups of 7–23% on specific benchmarks. Genetic algorithms can search for near-optimal orderings tailored to specific workloads and CPU architectures.

wasmz does not yet implement handler reordering, but the architecture (direct threaded code with multiple dispatch sites per handler) is well-suited for such optimization. The decision to prioritize other techniques first (superinstructions, accumulator registers, lazy compilation, mmap) reflects a pragmatic tradeoff: the gains from reducing dispatch count via fusion exceed what handler ordering alone typically provides, and fusion benefits apply uniformly across all workloads.