CMOC vs VBCC vs GCC6809
How well do 6809 C compilers optimize?
A head-to-head measurement of code size and execution speed for the three C compilers in common use for the Motorola 6809 — the CPU inside the Vectrex, the TRS-80 Color Computer, and the Dragon. The question: given the same C source, how tight and how fast is the machine code each one produces?
All numbers below were produced on an Apple-Silicon Mac. Everything is reproducible (see the last section).
The contestants
cmoc — a purpose-built 6809 C compiler (Pierre Sarrazin). Version 0.1.67,
-O2(its maximum). Small, portable, easy to build; a light optimizer.vbcc — a retargetable ISO-C compiler (Volker Barthelmann), using the
vbcchc126809 backend at-O=255. Modern, genuinely optimizing, builds cleanly from source today.gcc6809 — a port of GCC to the 6809, based on GCC 4.3.6 (dftools), at
-O2. A real optimizing GCC backend — strength reduction, tail calls, etc. From 2008 and painful to build; here it runs as a prebuilt x86_64 binary under Rosetta.
On all three: int is 16-bit, long is 32-bit, pointers are 16-bit.
Methodology
The benchmarks
Eleven small kernels modelled on the kind of code a Vectrex game actually runs. They are deliberately freestanding and dependency-free: each uses only local typedefs for the integer widths (no headers, no BIOS, no libc), so the identical source text compiles on all three compilers and the comparison is of code generation, not libraries.
objmove — move 16 sprites by velocity, wrap the screen (struct-array indexing, 16-bit add/compare).
collide — O(n²) AABB overlap over 12 sprites (nested loops, 8-bit math, branches).
fixmul — scale 16 values by a Q8.8 factor (32-bit multiply, shifts).
rng — xorshift-16 PRNG, fill 32 bytes (16-bit shifts/xor).
memops — hand-rolled memcpy + memset (pointer post-increment loops).
strupr — copy + upper-case a string (byte loop, range compare).
checksum — sum + rolling-hash over 64 bytes (16-bit accumulate, 8-bit rotate).
isort — insertion sort of 16 bytes (nested loop, array shifting).
statem — switch-based game state machine (switch lowering: jump table vs if-chain).
bcdscore — add to a packed-BCD score (nibble math, carry propagation).
clamp — clamp a vector list to signed-8 range (signed min/max branches on 16-bit).
Measuring size
Each kernel is compiled and assembled; the byte count of the generated code section is read directly from the object/listing. Runtime-helper bodies (multiply routines, etc.) live in libraries and are not counted — but the call sites that invoke them are, so a compiler that calls out instead of inlining still pays for the call.
Measuring speed
Speed is a dynamic cycle count, not an estimate. Each compiler’s machine code is linked to a flat image and executed in a cycle-accurate 6809 simulator until it returns to a sentinel address; the elapsed 6809 cycles are reported.
To isolate the kernel from the test harness, every kernel is wrapped in a no-argument function run() that initializes fixed inputs and calls the kernel. Each driver is built twice — once normally and once with the kernel call removed — and the baseline is subtracted. Because the setup code writes to global arrays (observable side effects), the optimizer can’t delete it, so the subtraction cleanly yields kernel cycles = full − baseline. Loop trip counts are compile-time constants, so each figure is a single deterministic run.
Results
Code size — bytes (lower is better)
kernel cmoc vbcc gcc6809
objmove 148 119 80
collide 206 203 131
fixmul 112 116 89
rng 118 93 64
memops 58 56 41
strupr 54 49 46
checksum 90 101 59
isort 117 111 59
statem 149 132 132
bcdscore 159 105 108
clamp 122 103 139
------------------------------------
TOTAL 1333 1188 948Normalized (gcc6809 = 1.00): cmoc 1.41 · vbcc 1.25 · gcc6809 1.00.
gcc6809 produces the smallest code on 8 kernels outright (tying statem); vbcc wins bcdscore and clamp; CMOC wins none. CMOC is consistently the largest, primarily due to a per-function frame-pointer prologue (see below).
Speed — kernel cycles (lower is better)
kernel cmoc vbcc gcc6809
objmove 3648 4300 1509
collide 20873 12110 7644
fixmul 15096 44611 n/a (see note)
rng 7360 6068 3583
memops 4152 2434 1726
strupr 1810 1197 1444
checksum 9316 9090 3657
isort 17329 8703 6553
statem 178 88 115
bcdscore 650 354 342
clamp 2695 1665 1473Note: gcc6809 fixmul needs __mulsi3 (32-bit multiply), a libgcc routine not bundled with the prebuilt toolchain, so it can’t be linked/run. Its size (89 B, smallest) was still measured.
Aggregate over the ten runnable kernels (gcc6809 = 1.00): cmoc 2.43 · vbcc 1.64 · gcc6809 1.00 — gcc6809 is roughly 2.4× faster than cmoc and 1.6× faster than vbcc on this mix. It wins 8 of 10; VBCC takes strupr and statem; CMOC takes fixmul.
Whole-program size — a complete Pong
The kernels are tiny hot loops. To see whether the same ranking holds for a real program, pong.c is a complete Vectrex Pong: ball physics, paddle AI, paddle/ball collision, scoring, a center-court draw list, score printing, and a main game loop. It’s written the same fair way — portable C logic with the Vectrex BIOS routines as uniform extern calls (so every compiler emits the same JSR at each call site, and the BIOS bodies in ROM are not counted, exactly like the runtime-helper bodies in the kernels).
Code size of the whole program:
program cmoc vbcc gcc6809
pong 1092 874 789Normalized (gcc6809 = 1.00): cmoc 1.38 · vbcc 1.11 · gcc6809 1.00.
The ranking is unchanged — gcc6809 smallest, CMOC largest — but the shape shifts in an instructive way: VBCC nearly catches gcc6809 (1.11×) on real program code, much closer than its 1.25× on the kernel size aggregate. The reason is that gcc6809’s largest advantages come from arithmetic-heavy inner loops (strength-reducing index multiplies, etc.), and those are a much smaller fraction of a whole game than of the micro-kernels. CMOC stays ~1.4× larger: its per-function frame-pointer prologue is a fixed tax that scales with the number of functions (Pong has ~12). There’s no speed figure for Pong — it’s a BIOS-driven infinite frame loop, not a bounded kernel.
Why the differences — at the assembly level
Three recurring code-generation decisions explain most of the spread.
1. cmoc sets up a frame pointer on every function
CMOC gives each function a U-based frame even when it isn’t needed:
_obj_update:
PSHS U ; save caller's U
LEAU ,S ; U = frame pointer
LEAS -3,S ; reserve locals
...That’s bytes and cycles on every call. VBCC and gcc6809 address locals off S directly and omit the frame pointer (-fomit-frame-pointer is gcc’s default at -O2), which is a major contributor to CMOC’s size and speed deficit.
2. 16-bit multiply: inline vs helper-call vs eliminated
The array index &objs[i] is objs + i*6. Watch the three strategies.
CMOC inlines the small constant multiply with the hardware MUL:
LDB -1,U ; i
LDA #6 ; sizeof(Obj)
MUL ; D = i*6
LEAX D,XVBCC calls its runtime helper for the general 16-bit multiply:
jsr (__mulint16)Compact at the call site, but the call/return and the routine cost cycles — this is why VBCC loses objmove despite being smaller than CMOC.
gcc6809 does strength reduction: it never multiplies inside the loop at all, keeping a running pointer and running products, and just adding each iteration:
L11: sty ,x ; store position
...
leax 6,x ; advance &objs[i] by sizeof(Obj)
leay 200,y ; next i*200 without a multiply
leau 111,u ; next i*111 without a multiply
...
jmp _obj_update ; tail call (no jsr/rts overhead)Eliminating the multiplies and tail-calling is why gcc6809’s objmove runs in 1509 cycles versus 3648 (cmoc) and 4300 (vbcc).
3. 32-bit multiply: the one CMOC wins
fixmul‘s (s32)vin[i] * factor It is a 32-bit multiply. Here CMOCs hand-written MUL16-Based on the long-multiply path is markedly tighter than VBCC’s __mulint32: 15 096 vs 44 611 cycles — CMOC is ~3× faster. gcc6809 would likely do well too (its code is the smallest), but its 32-bit multiply helper isn’t in the prebuilt library, so it can’t be timed. The lesson: aggregate winners still have specific weak spots — VBCC’s 32-bit math is one.
Per-compiler verdict
gcc6809 — best optimizer, worst ergonomics. Smallest code, fastest on most kernels, often by ~2×, thanks to real optimization passes (strength reduction, tail calls, good register allocation, frame-pointer omission). The catch: it’s a 2008-vintage GCC 4.3.6 that is genuinely hard to build on a modern host — on Apple Silicon, the native build cc1 crashes on any 2-argument function, so it has to be run as a prebuilt x86_64 binary under Rosetta. If you can get it working, it produces the best code.
VBCC — the pragmatic choice. A modern, actively maintained, retargetable compiler that builds cleanly from source in minutes. ~20-25% larger and ~1.6× slower than gcc6809 on the kernels, but it closes most of that size gap on a whole program (1.11×), and it’s comfortably better than cmoc almost everywhere. Its notable weakness is 32-bit multiply. For most projects, this is the best effort-to-quality ratio.
CMOC — simplest, most portable, largest/slowest. Builds natively everywhere with no fuss and is easy to reason about, but it’s a light optimizer: a frame-pointer prologue on every function and helper-heavy 16-bit math make it the biggest and (usually) slowest. It does win objmove (inline MUL beats VBCC’s helper call) and fixmul (Its 32-bit multiply is excellent.)
A rough way to hold all this in your head:
size (kernels): gcc6809 < vbcc < cmoc (1.00 : 1.25 : 1.41)
size (pong) : gcc6809 < vbcc < cmoc (1.00 : 1.11 : 1.38)
speed (kernels): gcc6809 < vbcc < cmoc (1.00 : 1.64 : 2.43)
ease to build : cmoc < vbcc << gcc6809 (gcc6809 hardest)Caveats/threats to validity
Synthetic kernels. These are small, hot-loop-style routines with compile-time-constant trip counts. They over-represent arithmetic and loops and under-represent large functions with heavy register pressure (where the 6809’s tiny register set hurts all three). Real game frames also spend most of their time in BIOS/vector-drawing code that’s identical across compilers.
One optimization setting each (cmoc
-O2, vbcc-O=255, gcc-O2). A size-focused pass could shift the size numbers.Helper bodies excluded from size, included in speed (via the call). Fair for “how good is the compiler”, but a multiply-heavy kernel’s total ROM footprint also depends on the runtime library.
gcc6809
fixmulis missing because its 32-bit-multiply library routine isn’t bundled; treat the gcc6809 speed aggregate as “10 of 11 kernels”.gcc6809 runs under Rosetta, but that only affects the host build/run time, not the 6809 code it emits — the generated code is identical to that of a native run.
Reproduce
Find the benchmark here: https://github.com/rogerboesch/vectreC
bench/measure.sh # 3-way code-size table (+ the Pong whole-program line)
bench/measure_speed.sh # 3-way dynamic cycle-count tableBoth scripts point at the toolchains and the cycle-accurate simulator; kernel sources are in bench/src/, the timing drivers in bench/drv/.
Addendum: cmoc 0.1.98 (June 2026)
A follow-up to the CMOC vs VBCC vs GCC6809 benchmark above. The main article measured cmoc 0.1.67. Since then, Pierre Sarrazin released 0.1.98 (2026-06-06), so I rebuilt it from source and re-ran the identical kernels.
Same method, same flags. Default --vectrex codegen, no -fomit-frame-pointer — exactly as in the main article, so the only thing that changed is the compiler version. (Building 0.1.98 needs a recent lwtools; 4.17 can’t assemble its stdlib, lwtools 4.24 can.)
Result: a free, across-the-board win
cmoc 0.1.98 is ~9% smaller overall and faster on 10 of 11 kernels than 0.1.67 — no source changes required.
CODE SIZE (bytes) SPEED (kernel cycles)
kernel 0.1.67 0.1.98 Δ 0.1.67 0.1.98 Δ
objmove 148 150 +1.4% 3648 3741 +2.5%
collide 206 173 -16.0% 20873 15945 -23.6%
fixmul 112 109 -2.7% 15096 14896 -1.3%
rng 118 113 -4.2% 7360 7093 -3.6%
memops 58 54 -6.9% 4152 3886 -6.4%
strupr 54 50 -7.4% 1810 1696 -6.3%
checksum 90 83 -7.8% 9316 8593 -7.8%
isort 117 99 -15.4% 17329 11536 -33.4%
statem 149 127 -14.8% 178 153 -14.0%
bcdscore 159 139 -12.6% 650 536 -17.5%
clamp 122 117 -4.1% 2695 2613 -3.0%
-----------------------------------------------------------
TOTAL 1333 1214 -8.9%The only regression is objmove (a narrower path for one address-of-then-store pattern), and it’s marginal.
Where it lands against the others
Re-normalizing the main article’s tables with the new cmoc column (gcc6809 = 1.00):
cmoc 0.1.67 cmoc 0.1.98 vbcc gcc6809
size (kernels) 1.41 1.28 1.25 1.00
speed (10 runnable) 2.43 1.99 1.64 1.00On size, cmoc essentially pulls level with vbcc (1.28 vs 1.25) — it now even produces the smallest code on statem and beats vbcc on collide, fixmul, memops, checksum and isort. On speed it closes roughly a third of the gap to gcc6809, though vbcc and gcc6809 stay ahead. fixmul is still cmoc’s standout win (its 32-bit multiply remains ~3× faster than vbcc’s).
Why — and what didn’t change
The gains are entirely from the low-level peephole optimiser maturing across 31 releases, not any architectural change. Diffing the isort inner loop shows the concrete wins: absolute LDX #addr replacing PC-relative LEAX addr,PCR; redundant CLRA/sign-extensions stripped before ABX indexing; 8-bit values kept narrow instead of promoted to 16-bit; and CMPB #0 / BLS fused into a single branch. These hit hardest in tight loops, which is why isort (−33%), collide (−24%) and bcdscore (−18%) move the most while arithmetic-bound kernels barely budge.
Crucially, the main article’s central criticism still holds: cmoc still emits the per-function frame-pointer prologue (PSHS U / LEAU ,S / LEAS -n,S) by default. So #1 of “Why the differences” is unchanged — these wins come on top of that fixed tax, not by removing it.
Sidebar: cmoc has had
-fomit-frame-pointersince 0.1.82, still opt-in. Adding it to 0.1.98 saves more again (e.g.memops54→40,statem127→120,isort99→94). I deliberately left it off for the table above so the comparison stays apples-to-apples with the main article’s default-flag runs.
Bottom line: if you’re on an older CMOC, upgrading to 0.1.98 is a strict improvement at the same flags. It doesn’t change the article’s overall ranking — gcc6809 still optimizes best, vbcc is still the pragmatic pick — but it meaningfully narrows cmoc’s gap, especially on code size.



It‘s pretty much what I expected.
Noteworthy - the gcc that comes with vide and is used by Peer and his students - has some bugfixes that Peer and I implemented. In addition - Vide can be configured to use addional peepholes to enhance the produced sources further.
My own comparrison cmoc versus gcc was done years ago - but with the same result. Still glad we chose gcc.
Noteworthy - the C-Toolchain that vide uses - can also be run completely from command line - on Mac, Linux and windows - uall binaries are included.
Peer and his students use Makefiles…
Thanks for the up to date data!