CMOC vs VBCC vs GCC6809

How well do 6809 C compilers optimize?

Jun 08, 2026

A head-to-head measurement of code size and execution speed for the three C compilers in common use for the Motorola 6809 — the CPU inside the Vectrex, the TRS-80 Color Computer, and the Dragon. The question: given the same C source, how tight and how fast is the machine code each one produces?

All numbers below were produced on an Apple-Silicon Mac. Everything is reproducible (see the last section).

The contestants

cmoc — a purpose-built 6809 C compiler (Pierre Sarrazin). Version 0.1.67, -O2 (its maximum). Small, portable, easy to build; a light optimizer.
vbcc — a retargetable ISO-C compiler (Volker Barthelmann), using the vbcchc12 6809 backend at -O=255. Modern, genuinely optimizing, builds cleanly from source today.
gcc6809 — a port of GCC to the 6809, based on GCC 4.3.6 (dftools), at -O2. A real optimizing GCC backend — strength reduction, tail calls, etc. From 2008 and painful to build; here it runs as a prebuilt x86_64 binary under Rosetta.

On all three: int is 16-bit, long is 32-bit, pointers are 16-bit.

Methodology

The benchmarks

Eleven small kernels modelled on the kind of code a Vectrex game actually runs. They are deliberately freestanding and dependency-free: each uses only local typedefs for the integer widths (no headers, no BIOS, no libc), so the identical source text compiles on all three compilers and the comparison is of code generation, not libraries.

objmove — move 16 sprites by velocity, wrap the screen (struct-array indexing, 16-bit add/compare).
collide — O(n²) AABB overlap over 12 sprites (nested loops, 8-bit math, branches).
fixmul — scale 16 values by a Q8.8 factor (32-bit multiply, shifts).
rng — xorshift-16 PRNG, fill 32 bytes (16-bit shifts/xor).
memops — hand-rolled memcpy + memset (pointer post-increment loops).
strupr — copy + upper-case a string (byte loop, range compare).
checksum — sum + rolling-hash over 64 bytes (16-bit accumulate, 8-bit rotate).
isort — insertion sort of 16 bytes (nested loop, array shifting).
statem — switch-based game state machine (switch lowering: jump table vs if-chain).
bcdscore — add to a packed-BCD score (nibble math, carry propagation).
clamp — clamp a vector list to signed-8 range (signed min/max branches on 16-bit).

Measuring size

Each kernel is compiled and assembled; the byte count of the generated code section is read directly from the object/listing. Runtime-helper bodies (multiply routines, etc.) live in libraries and are not counted — but the call sites that invoke them are, so a compiler that calls out instead of inlining still pays for the call.

Measuring speed

Speed is a dynamic cycle count, not an estimate. Each compiler’s machine code is linked to a flat image and executed in a cycle-accurate 6809 simulator until it returns to a sentinel address; the elapsed 6809 cycles are reported.

To isolate the kernel from the test harness, every kernel is wrapped in a no-argument function run() that initializes fixed inputs and calls the kernel. Each driver is built twice — once normally and once with the kernel call removed — and the baseline is subtracted. Because the setup code writes to global arrays (observable side effects), the optimizer can’t delete it, so the subtraction cleanly yields kernel cycles = full − baseline. Loop trip counts are compile-time constants, so each figure is a single deterministic run.

Results

Code size — bytes (lower is better)

kernel       cmoc    vbcc   gcc6809
objmove       148     119        80
collide       206     203       131
fixmul        112     116        89
rng           118      93        64
memops         58      56        41
strupr         54      49        46
checksum       90     101        59
isort         117     111        59
statem        149     132       132
bcdscore      159     105       108
clamp         122     103       139
------------------------------------
TOTAL        1333    1188       948

Normalized (gcc6809 = 1.00): cmoc 1.41 · vbcc 1.25 · gcc6809 1.00.

gcc6809 produces the smallest code on 8 kernels outright (tying statem); vbcc wins bcdscore and clamp; CMOC wins none. CMOC is consistently the largest, primarily due to a per-function frame-pointer prologue (see below).

Speed — kernel cycles (lower is better)

kernel       cmoc    vbcc   gcc6809
objmove      3648    4300      1509
collide     20873   12110      7644
fixmul      15096   44611       n/a   (see note)
rng          7360    6068      3583
memops       4152    2434      1726
strupr       1810    1197      1444
checksum     9316    9090      3657
isort       17329    8703      6553
statem        178      88       115
bcdscore      650     354       342
clamp        2695    1665      1473

Note: gcc6809 fixmul needs __mulsi3 (32-bit multiply), a libgcc routine not bundled with the prebuilt toolchain, so it can’t be linked/run. Its size (89 B, smallest) was still measured.

Aggregate over the ten runnable kernels (gcc6809 = 1.00): cmoc 2.43 · vbcc 1.64 · gcc6809 1.00 — gcc6809 is roughly 2.4× faster than cmoc and 1.6× faster than vbcc on this mix. It wins 8 of 10; VBCC takes strupr and statem; CMOC takes fixmul.

Whole-program size — a complete Pong

The kernels are tiny hot loops. To see whether the same ranking holds for a real program, pong.c is a complete Vectrex Pong: ball physics, paddle AI, paddle/ball collision, scoring, a center-court draw list, score printing, and a main game loop. It’s written the same fair way — portable C logic with the Vectrex BIOS routines as uniform extern calls (so every compiler emits the same JSR at each call site, and the BIOS bodies in ROM are not counted, exactly like the runtime-helper bodies in the kernels).

Code size of the whole program:

program      cmoc    vbcc   gcc6809
pong         1092     874       789

Normalized (gcc6809 = 1.00): cmoc 1.38 · vbcc 1.11 · gcc6809 1.00.

The ranking is unchanged — gcc6809 smallest, CMOC largest — but the shape shifts in an instructive way: VBCC nearly catches gcc6809 (1.11×) on real program code, much closer than its 1.25× on the kernel size aggregate. The reason is that gcc6809’s largest advantages come from arithmetic-heavy inner loops (strength-reducing index multiplies, etc.), and those are a much smaller fraction of a whole game than of the micro-kernels. CMOC stays ~1.4× larger: its per-function frame-pointer prologue is a fixed tax that scales with the number of functions (Pong has ~12). There’s no speed figure for Pong — it’s a BIOS-driven infinite frame loop, not a bounded kernel.

Why the differences — at the assembly level

Three recurring code-generation decisions explain most of the spread.

1. cmoc sets up a frame pointer on every function

CMOC gives each function a U-based frame even when it isn’t needed:

_obj_update:
        PSHS    U          ; save caller's U
        LEAU    ,S         ; U = frame pointer
        LEAS    -3,S       ; reserve locals
        ...

That’s bytes and cycles on every call. VBCC and gcc6809 address locals off S directly and omit the frame pointer (-fomit-frame-pointer is gcc’s default at -O2), which is a major contributor to CMOC’s size and speed deficit.

2. 16-bit multiply: inline vs helper-call vs eliminated

The array index &objs[i] is objs + i*6. Watch the three strategies.

CMOC inlines the small constant multiply with the hardware MUL:

        LDB   -1,U          ; i
        LDA   #6            ; sizeof(Obj)
        MUL                 ; D = i*6
        LEAX  D,X

VBCC calls its runtime helper for the general 16-bit multiply:

        jsr   (__mulint16)

Compact at the call site, but the call/return and the routine cost cycles — this is why VBCC loses objmove despite being smaller than CMOC.

gcc6809 does strength reduction: it never multiplies inside the loop at all, keeping a running pointer and running products, and just adding each iteration:

L11:    sty   ,x           ; store position
        ...
        leax  6,x          ; advance &objs[i] by sizeof(Obj)
        leay  200,y        ; next i*200 without a multiply
        leau  111,u        ; next i*111 without a multiply
        ...
        jmp   _obj_update  ; tail call (no jsr/rts overhead)

Eliminating the multiplies and tail-calling is why gcc6809’s objmove runs in 1509 cycles versus 3648 (cmoc) and 4300 (vbcc).

3. 32-bit multiply: the one CMOC wins

fixmul‘s (s32)vin[i] * factor It is a 32-bit multiply. Here CMOCs hand-written MUL16-Based on the long-multiply path is markedly tighter than VBCC’s __mulint32: 15 096 vs 44 611 cycles — CMOC is ~3× faster. gcc6809 would likely do well too (its code is the smallest), but its 32-bit multiply helper isn’t in the prebuilt library, so it can’t be timed. The lesson: aggregate winners still have specific weak spots — VBCC’s 32-bit math is one.

Per-compiler verdict

gcc6809 — best optimizer, worst ergonomics. Smallest code, fastest on most kernels, often by ~2×, thanks to real optimization passes (strength reduction, tail calls, good register allocation, frame-pointer omission). The catch: it’s a 2008-vintage GCC 4.3.6 that is genuinely hard to build on a modern host — on Apple Silicon, the native build cc1 crashes on any 2-argument function, so it has to be run as a prebuilt x86_64 binary under Rosetta. If you can get it working, it produces the best code.

VBCC — the pragmatic choice. A modern, actively maintained, retargetable compiler that builds cleanly from source in minutes. ~20-25% larger and ~1.6× slower than gcc6809 on the kernels, but it closes most of that size gap on a whole program (1.11×), and it’s comfortably better than cmoc almost everywhere. Its notable weakness is 32-bit multiply. For most projects, this is the best effort-to-quality ratio.

CMOC — simplest, most portable, largest/slowest. Builds natively everywhere with no fuss and is easy to reason about, but it’s a light optimizer: a frame-pointer prologue on every function and helper-heavy 16-bit math make it the biggest and (usually) slowest. It does win objmove (inline MUL beats VBCC’s helper call) and fixmul (Its 32-bit multiply is excellent.)

A rough way to hold all this in your head:

size  (kernels):  gcc6809  <  vbcc  <  cmoc     (1.00 : 1.25 : 1.41)
size  (pong)   :  gcc6809  <  vbcc  <  cmoc     (1.00 : 1.11 : 1.38)
speed (kernels):  gcc6809  <  vbcc  <  cmoc     (1.00 : 1.64 : 2.43)
ease  to build :  cmoc     <  vbcc  <<  gcc6809 (gcc6809 hardest)

Caveats/threats to validity

Synthetic kernels. These are small, hot-loop-style routines with compile-time-constant trip counts. They over-represent arithmetic and loops and under-represent large functions with heavy register pressure (where the 6809’s tiny register set hurts all three). Real game frames also spend most of their time in BIOS/vector-drawing code that’s identical across compilers.
One optimization setting each (cmoc -O2, vbcc -O=255, gcc -O2). A size-focused pass could shift the size numbers.
Helper bodies excluded from size, included in speed (via the call). Fair for “how good is the compiler”, but a multiply-heavy kernel’s total ROM footprint also depends on the runtime library.
gcc6809 fixmul is missing because its 32-bit-multiply library routine isn’t bundled; treat the gcc6809 speed aggregate as “10 of 11 kernels”.
gcc6809 runs under Rosetta, but that only affects the host build/run time, not the 6809 code it emits — the generated code is identical to that of a native run.

Reproduce

Find the benchmark here: https://github.com/rogerboesch/vectreC

bench/measure.sh         # 3-way code-size table (+ the Pong whole-program line)
bench/measure_speed.sh   # 3-way dynamic cycle-count table

Both scripts point at the toolchains and the cycle-accurate simulator; kernel sources are in bench/src/, the timing drivers in bench/drv/.

Addendum: cmoc 0.1.98 (June 2026)

A follow-up to the CMOC vs VBCC vs GCC6809 benchmark above. The main article measured cmoc 0.1.67. Since then, Pierre Sarrazin released 0.1.98 (2026-06-06), so I rebuilt it from source and re-ran the identical kernels.

Same method, same flags. Default --vectrex codegen, no -fomit-frame-pointer — exactly as in the main article, so the only thing that changed is the compiler version. (Building 0.1.98 needs a recent lwtools; 4.17 can’t assemble its stdlib, lwtools 4.24 can.)

Result: a free, across-the-board win

cmoc 0.1.98 is ~9% smaller overall and faster on 10 of 11 kernels than 0.1.67 — no source changes required.

            CODE SIZE (bytes)        SPEED (kernel cycles)
kernel    0.1.67  0.1.98   Δ        0.1.67   0.1.98     Δ
objmove      148     150  +1.4%       3648     3741   +2.5%
collide      206     173 -16.0%      20873    15945  -23.6%
fixmul       112     109  -2.7%      15096    14896   -1.3%
rng          118     113  -4.2%       7360     7093   -3.6%
memops        58      54  -6.9%       4152     3886   -6.4%
strupr        54      50  -7.4%       1810     1696   -6.3%
checksum      90      83  -7.8%       9316     8593   -7.8%
isort        117      99 -15.4%      17329    11536  -33.4%
statem       149     127 -14.8%        178      153  -14.0%
bcdscore     159     139 -12.6%        650      536  -17.5%
clamp        122     117  -4.1%       2695     2613   -3.0%
-----------------------------------------------------------
TOTAL       1333    1214  -8.9%

The only regression is objmove (a narrower path for one address-of-then-store pattern), and it’s marginal.

Where it lands against the others

Re-normalizing the main article’s tables with the new cmoc column (gcc6809 = 1.00):

                       cmoc 0.1.67   cmoc 0.1.98   vbcc   gcc6809
size  (kernels)            1.41          1.28       1.25     1.00
speed (10 runnable)        2.43          1.99       1.64     1.00

On size, cmoc essentially pulls level with vbcc (1.28 vs 1.25) — it now even produces the smallest code on statem and beats vbcc on collide, fixmul, memops, checksum and isort. On speed it closes roughly a third of the gap to gcc6809, though vbcc and gcc6809 stay ahead. fixmul is still cmoc’s standout win (its 32-bit multiply remains ~3× faster than vbcc’s).

Why — and what didn’t change

The gains are entirely from the low-level peephole optimiser maturing across 31 releases, not any architectural change. Diffing the isort inner loop shows the concrete wins: absolute LDX #addr replacing PC-relative LEAX addr,PCR; redundant CLRA/sign-extensions stripped before ABX indexing; 8-bit values kept narrow instead of promoted to 16-bit; and CMPB #0 / BLS fused into a single branch. These hit hardest in tight loops, which is why isort (−33%), collide (−24%) and bcdscore (−18%) move the most while arithmetic-bound kernels barely budge.

Crucially, the main article’s central criticism still holds: cmoc still emits the per-function frame-pointer prologue (PSHS U / LEAU ,S / LEAS -n,S) by default. So #1 of “Why the differences” is unchanged — these wins come on top of that fixed tax, not by removing it.

Sidebar: cmoc has had -fomit-frame-pointer since 0.1.82, still opt-in. Adding it to 0.1.98 saves more again (e.g. memops 54→40, statem 127→120, isort 99→94). I deliberately left it off for the table above so the comparison stays apples-to-apples with the main article’s default-flag runs.

Bottom line: if you’re on an older CMOC, upgrading to 0.1.98 is a strict improvement at the same flags. It doesn’t change the article’s overall ranking — gcc6809 still optimizes best, vbcc is still the pragmatic pick — but it meaningfully narrows cmoc’s gap, especially on code size.

Christopher Salomon

It‘s pretty much what I expected.

Noteworthy - the gcc that comes with vide and is used by Peer and his students - has some bugfixes that Peer and I implemented. In addition - Vide can be configured to use addional peepholes to enhance the produced sources further.

My own comparrison cmoc versus gcc was done years ago - but with the same result. Still glad we chose gcc.

Noteworthy - the C-Toolchain that vide uses - can also be run completely from command line - on Mac, Linux and windows - uall binaries are included.

Peer and his students use Makefiles…

Thanks for the up to date data!

1 reply by Roger Boesch

1 more comment...

Vector & Vertex

Discussion about this post

Ready for more?