VRULL | Branches Cost More Than You Think

RISC‑V’s minimalist base ISA deliberately omits conditional select. This creates a class of problems where the direction of a branch determines whether the register allocator can produce optimal code. Zicond’s two instructions — czero.eqz and czero.nez — eliminate this entire problem class.

A deceptively simple function

Consider the following:

long test(long a, long b, long c) {
    return (!c ? a : b);
}

On AArch64, the compiler generates exactly what you’d expect:

cmp   x2, 0
csel  x0, x0, x1, eq
ret

Three instructions, branchless. The csel instruction selects between x0 and x1 based on the flags — no branch predictor involved, no register pressure issues, and the result is already in the return register.

On RISC‑V rv64gc (without Zicond), GCC generates this:

bne   a2, zero, .L2
mv    a1, a0
.L2:
mv    a0, a1
ret

Four instructions, one branch, and — critically — two mv instructions where one is redundant. The value a is already in a0 (the return register), yet the compiler copies it to a1 only to copy it back. This isn’t a bug in GCC. It’s a structural consequence of a missing instruction.

If GCC had chosen the opposite branch direction, the code would be:

beq   a2, zero, .L2
mv    a0, a1
.L2:
ret

Three instructions, one move. But the compiler didn’t choose this direction — and the reason it didn’t reveals a deeper problem.

With Zicond enabled (rv64gc_zicond), GCC generates:

czero.eqz  a1, a1, a2
czero.nez  a0, a0, a2
add        a0, a0, a1
ret

Four instructions, zero branches, zero redundant moves. The branch-polarity problem doesn’t arise because there is no branch.

What `czero` actually does

RISC‑V’s base ISA has seqz — a pseudoinstruction that expands to sltiu rd, rs, 1 and produces a Boolean: rd = (rs == 0) ? 1 : 0. A zero-or-one flag is useful for arithmetic, but it cannot select between two arbitrary values.

Zicond’s czero.eqz and czero.nez operate on full register values:

czero.eqz rd, rs1, rs2 — rd = (rs2 == 0) ? 0 : rs1
czero.nez rd, rs1, rs2 — rd = (rs2 != 0) ? 0 : rs1

The result is either zero or the full value of rs1 — not a Boolean, but the actual data you want to keep. This is conditional zeroing, not conditional moving.

To select between two values a and b based on a condition, you combine them:

czero.eqz  t0, a, cond    // t0 = cond ? a : 0
czero.nez  t1, b, cond    // t1 = cond ? 0 : b
add        rd, t0, t1     // rd = cond ? a : b

Since exactly one of t0 and t1 is zero, the add produces whichever value was selected. This achieves the same conditional select as AArch64’s csel — but decomposed into two-input operations that fit RISC‑V’s encoding without requiring a three-read-port register file.

Other architectures solved this with conditional moves: x86 has cmov, AArch64 has csel. RISC‑V deliberately chose a smaller primitive. Conditional zeroing is sufficient to synthesise all the same patterns, requires simpler hardware, and — on wide-issue cores — the two czero instructions are independent and can execute in parallel.

The branch-polarity problem

The suboptimal code in the opening example is not caused by any single compiler pass making a bad decision. It emerges from an interaction between multiple passes, each acting reasonably in isolation.

At the GIMPLE level, GCC’s sink2 pass rearranges the control-flow graph for legitimate optimisation reasons — improving code motion or reducing register pressure in other contexts. As a side effect, it may swap which edge of a conditional carries the empty basic block. This is semantics-preserving: the program still computes the same result.

During RTL expansion, the expander picks BNE vs BEQ based on which basic block is the fall-through successor. This is a layout decision, not a semantic one — the expander is choosing instruction encoding, not program meaning.

In the conditional execution pass (ce1), GCC finds the IF-THEN-JOIN diamond pattern that could benefit from conditional execution. But it cannot convert it: there is no conditional select instruction on the target. The pass notes the opportunity and moves on.

The register allocator is now stuck. Variable a is already in a0 (the return register), but the branch skips over the path that needs no work. The “wrong” direction means the fall-through path is the one where b should end up in a0, so the RA must insert a mv a1, a0 to save a before the branch, and then a mv a0, a1 after the join point to restore whichever value won. One of these copies is always redundant.

The key insight: on architectures without conditional select, branch direction is a register allocation constraint, not just a microarchitectural preference. The “wrong” direction forces the register allocator to copy values that were already in the right place.

This is not a bug in any single pass. It is a fundamental interaction between CFG optimisation, RTL expansion, and register allocation that only manifests on targets lacking conditional select.

The broader problem class

The ternary operator is the simplest case. The same branch-polarity sensitivity affects a range of common patterns.

Min/max idioms. a < b ? a : b — branch direction determines whether the “already correct” value needs copying:

rv64gc	rv64gc_zicond
`ble a1,a0,.L5`	`sgt a5,a1,a0`
`mv a1,a0`	`czero.nez a1,a1,a5`
`.L5: mv a0,a1`	`czero.eqz a0,a0,a5`
`ret`	`add a0,a0,a1` / `ret`
4 insns, 1 branch, 2 moves	5 insns, branchless

(With Zbb, min a0,a0,a1 is a single instruction. But without Zbb, Zicond provides the branchless alternative.)

Saturating clamp. clamp(x, lo, hi) — two conditionals, both subject to polarity problems:

rv64gc	rv64gc_zicond
`bge a0,a1,.L13`	`slt a5,a0,a1`
`mv a0,a1`	`czero.nez a0,a0,a5`
`.L13: ble a0,a2,.L14`	`czero.eqz a5,a1,a5`
`mv a0,a2`	`add a5,a5,a0`
`.L14: ret`	`sgt a0,a5,a2`
	`czero.nez a5,a5,a0`
	`czero.eqz a0,a2,a0`
	`add a0,a0,a5` / `ret`
5 insns, 2 branches	9 insns, branchless

More instructions, but completely branchless. In a tight loop where x is uniformly distributed between lo and hi, the branch predictor sees near-random inputs on both conditionals. Two mispredicted branches per iteration easily cost more than four extra ALU operations. This is especially true on microarchitectures that can process multiple independent ALU instructions per cycle: the czero.eqz/czero.nez pairs read different source registers and write different destinations, so on any dual-issue or wider core they will execute in the same cycle.

Zero-if-negative. x < 0 ? 0 : x — both versions are branchless, but Zicond is shorter and clearer:

rv64gc	rv64gc_zicond
`not a5,a0`	`slti a5,a0,0`
`srai a5,a5,63`	`czero.nez a0,a0,a5`
`and a0,a0,a5`	`ret`
`ret`
4 insns (bit manipulation trick)	3 insns (direct intent)

The base ISA version uses not/srai/and to construct a sign-derived mask — correct but obscure. Zicond expresses the programmer’s intent directly.

Where Zicond doesn’t help

Not every conditional pattern benefits from Zicond, and it is important to understand the limits.

Absolute value. x < 0 ? -x : x — GCC already recognises this idiom and generates the optimal branchless sequence on both targets:

srai  a5, a0, 63
xor   a0, a5, a0
sub   a0, a0, a5
ret

Three instructions using the sign-extend/XOR/SUB identity. Zicond has nothing to contribute here — the algebraic trick is already optimal.

Conditional increment. cond ? x + 1 : x — GCC recognises this as x + (cond != 0):

snez  a1, a1
add   a0, a0, a1
ret

Two instructions, already optimal. No conditional select needed.

Guarded loads. p ? *p : default — the branch is necessary:

beq   a0, zero, .L
lw    a0, 0(a0)
ret
.L:
mv    a0, a1
ret

Zicond cannot help here. The branch guards a load from a potentially NULL pointer — you cannot speculatively execute a load from address zero. The conditional protects a side-effecting operation, not a selection between two already-available values.

Summary

Pattern	rv64gc	rv64gc_zicond	Improvement
Ternary select	4 insns, 1 br, 2 mv	4 insns, branchless	Eliminates branch + redundant mv
Min	4 insns, 1 br, 2 mv	5 insns, branchless	Eliminates branch + redundant mv
Max	4 insns, 1 br, 2 mv	5 insns, branchless	Eliminates branch + redundant mv
Clamp (2 conditionals)	5 insns, 2 br	9 insns, branchless	Eliminates 2 unpredictable branches
Zero-if-negative	4 insns, branchless	3 insns, branchless	1 insn shorter, clearer
Abs	4 insns, branchless	4 insns, branchless	No change (already optimal)
Guarded load	4 insns, 1 br	4 insns, 1 br	No change (branch required)
Conditional increment	2 insns, branchless	2 insns, branchless	No change (already optimal)

Zicond’s biggest wins are on patterns where the compiler cannot find a branchless algebraic identity on its own — ternary select, min, max, clamp. For patterns where such identities exist (abs, conditional increment), GCC already does well. And for patterns where the branch guards a side-effecting operation (guarded load), Zicond correctly doesn’t apply.

Why RISC‑V didn’t have conditional select originally

RISC‑V’s base ISA was designed with deliberate minimalism. Every instruction must justify its encoding space, and the original rationale for omitting conditional select was straightforward: branches are “good enough” for conditional patterns, and branch predictors handle the common case.

That reasoning underestimated the compiler-side cost. The branch-polarity problem is a static codegen quality issue. Even with perfect branch prediction, the extra moves consume issue slots, increase code size, and create unnecessary register pressure. The performance cost is paid on every execution, not just on mispredictions.

How Zicond was born

Zicond didn’t start as an ISA proposal. It started as a performance problem.

We were working on SPEC CPU2017 optimisation for a vendor of high-performance RISC‑V cores. The benchmarking quickly revealed a pattern: branchless conditional sequences were absent across the board, and the resulting branches were creating significant pressure on the branch predictor. In workloads with data-dependent conditionals — which SPEC2017 has in abundance — the misprediction penalties were real and measurable.

The obvious answer would have been conditional moves, the way ARM does it with csel. But on high-performance, wide-issue cores, conditional moves have an encoding problem: a three-input instruction (two data sources plus a condition) requires either a wider register file read port or a constrained encoding format. Neither was acceptable for these designs. What was essentially free was dual-issue — two simple ALU operations in the same cycle cost almost nothing.

So we decomposed the conditional select into two conditional zeroing operations and a standard add. Each instruction reads only two source registers, fits cleanly into RISC‑V’s existing encoding, and on a dual-issue core the two czero operations execute in parallel. The total cost is one cycle for the zeroing pair plus one cycle for the add — competitive with a single csel, with no encoding compromises.

With Zicond available, the compiler story changes completely. When GCC’s conditional execution pass encounters an IF-THEN-JOIN diamond, it can emit the czero sequence instead of leaving the branch in place. No branch means no polarity problem. No polarity problem means no register allocation constraint. The result is correct code regardless of how other passes have arranged the CFG.

Status and adoption

Zicond is ratified as a standard RISC‑V extension and is mandatory in the RVA23 profile (ratified October 2024). Any RVA23-compliant core implements Zicond. In 2024, RISC‑V International awarded Philipp Tomsich a ratification award for Zicond — recognition of the extension’s journey from a performance problem on a vendor engagement to a mandatory part of the RISC‑V application profile.

Toolchain support is mature: GCC has supported Zicond since GCC 14, and LLVM since LLVM 17.

All examples in this post were compiled with GCC trunk at -O2.

Conclusion

The branch-polarity problem illustrates how ISA design decisions ripple through the entire compiler stack. A “missing” instruction doesn’t just affect one optimisation pass — it creates a class of problems where CFG optimisation, instruction expansion, and register allocation interact in ways that no single pass can resolve. The result is code that is correct but unnecessarily slow, and the slowness is structural: it cannot be fixed by improving any individual compiler pass without adding a conditional instruction to the target.

Zicond’s two instructions solve this cleanly. They don’t just add conditional select to RISC‑V — they eliminate a fundamental tension between CFG optimisation and register allocation that the compiler cannot resolve on its own.

Two instructions, zero branches, one ratification award.

Branches Cost More Than You Think