Use as an explicit post-test optimization pass — fixing slow loops, expensive arithmetic, integer splitting or limb assembly, modular reduction, storage slot packing, or BoundedInt type bounds. Invoke after implementation and tests pass.
Resources
1Install
npx skillscat add keep-starknet-strange/starknet-agentic/cairo-optimization Install via the SkillsCat registry.
Coding Cairo
Rules and patterns for writing efficient Cairo code. Sourced from audit findings and production profiling.
Attribution: Originally authored by feltroidprime (cairo-skills). Integrated with permission into starknet-agentic.
When to Use
- Implementing arithmetic (modular, parity checks, quotient/remainder)
- Optimizing loops (slow iteration, repeated
.len()calls, index-based access) - Splitting or assembling integer limbs (u256 → u128, u32s → u128, felt252 → u96)
- Packing struct fields into storage slots
- Using
BoundedIntfor zero-overhead arithmetic with compile-time bounds - Choosing integer types (u128 vs u256, BoundedInt vs native types)
Not for: Profiling/benchmarking (use benchmarking-cairo if available)
Quick Reference — All Rules
| # | Rule | Instead of | Use |
|---|---|---|---|
| 1 | Combined quotient+remainder | x / m + x % m |
DivRem::div_rem(x, m) |
| 2 | Cheap loop conditions | while i < n |
while i != n |
| 3 | Constant powers of 2 | 2_u32.pow(k) |
match-based lookup table |
| 4 | Pointer-based iteration | *data.at(i) in index loop |
pop_front / for / multi_pop_front |
| 5 | Cache array length | .len() in loop condition |
let n = data.len(); before loop |
| 6 | Pointer-based slicing | Manual loop extraction | span.slice(start, length) |
| 7 | Cheap parity/halving | index & 1, index / 2 |
DivRem::div_rem(index, 2) |
| 8 | Smallest integer type | u256 when range < 2^128 |
u128 (type encodes constraint) |
| 9 | Storage slot packing | One slot per field | StorePacking trait |
| 10 | BoundedInt for limbs | Bitwise ops / raw u128 math | bounded_int::{div_rem, mul, add} |
| 11 | Fast 2-input Poseidon | poseidon_hash_span([x,y]) |
hades_permutation(x, y, 2) |
Always / Never Rules
1. Always use DivRem::div_rem — never separate % and /
Cairo computes quotient and remainder in a single operation. Using both % and / on the same value doubles the cost.
// BAD
let q = x / m;
let r = x % m;
// GOOD
let (q, r) = DivRem::div_rem(x, m);2. Never use < or > in while loop conditions — use !=
Equality checks are cheaper than comparisons in Cairo.
// BAD
while i < n { ... i += 1; }
// GOOD
while i != n { ... i += 1; }3. Never compute 2^k with pow() — use a lookup table
u32::pow() is expensive. Use a match lookup for known ranges.
// BAD
let p = 2_u32.pow(depth.into());
// GOOD — match-based lookup
fn pow2(n: u32) -> u32 {
match n {
0 => 1, 1 => 2, 2 => 4, 3 => 8, 4 => 16, 5 => 32,
6 => 64, 7 => 128, 8 => 256, 9 => 512, 10 => 1024,
// extend as needed
_ => core::panic_with_felt252('pow2 out of range'),
}
}4. Always iterate arrays with pop_front / for / multi_pop_front — never index-loop
Index-based access (array.at(i)) is more expensive than pointer-based iteration.
// BAD
let mut i = 0;
while i != data.len() {
let val = *data.at(i);
i += 1;
}
// GOOD — pop_front
while let Option::Some(val) = data.pop_front() { ... }
// GOOD — for loop (equivalent)
for val in data { ... }
// GOOD — batch iteration
while let Option::Some(chunk) = data.multi_pop_front::<4>() { ... }5. Never call .len() inside a loop condition — cache it
.len() recomputes every iteration. Store it once.
// BAD
while i != data.len() { ... i += 1; }
// GOOD
let n = data.len();
while i != n { ... i += 1; }6. Always use span.slice() instead of manual loop extraction
slice() manipulates pointers directly — no element-by-element copying.
// BAD
let mut result: Array = array![];
let mut i = 0;
while i != length {
result.append(*data.at(start + i));
i += 1;
}
// GOOD
let result = data.slice(start, length);7. Always use DivRem for parity checks — never use bitwise ops
Bitwise AND is more expensive than div_rem in Cairo. Use DivRem::div_rem(x, 2) to get both the halved value and parity in one operation.
// BAD
let is_odd = (index & 1) == 1;
index = index / 2;
// GOOD
let (q, r) = DivRem::div_rem(index, 2);
if r == 1 { /* odd branch */ }
index = q;8. Always use the smallest integer type that fits the value range
u128 instead of u256 when the range is known. Adds clarity, prevents intermediate overflow.
// BAD — u256 for a value known to be < 2^128
fn deposit(value: u256) { assert(value < MAX_U128, '...'); ... }
// GOOD — type encodes the constraint
fn deposit(value: u128) { ... }9. Always use StorePacking to pack small fields into one storage slot
Multiple small fields (basis points, flags, bounded amounts) can share a single felt252 slot.
use starknet::storage_access::StorePacking;
const POW_2_128: felt252 = 0x100000000000000000000000000000000;
impl MyStorePacking of StorePacking<MyStruct, felt252> {
fn pack(value: MyStruct) -> felt252 {
value.amount.into() + value.fee_bps.into() * POW_2_128
}
fn unpack(value: felt252) -> MyStruct {
let u256 { low, high } = value.into();
MyStruct { amount: low, fee_bps: high.try_into().unwrap() }
}
}10. Always use BoundedInt for byte cutting, limb assembly, and type conversions
Never use bitwise ops (&, |, shifts) or raw u128/u256 arithmetic for splitting or combining integer limbs. Use bounded_int::div_rem to extract parts and bounded_int::mul + bounded_int::add to assemble them. BoundedInt tracks bounds at compile time, eliminating overflow checks.
Assembling limbs (e.g., 4 x u32 → u128):
// BAD — direct u128 arithmetic (28,340 gas)
fn u32s_to_u128(d0: u32, d1: u32, d2: u32, d3: u32) -> u128 {
d0.into() + d1.into() * POW_2_32 + d2.into() * POW_2_64 + d3.into() * POW_2_96
}
// GOOD — BoundedInt (13,840 gas, 2x faster)
fn u32s_to_u128(d0: u32, d1: u32, d2: u32, d3: u32) -> u128 {
let d0_bi: u32_bi = upcast(d0);
let d1_bi: u32_bi = upcast(d1);
let d2_bi: u32_bi = upcast(d2);
let d3_bi: u32_bi = upcast(d3);
let r: u128_bi = add(add(add(d0_bi, mul(d1_bi, POW_32_UI)), mul(d2_bi, POW_64_UI)), mul(d3_bi, POW_96_UI));
upcast(r)
}Splitting values (e.g., felt252 → two u96 limbs):
// GOOD — div_rem to split, mul+add to reassemble
fn felt252_to_two_u96(value: felt252) -> (u96, u96) {
match u128s_from_felt252(value) {
U128sFromFelt252Result::Narrow(low) => {
let (hi32, lo96) = bounded_int::div_rem(low, NZ_POW96_TYPED);
(lo96, upcast(hi32))
},
U128sFromFelt252Result::Wide((high, low)) => {
let (lo_hi32, lo96) = bounded_int::div_rem(low, NZ_POW96_TYPED);
let hi64: BoundedInt<0, { POW64 - 1 }> = downcast(high).unwrap();
(lo96, bounded_int::add(bounded_int::mul(hi64, POW32_TYPED), lo_hi32))
},
}
}Extracting bits (e.g., building a 4-bit selector):
// GOOD — div_rem by 2 extracts LSB, quotient is right-shifted value
let (qu1, bit0) = bounded_int::div_rem(u1, TWO_NZ); // bit0 in {0,1}
let (qu2, bit1) = bounded_int::div_rem(u2, TWO_NZ);
let selector = add(bit0, mul(bit1, TWO_UI)); // selector in {0..3}See garaga/selectors.cairo and cairo-perfs-snippets for production examples.
11. Always use hades_permutation for 2-input Poseidon hashes
poseidon_hash_span has overhead for span construction. For exactly two inputs, use the permutation directly.
// BAD
let hash = poseidon_hash_span(array![x, y].span());
// GOOD
let (hash, _, _) = hades_permutation(x, y, 2);Code Quality
- DRY: Extract repeated validation into helper functions. If two functions validate-then-write the same struct, extract a shared
_set_config(). scarb fmt: Run before every commit..tool-versions: Pin Scarb (2.15.1) and Starknet Foundry (0.56.0) versions with ASDF for reproducible builds.- Keep dependencies updated: Newer Scarb/Foundry versions include gas optimizations and compiler improvements.
BoundedInt Optimization
BoundedInt<MIN, MAX> encodes value constraints in the type system, eliminating runtime overflow checks. Use the CLI tool (bounded_int_calc.py in this skill directory) to compute bounds — do NOT calculate manually.
Critical Architecture Decision: Avoid Downcast
The #1 optimization pitfall: Converting between u16/u32/u64 and BoundedInt at function boundaries.
The Problem
If your functions take u16 and return u16, you must:
downcastinput toBoundedInt(expensive — requires range check)- Do bounded arithmetic (cheap)
upcastresult back tou16(cheap but wasteful)
The downcast operation adds a range check that dominates the savings from bounded arithmetic. In profiling:
downcast: 161,280 steps (18.86%)bounded_int_div_rem: 204,288 steps (23.89%)- Total bounded approach: worse than original!
The Solution: BoundedInt Throughout
Use BoundedInt types as function inputs AND outputs. This eliminates downcast entirely.
// BAD: Converts at every call (downcast overhead kills performance)
pub fn add_mod(a: u16, b: u16) -> u16 {
let a: Zq = downcast(a).expect('overflow'); // EXPENSIVE
let b: Zq = downcast(b).expect('overflow'); // EXPENSIVE
let sum: ZqSum = add(a, b);
let (_q, rem) = bounded_int_div_rem(sum, nz_q);
upcast(rem)
}
// GOOD: BoundedInt in, BoundedInt out (no downcast)
pub fn add_mod(a: Zq, b: Zq) -> Zq {
let sum: ZqSum = add(a, b);
let (_q, rem) = bounded_int_div_rem(sum, nz_q);
rem
}Refactoring Strategy
When optimizing existing code:
- Identify the hot path — profile to find which functions use modular arithmetic heavily
- Change signatures — update function inputs/outputs to use
BoundedInttypes - Propagate types outward — callers must also use
BoundedInt - Downcast only at boundaries — convert from u16/u32 only at system entry points (e.g., deserialization)
Type Conversion Rules
| From | To | Operation | Cost |
|---|---|---|---|
u16 |
BoundedInt<0, 65535> |
upcast |
Free (superset) |
u16 |
BoundedInt<0, 12288> |
downcast |
Expensive (range check) |
BoundedInt<0, 12288> |
u16 |
upcast |
Free (subset) |
BoundedInt<A, B> |
BoundedInt<C, D> where [A,B] ⊆ [C,D] |
upcast |
Free |
BoundedInt<A, B> |
BoundedInt<C, D> where [A,B] ⊄ [C,D] |
downcast |
Expensive |
Key insight: upcast only works when target range is a superset of source range. You cannot upcast u32 to BoundedInt<0, 150994944> because u32 max (4294967295) > 150994944.
Prerequisites
# Scarb.toml
[dependencies]
corelib_imports = "0.1.2"// CORRECT imports — copy exactly
use corelib_imports::bounded_int::{
BoundedInt, upcast, downcast, bounded_int_div_rem,
AddHelper, MulHelper, DivRemHelper, UnitInt,
};
use corelib_imports::bounded_int::bounded_int::{SubHelper, add, sub, mul};Copy-Paste Template
Working example for modular arithmetic mod 100:
use corelib_imports::bounded_int::{
BoundedInt, upcast, downcast, bounded_int_div_rem,
AddHelper, MulHelper, DivRemHelper, UnitInt,
};
use corelib_imports::bounded_int::bounded_int::{SubHelper, add, sub, mul};
type Val = BoundedInt<0, 99>; // [0, 99]
type ValSum = BoundedInt<0, 198>; // [0, 198]
type ValConst = UnitInt<100>; // singleton {100}
impl AddValImpl of AddHelper<Val, Val> {
type Result = ValSum;
}
impl DivRemValImpl of DivRemHelper<ValSum, ValConst> {
type DivT = BoundedInt<0, 1>;
type RemT = Val;
}
fn add_mod_100(a: Val, b: Val) -> Val {
let sum: ValSum = add(a, b);
let nz_100: NonZero<ValConst> = 100;
let (_q, rem) = bounded_int_div_rem(sum, nz_100);
rem
}CLI Tool
Use bounded_int_calc.py in this skill directory. Always use CLI — never calculate manually.
# Addition: [a_lo, a_hi] + [b_lo, b_hi]
python3 bounded_int_calc.py add 0 12288 0 12288
# -> BoundedInt<0, 24576>
# Subtraction: [a_lo, a_hi] - [b_lo, b_hi]
python3 bounded_int_calc.py sub 0 12288 0 12288
# -> BoundedInt<-12288, 12288>
# Multiplication
python3 bounded_int_calc.py mul 0 12288 0 12288
# -> BoundedInt<0, 150994944>
# Division: quotient and remainder bounds
python3 bounded_int_calc.py div 0 24576 12289 12289
# -> DivT: BoundedInt<0, 1>, RemT: BoundedInt<0, 12288>
# Custom impl name
python3 bounded_int_calc.py mul 0 12288 0 12288 --name MulZqImplBoundedInt Bounds Quick Reference
| Operation | Formula |
|---|---|
| Add | [a_lo + b_lo, a_hi + b_hi] |
| Sub | [a_lo - b_hi, a_hi - b_lo] |
| Mul (unsigned) | [a_lo * b_lo, a_hi * b_hi] |
| Div quotient | [a_lo / b_hi, a_hi / b_lo] |
| Div remainder | [0, b_hi - 1] |
Negative Dividends: SHIFT Pattern
bounded_int_div_rem doesn't support negative lower bounds. When a subtraction produces a negative-bounded result that needs reduction, add a multiple of the modulus first:
// sub_mod: (a - b) mod Q via SHIFT
pub fn sub_mod(a: Zq, b: Zq) -> Zq {
let a_plus_q: BoundedInt<12289, 24577> = add(a, Q_CONST); // shift by +Q
let diff: BoundedInt<1, 24577> = sub(a_plus_q, b); // now non-negative
let (_q, rem) = bounded_int_div_rem(diff, nz_q());
rem
}
// fused_sub_mul_mod: a - (b*c) mod Q via large SHIFT
// OFFSET = 12288 * Q = 151007232 (smallest multiple of Q >= max product)
pub fn fused_sub_mul_mod(a: Zq, b: Zq, c: Zq) -> Zq {
let prod: ZqProd = mul(b, c);
let a_offset: BoundedInt<151007232, 151019520> = add(a, OFFSET_CONST);
let diff: BoundedInt<12288, 151019520> = sub(a_offset, prod);
let (_q, rem) = bounded_int_div_rem(diff, nz_q());
rem
}Rule: SHIFT = ceil(|min_possible_value| / modulus) * modulus. Adding SHIFT preserves the result mod Q (since SHIFT ≡ 0 mod Q) while making all values non-negative.
Common BoundedInt Mistakes
- Downcast at every function call — the biggest performance killer. Use
BoundedInttypes throughout, not just inside arithmetic functions. - Trying to upcast to a narrower type —
upcast(val: u32)toBoundedInt<0, 150994944>fails because u32 max > 150994944. - Wrong imports — use exact imports from Prerequisites section above.
- Wrong subtraction bounds — it's
[a_lo - b_hi, a_hi - b_lo], NOT[a_lo - b_lo, a_hi - b_hi]. - Negative dividend in
bounded_int_div_rem— div_rem doesn't support negative lower bounds. Add a SHIFT (multiple of modulus) before reducing. See SHIFT pattern above. - Missing intermediate types — always annotate:
let sum: ZqSum = add(a, b); - Division quotient off-by-one — integer division floors:
24576 / 12289 = 1, not 2. - Using
UnitIntvsBoundedIntfor constants — useUnitInt<N>for singleton constants like divisors. - Using
div_remvsbounded_int_div_rem— the function isbounded_int_div_rem, notdiv_rem. - Bounds exceed u128::max — BoundedInt bounds are hard-capped at 2^128. Larger values crash the Sierra specializer: 'Provided generic argument is unsupported.'