Evaluation & Quality
Security work without measurement is theater. The metrics that mean something, the ones that mislead, and the drills that actually build the muscle.
How security engineers prove their work matters
The hard question every security lead gets: "how do I know your work is actually making the protocol safer?" The temptation is to count "issues found" or "audits done." Neither is sufficient.
Defensible proof-of-value falls into three buckets:
- Leading indicators. Invariants proven, mutation kill rate, time to triage on bounty, audit findings per kLOC trending down.
- Lagging indicators. Incidents avoided (counterfactual, but you can sometimes show "the bounty caught X before it hit prod"), funds-at-risk avoided.
- Process indicators. Time from finding to fix, time from incident to comms, audit cycle duration.
In the loop, when asked "how would you measure your impact in this role?" — pick 2-3 of these, name the cadence (weekly, monthly), and acknowledge what they don't measure.
Coverage metrics that lie vs metrics that don't
Line coverage tells you which lines were executed by tests. It's the most quoted and least useful security metric — a contract can have 100% line coverage and a critical reentrancy bug because the test never set up the adversarial state.
Branch coverage is better. It tells you both arms of every conditional were exercised.
Path coverage better still, but quickly intractable.
Mutation kill rate is the metric a security engineer should actually quote. It measures whether your tests catch deliberate code mutations — a far stronger signal than "line touched."
Invariant coverage — how many of your stated invariants are proven (CVL) or empirically held (Foundry invariant testing). This is the metric you'd put on a slide for the protocol team's quarterly review.
"95% line coverage" looks like a quality answer in slack but means nothing without mutation kill rate. If asked about coverage in the loop, name the limitation before they do.
Counterexample-driven development
The most underrated workflow in DeFi security. The loop:
- State an invariant in CVL or as a Foundry invariant test.
- Run it. Solver/fuzzer produces a counterexample where the invariant fails.
- Triage: is the counterexample a real bug, a missing assumption, or solver noise?
- If a real bug: fix it. If a missing assumption: encode it (
requireorpreservedblock). If solver noise: tighten the spec or summarization. - Re-run. Repeat until the property is proven (or you've documented why it can't be).
This is the actual day-to-day of FV work. It's not "write a spec and submit." It's a 30-100 round dialogue with a solver, where the solver finds gaps faster than you would.
If asked "walk me through how you'd debug a failing CVL rule," this is the answer. Naming the dialogue with the solver — and the fact that the counterexample is often a missing assumption, not a bug — signals real experience.
Mutation testing
Mutation testing seeds deliberate bugs into your code (replace + with -, flip < to <=, etc.), then runs your test suite. If the tests still pass, the mutation "survived" — meaning your tests don't catch that bug class.
Tools you should know:
- Slither's
slither-mutate— Slither extension for mutation testing on Solidity. - Vertigo / VertiGo Rs — older mutation tooling for Solidity.
- Gambit — from Certora; mutation testing integrated with their pipeline.
- Necessist — Trail of Bits' approach to test/mutation analysis.
A useful target: kill rate > 80%. Below that, your test suite is providing false confidence.
slither-mutate src/ --solc-remaps "@openzeppelin=lib/openzeppelin-contracts/contracts"
# generates mutants, runs tests, reports survival rateThe "find the bug in 30 minutes" exercise
A common interview round: they hand you 50-300 lines of Solidity, 30-45 minutes on the clock, "what's wrong with this?" The structural moves that work:
- Read the README / spec first. Don't skip. The bug is usually a violation of the stated intent.
- List externally callable functions. Each is a potential attack entry point.
- Walk the money flow. Token in, token out. Where does the balance update? Is it before or after the external call?
- Check the math. Division before multiplication. Rounding direction. Share inflation on empty pool.
- Check access control. Who can call each function? Is there a missing modifier?
- Check reentrancy. Any external call? Any state change after it?
- Check initialization. Can it be front-run? Can it be replayed?
- Check first/last state. Empty pool, depleted pool, max value.
- Talk through your reasoning. They're scoring the process, not just the answer.
You should be drilling this against public bug-bounty disclosures (Code4rena reports, Sherlock contests, Immunefi public disclosures) before the loop.
CTFs — concepts, not credentials
The standard CTFs:
- Damn Vulnerable DeFi — DeFi-flavored, set of "levels" each exposing a real attack class.
- Ethernaut — OpenZeppelin's beginner-to-intermediate puzzles.
- Capture the Ether — older but classic.
- Paradigm CTF — biannual, high difficulty.
Reference them in the loop as "I used these to develop the muscle for X attack class," not as credentials. No one cares that you completed Ethernaut. They care that you can explain a specific category of bug fluently because you've drilled it.
CTFs are stylized; production bugs are messier. The translation work is taking what you learned about reentrancy in DVD level 1 and recognizing the same pattern hidden in 800 lines of someone else's contract.
Regression testing for invariants after refactors
One of the most embarrassing classes of bug: an invariant was proven in version N, the team refactored for gas, and version N+1 silently broke it. Prevention:
- CI runs CVL on every PR. Yes, it's slow. You pay the cost once; you avoid the bug forever.
- CI runs Foundry invariant tests on every PR. Fast enough to be free.
- Storage layout snapshot.
forge inspect storage-layoutoutput committed to the repo; CI fails on unexpected diff. - Diff selectors. External function selectors should rarely change. CI alerts on changes.
- Bytecode pinning. For deployed contracts, snapshot the bytecode hash. Any change is a fork; treat it as a new deploy.
forge inspect MyContract storage-layout > storage-layout.json
git diff --exit-code storage-layout.json # CI fails on unexpected diffMetrics you can actually report
| Metric | Cadence | What it tells you |
|---|---|---|
| Invariants proven (count) | Weekly | FV breadth |
| Invariants with stable proof (no timeouts) | Weekly | FV health |
| Mutation kill rate | Per release | Test pack quality |
| Branch coverage | Per PR | Test breadth |
| Slither high/medium findings | Per PR | Static-analysis hygiene |
| Open audit findings (by severity) | Weekly | Open-risk backlog |
| Days from finding → fix (median) | Per audit | Engineering velocity |
| Bounty triage time (initial response) | Per week | Researcher experience |
| Days since last incident | Daily | Vibes / morale |
| Funds-at-risk on outstanding criticals | Per incident | Open-risk magnitude |
The right answer in the loop, when asked "what would your dashboard look like in 90 days?": pick 4-5 of these, name them by hand, explain why those over others.