How I debug a failing regression at 2am: my actual process
DESIGN VERIFICATION ARTICLE 04 OF 06
How I debug a failing regression at 2am: my
actual process
The test name is tc_random_stress_42891. The error is ‘X
propagation on bus_data[7]’. The deadline is tomorrow. Here’s what I actually
do.
It’s past midnight. The
regression that was supposed to be clean isn’t. You have a test name, a seed, a
cryptic error message, and a log file that is, conservatively, the length of a
short novel. The tape-out review is in nine hours.
This is the part of
verification nobody talks about in conference papers. The methodology slides
show clean flows: stimulus, checking, coverage closure. They don’t show the 2am
session where you’re staring at a waveform trying to figure out why a 32-bit bus
decided to go X on bit 7 and only bit 7.
What follows is my actual
process. Not the idealized version. The one I use when I’m tired, under
pressure, and the failure is not obvious.
Step 0: Resist the waveform
The instinct, when a simulation
fails, is to open the waveform viewer immediately. This is almost always wrong.
Waveforms are for confirming a
hypothesis, not forming one. If you open GTKWave or Verdi before you have a
theory about what went wrong, you will drown. A stress test simulation can run
for millions of cycles. The failure might be the consequence of something that
happened 200,000 cycles earlier. Without a hypothesis to navigate toward, you
will scroll randomly through signal history and learn nothing.
A waveform is a place to confirm a hypothesis, not form one.
Open it last, not first.
Before touching the waveform, I
do three things:
•
Read the full error message carefully. Not the first
line — the full message, including any UVM_ERROR and UVM_FATAL context, the
time stamp, and the hierarchy path of the component that raised it.
•
Check whether this failure is new or recurring. Has
this seed failed before? Is this a flaky test or a consistent one? A test that
fails 1-in-100 runs is a different problem from a test that has never passed.
•
Check the recent change log. What changed in the RTL or
testbench in the last 24 hours? A failure that appeared after a specific commit
is much easier to isolate than one that appeared from nowhere.
Only after those three steps do
I start debugging in earnest.
Step 1: Reproduce it deterministically
Random tests are powerful
precisely because they explore stimulus space automatically. They are also
annoying to debug because the failure only appears with a specific seed. Step
one is to lock that seed down and make the failure 100% reproducible.
// Most simulators: re-run with explicit seed to reproduce
$ simv +ntb_random_seed=42891 +UVM_TESTNAME=tc_random_stress
// Confirm it fails consistently before you touch anything
// If it only fails 7 times out of 10, that’s a different problem
If the failure reproduces
cleanly with the fixed seed, you now have a controlled environment. If it
doesn’t reproduce reliably — if it’s intermittent even with the same seed — you
have a race condition, and that’s a harder problem that I’ll address separately.
When the failure is intermittent
Intermittent failures with a fixed seed usually indicate a time-zero race, a phase ordering issue, or non-determinism from a delta-cycle ordering dependency. Before debugging the logic, add +define+UVM_NO_RAND_SEED and check if the failure rate changes. If it does, your randomization is leaking somewhere it shouldn’t.
Step 2: Read the UVM log as a story
The UVM log is not a wall of
text. It’s a timestamped narrative of what the testbench did, in what order,
and what it thought about it. Reading it correctly is a skill that takes time
to develop, and it is one of the highest-leverage skills in verification.
My reading order:
•
Jump to the first UVM_ERROR or UVM_FATAL. Not the last
— the first. Subsequent errors are often cascading consequences of the original
failure. The first error is where the story breaks.
•
Read backward from the first error. What phase was the
simulation in? What was the last sequence item driven? What did the monitor
last observe? What did the scoreboard last compare?
•
Look at the time delta between the last successful
transaction and the failure. If the DUT was processing normally and then
suddenly went wrong, something changed — a stimulus corner case was hit, a
pipeline stage filled unexpectedly, a timeout expired.
# grep patterns I use constantly
$ grep -n 'UVM_ERROR\|UVM_FATAL' sim.log
# Find the first error and the surrounding context
$ grep -n 'UVM_ERROR\|UVM_FATAL' sim.log | head -1
$ sed -n '450,480p' sim.log # read 30 lines around that line
# Trace a specific signal or transaction ID through the log
$ grep 'bus_data\[7\]\|txn_id=0x2A' sim.log
# Find all scoreboard mismatches
$ grep 'SCOREBOARD\|mismatch\|expected.*got' sim.log
The first error in the log is the one that matters. Everything
after it is a consequence. Debug the cause, not the consequences.
A common mistake: engineers
grep for the signal name in the error message and scan every occurrence. This
produces a list of symptoms with no causal ordering. Instead, work forward from
the first error — or backward from it — and build a timeline.
Step 3: Bisect the failure in time and stimulus
You now know roughly when the
failure occurred. The next step is to understand what caused it — and the most
reliable method is bisection.
There are two dimensions to
bisect: simulation time and stimulus content.
Bisecting simulation time
If the failure occurs at time
1,842,000ns, the root cause is not necessarily there. The DUT might have
entered a bad state much earlier, and the consequence only manifested later. To
find the real origin:
// Add a checkpoint: force the test to stop at half the failure time
// and observe whether the DUT state is already corrupt
// In your test or environment:
initial begin
#921_000; // half of 1,842,000ns
$display("[CHECKPOINT] dut.cache_state = %0h", dut.cache_state);
$display("[CHECKPOINT] dut.arb_grant = %0b", dut.arb_grant);
end
// If state is already wrong at the checkpoint, the bug is earlier.
// Bisect again. If state is clean, the bug is in the second half.
Bisecting stimulus
Random stress tests generate
many transactions. Most of them are irrelevant to the failure. To find the
minimal reproducing case:
•
Reduce the test length. If the failure occurs at
transaction 847, try running only 500 transactions with the same seed. If it
still fails, try 200. You’re looking for the minimum number of transactions
that trigger the bug.
•
Disable randomization categories. If you’re generating
random addresses, lengths, and burst types, try fixing the address to a simple
value. If the failure goes away, the address randomization is relevant. If it
persists, it’s not.
•
Print the specific sequence items that were generated
around the failure window. Compare them against your protocol spec manually.
Often, you’ll spot the illegal or corner-case combination immediately.
The minimal reproducing case is worth the investment
A 10-transaction sequence that reproduces a bug is infinitely more useful than a 10,000-transaction sequence that contains it. You can read 10 transactions. You can hand-trace them against the spec. You can add them to your regression as a directed test. Take the time to reduce.
Step 4: Form a hypothesis before opening the
waveform
By now, you should have:
•
The exact time of the first error
•
A rough understanding of what the testbench was doing
when it failed
•
A reduced or bisected failure scenario
•
A suspicion about which component is involved — the
DUT, the driver, the monitor, or the scoreboard
Write it down. Seriously. Not
in your head — in a text file, a notepad, anywhere. Something like:
// Hypothesis (2:14am):
// - Failure occurs at t=1,842,000ns, first seen at t=1,839,440ns in scoreboard
// - Last transaction was a WRITE burst of length 256 to address 0xFFFF_FF00
// - Suspect: address wrapping at 4KB boundary not handled in write path
// - Evidence for: DUT spec section 4.3.2 mentions wrap behavior is optional
// - Evidence against: same burst length passes with address 0x0000_FF00
//
// Plan: open waveform, navigate to t=1,839,000ns, examine
// dut.write_addr_gen and dut.wrap_detect signals
This forces you to be precise
about what you’re looking for before you open the waveform. It also creates a
record: if your hypothesis is wrong, you know exactly what you tested and can
rule it out cleanly.
Step 5: Open the waveform with intention
Now you open the waveform
viewer. You have a specific time, a specific hypothesis, and a specific set of
signals you want to examine. This session should take 15–30 minutes, not three
hours.
What I look at, in order:
•
Navigate directly to the failure time minus a small
margin (e.g., t=1,839,000ns for a failure at t=1,842,000ns). Confirm the
signals I hypothesized are behaving unexpectedly.
•
Trace backward from the anomaly. If dut.wrap_detect is
wrong at t=1,839,440ns, when did it last transition? What drove that
transition? Work backward through the combinational or sequential logic until
you find the input that caused it.
•
Check the monitor capture against the raw DUT pins.
Sometimes the bug is not in the DUT at all — the monitor is misinterpreting a
valid signal. Compare what the monitor logged against what the waveform
actually shows.
•
Look for X propagation sources. An X on bus_data[7]
usually originates from an uninitialized register, a case statement without a
default, or a memory read before write. Use the ‘trace X’ feature in your
waveform tool to walk backward to the source.
// In Verdi / nWave: trace X source
// Right-click the X value → "Trace X" → follows back to origin
// Useful Tcl for navigating large waveforms programmatically:
// nWave::gotoTime 1839000 ; # jump to specific time
// nWave::addSignal {dut.wrap_detect dut.write_addr} ; # add signals
// nWave::findEdge -next -signal {dut.wrap_detect} ; # next transition
An X on a data bus is not the bug. It’s the symptom. The bug is
the uninitialized state, the missing reset, or the case statement without a
default that produced the X three hundred cycles earlier.
Step 6: Distinguish DUT bugs from testbench bugs
This step deserves its own
discussion because the consequences of getting it wrong are significant. If you
file a DUT bug that turns out to be a testbench bug, you’ve wasted a designer’s
time and lost credibility. If you fix a testbench bug when the DUT was actually
wrong, you’ve masked a real silicon issue.
Before concluding a bug is in
the DUT, confirm:
•
The monitor is capturing signals at the correct
protocol phase (setup time, hold time, valid strobe). A monitor that samples
one cycle early will see incorrect data on a correctly-behaving bus.
•
The reference model handles this specific transaction
type. If this is the first time a burst of length 256 to a near-boundary
address has been generated, your reference model may never have been exercised
for this case.
•
The scoreboard comparison is correct. If the comparison
function does byte-swapping for endianness, does it handle the boundary case?
Is it comparing the right fields?
If all three check out, the bug
is in the DUT. Write it up with: the exact failing scenario, the expected
behavior per spec (with section reference), the observed behavior, the minimal
reproducing test, and the waveform screenshot with annotations.
A well-written bug report is
half of verification’s value. A vague one (“Bus goes X sometimes”) sends the
designer on their own debugging expedition and doubles the time to fix.
Step 7: Know when to stop and ask for help
There is a skill that takes
years to develop: knowing when you have been debugging the wrong thing for too
long.
I give myself a time limit. If
I have been working on a single failure for more than 90 minutes without
meaningful progress — without ruling out hypotheses or narrowing the failure
space — I stop. I write down everything I know, everything I’ve ruled out, and
the current state of my best hypothesis. Then I either sleep (if possible) or
ask for help.
Fresh eyes find bugs that tired eyes have walked past twenty
times. This is not a failure of skill. It is a failure of sleep.
When asking for help, the
write-down is essential. “Hey, this thing is broken” requires the other person
to repeat all your debugging work. “Here’s what fails, here’s what I’ve ruled
out, here’s my current hypothesis,” lets them start where you left off.
Step 8: Build the playbook entry
Once you’ve found and
understood the bug — whether it’s in the DUT or the testbench — do one more
thing before you close the ticket: write a one-paragraph entry for your
personal debug playbook.
// Debug playbook entry — 2024-03-15
// Symptom: X on data bus bits, appearing mid-burst, scoreboard mismatch
// Root cause pattern: address wrap at 4KB boundary, missing in DUT write path
// How to spot it: look for burst addr+len crossing 0x_____000 boundary
// Key signals: dut.wrap_detect, dut.write_addr_gen.next_addr
// Time to debug: 2h 20m
// Added directed test: tc_write_wrap_4k_boundary (now in regression)
A playbook is not a formal
document. It’s a searchable personal record of patterns you’ve seen. Address
wrap bugs look like address wrap bugs. Uninitialized reset bugs look like
uninitialized reset bugs. Protocol phase sampling bugs look like protocol phase
sampling bugs. After a few years, you will recognize the pattern before you
finish reading the log.
That pattern recognition — the
ability to narrow a complex failure to a root cause class in minutes instead of
hours — is what separates a senior verification engineer from a junior one. It
is not magic. It’s accumulated, documented experience.
The 2am summary
Written out in full, the
process looks long. In practice, for most failures, steps 0 through 3 take
20–30 minutes. Step 4 takes 5 minutes. Step 5 takes 15–30 minutes. Steps 6 and
7 are judgment calls that take as long as they take.
The discipline is in order.
Resist the waveform. Read the log. Bisect. Form a hypothesis. Then confirm.
The engineers who are fast at
debugging are not faster at staring at waveforms. They are faster at forming
correct hypotheses, which means they spend less total time in the waveform
viewer. The work is almost entirely in your head and in the log file, before
you’ve opened a single signal.
The deadline is in eight hours.
You have a hypothesis. Go confirm it.
Next in this series:
Article 05 —
Formal verification isn’t scary — you’re just using it wrong
Comments
Post a Comment