How I debug a failing regression at 2am: my actual process

 DESIGN VERIFICATION    ARTICLE 04 OF 06

How I debug a failing regression at 2am: my actual process

The test name is tc_random_stress_42891. The error is ‘X propagation on bus_data[7]’. The deadline is tomorrow. Here’s what I actually do.

 

It’s past midnight. The regression that was supposed to be clean isn’t. You have a test name, a seed, a cryptic error message, and a log file that is, conservatively, the length of a short novel. The tape-out review is in nine hours.

This is the part of verification nobody talks about in conference papers. The methodology slides show clean flows: stimulus, checking, coverage closure. They don’t show the 2am session where you’re staring at a waveform trying to figure out why a 32-bit bus decided to go X on bit 7 and only bit 7.

What follows is my actual process. Not the idealized version. The one I use when I’m tired, under pressure, and the failure is not obvious.

Step 0: Resist the waveform

The instinct, when a simulation fails, is to open the waveform viewer immediately. This is almost always wrong.

Waveforms are for confirming a hypothesis, not forming one. If you open GTKWave or Verdi before you have a theory about what went wrong, you will drown. A stress test simulation can run for millions of cycles. The failure might be the consequence of something that happened 200,000 cycles earlier. Without a hypothesis to navigate toward, you will scroll randomly through signal history and learn nothing.

A waveform is a place to confirm a hypothesis, not form one. Open it last, not first.

Before touching the waveform, I do three things:

        Read the full error message carefully. Not the first line — the full message, including any UVM_ERROR and UVM_FATAL context, the time stamp, and the hierarchy path of the component that raised it.

        Check whether this failure is new or recurring. Has this seed failed before? Is this a flaky test or a consistent one? A test that fails 1-in-100 runs is a different problem from a test that has never passed.

        Check the recent change log. What changed in the RTL or testbench in the last 24 hours? A failure that appeared after a specific commit is much easier to isolate than one that appeared from nowhere.

 

Only after those three steps do I start debugging in earnest.

Step 1: Reproduce it deterministically

Random tests are powerful precisely because they explore stimulus space automatically. They are also annoying to debug because the failure only appears with a specific seed. Step one is to lock that seed down and make the failure 100% reproducible.

// Most simulators: re-run with explicit seed to reproduce

$ simv +ntb_random_seed=42891 +UVM_TESTNAME=tc_random_stress

// Confirm it fails consistently before you touch anything

// If it only fails 7 times out of 10, that’s a different problem

 

If the failure reproduces cleanly with the fixed seed, you now have a controlled environment. If it doesn’t reproduce reliably — if it’s intermittent even with the same seed — you have a race condition, and that’s a harder problem that I’ll address separately.

When the failure is intermittent

Intermittent failures with a fixed seed usually indicate a time-zero race, a phase ordering issue, or non-determinism from a delta-cycle ordering dependency. Before debugging the logic, add +define+UVM_NO_RAND_SEED and check if the failure rate changes. If it does, your randomization is leaking somewhere it shouldn’t.

Step 2: Read the UVM log as a story

The UVM log is not a wall of text. It’s a timestamped narrative of what the testbench did, in what order, and what it thought about it. Reading it correctly is a skill that takes time to develop, and it is one of the highest-leverage skills in verification.

My reading order:

        Jump to the first UVM_ERROR or UVM_FATAL. Not the last — the first. Subsequent errors are often cascading consequences of the original failure. The first error is where the story breaks.

        Read backward from the first error. What phase was the simulation in? What was the last sequence item driven? What did the monitor last observe? What did the scoreboard last compare?

        Look at the time delta between the last successful transaction and the failure. If the DUT was processing normally and then suddenly went wrong, something changed — a stimulus corner case was hit, a pipeline stage filled unexpectedly, a timeout expired.

 

# grep patterns I use constantly

 # Find all errors and their timestamps

$ grep -n 'UVM_ERROR\|UVM_FATAL' sim.log

 

# Find the first error and the surrounding context

$ grep -n 'UVM_ERROR\|UVM_FATAL' sim.log | head -1

$ sed -n '450,480p' sim.log   # read 30 lines around that line

 

# Trace a specific signal or transaction ID through the log

$ grep 'bus_data\[7\]\|txn_id=0x2A' sim.log

 

# Find all scoreboard mismatches

$ grep 'SCOREBOARD\|mismatch\|expected.*got' sim.log

 

The first error in the log is the one that matters. Everything after it is a consequence. Debug the cause, not the consequences.

A common mistake: engineers grep for the signal name in the error message and scan every occurrence. This produces a list of symptoms with no causal ordering. Instead, work forward from the first error — or backward from it — and build a timeline.

Step 3: Bisect the failure in time and stimulus

You now know roughly when the failure occurred. The next step is to understand what caused it — and the most reliable method is bisection.

There are two dimensions to bisect: simulation time and stimulus content.

Bisecting simulation time

If the failure occurs at time 1,842,000ns, the root cause is not necessarily there. The DUT might have entered a bad state much earlier, and the consequence only manifested later. To find the real origin:

// Add a checkpoint: force the test to stop at half the failure time

// and observe whether the DUT state is already corrupt

 

// In your test or environment:

initial begin

  #921_000;  // half of 1,842,000ns

  $display("[CHECKPOINT] dut.cache_state = %0h", dut.cache_state);

  $display("[CHECKPOINT] dut.arb_grant   = %0b", dut.arb_grant);

end

// If state is already wrong at the checkpoint, the bug is earlier.

// Bisect again. If state is clean, the bug is in the second half.

 

Bisecting stimulus

Random stress tests generate many transactions. Most of them are irrelevant to the failure. To find the minimal reproducing case:

        Reduce the test length. If the failure occurs at transaction 847, try running only 500 transactions with the same seed. If it still fails, try 200. You’re looking for the minimum number of transactions that trigger the bug.

        Disable randomization categories. If you’re generating random addresses, lengths, and burst types, try fixing the address to a simple value. If the failure goes away, the address randomization is relevant. If it persists, it’s not.

        Print the specific sequence items that were generated around the failure window. Compare them against your protocol spec manually. Often, you’ll spot the illegal or corner-case combination immediately.

 

The minimal reproducing case is worth the investment

A 10-transaction sequence that reproduces a bug is infinitely more useful than a 10,000-transaction sequence that contains it. You can read 10 transactions. You can hand-trace them against the spec. You can add them to your regression as a directed test. Take the time to reduce.


Step 4: Form a hypothesis before opening the waveform

By now, you should have:

        The exact time of the first error

        A rough understanding of what the testbench was doing when it failed

        A reduced or bisected failure scenario

        A suspicion about which component is involved — the DUT, the driver, the monitor, or the scoreboard

 

Write it down. Seriously. Not in your head — in a text file, a notepad, anywhere. Something like:

// Hypothesis (2:14am):

// - Failure occurs at t=1,842,000ns, first seen at t=1,839,440ns in scoreboard

// - Last transaction was a WRITE burst of length 256 to address 0xFFFF_FF00

// - Suspect: address wrapping at 4KB boundary not handled in write path

// - Evidence for: DUT spec section 4.3.2 mentions wrap behavior is optional

// - Evidence against: same burst length passes with address 0x0000_FF00

//

// Plan: open waveform, navigate to t=1,839,000ns, examine

//   dut.write_addr_gen and dut.wrap_detect signals

 

This forces you to be precise about what you’re looking for before you open the waveform. It also creates a record: if your hypothesis is wrong, you know exactly what you tested and can rule it out cleanly.

Step 5: Open the waveform with intention

Now you open the waveform viewer. You have a specific time, a specific hypothesis, and a specific set of signals you want to examine. This session should take 15–30 minutes, not three hours.

What I look at, in order:

        Navigate directly to the failure time minus a small margin (e.g., t=1,839,000ns for a failure at t=1,842,000ns). Confirm the signals I hypothesized are behaving unexpectedly.

        Trace backward from the anomaly. If dut.wrap_detect is wrong at t=1,839,440ns, when did it last transition? What drove that transition? Work backward through the combinational or sequential logic until you find the input that caused it.

        Check the monitor capture against the raw DUT pins. Sometimes the bug is not in the DUT at all — the monitor is misinterpreting a valid signal. Compare what the monitor logged against what the waveform actually shows.

        Look for X propagation sources. An X on bus_data[7] usually originates from an uninitialized register, a case statement without a default, or a memory read before write. Use the ‘trace X’ feature in your waveform tool to walk backward to the source.

 

// In Verdi / nWave: trace X source

// Right-click the X value → "Trace X" → follows back to origin

 

// Useful Tcl for navigating large waveforms programmatically:

// nWave::gotoTime 1839000  ; # jump to specific time

// nWave::addSignal {dut.wrap_detect dut.write_addr}  ; # add signals

// nWave::findEdge -next -signal {dut.wrap_detect}   ; # next transition

 

An X on a data bus is not the bug. It’s the symptom. The bug is the uninitialized state, the missing reset, or the case statement without a default that produced the X three hundred cycles earlier.

Step 6: Distinguish DUT bugs from testbench bugs

This step deserves its own discussion because the consequences of getting it wrong are significant. If you file a DUT bug that turns out to be a testbench bug, you’ve wasted a designer’s time and lost credibility. If you fix a testbench bug when the DUT was actually wrong, you’ve masked a real silicon issue.

Before concluding a bug is in the DUT, confirm:

        The monitor is capturing signals at the correct protocol phase (setup time, hold time, valid strobe). A monitor that samples one cycle early will see incorrect data on a correctly-behaving bus.

        The reference model handles this specific transaction type. If this is the first time a burst of length 256 to a near-boundary address has been generated, your reference model may never have been exercised for this case.

        The scoreboard comparison is correct. If the comparison function does byte-swapping for endianness, does it handle the boundary case? Is it comparing the right fields?

 

If all three check out, the bug is in the DUT. Write it up with: the exact failing scenario, the expected behavior per spec (with section reference), the observed behavior, the minimal reproducing test, and the waveform screenshot with annotations.

A well-written bug report is half of verification’s value. A vague one (“Bus goes X sometimes”) sends the designer on their own debugging expedition and doubles the time to fix.

Step 7: Know when to stop and ask for help

There is a skill that takes years to develop: knowing when you have been debugging the wrong thing for too long.

I give myself a time limit. If I have been working on a single failure for more than 90 minutes without meaningful progress — without ruling out hypotheses or narrowing the failure space — I stop. I write down everything I know, everything I’ve ruled out, and the current state of my best hypothesis. Then I either sleep (if possible) or ask for help.

Fresh eyes find bugs that tired eyes have walked past twenty times. This is not a failure of skill. It is a failure of sleep.

When asking for help, the write-down is essential. “Hey, this thing is broken” requires the other person to repeat all your debugging work. “Here’s what fails, here’s what I’ve ruled out, here’s my current hypothesis,” lets them start where you left off.

Step 8: Build the playbook entry

Once you’ve found and understood the bug — whether it’s in the DUT or the testbench — do one more thing before you close the ticket: write a one-paragraph entry for your personal debug playbook.

// Debug playbook entry — 2024-03-15

// Symptom: X on data bus bits, appearing mid-burst, scoreboard mismatch

// Root cause pattern: address wrap at 4KB boundary, missing in DUT write path

// How to spot it: look for burst addr+len crossing 0x_____000 boundary

// Key signals: dut.wrap_detect, dut.write_addr_gen.next_addr

// Time to debug: 2h 20m

// Added directed test: tc_write_wrap_4k_boundary (now in regression)

 

A playbook is not a formal document. It’s a searchable personal record of patterns you’ve seen. Address wrap bugs look like address wrap bugs. Uninitialized reset bugs look like uninitialized reset bugs. Protocol phase sampling bugs look like protocol phase sampling bugs. After a few years, you will recognize the pattern before you finish reading the log.

That pattern recognition — the ability to narrow a complex failure to a root cause class in minutes instead of hours — is what separates a senior verification engineer from a junior one. It is not magic. It’s accumulated, documented experience.

The 2am summary

Written out in full, the process looks long. In practice, for most failures, steps 0 through 3 take 20–30 minutes. Step 4 takes 5 minutes. Step 5 takes 15–30 minutes. Steps 6 and 7 are judgment calls that take as long as they take.

The discipline is in order. Resist the waveform. Read the log. Bisect. Form a hypothesis. Then confirm.

The engineers who are fast at debugging are not faster at staring at waveforms. They are faster at forming correct hypotheses, which means they spend less total time in the waveform viewer. The work is almost entirely in your head and in the log file, before you’ve opened a single signal.

The deadline is in eight hours. You have a hypothesis. Go confirm it.

 

Next in this series:

Article 05 — Formal verification isn’t scary — you’re just using it wrong

Comments

Popular posts from this blog

Why verification is the hardest engineering job nobody talks about

UVM anti-patterns I see in almost every new project

Coverage closure isn’t done when the number hits 100%