DESIGN VERIFICATION · ARTICLE 06 OF 06

What AI tools actually change about verification — and what they don’t

AI can write a UVM agent. It cannot tell you if your coverage model is correct. That distinction is the whole game right now.

Every few months, a new wave of content arrives claiming that AI is about to transform chip design. Sometimes the claim is that AI will write RTL. Sometimes it’s that AI will generate testbenches. Sometimes it’s that AI will close coverage automatically, or debug failing regressions, or replace the verification engineer entirely.

None of this is happening at the level the headlines suggest. Some of it is happening at a narrower, more useful level that the headlines miss entirely. The gap between what AI tools actually do in a verification context and what gets written about them is wide enough to cause real confusion — both among engineers trying to decide whether to invest time in these tools, and among managers trying to assess what they’re paying for.

This is the last article in this series, and I want to end with something honest: a clear-eyed account of where AI tools are genuinely useful in verification today, where they fall short in ways that matter, and what that means for how you should think about your career and your skill development over the next several years.

I’ve been using LLM-based coding assistants in my verification work for a while. What follows is not a vendor comparison or a benchmarking exercise. It’s an account of what actually changed in my workflow — and what didn’t.

Where AI genuinely helps

Let’s start with the honest wins, because there are real ones.

Boilerplate generation and structural scaffolding

UVM has significant boilerplate. A correctly structured agent requires a driver, a monitor, a sequencer, a sequence item, a coverage collector, an agent wrapper, and a set of correctly-wired analysis ports — all with factory registration, phase-correct logic, and consistent naming. Writing this from scratch takes time that is largely mechanical. It is also a surface area for small, tedious mistakes: a missing `uvm_component_utils` macro, a wrong phase function signature, a port connection that compiles but doesn’t wire correctly.

AI assistants are genuinely good at this. Given a clear description of what you want — “a UVM agent for an AXI4-Lite slave interface, with a passive monitor and an active driver, sequence item fields matching this signal list” — a current LLM will produce a structurally correct scaffold in seconds that would have taken 30–45 minutes to write by hand.

// Prompt pattern that works well:

// "Generate a UVM sequence item for an AXI4-Lite write transaction.

// Fields: addr (32-bit), data (32-bit), strb (4-bit), prot (3-bit).

// Include rand declarations, constraints for aligned access,

// and a do_copy / do_compare implementation."

// What you get back: a usable starting point with correct UVM macros,

// reasonable constraints, and the boilerplate copy/compare methods.

// What you still need to verify: constraint correctness against your

// specific protocol spec, field ranges, and interaction constraints.

The key word is scaffold. The generated code is a starting point, not a finished product. The constraint values need to be checked against the spec. The field names need to match your interface naming convention. The copy and compare implementations need to handle any non-trivial fields your protocol has. But the structural skeleton is correct, and that’s the part that was purely mechanical anyway.

Generating directed test cases from written descriptions

One of the more surprising genuinely-useful applications: describing a corner case in English and asking for a directed test sequence. This works better than many engineers expect, because the scenarios that matter most in verification — boundary conditions, protocol corner cases, error injection sequences — are well-described in natural language long before they’re translated into constrained-random sequences or directed tests.

// Prompt: "Write a UVM sequence that exercises AXI4-Lite write

// followed immediately by a read to the same address, with no

// intervening cycles. The read response should return the written

// value. Include a check that the read data matches the write data."

// The generated sequence will be structurally plausible.

// Review what needs careful checking:

// - Timing assumptions ("no intervening cycles" is protocol-specific)

// - How the check is implemented (inline vs. scoreboard notification)

// - Whether the sequence properly uses p_sequencer or is standalone

This is not magic. The LLM is pattern-matching on UVM code it has seen. It doesn’t know your specific protocol variant or your DUT’s latency characteristics. But for well-known protocols (AXI, APB, I2C, SPI), the generated sequences are usually directionally correct and save meaningful time.

Log summarization and first-pass failure triage

A regression log from a complex stress test can be 500,000 lines. The relevant portion is almost always fewer than 200. Finding it manually — grepping for errors, tracing back to the first UVM_FATAL, extracting the transaction context — is the kind of mechanical work that AI tools handle well.

Current LLM-based tools (both general-purpose assistants and purpose-built EDA integrations) can read a truncated log, identify the first error, summarize the transaction history leading up to it, and suggest candidate root cause categories. This doesn’t replace the debug process described in Article 04 of this series. It compresses the first 20 minutes of that process into 2 minutes.

On log summarization in practice

Most LLMs have context window limits that make feeding a full simulation log impractical. The useful pattern: pipe only the first error and surrounding context window (say, 500 lines before and after the first UVM_FATAL) into the assistant. Ask for a summary of what was happening and a hypothesis list. Use this as a starting point for your own analysis, not a conclusion.

Constraint writing and refinement

Writing SystemVerilog constraints is iterative work. You write a constraint, run a few hundred simulation cycles, look at the distribution of generated values, decide the distribution is wrong, and refine. AI assistants are useful in this loop because constraint syntax is regular enough that an LLM rarely makes syntactic errors, and the constraint intent can be described naturally.

// Prompt: "Add a constraint that ensures addr is always 4KB-page-aligned

// when txn_type is WRITE and len > 128. For other cases, addr can be

// any value within the valid range."

// Generated constraint (verify the logic before using):

constraint c_aligned_long_write {

if (txn_type == WRITE && len > 128) {

addr[11:0] == 12’h000; // 4KB alignment

}

addr inside {[ADDR_MIN:ADDR_MAX]};

}

// What to verify: the alignment mask, the range bounds, whether

// the constraint interacts correctly with other active constraints.

Where AI fails — and why it matters

The useful wins above all share a common property: they involve generating syntactically-correct code in a well-known style, for a well-known framework, based on a clearly-specified intent. This is the domain where LLMs are strong.

The failures are also systematic, and they cluster around the things that actually determine whether your verification is correct.

Coverage model correctness

This is the critical one. An LLM asked to generate a coverage model for a given design will produce something that looks like a coverage model. It will have covergroups, coverpoints, and bins. The structure will be syntactically valid. It will not be correct in the way that matters.

Coverage model correctness — as discussed in Article 03 of this series — requires knowing which scenarios in this specific design, with this specific microarchitecture and this specific set of edge cases, are likely to harbor bugs. It requires having read the spec carefully enough to know which field interactions trigger distinct behavior. It requires understanding what “closed” means for this particular design, not for coverage models in general.

LLMs don’t have this knowledge. They can’t read your spec, reason about your DUT’s failure modes, and produce a coverage model that reflects what actually needs to be verified. What they produce is a template — a coverage model that covers the obvious fields and the obvious crosses, misses the subtle interactions, and will give you 100% closure without giving you confidence.

An AI-generated coverage model will close at 100%. It will not tell you whether 100% means anything. That judgment is yours, and it requires knowledge the tool does not have.

Constraint correctness under interaction

Individual constraints are something AI tools handle reasonably well, as shown above. The failure mode appears when constraints interact. A set of five independently-correct constraints can be mutually contradictory, or can produce a stimulus distribution that systematically avoids the interesting corner cases. Detecting this requires running the constrained solver, analyzing the generated stimulus distribution, and recognizing that the distribution is skewed.

LLMs cannot do this analysis. They can write constraints. They cannot tell you that your five constraints together make a specific corner case unreachable, because they’re not running your solver and they’re not analyzing the output distribution. The constraint interaction problem is exactly the kind of reasoning that requires feedback from a running system — feedback that a text-based LLM doesn’t have.

Protocol-specific edge case reasoning

AXI4 has a spec. LLMs have been trained on AXI4 documentation. But AXI4 implementations vary: different versions of the spec, different optional features, different implementation-defined behaviors in the unspecified regions. Your DUT is one specific AXI4 implementation, and the bugs in it are in the specifics, not in the general protocol behavior that the LLM learned.

When you ask an LLM “what corner cases should I test for this AXI4 slave?”, you get back the standard list: wrap boundaries, exclusive access sequences, narrow transfers, unaligned addresses. This list is not wrong. It’s also not specific to your DUT, your microarchitecture, or your implementation choices. The corner cases that will actually find bugs in your specific design require knowledge that doesn’t exist in any training corpus — it exists in your spec, your architecture documentation, and your understanding of how the design was built.

The difference between protocol knowledge and design knowledge

LLMs know AXI4. They don’t know your AXI4 implementation. The bugs that ship in silicon are almost never in the parts of a protocol that are well-documented and widely understood. They’re in the design-specific interpretation of the unspecified regions, the interaction between protocol handling and power management, the corner case in the retry logic that the designer added at rev 2.3. No training corpus contains this. You have to find it yourself.

Testbench architecture decisions

Should this block’s scoreboard be integrated into the subsystem environment or kept separate? Should the coverage model be split by transaction type or by interface? How should phase synchronization work when the DUT has multiple independent interfaces with different latency characteristics? Should formal be used for this arbiter, or is the state space too large?

These are architecture decisions. They require understanding your specific design, your team’s reuse goals, your project schedule, and your prior experience with similar blocks. LLMs will answer these questions confidently. The answers will be generic best practices that may or may not apply to your situation. Use them as a starting-point checklist, not as a recommendation.

Debugging non-trivial failures

AI-assisted log summarization is genuinely useful for first-pass triage, as described above. For the actual debugging work — the hypothesis formation, the bisection, the root cause identification — current AI tools add limited value and can actively mislead.

The failure mode is confident wrongness. An LLM given a failing simulation log will often produce a confident, plausible-sounding explanation that is incorrect. It pattern-matches on surface features of the log and generates an explanation that fits those features, without the underlying causal reasoning that would distinguish a correct hypothesis from a plausible-but-wrong one. For an engineer who doesn’t know the design well, this confident wrong answer is worse than no answer, because it sends them down the wrong debugging path.

Use AI tools to compress the log and surface the first error quickly. Do the hypothesis formation yourself.

The skills that become more valuable

Given the genuine wins and the genuine failures above, a pattern emerges. AI tools are strong where the work is: syntactically regular, stylistically consistent with a large training corpus, and correctness-checkable by running the output. They are weak where the work requires: design-specific knowledge, judgment about what matters, and reasoning about correctness that can’t be verified by compiling and running.

That pattern has a direct implication for skill development. The skills that become less scarce as AI tools improve are the mechanical ones: writing UVM boilerplate, producing syntactically-correct constraint code, generating a structural scaffold from a description. These are real skills, and they still require review and validation. But the time investment required to perform them at a basic level is shrinking.

The skills that become more valuable are the ones AI tools can’t replicate:

Coverage model design

Writing a coverage model that correctly captures verification intent for a specific design is genuinely hard. It requires reading the spec with the specific goal of finding the interactions that could harbor bugs. It requires architectural judgment about which crosses are meaningful and which are combinatorial noise. It requires knowing when 100% closure is worth defending and when it isn’t.

As AI tools take over more of the boilerplate work, the value of this judgment increases. If everyone on your team can generate a UVM agent scaffold in 30 seconds, the differentiator is whether the coverage model that agent feeds into is correct. This skill is not learnable from a tutorial. It comes from reading specs carefully, building models that turn out to be wrong, and figuring out why.

Debug methodology

The process described in Article 04 — resisting the waveform, reading the log as a story, bisecting in time and stimulus, forming a written hypothesis before confirming it — is a reasoning process. It is not a pattern-matching process. It requires building a causal model of what the DUT was doing and why it failed, from incomplete and noisy evidence.

This is exactly the kind of reasoning that current AI systems do poorly. It is also the kind of reasoning that distinguishes a DV engineer who can close a complex bug in two hours from one who spends two days staring at waveforms. The gap is entirely in methodology, and methodology is built through deliberate practice, not through using better tools.

Formal property specification

Writing SVA properties that are both correct and useful — that capture genuine design intent, that aren’t vacuously true, that have assumptions tight enough to avoid spurious counterexamples but loose enough to cover the interesting cases — is a skill that requires understanding both the property language and the design being verified. AI tools can generate SVA syntax. They cannot tell you whether the property is meaningful.

As formal verification becomes more accessible (more engineers using it, better tool integration, lower entry barriers), the engineers who can write good properties will have a significant advantage over those who can only generate syntactically-correct ones.

Architecture and reuse judgment

The decisions about testbench structure, environment layering, and reuse strategy that AI tools handle poorly are also the decisions that have the highest leverage on project efficiency. A well-architected testbench gets reused at the chip level. A poorly-architected one gets rewritten. That rewrite costs months.

The engineer who can look at a new block and make good decisions about agent structure, scoreboard decomposition, and phase synchronization — decisions that will pay off when the block gets integrated six months later — is providing value that no AI tool is close to providing.

The automation of mechanical work doesn’t reduce the importance of judgment. It concentrates it. When boilerplate is free, the differentiator is entirely in the decisions that require knowing what matters.

How to evaluate AI tools as a DV engineer

The semiconductor EDA market has responded to the current AI moment with a wave of AI-enhanced tool announcements. Some of these represent genuine capability improvements. Some are marketing repositioning of existing features. Evaluating them requires a framework that isn’t “does this tool generate code?” — most of them do — but “does the generated output require the same verification effort as hand-written output?

The question that matters for any AI tool in a verification context:

• What is the failure mode when the tool is wrong? Is the error obvious (syntax error, compile failure) or subtle (structurally-correct code that silently misses the verification intent)?

• Does using this tool shift work from me to the tool, or does it shift work from productive engineering to AI output review? If reviewing the output takes as long as writing it, the tool has not saved time — it has changed the nature of the work without reducing the total.

• Is the tool operating on a domain where it has been validated? An LLM trained on general UVM code is not validated on your specific protocol variant or your company’s internal naming conventions. The closer the generated output is to your actual environment, the more review it requires.

• Does the tool’s output get tested? Code that goes directly from AI generation to a running regression is higher risk than code that gets reviewed, compiled, simulated against a known-good reference, and integrated incrementally. The speed benefit of AI generation is partly offset if it requires a slower, more careful integration process.

None of these questions argue against using AI tools. They argue for using them with the same skepticism you’d apply to any new methodology: try it on a bounded scope, measure the actual time and quality impact, adjust your workflow based on what you observe rather than what the marketing materials claim.

The bet I’m making

Every prediction about AI and engineering is a bet. Mine is this:

The productivity gains from AI in verification will be real but unevenly distributed. Engineers who use AI tools to accelerate the mechanical work — boilerplate, scaffolding, first-pass log triage, constraint syntax — will get those hours back. Engineers who use AI tools as a substitute for understanding coverage model design, debug methodology, and formal property specification will produce verification environments that look correct and aren’t.

The gap between those two outcomes is entirely determined by whether you understand what the tool is doing and where it’s likely to be wrong. That understanding requires the same deep knowledge of verification methodology that it has always required. AI tools don’t reduce the importance of that knowledge. They raise the cost of not having it, because the failure modes of AI-assisted verification are less visible than the failure modes of slow, manual verification.

A testbench that takes three months to write and misses a coverage hole is a problem you can see — the coverage report tells you. A testbench that takes three weeks to generate and has a subtly incorrect coverage model gives you 100% on the dashboard and ships the bug.

AI tools make the fast path faster. They don’t change what the fast path misses. Know what the fast path misses.

How to build your AI fluency roadmap

If you want to integrate AI tools into your verification workflow in a way that captures the genuine wins without the failure modes, here is the approach I’d recommend:

• Start with boilerplate generation and measure the actual time savings over one month. Not in theory — in practice, on your specific codebase, with your specific naming conventions and style requirements. Calibrate your expectations against observed reality.

• Build a personal evaluation set: three or four verification tasks you know well, with known-correct outputs. When you try a new AI tool, run it on your evaluation set and check the outputs against your reference. This gives you a calibrated sense of where the tool is reliable and where it needs review.

• Never use AI-generated coverage models without a full correctness review against your spec. Treat the generated model as a first draft that covers the obvious cases, then add the crosses and boundary conditions that require spec knowledge. The AI gets you to 60%; you do the remaining 40% that matters.

• Use AI for log summarization to identify the first error and the surrounding context. Stop there. Do the hypothesis formation and root cause analysis yourself, using the process in Article 04.

• If you use AI to generate SVA assertions, run them in simulation first with known-good stimulus and confirm they don’t fire. Then add cover properties to confirm they’re not vacuously true. Only then run them formally.

• Keep a running log of AI tool failures: cases where the generated output was wrong in a non-obvious way. This log is your most valuable calibration data. After three months, read it before evaluating any new AI tool.

Closing the series

This series started with an argument: design verification is the hardest engineering job nobody talks about, and it deserves to be treated as a first-class discipline. The five articles between that claim and this one were an attempt to give that discipline a more concrete shape — not the textbook version, but the version that shows up at 2am when a regression fails and the tape-out is nine hours away.

UVM anti-patterns matter because they determine whether your testbench is maintainable six months from now. Coverage closure matters because 100% on a weak model is indistinguishable from 100% on a strong one unless you understand the difference. Debug methodology matters because the engineers who find root causes in two hours have a process, and the process is learnable. Formal verification matters because there is a class of bugs that simulation cannot prove absent, and formal can. And AI tools matter because they are changing the economics of the mechanical work, in ways that are real but narrower than the headlines suggest.

What none of these articles can give you is the judgment that comes from doing the work. Reading about coverage model design is not the same as writing a coverage model, watching it close at 100%, shipping the chip, finding the bug in silicon, and tracing it back to a cross you didn’t write. That experience is irreplaceable and it is the foundation of everything else.

Build the methodology. Develop the judgment. Use the tools where they actually help. And when someone asks you what design verification is, explain it clearly — because the discipline is worth explaining, and the people who do it well are worth knowing about.

This concludes the six-part design verification blog series.

Articles 01–06 cover: the discipline itself, UVM patterns, coverage closure, debug methodology, formal verification, and AI tools in verification.

Search This Blog

Silicon Talks