The Real Cost of Prompt-Based Underwriting: Tokens, Time, and Trust

Minervian AI Research2026 · 6 min read

There is a version of AI-assisted underwriting that feels like progress. You drop an offering memorandum into a chat interface, paste in a rent roll, attach a trailing twelve-month income statement, and start prompting. Thirty minutes later, you have something that looks like an underwriting output.

What you also have, though you may not have tracked it, is a token bill somewhere between 500,000 and 1,000,000 units, a workflow that can't be audited, and a calculation that may or may not repeat itself tomorrow. This is the real cost of prompt-based underwriting. For fund managers evaluating AI infrastructure at any serious volume, it deserves a more rigorous accounting than it typically receives.

300K

Tokens for a single run to first output

One OM + rent roll + T12

15–20M

Tokens across OM, rent roll, code, over 10-15 iterations

Quadratic compounding, not linear

$45–60K

Annual raw token spend at 200 deals

Mostly on deals that don't close

Token Cost

Context windows don't reset between iterations. Each run carries every prior prompt and response forward, a quadratic compounding problem that balloons a 10-run cycle from ~4M tokens to 15–20M.

Time Cost

10–15 minutes to first output per deal sounds minor on one deal. Across a pipeline of live bid processes and simultaneous markets, it becomes a structural bottleneck that limits deal velocity and redirects analyst attention.

Audit & Reproducibility

Language models are probabilistic. The same prompt applied twice will not always follow the same calculation path. For funds with LP audit expectations or fiduciary obligations, this is not a small risk, but a structural liability.

The Economics Are Worse Than They Appear

Token costs are easy to underestimate because they accumulate in ways that aren't immediately visible. A raw LLM workflow doesn't just read a document once; it reads it every time you extend the context. Every follow-up prompt carries the prior conversation. Every iteration reprocesses the same rent roll, the same OM, the same T12. By the time you've worked through vacancy assumptions, refined the debt service calculation, and run a few cap rate scenarios, you've likely crossed 500,000 tokens for a single deal.

A single first-output run on a typical deal consumes roughly 100-300k tokens, depending on the input format (concise prompts or expanded OM). But context windows don't reset between iterations. Each subsequent run carries the full prior context forward: every prompt, every response, every block of model-generated analysis. Run 2 is already 400–500k tokens. By run 10, a single, simple, iteration can exceed 2M tokens as model outputs bulk up the context window.

This is a quadratic compounding problem, not a linear one. A realistic 10–15 run underwriting cycle consumes closer to 15–20M tokens, not 3–4M.

While the first run seems to get you to 80% of the work, the remaining 20% will consume roughly 95% of your token spend. And token costs are only part of the compute bill; every one of those tokens is processed by a large model running on expensive GPU infrastructure. The latency you feel waiting for each response is the compute cost made physical.

On the other hand, a purpose-built underwriting platform uses an LLM for what it's actually good at: extracting data from unstructured documents and interpreting language in context. That's a few thousand tokens per document. The actual calculations, from NOI, DSCR, IRR, cash-on-cash return, to waterfall distributions, run through a backend engine that doesn't need to think. It just computes. The result is a 50 to 100× cost reduction per deal, not through any compromise in output quality, but through a cleaner division of labor.

Pure-LLM Workflow Annual Token Spend

$45K–$60K

200 deals, excluding memo write-up or asset management

Purpose-built cost reduction

50–100×

Per deal vs ad hoc LLM workflow

Thirty Minutes Is Not a Workflow

A first prompt make take 5-10 minutes for the LLM to produce a response. Additional inputs and tweaks can take it to 20-30 minutes. This problem is underappreciated. In the context of a single deal, it seems like a minor inconvenience. In the context of a pipeline, it is a workflow bottleneck with real downstream consequences.

Fund managers and acquisitions teams don't screen deals in isolation, but in batches, during live bid processes, at the end of a broker's marketing period, across multiple markets simultaneously. When each screening takes half an hour to return a first usable number, the team's effective throughput drops sharply. Associates spend time shepherding prompts rather than analyzing results. Senior staff wait on outputs before they can allocate attention.

The practical difference between a 30-minute workflow and a 30-second one isn't only speed but also how teams structure their day, how many deals they can seriously evaluate in a given week, how much time they can allocate to efforts such as due diligence and relationship building, and whether AI becomes a genuine force multiplier or an elaborate bottleneck.

A purpose-built platform running the same workflow, from document ingestion, data extraction, to financial modeling, returns results in seconds, because it isn't using the model to perform calculations. It's using the model to translate inputs and then handing the math to a system designed to do math. The economics of time aren't captured in token cost alone. They're captured in analyst hours, deal velocity, and whether the technology actually changes behavior or simply replaces one slow process with a different one.

The Trust Problem Is Harder to Quantify, but More Important

Speed and cost are the easy part of this argument. The harder sell, and the more consequential one, is consistency.

When an LLM generates a calculation, you are trusting that this invocation of the model wrote the same logic as the last invocation. Language models are probabilistic systems. They do not execute code; they predict outputs. The same prompt run twice will not always return precisely the same calculation path. Subtle differences in how a model interprets a rent roll, handles a footnote, or applies a vacancy assumption can produce outputs that look identical at first glance and diverge in the detail.

For a single deal in early-stage exploration, this might be acceptable. For a fund with a fiduciary obligation, an LP base with audit expectations, and an investment committee that needs to trace how a number was produced, it is not.

The calculation must be executed, not generated. The logic is defined, tested, and locked, not inferred. This is the audit trail that prompt-based underwriting structurally cannot provide.

A dedicated underwriting platform running calculations through a versioned engine behaves identically on every deal. The same inputs, applied to the same model version, produce the same output. Every time. This isn't a criticism of the underlying models, which are genuinely impressive at document comprehension and data extraction. It's a recognition that “impressive at language” and “appropriate for financial calculation” are different standards, and conflating them creates risk that doesn't show up until it matters most.

What This Means for Platform Selection

The fund manager or technology decision-maker evaluating AI underwriting infrastructure should be asking a specific set of questions that raw LLM workflows cannot answer satisfactorily.

Can you reproduce this output exactly six months from now?

Can you demonstrate to a limited partner that the methodology applied to deal 47 is identical to the methodology applied to deal 12?

If the model is updated or the prompt is refined, how do you know which prior outputs are now inconsistent with your current approach?

How do you version your underwriting logic?

Purpose-built platforms are designed to answer these questions because they separate the AI layer from the calculation layer. The AI reads documents. The engine runs the math. The outputs are deterministic, traceable, and reproducible. This isn't an argument against AI in underwriting, but an argument for applying it correctly. The extraction and interpretation problems are real, and large language models solve them well. The calculation problem is not an AI problem. It is a software problem, and it has a software solution.

The Bottom Line

Funds that build their underwriting workflows on prompt-based AI infrastructure are making a bet that the trust problem won't surface visibly. For some, it won't, at least not soon.

But as deal volume grows, LP scrutiny increases, and regulators develop more specific expectations around AI-assisted financial decision-making, the absence of a reproducible audit trail becomes an increasingly expensive liability. The token cost and the time cost are real. They are also the smaller concerns.

The real cost of prompt-based underwriting is what happens when you need to explain, whether to an LP, to a regulator, or to yourself, exactly how you arrived at a number, and the honest answer is that a language model generated it, and you can't guarantee it would generate the same one again. The difference between using AI and deploying AI correctly isn't philosophical. It's economic. And at volume, it's decisive.

AI UnderwritingToken CostCRE TechnologyAudit TrailLLM in FinanceInvestment Infrastructure

←Back to Insights