There is a version of AI-assisted underwriting that feels like progress. You drop an offering memorandum into a chat interface, paste in a rent roll, attach a trailing twelve-month income statement, and start prompting. Thirty minutes later, you have something that looks like an underwriting output.
What you also have, though you may not have tracked it, is a token bill somewhere between 500,000 and 1,000,000 units, a workflow that can't be audited, and a calculation that may or may not repeat itself tomorrow. This is the real cost of prompt-based underwriting. For fund managers evaluating AI infrastructure at any serious volume, it deserves a more rigorous accounting than it typically receives.
Context windows don't reset between iterations. Each run carries every prior prompt and response forward, a quadratic compounding problem that balloons a 10-run cycle from ~4M tokens to 15–20M.
10–15 minutes to first output per deal sounds minor on one deal. Across a pipeline of live bid processes and simultaneous markets, it becomes a structural bottleneck that limits deal velocity and redirects analyst attention.
Language models are probabilistic. The same prompt applied twice will not always follow the same calculation path. For funds with LP audit expectations or fiduciary obligations, this is not a small risk, but a structural liability.
The Economics Are Worse Than They Appear
Token costs are easy to underestimate because they accumulate in ways that aren't immediately visible. A raw LLM workflow doesn't just read a document once; it reads it every time you extend the context. Every follow-up prompt carries the prior conversation. Every iteration reprocesses the same rent roll, the same OM, the same T12. By the time you've worked through vacancy assumptions, refined the debt service calculation, and run a few cap rate scenarios, you've likely crossed 500,000 tokens for a single deal.
A single first-output run on a typical deal consumes roughly 100-300k tokens, depending on the input format (concise prompts or expanded OM). But context windows don't reset between iterations. Each subsequent run carries the full prior context forward: every prompt, every response, every block of model-generated analysis. Run 2 is already 400–500k tokens. By run 10, a single, simple, iteration can exceed 2M tokens as model outputs bulk up the context window.
This is a quadratic compounding problem, not a linear one. A realistic 10–15 run underwriting cycle consumes closer to 15–20M tokens, not 3–4M.
While the first run seems to get you to 80% of the work, the remaining 20% will consume roughly 95% of your token spend. And token costs are only part of the compute bill; every one of those tokens is processed by a large model running on expensive GPU infrastructure. The latency you feel waiting for each response is the compute cost made physical.
On the other hand, a purpose-built underwriting platform uses an LLM for what it's actually good at: extracting data from unstructured documents and interpreting language in context. That's a few thousand tokens per document. The actual calculations, from NOI, DSCR, IRR, cash-on-cash return, to waterfall distributions, run through a backend engine that doesn't need to think. It just computes. The result is a 50 to 100× cost reduction per deal, not through any compromise in output quality, but through a cleaner division of labor.
Thirty Minutes Is Not a Workflow
A first prompt make take 5-10 minutes for the LLM to produce a response. Additional inputs and tweaks can take it to 20-30 minutes. This problem is underappreciated. In the context of a single deal, it seems like a minor inconvenience. In the context of a pipeline, it is a workflow bottleneck with real downstream consequences.
Fund managers and acquisitions teams don't screen deals in isolation, but in batches, during live bid processes, at the end of a broker's marketing period, across multiple markets simultaneously. When each screening takes half an hour to return a first usable number, the team's effective throughput drops sharply. Associates spend time shepherding prompts rather than analyzing results. Senior staff wait on outputs before they can allocate attention.
The practical difference between a 30-minute workflow and a 30-second one isn't only speed but also how teams structure their day, how many deals they can seriously evaluate in a given week, how much time they can allocate to efforts such as due diligence and relationship building, and whether AI becomes a genuine force multiplier or an elaborate bottleneck.
A purpose-built platform running the same workflow, from document ingestion, data extraction, to financial modeling, returns results in seconds, because it isn't using the model to perform calculations. It's using the model to translate inputs and then handing the math to a system designed to do math. The economics of time aren't captured in token cost alone. They're captured in analyst hours, deal velocity, and whether the technology actually changes behavior or simply replaces one slow process with a different one.
The Trust Problem Is Harder to Quantify, but More Important
Speed and cost are the easy part of this argument. The harder sell, and the more consequential one, is consistency.
When an LLM generates a calculation, you are trusting that this invocation of the model wrote the same logic as the last invocation. Language models are probabilistic systems. They do not execute code; they predict outputs. The same prompt run twice will not always return precisely the same calculation path. Subtle differences in how a model interprets a rent roll, handles a footnote, or applies a vacancy assumption can produce outputs that look identical at first glance and diverge in the detail.
For a single deal in early-stage exploration, this might be acceptable. For a fund with a fiduciary obligation, an LP base with audit expectations, and an investment committee that needs to trace how a number was produced, it is not.
The calculation must be executed, not generated. The logic is defined, tested, and locked, not inferred. This is the audit trail that prompt-based underwriting structurally cannot provide.
A dedicated underwriting platform running calculations through a versioned engine behaves identically on every deal. The same inputs, applied to the same model version, produce the same output. Every time. This isn't a criticism of the underlying models, which are genuinely impressive at document comprehension and data extraction. It's a recognition that “impressive at language” and “appropriate for financial calculation” are different standards, and conflating them creates risk that doesn't show up until it matters most.
What This Means for Platform Selection
The fund manager or technology decision-maker evaluating AI underwriting infrastructure should be asking a specific set of questions that raw LLM workflows cannot answer satisfactorily.
Can you reproduce this output exactly six months from now?
Can you demonstrate to a limited partner that the methodology applied to deal 47 is identical to the methodology applied to deal 12?
If the model is updated or the prompt is refined, how do you know which prior outputs are now inconsistent with your current approach?
How do you version your underwriting logic?
Purpose-built platforms are designed to answer these questions because they separate the AI layer from the calculation layer. The AI reads documents. The engine runs the math. The outputs are deterministic, traceable, and reproducible. This isn't an argument against AI in underwriting, but an argument for applying it correctly. The extraction and interpretation problems are real, and large language models solve them well. The calculation problem is not an AI problem. It is a software problem, and it has a software solution.