The Hard Problems at FurtherAI

Published on

23 April 2026

At FurtherAI, we build AI agents for the insurance industry. Having worked in the industry for years, we’ve seen it all. We’ve seen Excel files with 100,000 rows across 47 tabs (no exaggeration), PDFs where the critical table was on page 187 of 200, and even documents over 4,000 pages, and handwriting that barely qualifies as legible.

This is the reality of enterprise data. And the reason why we’re writing the playbook for agent-first software.

The Problems

Here are some of the hard technical problems we're solving

1. The Agent Harness Problem

A harness defines the primitives an agent interacts with—a filesystem, a set of tools, a loop, and a verification step. The filesystem gives it memory and persistence. The tools let it read documents and act on what it finds. The loop handles orchestration and self-correction. Verification decides whether the work is correct. Those same primitives are what make self-improvement possible.

Karpathy's autoresearch showed this: give an agent a codebase, a metric, and a loop, and it'll make local improvements all night.

For us, the same loop applies. Here are the metrics:

Did we extract the right values?
Can we verify the agent trace that produced them?
Did the downstream systems get updated correctly?
‍

The second metric is the hardest. Two agents can produce identical extractions through wildly different traces—one focused, one thrashing through retries and dead ends before stumbling onto the answer. Only one generalizes. And when results are wrong, “the agent got it wrong” isn’t actionable. You need trajectory-level visibility: did it read the wrong document? Misinterpret a table? Have the correct value at step 12 but overwrite it at step 20? — Punyaslok Pattnaik

‍

Defining what a good trajectory looks like is itself an open problem. Some exploration is healthy but the line between exploration and thrashing shifts with document complexity. We're building the tooling to measure this.

While true for many industries, this is especially acute in insurance. Data varies wildly across carriers and document types. Every step in the pipeline needs to be well-specified: what to extract, which values to trust when documents disagree, and what to do with the result. Getting that specification right, across hundreds of formats and customer requirements, is the bottleneck to self-improvement.

2. Building AI that Learns From Humans Over Time

On day zero, an AI system might be ~80% accurate. By day 100, it needs to be closer to 99%.

The challenge lies in building systems that improve with every interaction, capturing each customer’s specific preferences without regressing on what already works.

We've built a memory system in our AI assistant that learns from user interactions—when an underwriter corrects the system or clarifies a preference, that gets stored and applied to future conversations.
‍

The harder problem—the one we're actively working on—is consolidating hundreds of individual corrections into coherent, generalizable knowledge. A single correction is easy to store. But corrections can conflict, go stale, or apply only in narrow contexts. Turning a pile of one-off fixes into knowledge that reliably improves the system's responses over time is where most of the difficulty lives. — Frieda Huang

3. Scaling Forward-Deployed Engineering

FDE work is still largely manual. You need to understand each customer’s data, map their requirements, and tune the pipeline until it works. Each step takes weeks of attention.

The question is whether we can invert this. Today the bottlenecks are domain understanding, evals, and iteration—and only ground truth labeling truly requires humans. Everything else can be compute. Imagine an agent with access to customer data, the workflow builder, and the eval platform, that builds, tests, and iterates autonomously. If you do that, scale is no longer a headcount problem.

4. Linking Entities Across Documents

A single entity might have 100 attributes spread across different documents. And data is oftentimes messy or incomplete. Here’s a quick example:

One document mentions the year a property was built, while another has the last renovation date, and the third one indicates the coverage limits. The only link between those three might be an address, but it can also be written differently across documents: “123 Main St” vs “123 Main Street, Unit A.

The challenge is determining whether those variations refer to the same property or to different ones—especially when the data is incomplete or messy. Match too aggressively and you collapse distinct properties into one. Too conservatively and the same building shows up three times with conflicting data. — Kshitij Jain

5. The Human-in-the-Loop Aspect

It’s important to remember that AI will not be 100% accurate. And in the industry like insurance, errors compound.

If the end-to-end target is 95%+ accuracy across a multi-step pipeline, our system lets AI take the first pass, while humans review, edit, and clarify.

This is why we need to build intuitive UI surfaces. We need citations that show exactly where each value came from, we need visual cues when the system is less confident, and we need simple correction tools.
‍

Every human edit feeds back into the model and memory layer. The challenge is making this feedback loop seamless and natural—so users feel like they’re collaborating with the system, not fixing it. — Giancarlo Fissore

6. The Platform Paradox

Another problem is that clients might have completely opposite requirements for the same product. Say, there is a Customer A who “always rejects California exposure” and a Customer B who “usually prioritizes California submissions.”

You can’t build a separate interface per each customer, because you wouldn’t be able to scale. And you can’t build a generic system with too many toggles to cover every case, because this would make the whole thing overly complicated.

The problem is designing a single interface that’s expressive enough for each customer’s logic. With LLMs, we know this is possible for the first time. That’s why we are building agentic UI that adapts to what each customer actually needs. — Satvarsh Gondala

7. Synthetic Data for Insurance AI

The tricky part is that there’s no ImageNet for insurance documents. There are no large-scale labeled datasets of SOVs, loss runs, or bordereaux.

We're creating synthetic data to train and evaluate models. The challenge is generating documents that capture real-world messiness—inconsistent formats, typos, missing data, conflicting information. We cannot rely on synthetic data that is too clean, because it would not help. And data that is too messy is not useful either. To make it work, we need the right distribution of chaos.

Why This Matters

The most challenging (but also the most exciting) part is that there’s no established playbook for these problems. We’re defining engineering patterns for AI systems that must be flexible, auditable, and production-ready at the same time.

While traditional enterprise software is deterministic but rigid, and AI systems are adaptive but unpredictable, we’re here to build systems that bring together the best of both worlds.

Join Us

We're backed by Andreessen Horowitz. If these problems excite you, join us.

‍