Work

Trust Broker – AI Compliance Copilot

AI/RAG
Compliance
Solo Development

From 10-20 hours per questionnaire to a few minutes: An AI system that doesn't just fill out security questionnaires, but delivers 100% accuracy at 75% lower cost.

Trust Broker - AI-powered system for automated security questionnaire responses
100%
Accuracy
75%
Cost Reduction
4 Mo.
Solo Development

The Scenario

Picture this: It’s Thursday afternoon, and a security questionnaire lands on your desk. 150 questions. Deadline: Monday. The answers exist somewhere—scattered across PDFs, policies, wiki pages, and the minds of colleagues who are conveniently on vacation.

Question 73: “Describe your incident response process.” The answer is in a 40-page document from 2019. Somewhere. Maybe page 12. Or 28.

This is daily life for compliance teams. 10-20 hours per questionnaire. Not because the work is complex, but because it’s endless copy-paste.

My client, a B2B compliance SaaS provider, wanted to change that.


The Brief

The client came with a clear vision: a system that automatically fills out security questionnaires. Sounds like a standard AI project at first—grab an LLM, throw in some documents, done.

But it’s not that simple.

Compliance isn’t a playground for “close enough.” If you answer “Yes” to an ISO-27001 question when the correct answer is “No,” it gets expensive fast. Audit consequences, lost trust, legal trouble.

So before writing a single line of code, I asked myself three questions:

Where are the hard constraints? Questionnaire platforms don’t have APIs. We need to automate real browser flows—with login, multi-page forms, and all the complexity that comes with it.

What’s compliance-critical? Answers need to be traceable. Where did the information come from? Which document? Which page? And: the system must never submit on its own—a human has to review first.

How do we measure quality? In AI projects, the demo doesn’t matter. What matters is how the system behaves after the third prompt change. Without metrics, you’re flying blind.


The Solution: A 3-Stage Pipeline

Trust Broker isn’t a ChatGPT wrapper. It’s a pipeline with clear interfaces between each step:

1
Parse
Form → Schema
2
Analyze
Sources → Answer
3
Fill
Answer → Form

Step 1: Parse An AI-driven browser agent opens the questionnaire and reads it into a structured format. Field types, validation rules, dropdown options, multi-page navigation—everything ends up in a JSON schema. The result is a stable, testable foundation, regardless of what the UI looks like.

Step 2: Analyze This is where the magic happens. The system searches the client’s knowledge base—first curated FAQ entries, then documents. An LLM selects relevant sources and generates a type-compliant answer with citations.

Step 3: Fill The browser agent navigates back to the form and fills in fields appropriately—radio buttons, dropdowns, text fields. But—and this is crucial—it never clicks “Submit.” The form is pre-filled, but a human must approve it.

Why this architecture? Each step has defined inputs and outputs. This means I can test, debug, and optimize each step in isolation. If document selection isn’t good enough, I don’t need to touch the whole system—just Step 2.


The First Surprise: LLM Beats Embeddings

Here’s where it gets interesting.

When you think “AI-powered document search” today, you’re probably thinking RAG with vector embeddings. That’s the standard approach: convert documents to vectors, store them in a database, find the most similar vectors when a question comes in.

I made the same assumption. So I built both approaches and compared them.

The result surprised me.

ApproachF1 ScorePrecisionRecall
Embedding Baseline0.43930.8%84.7%
LLM (Google Gemini)0.56871.9%70.7%
LLM (OpenAI GPT-4o-mini)0.55943.3%92.0%

LLM-based metadata analysis was 29% better than embeddings. Not slightly better—significantly better.

Why?

With a manageable document corpus (15-150 documents, typical for mid-sized companies), an LLM can analyze the metadata directly—title, description, category—and understand it. No need to route through semantic similarity.

The best part: no vector database needed. Less infrastructure. Less maintenance. And the LLM can explain why it selected a document—that’s gold for compliance.

Important caveat: For very large document sets (1000+ docs), I’d go back to embeddings. But for the typical use case? LLM wins.


The Second Surprise: The Most Expensive Model Isn’t the Best

Six models. 21 compliance questions. A clear test case.

ModelAccuracySpeedCost/Question
GPT-4 Turbo100%3.9s$0.044
GPT-4o95.2%3.9s$0.011
GPT-580%24.2s$0.342
Gemini 2.5 Pro76.2%5.9s$0.040

Yes, you read that right. GPT-5 performed worse than GPT-4 Turbo. Slower. More expensive. And less accurate.

This isn’t a bug. It’s the reality of LLMs: “newer” or “bigger” doesn’t automatically mean “better for your use case.”

The practical takeaway:

  • Development: GPT-4o with 95.2% accuracy at $0.011 per question
  • Production: GPT-4 Turbo with 100% accuracy for zero-tolerance compliance

The difference? $12,000 per year at 1,000 questions daily—with no quality loss in production.

Because I measured it—not guessed.


The Speed Trick: Knowledge Base First

Document extraction is slow. Opening a PDF, extracting text, sending it to an LLM—that takes 8-12 seconds per question.

But many compliance questions are standard questions. “Do you have a privacy policy?” appears in every other questionnaire.

The idea: Before searching documents, the system checks a curated knowledge base. If there’s an answer there, we’re done in 2-4 seconds.

Question
Knowledge Base2-4s
Match?
YES
Done 2-4s total
NO
📄 Documents 10-16s

The result: 50% faster for common questions. And more consistent—knowledge base answers don’t vary from run to run.


Browser Automation Without the Maintenance Nightmare

Now for the tricky part: the forms don’t have APIs.

The classic approach would be Selenium or Playwright with CSS selectors. The problem: questionnaire platforms change their UI constantly. The button that was #submit-btn yesterday is button.primary-action today. Every change breaks the automation.

My solution: AI-driven browser agents.

Instead of “Click on element with ID xyz,” I tell the agent: “Fill question x with answer y.” The LLM analyzes the page, finds the fields, understands the context. When the UI changes, the agent adapts.

Sounds like magic? It kind of is. But it works remarkably well.

The critical constraint: The agent must never click “Submit.” This is defined as a Critical Error. If it tries, everything stops. Compliance means: a human must give final approval.


Authorization: Four Levels, Transparent

In B2B SaaS, data separation isn’t optional. Customer A can’t see Customer B’s documents. Some documents require an NDA. Others are strictly internal.

The model:

LevelRule
PublicEveryone can see it
RestrictedAuthorized contacts only
Requires NDAOnly with signed NDA
InternalNever accessible via API

The key insight: When the system finds relevant documents that the user can’t access, it shows this. “I found 3 relevant documents, but you need an NDA for 2 of them.”

Why? The alternative would be: “No information found.” That’s frustrating and unclear. Is it the system? Is there really nothing? With transparent feedback, the user knows an answer exists—they just need to get the right permissions. This reduces support tickets and builds trust.


What I Learned About Infrastructure

Inngest Instead of a Custom Job Queue

My first approach: a custom job queue with Redis. 430 lines of code for job management, retries, cleanup.

Then it hit me: the client already uses Inngest. Why build a second system?

Result: 430 lines deleted. Durable execution out of the box. Jobs survive service restarts. Better observability.

OpenAPI → TypeScript for Cross-Service Type Safety

The browser agent runs in Python (because the library requires it). Everything else is TypeScript. How do I prevent runtime errors between services?

Solution: The Python service generates an OpenAPI spec. From that, I generate TypeScript types. If someone renames a field in Python, the TypeScript compiler catches it—not production.


The Final Numbers

After four months of solo development:

Quality

  • 100% semantic accuracy (GPT-4 Turbo)
  • 29% better document selection than embedding baseline
  • 100% type compliance (Boolean stays Boolean)

Performance

  • 3.9 seconds average response time
  • 50% faster with knowledge-base-first architecture
  • 10 seconds saved per 100-question batch (authorization caching)

Cost

  • $0.011 per question (GPT-4o) vs. $0.044 (GPT-4 Turbo)
  • 75% cost reduction for development
  • $12,000/year savings at 1,000 questions/day

Code

  • 6 microservices, cleanly separated
  • 430 fewer lines through Inngest migration
  • Zero runtime type errors through OpenAPI codegen

What I’d Do Differently

Evaluation from day 1. For the first two weeks, I built features without knowing how to measure them. The evaluation framework came relatively early—but even earlier would have yielded better insights.

Less perfection in the first pass. The 4-level authorization was fully implemented from day 1. In hindsight, “Public vs. Private” would have been enough for the MVP. The complexity came too early.

Prompt versioning with performance data. Prompts were in Git, but not linked to historical performance metrics. After a change, I knew what was different—but not always how accuracy shifted.

Cost tracking from day 1. I only started measuring API costs systematically after a few weeks. Surprise: the most expensive calls weren’t the ones I expected. Earlier monitoring would have influenced design decisions.

Less abstraction, more pragmatism. My first design had elegant interfaces for future extensions. I never needed most of them. Simple code that works beats flexible code that never gets extended.


Tech Stack

  • Backend: TypeScript (Node 22, strict mode), Python (FastAPI for browser agent)
  • Infrastructure: Docker Compose, Coolify for deployments
  • Database: Directus CMS (client infrastructure)
  • Job Processing: Inngest for durable execution
  • LLM: Multi-provider (OpenAI, Google, Anthropic), switchable via config
  • Testing: Vitest, 50 realistic test cases, evaluation framework

Conclusion

Trust Broker isn’t a demo project. It’s a system that runs in production, delivers measurable quality, and stays maintainable.

The most important lesson: AI projects need evaluation from day 1. Not as a nice-to-have, but as the core of the product. Without metrics, you’re flying blind.

The second-most important: Test your assumptions. LLM beats embeddings. GPT-4 beats GPT-5. Not because I read it somewhere—because I measured it.

Planning a similar project? You need AI that doesn't just feel right, but demonstrably works? Let's talk.