AloLab · Research

Building the world's best and safest
AI agents.

Q: What is trajectory intelligence in AI agents?

Trajectory intelligence evaluates every step an AI agent takes — tool calls, reasoning, retries, failures — not just the final output. AloLab's research shows this step-level analysis reveals failure modes invisible to standard pass/fail evaluations.

Q: Why do AI agents fail in production despite passing benchmarks?

Standard benchmarks measure task accuracy alone. AloLab research demonstrates that models can achieve 85% task accuracy while producing 0% usable output — the structured-output reliability gap makes benchmark scores misleading for production deployment.

Q: How does AloLab improve AI agent reliability?

AloLab is an iterative system-prompt optimizer requiring only black-box API access. It analyzes agent execution traces, identifies recurring failure patterns, and rewrites the orchestration harness — reaching 84–95% output accuracy at near-zero latency overhead.

Q: Is AloLab open source?

Yes. The AloLab framework and benchmarks are open-source at github.com/alomana-lab/alolab. The peer-reviewed paper is available at arXiv 2605.02363.

AloLab researches how to make AI agents that are reliable, safe, and auditable, for the highest-stakes operations in enterprise. Peer-reviewed research, open-source tooling, production-derived benchmarks.

Read the research View on GitHub

InterpretabilityAgent reliabilityPeer-reviewed & open-source

What we believe

Four principles.
One research agenda.

Work-AlignmentReliability first. Humans freed from process.Most AI research optimizes for capability, what a model can do. We optimize for what it should do reliably, every time, so humans don't have to. Our systems absorb the repetitive, the tedious, the operational load, and return time, judgment, and creative capacity back to the people who lost it to process.

Human-CentricityTechnology that serves people, not the other way around.Every system we build starts from the same question: does this make someone's day better? We design for zero learning curve, tools that meet people where they already are, speak their language, and stay invisible until the moment they're needed. The measure of a great AI system is that people forget it's there.

Harness > ModelThe orchestration layer is the product. The model is a commodity.Models get smarter every quarter. That's not a moat, it's a tide that lifts everyone. What compounds is the harness: the execution framework that turns raw capability into deterministic, auditable, production-grade output. The model is the engine. The harness is the driver.

Open Science, Private DeploymentPublish the method. Protect the data.We open-source our framework, our benchmarks, and our findings because science without reproducibility is marketing. But open research doesn't mean open data. Your production traces, your optimized prompts, your competitive intelligence, that stays in your environment, under your control.

Models improve.Harnesses compound.

The industry optimizes for model capability. We optimize for execution reliability. A frontier model that produces the right answer in the wrong format is as unusable as one that gets it wrong. The harness is what closes the gap.

The finding

85% accurate.
0% usable.

Without a harness, every model we tested, open and proprietary, achieved strong task accuracy but produced zero parsable output. The bottleneck was never reasoning. It was structured execution.

Llama 3.1-8B

Task accuracy

76.9%

Output accuracy

Gemma 2-9B

Task accuracy

80.4%

Output accuracy

Qwen 2.5-7B

Task accuracy

85.1%

Output accuracy

GPT-4o

Task accuracy

84.3%

Output accuracy

GSM8K benchmark, naive prompting — no formatting instruction. Task accuracy measures mathematical correctness; output accuracy measures correctness + valid structured output.

Trajectory intelligence

Most evals measure
the answer.

We measure the work. Every tool call, reasoning step, retry loop, and failure mode, logged and classified. Standard pass/fail scoring masks the execution pathologies that make agents unreliable in production.

Standard evalsBinary pass/fail on final output. Failure modes invisible.

Trajectory evalsStep-level traces. Failure taxonomy: loops, context collapse, hallucinated tool use, reasoning-output decoupling.

Execution tracetask-0047

step 1Read dataset schema, identify join keys✓

step 2Execute aggregation query✓

step 3Retry same query with modified filter — no change in resultloop

step 4Retry again — identical outputloop ×2

step 5Return answer — mathematically correct, wrong field populateddecoupled

Pass/fail result:✗·Trajectory result:reasoning sound, execution loop at step 3, serialisation failure at step 5

Key results

Reliability at scale,
without fine-tuning.

0 → 84–95%output accuracy after AloLab optimization, from 0% baseline under naive prompting

8.2×latency overhead of constrained decoding eliminated, AloLab runs at near-zero cost

29/30paired McNemar comparisons significant at p < 0.05 across all models and datasets

Black-box API access only. No weights, no gradients, no fine-tuning. The same framework works across open-weight and proprietary models.

How AloLab works

Iterative AI optimization
via execution traces.

01SolveQuery the target model via API on production-derived tasks. Black-box only — no weights, no gradients.

02EvaluateCompute per-sample metrics deterministically: JSON validity, task correctness, execution trace.

03AnalyzeIdentify recurring failure and success patterns across the full execution trace. Categorize by failure taxonomy.

04OptimizeRewrite the system prompt to address failures without disrupting what already works. Repeat for 4 epochs.

Best validation checkpoint selected for deployment. Optimization cost paid once — inference uses only the resulting prompt.

Publications

The work,
in the open.

InterpretabilityJun 2026

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

The direction that detects a behavior is not the direction that controls it. We measure the angle between knowing and steering — a weight-computable signature of the detection–intervention gap.

Galeone, A. Ettorre, Park, G. Ettorre, Ligorio · arXiv 2606.24952

Agent reliabilityMay 2026

When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

Models can be 85% accurate yet 0% usable. AloLab, an iterative prompt optimizer, reaches 84–95% structured-output accuracy from a 0% baseline — without fine-tuning.

Galeone, Park, G. Ettorre, Ligorio · arXiv 2605.02363

More papers on the way. The framework and benchmarks are open-source at github.com/alomana-lab/alolab.

Open research

The paper, framework, and benchmarks are public. We publish because reproducibility is the standard, not the exception.

Read the papers github.com/alomana-lab/alolab

Private deployment

AloLab is the research arm of Alomana, an enterprise AI platform that runs in a private, ISO 27001-certified instance. Your production traces, models, and optimized prompts never leave your environment.

Security & privacy

ISO 27001GDPRPrivate instance

Questions about the research

Asked,
answered.

What is trajectory intelligence in AI agents?

Trajectory intelligence evaluates every step an AI agent takes — tool calls, reasoning, retries, failures — not just the final output. AloLab’s research shows this step-level analysis reveals failure modes invisible to standard pass/fail evaluations.

Why do AI agents fail in production despite passing benchmarks?

Standard benchmarks measure task accuracy alone. AloLab research demonstrates that models can achieve 85% task accuracy while producing 0% usable output — the structured-output reliability gap makes benchmark scores misleading for production deployment.

How does AloLab improve AI agent reliability?

AloLab is an iterative system-prompt optimizer requiring only black-box API access. It analyzes agent execution traces, identifies recurring failure patterns, and rewrites the orchestration harness — reaching 84–95% output accuracy at near-zero latency overhead.

Is AloLab open source?

Yes. The AloLab framework and benchmarks are open-source at github.com/alomana-lab/alolab. The peer-reviewed paper is available at arXiv 2605.02363.

For enterprises and model labs

Partner with
the lab.

Run AloLab on your own models and production traces. Benchmark agents against a deployed baseline. Close the reliability gap before it reaches your customers.

Partner with the lab Read the research

Building the world's best and safestAI agents.

Four principles.One research agenda.

Models improve.Harnesses compound.

85% accurate.0% usable.

Most evals measurethe answer.

Reliability at scale,without fine-tuning.

Iterative AI optimizationvia execution traces.

The work,in the open.

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

When Correct Isn't Usable: Improving Structured Output Reliability in Small Language Models

Open research

Private deployment

Asked,answered.

Partner withthe lab.

Building the world's best and safest
AI agents.

Four principles.
One research agenda.

85% accurate.
0% usable.

Most evals measure
the answer.

Reliability at scale,
without fine-tuning.

Iterative AI optimization
via execution traces.

The work,
in the open.

Asked,
answered.

Partner with
the lab.