Inside Agnes AI’s Proprietary Tech Stack: How a Singapore Startup Is Outperforming DeepSeek

January 6, 2026

Singapore — In the global AI arms race, most startups face a difficult choice: build on top of existing large language models from OpenAI, Google, or open-source alternatives, or invest enormous resources developing a proprietary tech stack from the ground up. Agnes AI, the Singapore-based platform that has grown to over 3 million users since July 2025, chose the harder path — and it’s paying dividends.

“We build our own models — we’re not heavily dependent on open-source or third-party APIs,” founder and CEO Bruce Yang explained in a recent interview on the Convo AI World Podcast. “That makes us more ‘sovereign’ as an AI company.”

The technical architecture behind Agnes reveals how a relatively small team can compete with — and in some cases outperform — models from far larger organizations.

The Agnes-R1 Model: SOTA in the 7B Range
The Universal Verifier: Bridging Benchmarks and Reality
A Full Family of Proprietary Tech Stacks: 7B to 30B
Code Agents: The Token Efficiency Breakthrough
The Three-Agent System for Multimodal Generation
Training on Real User Data
Regional LLM Training
Speed as a Product Principle
The Research-Production Bridge

The Agnes-R1 Model: SOTA in the 7B Range

At the heart of Agnes’s technology stack is Agnes-R1, a fully proprietary 7-billion-parameter model trained using the company’s own reinforcement learning framework called DSP (Dynamic Sequential Policy Optimization).

“DSP is an enhanced version of DPO,” Yang explained. “From our research team, we’ve achieved strong results on open benchmarks — in reasoning, we outperform earlier versions of DeepSeek.”

On multiple QA benchmarks, Agnes-R1 demonstrated a 34.1% improvement compared to DeepSeek’s GRPO framework and surpassed previous 14B models by nearly 9% on complex multi-hop reasoning tasks such as HotpotQA.

But Yang is clear-eyed about the limitations of benchmarks.

“Benchmarks don’t translate well to subjective real-world tasks — judging research quality, commenting on presentation design, participating naturally in group chats,” he noted. “There’s no exact right or wrong answer, and it’s very hard to give a score.”

The Universal Verifier: Bridging Benchmarks and Reality

To solve this problem, Agnes developed what Yang calls a Universal Verifier — an internal evaluation system that provides scores and rewards for subjective tasks.

“Otherwise the model doesn’t know which direction to improve,” Yang explained. “It doesn’t even know where to go.”

The Universal Verifier helps Agnes bridge open benchmarks and real-life problems like:

Writing high-quality research summaries
Producing professional slides
Acting naturally in group chats (“talk when needed, stay quiet when needed”)

“We spent a lot of effort transitioning from research to production,” Yang said. “This verifier is part of our secret sauce — we haven’t open-sourced it.”

A Full Family of Proprietary Tech Stacks: 7B to 30B

Agnes-R1 is just one component of a broader model ecosystem. The company has built a full family of models ranging from 7B to 30B parameters, each optimized for specific tasks.

“We don’t need to beat GPT-4 in everything,” Yang emphasized. “We just need to outperform in specific categories — research, generating PowerPoints, group chat assistance. We have specific models for each mode in our system.”

Under the hood, Agnes utilizes sophisticated task orchestration at the underlying technology layer. Roughly half of user traffic is routed to various self-developed models optimized for specific tasks such as research, slide generation, and image or video creation.

This mixture approach allows Agnes to:

Deliver faster inference speeds than relying solely on external APIs
Maintain higher output quality for targeted use cases
Achieve significantly lower token costs

Code Agents: The Token Efficiency Breakthrough

One of Agnes’s most innovative technical contributions is its Code Agents system — a framework that uses pseudo-code instead of natural language for multi-agent communication.

“The idea is simple,” Yang explained. “If you run the same pseudo-code twice, you get the same result. But if two agents speak natural language, misunderstandings occur because of ambiguity.”

The results are significant:

5–10% improvement in accuracy on benchmarks like HotpotQA
40–70% reduction in token usage
Dramatically faster response times

This work has been documented in the team’s research paper CodeAgents: A Token-Efficient Framework for Codified Multi-Agent Reasoning in LLMs, which introduces the code-first prompting framework that enables structured and token-efficient planning in multi-agent environments.

“On mobile, users want results within 1 minute,” Yang said. “If research takes 5 minutes, they lose context and interest. So we designed our system to maximize accuracy within one minute, then cut off. This makes us the fastest in the world in our category.”

The Three-Agent System for Multimodal Generation

For image and video generation, Agnes employs a sophisticated multi-agent architecture for its proprietary tech stacks:

1. Intent Understanding Agent

“LLMs are much better at understanding human intent than diffusion models,” Yang explained. “If you give a diffusion model a vague prompt, it hallucinates. So the first agent analyzes intent.”

2. Generation Agent

Extremely precise instructions are passed to Agnes’s in-house generation models — post-trained versions of leading open-source models for image and video.

“They perform very well when given precise instructions,” Yang noted. “But if you give abstract or incomplete descriptions, they hallucinate and give lousy results.”

3. Judge Agent (LLM as Evaluator)

After generation, a judging agent evaluates:

Does the output meet requirements?
What needs improvement?

The feedback loops back to the generation step, repeating until the result satisfies the evaluation model.

“This multi-agent chain works better than any single model alone,” Yang said.

Training on Real User Data

Agnes’s technical advantage is compounded by its user base. With over 200,000 daily active users generating prompts, the company has accumulated massive amounts of real-world training data.

“We use that data — especially minority language data, spoken language, not something you find in literature — to do pre-training,” Yang explained. “Then during post-training, we use distilled data from state-of-the-art models. Finally, we apply our policy optimization reinforcement learning framework to make our models work even better.”

The approach is iterative and data-driven: user interactions improve models, better models improve user experience, and improved experience generates more training data.

Regional LLM Training

To further enhance localization, Agnes is continuously training regional large language models focused on Southeast Asia and Latin America.

“We want to deepen our understanding and generation of local dialects, slang, and cultural contexts,” Yang said. “Global platforms treat regional markets as an afterthought. We’re building specifically for them.”

This includes support for minority languages commonly used across Southeast Asia that major global LLMs frequently overlook — a technical investment that directly supports Agnes’s regional-first growth strategy.

Speed as a Product Principle

Throughout our conversation, Yang returned repeatedly to one theme: speed.

“We want to be the fastest AI consumer product on mobile in the world,” he stated. “Our philosophy is: try our best to get the best results in one minute. After one minute, we cut it off.”

This isn’t just an engineering preference — it’s a product strategy. Mobile users in emerging markets expect instant responses. They’re often on slower networks, older devices, and limited data plans. An AI that takes five minutes to research is an AI they’ll abandon.

By combining proprietary models, code agents, task orchestration, and aggressive optimization, Agnes has built a technical foundation that serves this philosophy.

The Research-Production Bridge

Agnes’s technical achievements reflect its unique position as a company bridging academic research and commercial deployment through a proprietary tech stack purpose-built for applied AI systems. Founded by a Raffles Institution alumnus and NUS AI PhD, the team includes researchers from NUS, NTU, MIT, Stanford, UC Berkeley, and UT Austin.

Their paper on “Stable and Efficient Policy Optimization for Agentic Search and Reasoning (DSPO)” has been submitted to ICLR 2025 and published on arXiv, demonstrating the team’s commitment to advancing the state of AI research — not just deploying existing models.

“Through improvements in reinforcement learning, we maintain non-collapsing training stability and convert stable reward optimization into consistent real-world generalization efficiency,” Yang explained. “By replacing closed-source SOTA models with ensembles of smaller, self-developed models, we’ve achieved excellent results in inference speed, output quality, and token cost efficiency.”

For a Singapore startup competing against the world’s best-funded AI labs, this research-first approach isn’t a luxury — it’s a necessity. And so far, the results suggest it’s working.

Hot topics

Finance

Marketing

Politics

Strategy

Table of contents