Chirag Agrawal Podcast Transcript
Chirag Agrawal joins host Brian Thomas on The Digital Executive Podcast.
Brian Thomas: Welcome to Coruzant Technologies, Home of The Digital Executive Podcast.
Do you work in emerging tech, working on something innovative? Maybe an entrepreneur? Apply to be a guest at www.coruzant.com/brand.
Welcome to The Digital Executive. Today’s guest is Chirag Agrawal. Chirag Agrawal is a seasoned technology professional with over a decade of experience in building large scale AI platforms, distributed systems, and developer tooling as a senior engineer and tech lead.
He specializes in LLM infrastructure, advanced AI powered conversational systems and multi-agent orchestration, including agent execution system, prompt engineering, AI tool use, and AI memory along with software development, KIPP and compilers Chirag leads cross-functional architectural initiatives driving lower latency, reliability, and scale for Alexa and its developer ecosystem.
Brian Thomas: Well, good afternoon, Chirag. Welcome to the show.
Chirag Agrawal: Thank you for having me, Brian. It’s great to be here.
Brian Thomas: Absolutely, my friend. I appreciate it. You are currently in India via Seattle, Washington, but I appreciate you making the time. It’s hard to traverse time zones and this busy world that we have, so I appreciate that.
Chirag, let’s jump into your first question. You focused on the infrastructure between models and applications, runtimes orchestration. Memory, execution, graphs, et cetera. Right. How should product teams think about the gap between, we have a model and we have a system running it reliably in production, and what are the most common pitfalls you see?
Chirag Agrawal: That’s a great question. So I think model should be treated as a dependency, not the main product. So, like, I would say product teams should think about not the model, but the system around it. And one of the common pitfalls I usually see is that teams try to build their AI agents from scratch.
And what happens is that they spend most of their time doing undifferentiated work. And this is because they often overlook the complexity of the system around the model that is required in order to shift this. Ship the agent in production. There are like many concerns related to retrieval.
Tool calling orchestration, like you said context management, compression, caching, all of that like evaluation. And everybody is doing it. Like every team that’s building an agent has to do all of this work. So, it’s really undifferentiated and it’s often handled by frameworks or developer tooling provided by platforms.
So instead, what they should do is probably like build the agent once for the prototype because it does serve as a useful learning exercise, but then use a framework going forward to ship the agent in production systems. I think another pitfall I usually sees teams not treating evaluation and guardrails as first class citizens in their development life cycle.
Like they are often thought of as like this last step that we will do after we have built our agent. We will just evaluate it on some dataset. But that is usually a mistake. And when I’m saying evaluation I don’t just mean the data set. I’m also including the framework required to run the evaluation and make it repeatable for your system so that you can quickly fine tune your prompt or other behaviors of your agent over the course of time without it bogging you down.
Brian Thomas: Right. Thank you so much and I appreciate you unpacking that for us, especially in this development world we live in building agents and I took a couple highlights away. Your recommendation is you build that agent as a prototype, you stick to a framework going forward, however, and shouldn’t be so much focus on the models.
You said they should be treated as a dependency, I believe and really. The application is the main focus. So again, I appreciate your insights. And Chirag, one of your focus areas is developing tooling, typed function, calling SDKs, binding model outputs to real APIs. How do you strike the balance between giving developers freedom to experiment in en forcing disciplined architecture so the system scale remain manageable and they don’t fragment?
Chirag Agrawal: So, developer freedom and architectural discipline they can be seen as opposite forces, but they are actually not. In fact, developer tooling provides good abstractions to developers and it often lowers the barrier to experimentation and it speeds up velocity of development. So the goal of the architectural discipline is to provide, you know, a safe playground where developers can move fast, but without breaking the larger system.
An example is, as you said, like binding model output to real APIs. So. A good well architected developer tooling will provide you abstractions so that you as a developer can design your API and bind it to model outputs the way you see fit. But concerns like schema validation or error handling, or, safety guardrails or progressive context compression.
All of these things that are required to keep the user experience smooth and to manage, latency and concerns like cost they are all handled by platform for you. So a developer platform provides general solutions to essential problems. It doesn’t really curb. Developers freedom at all.
And these this type of tooling or platform can create incredible leverage for an organization. For example, a single change that you do in the platform can improve latency or cost for all teams or products running on the platform.
Brian Thomas: Thank you. I appreciate that. I do believe that there is such a thing as developer freedom while remaining in that architecture discipline, right?
They can work in harmony, although sometimes they can be challenging, but over time I’ve seen a lot of improvements. The way we work on that system design lifecycle and some of the tools that we use today to kinda keep those guardrails in place. So, I appreciate that and she rock. Many organizations adopt AI with enthusiasm, but when you push into latency constraints, cost optimization, and real-world performance, you hit friction.
What are the operational metrics you monitor closely in agent platforms and how do you trade off quality versus cost versus speed?
Chirag Agrawal: There are like three big buckets of metrics we monitor. First is, latency. Second is number of tokens, and third is quality. And within these three buckets, you can define the metrics depending on which part of the system you’re monitoring, for example.
For latency, you can start with how much time it took to construct the prompt. And this would include all the time you spent gathering context. Another one could be how much time it took from the moment you sent the prompt to the model, to the time you received the first to. And that tells you, really like how much latency you’re adding to the system by virtue of your prompt because usually that metric is related to the size of the prompt.
And the size of the model you’re using. And then the other one that you can monitor is how much time it took to render the first word of the response or the first element of the response to the user. That, that’s because it’s generative AI and thankfully like. Users don’t have to wait for the entire model response to be available.
As soon as the first word of the response is ready, you can start streaming it right away to the user. So those are just like few examples of. The metrics you can mod monitor for latency. But of course you can go much deeper than that depending on your problem space. For number of tokens, of course the obvious metrics to monitor are like how many input tokens you’re sending in and how many number of tokens the model is producing per user request.
Those two things drive your cost that you’re going to pay for these tokens and the latency you’re going to incur for that user request. But within those you can also monitor, number of cashed inputs. Tokens and num and you can measure number of tokens for different parts of the prompt that are dynamic in nature.
For example conversation history, which is built up over the course of the back and forth you do with the ai cached input tokens can be really valuable to monitor because, they can guide you. They can essentially guide your prompting strategies. If you if you design your prompt in a way where the top parts of the prompt, are unchanged over the course of the conversation and the most dynamic parts of the prompt are towards the bottom. Then, uh, you will you will end up utilizing a lot of cached input tokens, which will reduce your cost and latency. And you can, you can do this for different parts of the prompt.
So that kind of like really guides, uh, your prompt and context engineering. Quality, which is like our last bucket, is a much harder to define, and it is often dependent on the domain or like the task you’re solving with ai. But, some of the high level metrics I believe which should be common across most of the agents are it could be like accuracy of tool selection.
Accuracy of argument filling, truthfulness of the model relative to the context. And these are often like metrics you cannot monitor online because you need the truth, like ground truth for it. So. This last bucket is often monitored offline through your test harness. And you can you can behavior, you can model the behavior of the agent through online reflection.
But that’s yeah, I think that that spills into more of like, product quality metrics rather than operational metrics. Your the second part of your question, which was related to the trade off between these three is all quite interesting. So, there is these are kind of like, arranged in a triangle.
If you try to like lean on one of them too much, then you would have to. Give up the others. For example, if you try to improve latency too aggressively you might compromise the intelligence or the quality of your product. And if you try to chase the quality too aggressively, then. Your tokens your token cost and number of tokens will explode, which will add to latency.
So, it’s really like a it’s really an art. You have to constantly tune the system as per the user feedback.
Brian Thomas: Great. I really appreciate that rag. I know there’s a lot of metrics that you would like to monitor during this whole process. Latency has always been a big one. You talked about some of the other constraints, you know, cost optimization, token usage, caching, and again, kind of breaking out the trade off with that quality versus cost versus speed.
So, I appreciate that. Sherah. Last question of the day. As you build the foundational layers of ag agentic AI systems, how are you thinking about ethics, bias, transparency, and interoperability? So agents from different teams and systems can co coordinate. But looking ahead, what do you see as the next frontier in production?
AI infrastructure, whether that’s plugin marketplaces, agent networks, or open protocols.
Chirag Agrawal: ethics bias, and transparency. These are these are things that should be built at the foundational layer of the system itself. They cannot be bolded on later on. All the requests that flow through these agented platforms.
They should be auditable, observable, and traceable. And like I said, said earlier, evaluation hooks should be built into the runtime platform itself so that we can monitor agent’s behavior in real time through reflection and, you can essentially devise ways to prevent unsafe behavior if you have that or at least mitigate it very quickly.
Interoperability is the other side of it. So, agents built by different teams. Uh. We need to communicate through open and type protocols, much like, how systems are integrated through SGTP. And that’s why like I’m very excited about the emerging standards like MCP and A two A, which kind of define how agents discover each other, authenticate Authenticate other unknown agents and collaborate safely, across system boundaries. This current the current scenario reminds me of like the early days of Android and iOS, where like we had we had. Like, mobile apps, and they were good. Like, they were like useful functional mobile apps, but they were not that good.
And over the course of the next decade, they became really, really good. I think same thing is going to happen with these agents. Like, I think they are sort of in a recent stage right now, but they’re going to improve dramatically over the next few years. Yeah, so like, looking ahead, I think the next frontier, Production AI prise is an internet of agent or like multi-agent systems where like, agents built by different teams can you know, share capabilities but still operate under, their
Brian Thomas: own governance. Thank you, and I think that’s really important. We talk a lot about guardrails, governance, ethics, and you talked about how it’s important to build those into the systems and make sure that these systems are audible and observable.
And then moving into interoperability, I think it’s important that agents can interact with. Other agents that were built by different teams. You know, you talked about MCP or that model context protocol, which is so important today. And that’s been really a hot topic obviously for that interoperability.
So, I appreciate that. And Chirag, it was such a pleasure having you on today and I look forward to speaking with you real soon.
Chirag Agrawal: Yeah, thank you for having me. It was a pleasure talking to you as well.
Brian Thomas: Bye for now.
Chirag Agrawal Podcast Transcript. Listen to the audio on the guest’s Podcast Page.











