AIOS — AI Agent Operating System
Running many AI agents on shared infrastructure without coordination is like running programs on a computer with no operating system — chaos. This pattern applies the same ideas that made OS design successful to the problem of managing AI agents fairly, safely, and in parallel.
Interactive diagram — pan, zoom, and explore. Click export to download as PNG.
Why agents need an operating system
Imagine running ten different programs on a computer that has no operating system — each program directly grabs as much CPU as it wants, writes wherever it likes in memory, and calls hardware directly. It would be chaos. Operating systems were invented exactly for this: give each program its own isolated space, schedule access to shared resources fairly, and prevent one runaway process from breaking everything else.
AI agent systems face the same problem. When many agents share the same LLM infrastructure, they compete for GPU time, burn through token budgets, and can interfere with each other's context. A slow agent blocks fast ones. A runaway agent burns resources that other agents needed. Without coordination, the system degrades in unpredictable ways.
AIOS applies the proven OS model to this problem. The LLM becomes the CPU — a shared compute resource that the kernel manages on behalf of all agents. Agents are processes. Tool calls are system calls — formally declared requests that go through the kernel rather than happening directly. A scheduler, memory manager, context manager, and access manager sit between agents and the LLM, handling contention the same way an OS handles process contention. Published research shows up to 2.1× faster serving compared to raw framework chaining.
The AIOS Kernel
The kernel is the core of the architecture. Agents do not call the LLM directly — they submit requests through the kernel's interface, and the kernel decides when each request runs, which backend handles it, and whether the agent has permission to do what it is asking. Five specialized managers handle different concerns. This separation is deliberate: each manager has a focused job, which makes the system predictable and lets you replace or upgrade individual components independently.
Scheduler
Decides the order in which agent requests run. Supports multiple strategies — first in first out for simple cases, priority queues for urgent agents, round-robin to prevent any one agent from starving others of compute.
Context Manager
When the scheduler pauses one agent to let another run, the context manager saves the current LLM state and restores it when the paused agent resumes — like saving your place in a book before lending it.
Memory Manager
Keeps each agent's working memory separate from every other agent's. Prevents one agent from accidentally reading or overwriting another's in-progress state, even when they share the same underlying infrastructure.
Access Manager
Before any agent accesses a shared resource or calls another agent, the access manager checks whether it is permitted to. This stops privilege escalation — an agent cannot do more than it was designed to do.
Tool Manager
Treats every tool call as a formal request with a declared signature. Logs every invocation, enforces rate limits per agent, and mediates between the agent and the actual tool endpoint.
Agent Syscalls
In operating system design, a system call is the only sanctioned way for a program to ask the kernel for something — memory, hardware, files. The program cannot grab these things directly; it asks, and the kernel decides whether to grant the request. This formalization is what makes scheduling, auditing, and rate-limiting possible. AIOS applies the same model to agent-to-LLM interactions. Instead of calling an LLM API directly, agents call kernel syscalls. The kernel logs each request, checks quotas, and dispatches to an available backend. Agents never need to know which model answered, which backend was used, or how the kernel scheduled the work.
llm_generate
An inference request submitted to the kernel. The scheduler picks an available LLM core, dispatches the request, and returns the result to the agent. The agent does not know or care which model or backend was used.
execute_tool
A request to invoke an external tool — web search, code execution, database query. Goes through the tool manager for logging and rate-limiting and through the access manager before execution.
add_memory
Write something into the agent's memory store. Subject to per-agent quotas that prevent one agent from monopolizing the shared memory layer.
search_memory
Read from the agent's memory store using vector similarity or keyword search. Returns only results that belong to the requesting agent — memory is always isolated.
LLM Cores
Just as a modern CPU has multiple cores that can run programs in parallel, an AIOS deployment has multiple LLM backends that can serve agent requests simultaneously. The kernel abstracts over all of them — it knows their capabilities, current load, and cost, and routes each request to the most appropriate one. This abstraction solves a practical problem: agent code should not need to be rewritten when you switch from one model provider to another, or when you add a cheaper local model for simpler tasks. The kernel handles routing. Agents just call the kernel.
Frontier Cores
High-capability models like GPT or Claude. Used when quality and reasoning depth matter more than cost or speed — complex multi-step tasks, code generation, nuanced judgment calls.
Local Cores
Smaller, self-hosted models like Llama or Mistral. Used for cheaper, faster tasks — summarization, classification, routing decisions — where a frontier model would be overkill.
Core Dispatcher
The routing layer that assigns incoming requests to the right backend. Can optimize for cost, latency, capability class, or simply the backend with the shortest current queue.
Quota Accounting
Tracks token spend and dollar cost per agent across all backends. Enforces hard limits so a runaway agent cannot burn through the entire fleet budget on its own.