Agent Infrastructure
as Code.

Orloj is an open-source orchestration runtime for multi-agent AI systems. Define agents, tools, policies, and workflows in YAML. Orloj schedules, executes, and governs them.

View on GitHub →Docs

Orloj in 10 seconds

Pick your lens.

Cavemanjust tell me what it does

You have AI agents. They need rules, schedules, and someone watching. Orloj is that someone.

→ Clone and run it

AI Big BrainI build agents for a living

You know the frameworks. Orloj is what you reach for when prototypes have to become reliable production systems: YAML-defined workflows, tools running in isolated containers, fail-closed governance, retries, and observability without the glue code.

→ Read the integration docs

VC Browhat’s the pitch?

It’s Kubernetes for AI agents. Open source. Declarative. The missing infrastructure layer between ‘demo agent’ and ‘production agent fleet.’

→ View the repository

Infra Engineershow me the architecture

Declarative YAML manifests for agents, policies, and DAG workflows. Tools run in isolated containers with explicit image pins. CLI apply/rollback. Fail-closed governance at the execution layer. Lease-based scheduling with dead-letter handling.

apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: research-agent
spec:
  model_ref: openai-default
  prompt: You are a research assistant. Be concise.
  roles:
    - analyst-role
  tools:
    - name: web_search
    - name: code_exec
  limits:
    max_steps: 6
    timeout: 30s

→ Read the architecture docs

The Problem

Production agents need governance.

Same agent ambition. Different operational outcomes once runtime constraints are enforced as policy, not convention.

Capability	Today	With Orloj
Tool Boundaries	Agents call tools they should not touch.	Tool permissions enforced at execution time.
Cost Controls	Token spend spikes without policy limits.	Per-agent token caps and model allowlists.
Failure Handling	Retries and dead-letter handling are hand-rolled.	Lease-based retry, replay, and dead-letter primitives.
System Composition	Multi-agent wiring lives in bespoke glue code.	Declarative YAML graphs with fan-out and join gates.
Auditability	No end-to-end trace when incidents hit production.	Full task trace and message lifecycle logging.

Tool Boundaries

TodayAgents call tools they should not touch.

With OrlojTool permissions enforced at execution time.

Cost Controls

TodayToken spend spikes without policy limits.

With OrlojPer-agent token caps and model allowlists.

Failure Handling

TodayRetries and dead-letter handling are hand-rolled.

With OrlojLease-based retry, replay, and dead-letter primitives.

System Composition

TodayMulti-agent wiring lives in bespoke glue code.

With OrlojDeclarative YAML graphs with fan-out and join gates.

Auditability

TodayNo end-to-end trace when incidents hit production.

With OrlojFull task trace and message lifecycle logging.

Why Orloj

From prototype logic to production runtime guarantees.

The platform is designed for teams that need deterministic execution, policy enforcement, and safe operations under real production load.

Governance enforced at the execution layer

Policies and permissions are evaluated inline on every turn and tool call. Unauthorized actions fail closed with traceable outcomes.

AgentPolicy, AgentRole, ToolPermission
Token budget and model allowlists
Deny events with full call context

Agents as declarative manifests, not programs

Version-controlled manifests for agents, tools, models, and workflows. Apply once, diff in PRs, and roll back safely.

Idempotent reconcile on every apply
Schema validation before commit
Data contracts over glue code

Production reliability built into the runtime

Reliability primitives you'd otherwise hand-roll. Fan-out/fan-in and failure handling are part of the runtime, not application code you maintain.

Lease-based task ownership
Bounded retry with jitter
Dead-letter queue and cron scheduling

See It

One command. Full agent system.

orlojctl apply -f ./your-system/ reconciles agents, graph, governance, and tasks in a single declarative pass.

Step 1

Define an agent

Agents declare model, tools, permissions, and execution limits as data. No bespoke orchestration code required.

agent.yamlYAML

apiVersion: orloj.dev/v1
kind: Agent
metadata:
  name: research-agent
spec:
  model_ref: openai-default
  prompt: |
    You are a research assistant.
    Produce concise, evidence-backed answers.
  tools:
    - web_search
    - vector_db
  roles:
    - analyst-role
  limits:
    max_steps: 6
    timeout: 30s

Step 2

Compose a workflow graph

AgentSystem resources connect specialized agents into deterministic pipelines with explicit handoffs.

agent-system.yamlYAML

apiVersion: orloj.dev/v1
kind: AgentSystem
metadata:
  name: report-system
spec:
  agents:
    - planner-agent
    - research-agent
    - writer-agent
  graph:
    planner-agent:
      next: research-agent
    research-agent:
      next: writer-agent

Step 3

Enforce governance

Policies are runtime gates. Blocked actions return structured errors and complete audit traces.

policy.yamlYAML

apiVersion: orloj.dev/v1
kind: AgentPolicy
metadata:
  name: cost-and-security-policy
spec:
  apply_mode: scoped
  target_systems:
    - report-system
  max_tokens_per_run: 50000
  allowed_models:
    - gpt-4o
  blocked_tools:
    - filesystem_delete

Architecture

Server. Workers. Governance.

Orloj runs as a server/worker architecture that scales from a single process to distributed deployments. Governance is enforced inline at the worker layer.

Orlojruntime

Serverorlojd

API ServerREST, watch, web console

Resource Storemem or Postgres

Task Schedulerassignment, cron, webhooks

Servicesreconciliation loops per resource

assigns tasks

Governanceenforced inline at the worker layer

AgentPolicyAgentRoleToolPermission

Workersorlojworker

Model GatewayOpenAI, Anthropic, Ollama

Tool Runtimesandboxed, container, WASM

Message Busmem or NATS JetStream

Task Workerlease-based, concurrent

Single process. In-memory storage. Sequential execution. No external dependencies.

orlojd --embedded-worker --storage-backend=memory

Templates

Starter templates for real operational workflows.

Each template is a ready-to-deploy Orloj manifest for a common infrastructure task. These are on the roadmap and community contributions are welcome.

Coming soon

Incident response triage

Webhook-triggered. Agents pull logs, correlate metrics, check recent deployments. Read-only tool permissions mean investigation agents can look but can't roll back infrastructure.

Coming soon

Compliance evidence collector

Pipeline agents check contracts against regulatory requirements. Model whitelists keep sensitive content off unapproved providers. Every finding is traced and auditable.

Coming soon

CVE investigation pipeline

Researcher, analyst, and editor stages in a hierarchical agent system. The researcher can query CVE databases; only the editor can write to the output. Token budgets enforced per run.

Coming soon

Secret rotation auditor

Agents scan infrastructure for stale or exposed secrets using WASM-isolated tools. Metadata-only access patterns let agents audit secrets without reading secret values.

Get Started

Running in five minutes.

Install CLI and init a project

brew tap OrlojHQ/orloj
brew install orlojctl
orlojctl init example-system

Install runtime binaries

curl -sSfL https://raw.githubusercontent.com/OrlojHQ/orloj/main/scripts/install.sh | sh

Run Orloj locally

orlojd --storage-backend=memory --embedded-worker

Deploy your agent system

orlojctl apply -f example-system

Need a full walkthrough and production setup guidance? Read the full quickstart →

Pricing

Open source at the core. Managed when you need it.

Orloj is free and open source under Apache 2.0. When you're ready to scale, we handle the infrastructure.

Open Source

Community

Free

Apache 2.0, forever

The full Orloj runtime, open source. Deploy on your own infrastructure with no limits.

Unlimited agents and workflows
DAG orchestration and fan-out/fan-in
Fail-closed governance engine
MCP tool integration
Built-in observability console
CLI and YAML-first workflow
Community support via Discord

Get Started →

Coming Soon

Cloud

Pay as you go

usage-based pricing

Managed Orloj infrastructure so your team can focus on building agents, not operating them.

Fully managed control plane
Spin up additional workers on demand
Hosted dashboard with logs and traces
Automatic runtime updates
Team workspaces
Email and chat support

Join Waitlist →

Enterprise

Custom

annual contract

For organizations that need advanced security, compliance, and dedicated support.

Everything in Cloud
SSO / SAML authentication
Role-based access control (RBAC)
Audit log export and SIEM integration
Custom SLAs and uptime guarantees
Dedicated support engineer
On-prem / VPC deployment option

Talk to the team →

FAQ

Frequently asked questions

Orloj agent orchestration is coordinating multiple AI agents in production with governance, scheduling, and observability. It’s like Kubernetes for agents: you need the same operational rigor as you do for containers or databases.

LangChain helps you build agents. CrewAI helps agents collaborate. Orloj runs agents in production, with governance, observability, and the reliability patterns you expect from infrastructure. They’re all solutions to different problems, not competing.

Fail-closed means unauthorized actions are denied by default. An agent can only use tools you explicitly permit. Fail-open (the alternative) would allow actions unless you explicitly block them, which is a risky default in production.

Orloj is an orchestration plane for running agents. You can build agents in Orloj just like you would with frameworks like LangChain, LlamaIndex, or CrewAI. Orloj then manages them at scale with governance, scheduling, and reliability.

Not necessarily. Orloj works with agents built in any framework via standardized tool interfaces. Some refactoring may be needed for specific governance requirements, but you don’t need to rebuild from scratch.

Orloj includes lease-based task ownership, retry with jitter, idempotency tracking, and dead-letter handling. These patterns prevent cascading failures and ensure your agent fleet survives partial outages.

Orloj logs all agent actions, tool calls, and policy decisions. The structured audit trail is designed to support compliance workflows for frameworks like HIPAA, SOC 2, and the EU AI Act. Governance is enforced at the execution layer, not as an afterthought.

Orloj provides structured logging, distributed tracing, metrics collection, and cost attribution. You can trace an agent’s decision path, see which tools it called, understand latency, and allocate costs by agent or workflow.

Yes. Orloj is Apache 2.0 licensed and developed publicly on GitHub. You can run it on-premise or in your own VPC.

If you’re familiar with Kubernetes, Docker, or infrastructure-as-code tools, Orloj will feel familiar. You define agents and policies in YAML manifests and deploy with a single command. The concepts are straightforward for engineers.

Community

Built in the open. Contribute from day one.

Orloj is Apache 2.0. The full runtime is open source: governance, orchestration, scheduling, observability.

GitHub

Star the repo, read the source, open an issue.

github.com/OrlojHQ/orloj →

Discord

Ask questions, share what you're building, join weekly community calls.

discord.gg/orloj →

Contribute

Good first issues labeled. Architecture docs available. PRs welcome.

Contributing guide →

Blog

Latest from the team.

All posts →

comparison

Orloj vs. LangGraph vs. CrewAI: 2026 Update

Six months since our original comparison. All three frameworks shipped major updates. Here's what changed and what didn't.

Jon MandrakiRead article →

governance

Why Every Agent System Needs a Governance Layer (Not Just Guardrails)

Guardrails check outputs. A governance layer controls inputs, execution, access, and budget. They solve different problems. Most teams need both.

Jon MandrakiRead article →

comparison

Orloj vs. Microsoft Semantic Kernel Agent Framework

Microsoft's Agent Framework brings .NET, Python, and Java support with deep Azure integration. Orloj is language-agnostic and cloud-agnostic. Different trade-offs for different teams.

Jon MandrakiRead article →

Stop wiring. Start declaring.

Define your agents, enforce your policies, and ship to production.

Get Started →Read the Docs

Agent Infrastructureas Code.

Orloj in 10 seconds

Production agents need governance.

Tool Boundaries

Cost Controls

Failure Handling

System Composition

Auditability

From prototype logic to production runtime guarantees.

Governance enforced at the execution layer

Agents as declarative manifests, not programs

Production reliability built into the runtime

One command. Full agent system.

Define an agent

Compose a workflow graph

Enforce governance

Server. Workers. Governance.

Starter templates for real operational workflows.

Incident response triage

Compliance evidence collector

CVE investigation pipeline

Secret rotation auditor

Running in five minutes.

Install CLI and init a project

Install runtime binaries

Run Orloj locally

Deploy your agent system

Open source at the core. Managed when you need it.

Community

Cloud

Enterprise

Frequently asked questions

What is Orloj agent orchestration?

How does Orloj differ from LangChain or CrewAI?

What does "fail-closed" mean?

Is Orloj a framework for building agents?

Do I need to rewrite my existing agents for Orloj?

How does Orloj handle agent failures?

What about compliance and audit trails?

How does observability work?

Is Orloj open source?

What’s the learning curve?

Built in the open. Contribute from day one.

GitHub

Discord

Contribute

Latest from the team.

Orloj vs. LangGraph vs. CrewAI: 2026 Update

Why Every Agent System Needs a Governance Layer (Not Just Guardrails)

Orloj vs. Microsoft Semantic Kernel Agent Framework

Stop wiring. Start declaring.

Agent Infrastructure
as Code.