llm.rb

Ruby's most capable AI runtime. Zero external dependencies. Runs on Ruby's standard library. Supports OpenAI, Anthropic, Google Gemini, DeepSeek, Ollama, and more.

gem install llm.rb

Zero deps MCP A2A Streaming Tools Agents

About

llm.rb is Ruby's most capable AI runtime.

It runs on Ruby's standard library by default, loads optional pieces only when needed, and offers a single runtime for providers, agents, tools, skills, MCP, A2A (Agent2Agent), RAG (vector stores & embeddings), streaming, files, and persisted state. As a bonus, llm.rb is also available for mruby and has first-class Rails support via a separate gem.

It supports OpenAI, OpenAI-compatible endpoints, Anthropic, Google Gemini, DeepSeek, xAI, Z.ai, AWS Bedrock, Ollama, and llama.cpp. It also includes built-in ActiveRecord and Sequel support, plus concurrent tool execution through threads, tasks (via async gem), fibers, ractors, and fork (via xchan.rb gem).

Getting Started

LLM::Context

The LLM::Context object is at the heart of the runtime. Almost all other features build on top of it. It is a low-level interface to a model, and requires tool execution to be managed manually. The LLM::Agent class is almost the same as LLM::Context but it manages tool execution for you — we'll cover agents next:

require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: $stdout)
ctx.talk "Hello world"

LLM::Agent

The LLM::Agent object is implemented on top of LLM::Context. It provides the same interface, but manages tool execution for you. It also has builtin features such as a loop guard that detects repeated tool call patterns, and another guard that detects infinite tool call loops. Both guards advise the model to change course rather than raise an error:

require "llm"

llm = LLM.openai(key: ENV["KEY"])
agent = LLM::Agent.new(llm, stream: $stdout)
agent.talk "Hello world"

Tools

The LLM::Tool class can be subclassed to implement your own tools that can extend the abilities of a model:

class ReadFile < LLM::Tool
  name "read-file"
  description "Read a file"
  parameter :path, String, "The filename or path"
  required %i[path]

  def call(path:)
    {contents: File.read(path)}
  end
end

ask

LLM::Context also provides ask, a convenience interface that is compatible with RubyLLM's ask method. It accepts a prompt, an optional with: attachment path or paths, an optional stream: target, and an optional block that chunks are yielded to. It returns an LLM::Response, so use .content when you want the text directly:

require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm)

puts ctx.ask("Hello world").content
puts ctx.ask("Summarize this document.", with: "README.md").content
ctx.ask("Stream this reply.") { $stdout << _1 }

Protocols & Skills

MCP

The LLM::MCP object lets llm.rb use tools provided by an MCP server. Those tools are exposed through the same runtime as local tools, so you can pass them to either LLM::Context or LLM::Agent. In this example, the MCP server runs over stdio and LLM::Context uses the same tool loop as local tools. For stdio, mcp.session is the preferred pattern because it keeps one MCP session alive across discovery and tool calls:

require "llm"

llm = LLM.openai(key: ENV["KEY"])
mcp = LLM::MCP.stdio(argv: ["ruby", "server.rb"])

mcp.session do
  ctx = LLM::Context.new(llm, stream: $stdout, tools: mcp.tools)
  ctx.talk "Use the available tools to inspect the environment."
  ctx.talk(ctx.wait(:call)) while ctx.functions?
end

A2A (Agent 2 Agent)

The LLM::A2A object lets llm.rb use skills provided by a remote A2A agent. Those skills are exposed through the same runtime as local tools, so you can pass them to either LLM::Context or LLM::Agent.

Use remote skills as local tools:

require "llm"

a2a = LLM::A2A.rest(
  url: "https://remote-agent.example.com",
  headers: {"Authorization" => "Bearer token"}
)
llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, tools: a2a.skills)
ctx.talk "Analyze this CSV and summarize the trends."
ctx.talk(ctx.wait(:call)) while ctx.functions?

Use persistent HTTP connections:

require "llm"

a2a = LLM::A2A.rest(
  url: "https://remote-agent.example.com",
  transport: LLM::Transport.net_http_persistent
)

For more on direct messaging, task operations, push notification configs, and JSON-RPC, see the LLM::A2A API docs.

Skills

Skills are reusable instructions loaded from a SKILL.md directory. They let you package behavior and tool access together, and they plug into the same runtime as tools, agents, MCP, and A2A. When a skill runs, llm.rb spawns a subagent with the skill instructions, access to only the tools listed in the skill, and recent conversation context:

---
name: release
description: Prepare a release
tools: ["search-docs", "git"]
---

## Task

Review the release state, summarize what changed, and prepare the release.

require "llm"

class ReleaseAgent < LLM::Agent
  model "gpt-5.4-mini"
  skills "./skills/release"
end

llm = LLM.openai(key: ENV["KEY"])
ReleaseAgent.new(llm, stream: $stdout).talk("Prepare the next release.")

A skill can also have its sub-agent inherit the parents tools through the inherit directive. The inherit directive has coverage for the "classic" tools (a subclass of LLM::Tool), MCP tools, and A2A tools that a parent context or agent has access to:

---
  name: release
  description: Prepare a release
  tools: inherit
---

Advanced Usage

Streaming

By default, talk returns the full response when it's done. Streaming lets you see output as it arrives — useful for chatbots, CLI tools, and any interface that should feel responsive.

The simplest form: pass any object that implements #<< ($stdout, StringIO, a file, etc.) as stream::

require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: $stdout)
ctx.talk("Explain TCP keepalive in one paragraph.")

Structured streaming with LLM::Stream

Subclass LLM::Stream when you need structured callbacks — separate hooks for content, reasoning, and tool lifecycle events. This is useful when streaming is part of your control flow, not just presentation:

require "llm"

class Stream < LLM::Stream
  def on_content(content)
    $stdout << content
  end

  def on_tool_call(tool, error)
    return queue << error if error
    queue << ctx.spawn(tool, :thread)
  end

  def on_tool_return(tool, result)
    $stdout << "Finished #{tool.name}\n"
  end
end

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: Stream.new, tools: [ReadFile])
ctx.talk("Read README.md and summarize it.")
ctx.talk(ctx.wait(:call)) while ctx.functions?

The stream queues tool work as it arrives. When you call wait, it drains the queue — so tools can be resolved while the model is still generating its response.

Requiring tool confirmation

Some tools are destructive — deleting files, running shell commands, sending emails. You can require confirmation before they execute by declaring them with confirm in your agent.

When the model calls a confirmed tool, llm.rb runs on_tool_confirmation instead of executing it immediately. Your callback inspects the arguments and decides: approve it with fn.spawn(strategy).wait, or reject it with fn.cancel(reason:):

require "llm"

class Agent < LLM::Agent
  tools DeleteFile
  confirm "delete-file"

  def on_tool_confirmation(fn, strategy)
    path = fn.arguments["path"] || fn.arguments[:path]
    if path.start_with?("/tmp/")
      fn.spawn(strategy).wait
    else
      fn.cancel(reason: "Deletion requires approval")
    end
  end
end

llm = LLM.openai(key: ENV["KEY"])
Agent.new(llm, stream: $stdout).talk("Delete /tmp/example.txt.")

The callback must always return a LLM::Function::Return — either the tool result or a cancellation message. This keeps the tool loop consistent regardless of whether the call was approved.

Running tools concurrently

When a model calls multiple tools in one turn, you can resolve them in parallel instead of one at a time. This reduces latency when tools are independent — for example, fetching weather and news simultaneously.

On LLM::Agent, set concurrency to your preferred strategy:

:call — sequential (default for Context)
:thread — parallel with Ruby threads (I/O-bound work)
:task — parallel with the async gem
:fiber — concurrent with Fiber.schedule
:ractor — parallel with ractors (CPU-bound, experimental)
:fork — parallel with forks (requires xchan.rb)

require "llm"

class Agent < LLM::Agent
  model "gpt-5.4-mini"
  tools ReadFile
  concurrency :thread
end

llm = LLM.openai(key: ENV["KEY"])
agent = Agent.new(llm, stream: $stdout)
agent.talk "Read README.md and CHANGELOG.md and compare them."

At the Context level, use ctx.wait(:thread), ctx.wait(:fiber), etc. to control concurrency per turn instead of setting a fixed strategy on the agent.

Saving and restoring context

LLM::Context holds all conversation state — messages, tool returns, usage. You can serialize this to JSON and restore it later, which makes it straightforward to persist conversations across requests, retries, or process restarts.

require "llm"

llm = LLM.openai(key: ENV["KEY"])

# Build up state
ctx1 = LLM::Context.new(llm)
ctx1.talk "Remember that my favorite language is Ruby"
string = ctx1.to_json

# Later, in another process or request
ctx2 = LLM::Context.new(llm, stream: $stdout)
ctx2.restore(string:)
ctx2.talk "What is my favorite language?"

The built-in ActiveRecord and Sequel plugins (covered below) use this same serialization under the hood. You can also save to a file with ctx.save(path:) and restore with ctx.restore(path:).

Providers & Schemas

Choosing a provider

llm.rb supports multiple LLM providers behind a single API surface. Providers are loaded lazily — nothing is required until you call its constructor. This means you can swap backends without changing your application code:

OpenAI — LLM.openai(key:)
Anthropic — LLM.anthropic(key:)
Google Gemini — LLM.google(key:)
DeepSeek — LLM.deepseek(key:)
xAI — LLM.xai(key:)
zAI — LLM.zai(key:)
Ollama — LLM.ollama
Llama.cpp — LLM.llamacpp
AWS Bedrock — LLM.bedrock(key:)

Using OpenAI-compatible endpoints

Many providers (DeepSeek, DeepInfra, OpenRouter) expose an OpenAI-compatible API. Override host: to point anywhere. Some providers also need base_path: when the API lives under a non-standard prefix:

llm = LLM.openai(
  key: ENV["DEEPSEEK_KEY"],
  host: "api.deepseek.com"
)

llm = LLM.openai(
  key: ENV["DEEPINFRA_TOKEN"],
  host: "api.deepinfra.com",
  base_path: "/v1/openai"
)

Using the Responses API

OpenAI's Responses API is available through mode: :responses. With store: false the endpoint stays stateless while using the Responses endpoint. With store: true, OpenAI keeps response state server-side:

require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, mode: :responses, store: false)

ctx.talk("Your task is to answer questions", role: :developer)
res = ctx.talk("What is the capital of France?")
puts res.content

Getting structured output

When you need the model to return structured data — JSON with specific fields — pass a LLM::Schema. The schema defines the shape, and llm.rb adapts it to your provider's structured-output format.

Class-based schemas

Define the schema as a class with property declarations. This works well for reusable schemas that appear across your app:

class Report < LLM::Schema
  property :category, Enum["performance", "security", "outage"], required: true
  property :summary, String, "Short summary", required: true
  property :services, Array[String], "Impacted services", required: true
end

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, schema: Report)
res = ctx.talk("Structure this report: 'API latency spiked.'")
puts res.content!

Fluent schemas

For one-off schemas or dynamic shapes, build the schema inline without a class:

schema = LLM::Schema.new.object(
  category: LLM::Schema.new.string.enum("a", "b", "c").required,
  summary: LLM::Schema.new.string.required
)

ctx = LLM::Context.new(llm, schema:)
res = ctx.talk("Classify this.")

Working with model reasoning

Some models (DeepSeek Reasoner, OpenAI o-series) show their thinking process separately from the visible answer. llm.rb exposes this in two ways, depending on whether you're streaming or reading the final response.

Streaming reasoning output

Use on_reasoning_content on your stream subclass to capture reasoning as it arrives — useful for surfacing it in a separate panel or log:

class Stream < LLM::Stream
  def on_content(content)
    $stdout << content
  end

  def on_reasoning_content(content)
    $stderr << content
  end
end

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm,
  model: "gpt-5.4-mini",
  mode: :responses,
  reasoning: { effort: "medium" },
  stream: Stream.new)
ctx.talk("Solve 17 * 19 and show your work.")

Reading reasoning from the response

When the provider includes reasoning in the final response, read it from res.reasoning_content after the call completes. This is useful for offline analysis or debugging:

llm = LLM.deepseek(key: ENV["KEY"])
ctx = LLM::Context.new(llm, model: "deepseek-reasoner")
res = ctx.talk("Solve 17 * 19.")
puts res.content
puts res.reasoning_content

Managing token budgets with compaction

Long-running conversations accumulate tokens. When you're approaching the model's context window, compaction summarizes older messages into a single message — keeping the conversation going without losing important context.

Configure compaction with token_threshold: (a percentage of the context window or a fixed token count) and retention_window: (how many recent messages to keep verbatim). Attach a stream to observe when compaction happens:

require "llm"

class Stream < LLM::Stream
  def on_compaction(ctx, compactor)
    puts "Compacting #{ctx.messages.size} messages..."
  end

  def on_compaction_finish(ctx, compactor)
    puts "Compacted to #{ctx.messages.size} messages."
  end
end

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(
  llm,
  stream: Stream.new,
  compactor: {
    token_threshold: "90%",
    retention_window: 8,
    model: "gpt-5.4-mini"
  }
)

"90%" means compaction triggers when total token usage exceeds 90% of the model's context window. You can also use a fixed integer for token_threshold: or set message_threshold: to compact based on message count.

Persistence

Persisting contexts with ActiveRecord

llm.rb has built-in ActiveRecord support through acts_as_llm and acts_as_agent. Both store the serialized context in a single data column — your database table doesn't need provider or model columns. Just wire them through hooks.

acts_as_llm (wraps LLM::Context)

Use acts_as_llm when you want full control over tool execution. The provider: hook resolves a provider instance, and context: injects defaults like model and mode:

require "llm"
require "active_record"
require "llm/active_record"

class Context < ApplicationRecord
  acts_as_llm provider: :set_provider, context: :set_context

  private

  def set_provider
    LLM.openai(key: ENV["OPENAI_SECRET"])
  end

  def set_context
    {model: "gpt-5.4-mini", mode: :responses, store: false}
  end
end

ctx = Context.create!
ctx.talk("Remember that my favorite language is Ruby")
puts ctx.talk("What is my favorite language?").content

acts_as_agent (wraps LLM::Agent)

Use acts_as_agent when you want automatic tool-loop execution with agent DSL features like instructions, tools, and concurrency:

require "llm"
require "active_record"
require "llm/active_record"

class Ticket < ApplicationRecord
  acts_as_agent provider: :set_provider, context: :set_context
  model "gpt-5.4-mini"
  instructions "You are a concise support assistant."
  tools SearchDocs, Escalate
  concurrency :thread

  private

  def set_provider
    LLM.openai(key: ENV["OPENAI_SECRET"])
  end

  def set_context
    {mode: :responses, store: false}
  end
end

ticket = Ticket.create!
puts ticket.talk("How do I rotate my API key?").content

Both support format: :jsonb on PostgreSQL for native JSON column storage, and tracer: for attaching observability to the provider.

Persisting contexts with Sequel

Sequel support works the same way through plugin :llm. The provider:, context:, and tracer: hooks use the same contract as ActiveRecord:

require "llm"
require "net/http/persistent"
require "sequel"
require "sequel/plugins/llm"

class Context < Sequel::Model
  plugin :llm, provider: :set_provider, context: :set_context

  private

  def set_provider
    LLM.openai(key: ENV["OPENAI_SECRET"], persistent: true)
  end

  def set_context
    {model: "gpt-5.4-mini", mode: :responses, store: false}
  end
end

ctx = Context.create
ctx.talk("Remember that my favorite language is Ruby")
puts ctx.talk("What is my favorite language?").content

Sequel also supports plugin :agent for the agent wrapper. On PostgreSQL, use format: :jsonb for native JSON columns.

Production & Operations

Sending images, audio, and files

llm.rb supports multimodal prompts — text plus images, audio, or files. Build a prompt as an array, mixing text with helpers like ctx.image_url or ctx.local_file:

Image from a URL

ctx.talk ["Describe this image",
  ctx.image_url("https://example.com/cat.jpg")]

Local file attachment

Use ctx.ask(..., with:) to attach a file directly. The provider adapter handles the format — no need to upload first:

puts ctx.ask("Summarize this document.", with: "README.md").content

Generating audio

res = llm.audio.create_speech(input: "Hello world")
IO.copy_stream res.audio, "hello.mp3"

Generating images

res = llm.images.create(prompt: "a dog on a rocket to the moon")
IO.copy_stream res.images[0], "dogonrocket.png"

Observing requests and tool calls

Tracing is attached at the provider level. Once assigned, all requests and tool calls through that provider are instrumented — no per-context setup needed.

Logger tracer

Print request details to a stream. Useful for development and debugging:

llm.tracer = LLM::Tracer::Logger.new(llm, io: $stdout)
ctx = LLM::Context.new(llm)
ctx.talk("Hello")

Telemetry tracer

Collect spans programmatically. Useful for custom dashboards or integrating with APM tools:

llm.tracer = LLM::Tracer::Telemetry.new(llm)
ctx = LLM::Context.new(llm)
ctx.talk("Hello")
pp llm.tracer.spans

Langsmith tracer

Send traces to Langsmith for team-wide observability:

llm.tracer = LLM::Tracer::Langsmith.new(llm,
  metadata: { env: "dev" }, tags: ["chatbot"])

Production patterns

These are small switches that make llm.rb production-ready — not a second framework. Providers are meant to be shared, contexts stay isolated, and performance or cost controls layer onto the same core objects.

Thread safety

Share provider instances across threads. Create one context per thread or request — that's where the mutable state lives:

llm = LLM.openai(key: ENV["KEY"])

Thread.new do
  ctx = LLM::Context.new(llm)
  ctx.talk("Hello from thread 1")
end

Thread.new do
  ctx = LLM::Context.new(llm)
  ctx.talk("Hello from thread 2")
end

Performance

Swap to a faster JSON backend (:oj) and enable persistent HTTP connections when request volume grows:

LLM.json = :oj
llm = LLM.openai(key: ENV["KEY"], persistent: true)

Cost tracking

Each context tracks its own usage. ctx.cost returns the estimated spend so far:

ctx = LLM::Context.new(llm)
ctx.talk("Hello")
puts "Estimated cost: $#{ctx.cost}"

Model registry

The local registry provides model capabilities, pricing, and limits without API calls — useful for making local decisions about model selection:

registry = LLM.registry_for(:openai)
info = registry.limit(model: "gpt-4.1")
puts "Context window: #{info.context} tokens"

Examples

Building a REPL

A simple interactive loop using LLM::Context directly. Each line of input becomes a turn, and streaming sends output to the terminal as it arrives:

require "llm"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: $stdout)

loop do
  print "> "
  ctx.talk(STDIN.gets || break)
  puts
end

Cancelling a request mid-stream

Need to stop a long generation? ctx.interrupt! cancels the active provider request and notifies any running tools through on_interrupt. Useful in web apps where one message may need to cancel an earlier in-flight request:

require "llm"
require "io/console"

llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: $stdout)
worker = Thread.new do
  ctx.talk("Write a very long essay about network protocols.")
rescue LLM::Interrupt
  puts "Request was interrupted!"
end

STDIN.getch
ctx.interrupt!
worker.join

More resources

The deepdive guide covers every feature with runnable examples — providers, tools, agents, schemas, MCP, A2A, compaction, tracing, and production patterns.

Relay

Relay is a production application built on llm.rb with context management, tool composition, concurrent workflows, and cost tracking.

Robert

Robert is a 2MB native FreeBSD AI assistant built with mruby-llm.

rails-llm

rails-llm provides a Rails engine with chat UI, generators, and acts_as_agent.