llm.rb
Ruby's most capable AI runtime. Zero external dependencies. Runs on Ruby's standard library. Supports OpenAI, Anthropic, Google Gemini, DeepSeek, Ollama, and more.
gem install llm.rbAbout
llm.rb is Ruby's most capable AI runtime.
It runs on Ruby's standard library by default, loads optional pieces only when needed, and offers a single runtime for providers, agents, tools, skills, MCP, A2A (Agent2Agent), RAG (vector stores & embeddings), streaming, files, and persisted state. As a bonus, llm.rb is also available for mruby and has first-class Rails support via a separate gem.
It supports OpenAI, OpenAI-compatible endpoints, Anthropic, Google Gemini, DeepSeek, xAI, Z.ai, AWS Bedrock, Ollama, and llama.cpp. It also includes built-in ActiveRecord and Sequel support, plus concurrent tool execution through threads, tasks (via async gem), fibers, ractors, and fork (via xchan.rb gem).
LLM::Context
The LLM::Context object is at the heart of the runtime. Almost all other features build on top of it. It is a low-level interface to a model, and requires tool execution to be managed manually. The LLM::Agent class is almost the same as LLM::Context but it manages tool execution for you — we'll cover agents next:
require "llm"
llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: $stdout)
ctx.talk "Hello world"
LLM::Agent
The LLM::Agent object is implemented on top of LLM::Context. It provides the same interface, but manages tool execution for you. It also has builtin features such as a loop guard that detects repeated tool call patterns, and another guard that detects infinite tool call loops. Both guards advise the model to change course rather than raise an error:
require "llm"
llm = LLM.openai(key: ENV["KEY"])
agent = LLM::Agent.new(llm, stream: $stdout)
agent.talk "Hello world"
Tools
The LLM::Tool class can be subclassed to implement your own tools that can extend the abilities of a model:
class ReadFile < LLM::Tool
name "read-file"
description "Read a file"
parameter :path, String, "The filename or path"
required %i[path]
def call(path:)
{contents: File.read(path)}
end
end
ask
LLM::Context
also provides ask, a convenience interface that is compatible with
RubyLLM's ask method. It accepts a prompt, an optional with:
attachment path or paths, an optional stream: target, and an optional
block that chunks are yielded to. It returns an
LLM::Response,
so use .content when you want the text directly:
require "llm"
llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm)
puts ctx.ask("Hello world").content
puts ctx.ask("Summarize this document.", with: "README.md").content
ctx.ask("Stream this reply.") { $stdout << _1 }
MCP
The
LLM::MCP
object lets llm.rb use tools provided by an MCP server. Those tools are
exposed through the same runtime as local tools, so you can pass them
to either
LLM::Context
or
LLM::Agent.
In this example, the MCP server runs over stdio and
LLM::Context
uses the same tool loop as local tools. For stdio, mcp.session
is the preferred pattern because it keeps one MCP session alive across
discovery and tool calls:
require "llm"
llm = LLM.openai(key: ENV["KEY"])
mcp = LLM::MCP.stdio(argv: ["ruby", "server.rb"])
mcp.session do
ctx = LLM::Context.new(llm, stream: $stdout, tools: mcp.tools)
ctx.talk "Use the available tools to inspect the environment."
ctx.talk(ctx.wait(:call)) while ctx.functions?
end
A2A (Agent 2 Agent)
The LLM::A2A object lets llm.rb use skills provided by a remote A2A agent. Those skills are exposed through the same runtime as local tools, so you can pass them to either LLM::Context or LLM::Agent.
Use remote skills as local tools:
require "llm"
a2a = LLM::A2A.rest(
url: "https://remote-agent.example.com",
headers: {"Authorization" => "Bearer token"}
)
llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, tools: a2a.skills)
ctx.talk "Analyze this CSV and summarize the trends."
ctx.talk(ctx.wait(:call)) while ctx.functions?
Use persistent HTTP connections:
require "llm"
a2a = LLM::A2A.rest(
url: "https://remote-agent.example.com",
transport: LLM::Transport.net_http_persistent
)
For more on direct messaging, task operations, push notification configs, and JSON-RPC, see the LLM::A2A API docs.
Skills
Skills are reusable instructions loaded from a SKILL.md directory. They let
you package behavior and tool access together, and they plug into the
same runtime as tools, agents, MCP, and A2A. When a skill runs, llm.rb
spawns a subagent with the skill instructions, access to only the tools
listed in the skill, and recent conversation context:
---
name: release
description: Prepare a release
tools: ["search-docs", "git"]
---
## Task
Review the release state, summarize what changed, and prepare the release.
require "llm"
class ReleaseAgent < LLM::Agent
model "gpt-5.4-mini"
skills "./skills/release"
end
llm = LLM.openai(key: ENV["KEY"])
ReleaseAgent.new(llm, stream: $stdout).talk("Prepare the next release.")
A skill can also have its sub-agent inherit the parents tools through the
inherit directive. The inherit directive has coverage for the "classic"
tools (a subclass of LLM::Tool),
MCP tools, and A2A tools that a parent context or agent has access to:
---
name: release
description: Prepare a release
tools: inherit
---
Streaming
By default, talk returns the full response when it's done.
Streaming lets you see output as it arrives — useful for chatbots,
CLI tools, and any interface that should feel responsive.
The simplest form: pass any object that implements #<<
($stdout, StringIO, a file, etc.) as
stream::
require "llm"
llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: $stdout)
ctx.talk("Explain TCP keepalive in one paragraph.")
Structured streaming with LLM::Stream
Subclass LLM::Stream when you need structured callbacks
— separate hooks for content, reasoning, and tool lifecycle events.
This is useful when streaming is part of your control flow, not just
presentation:
require "llm"
class Stream < LLM::Stream
def on_content(content)
$stdout << content
end
def on_tool_call(tool, error)
return queue << error if error
queue << ctx.spawn(tool, :thread)
end
def on_tool_return(tool, result)
$stdout << "Finished #{tool.name}\n"
end
end
llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: Stream.new, tools: [ReadFile])
ctx.talk("Read README.md and summarize it.")
ctx.talk(ctx.wait(:call)) while ctx.functions?
The stream queues tool work as it arrives. When you call wait,
it drains the queue — so tools can be resolved while the model is still
generating its response.
Requiring tool confirmation
Some tools are destructive — deleting files, running shell commands,
sending emails. You can require confirmation before they execute
by declaring them with confirm in your agent.
When the model calls a confirmed tool, llm.rb runs
on_tool_confirmation instead of executing it immediately.
Your callback inspects the arguments and decides: approve it with
fn.spawn(strategy).wait, or reject it with
fn.cancel(reason:):
require "llm"
class Agent < LLM::Agent
tools DeleteFile
confirm "delete-file"
def on_tool_confirmation(fn, strategy)
path = fn.arguments["path"] || fn.arguments[:path]
if path.start_with?("/tmp/")
fn.spawn(strategy).wait
else
fn.cancel(reason: "Deletion requires approval")
end
end
end
llm = LLM.openai(key: ENV["KEY"])
Agent.new(llm, stream: $stdout).talk("Delete /tmp/example.txt.")
The callback must always return a LLM::Function::Return —
either the tool result or a cancellation message. This keeps the
tool loop consistent regardless of whether the call was approved.
Running tools concurrently
When a model calls multiple tools in one turn, you can resolve them in parallel instead of one at a time. This reduces latency when tools are independent — for example, fetching weather and news simultaneously.
On LLM::Agent, set concurrency to your
preferred strategy:
:call— sequential (default for Context):thread— parallel with Ruby threads (I/O-bound work):task— parallel with theasyncgem:fiber— concurrent withFiber.schedule:ractor— parallel with ractors (CPU-bound, experimental):fork— parallel with forks (requires xchan.rb)
require "llm"
class Agent < LLM::Agent
model "gpt-5.4-mini"
tools ReadFile
concurrency :thread
end
llm = LLM.openai(key: ENV["KEY"])
agent = Agent.new(llm, stream: $stdout)
agent.talk "Read README.md and CHANGELOG.md and compare them."
At the Context level, use ctx.wait(:thread),
ctx.wait(:fiber), etc. to control concurrency per turn
instead of setting a fixed strategy on the agent.
Saving and restoring context
LLM::Context holds all conversation state — messages,
tool returns, usage. You can serialize this to JSON and restore it
later, which makes it straightforward to persist conversations across
requests, retries, or process restarts.
require "llm"
llm = LLM.openai(key: ENV["KEY"])
# Build up state
ctx1 = LLM::Context.new(llm)
ctx1.talk "Remember that my favorite language is Ruby"
string = ctx1.to_json
# Later, in another process or request
ctx2 = LLM::Context.new(llm, stream: $stdout)
ctx2.restore(string:)
ctx2.talk "What is my favorite language?"
The built-in ActiveRecord and Sequel plugins (covered below) use this
same serialization under the hood. You can also save to a file with
ctx.save(path:) and restore with
ctx.restore(path:).
Choosing a provider
llm.rb supports multiple LLM providers behind a single API surface. Providers are loaded lazily — nothing is required until you call its constructor. This means you can swap backends without changing your application code:
- OpenAI —
LLM.openai(key:) - Anthropic —
LLM.anthropic(key:) - Google Gemini —
LLM.google(key:) - DeepSeek —
LLM.deepseek(key:) - xAI —
LLM.xai(key:) - zAI —
LLM.zai(key:) - Ollama —
LLM.ollama - Llama.cpp —
LLM.llamacpp - AWS Bedrock —
LLM.bedrock(key:)
Using OpenAI-compatible endpoints
Many providers (DeepSeek, DeepInfra, OpenRouter) expose an
OpenAI-compatible API. Override host: to point anywhere.
Some providers also need base_path: when the API lives
under a non-standard prefix:
llm = LLM.openai(
key: ENV["DEEPSEEK_KEY"],
host: "api.deepseek.com"
)
llm = LLM.openai(
key: ENV["DEEPINFRA_TOKEN"],
host: "api.deepinfra.com",
base_path: "/v1/openai"
)
Using the Responses API
OpenAI's Responses API is available through mode: :responses.
With store: false the endpoint stays stateless while using
the Responses endpoint. With store: true, OpenAI keeps
response state server-side:
require "llm"
llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, mode: :responses, store: false)
ctx.talk("Your task is to answer questions", role: :developer)
res = ctx.talk("What is the capital of France?")
puts res.content
Getting structured output
When you need the model to return structured data — JSON with specific
fields — pass a LLM::Schema. The schema defines the shape,
and llm.rb adapts it to your provider's structured-output format.
Class-based schemas
Define the schema as a class with property declarations.
This works well for reusable schemas that appear across your app:
class Report < LLM::Schema
property :category, Enum["performance", "security", "outage"], required: true
property :summary, String, "Short summary", required: true
property :services, Array[String], "Impacted services", required: true
end
llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, schema: Report)
res = ctx.talk("Structure this report: 'API latency spiked.'")
puts res.content!
Fluent schemas
For one-off schemas or dynamic shapes, build the schema inline without a class:
schema = LLM::Schema.new.object(
category: LLM::Schema.new.string.enum("a", "b", "c").required,
summary: LLM::Schema.new.string.required
)
ctx = LLM::Context.new(llm, schema:)
res = ctx.talk("Classify this.")
Working with model reasoning
Some models (DeepSeek Reasoner, OpenAI o-series) show their thinking process separately from the visible answer. llm.rb exposes this in two ways, depending on whether you're streaming or reading the final response.
Streaming reasoning output
Use on_reasoning_content on your stream subclass to
capture reasoning as it arrives — useful for surfacing it in a
separate panel or log:
class Stream < LLM::Stream
def on_content(content)
$stdout << content
end
def on_reasoning_content(content)
$stderr << content
end
end
llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm,
model: "gpt-5.4-mini",
mode: :responses,
reasoning: { effort: "medium" },
stream: Stream.new)
ctx.talk("Solve 17 * 19 and show your work.")
Reading reasoning from the response
When the provider includes reasoning in the final response, read it
from res.reasoning_content after the call completes.
This is useful for offline analysis or debugging:
llm = LLM.deepseek(key: ENV["KEY"])
ctx = LLM::Context.new(llm, model: "deepseek-reasoner")
res = ctx.talk("Solve 17 * 19.")
puts res.content
puts res.reasoning_content
Managing token budgets with compaction
Long-running conversations accumulate tokens. When you're approaching the model's context window, compaction summarizes older messages into a single message — keeping the conversation going without losing important context.
Configure compaction with token_threshold: (a percentage
of the context window or a fixed token count) and
retention_window: (how many recent messages to keep
verbatim). Attach a stream to observe when compaction happens:
require "llm"
class Stream < LLM::Stream
def on_compaction(ctx, compactor)
puts "Compacting #{ctx.messages.size} messages..."
end
def on_compaction_finish(ctx, compactor)
puts "Compacted to #{ctx.messages.size} messages."
end
end
llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(
llm,
stream: Stream.new,
compactor: {
token_threshold: "90%",
retention_window: 8,
model: "gpt-5.4-mini"
}
)
"90%" means compaction triggers when total token usage
exceeds 90% of the model's context window. You can also use a fixed
integer for token_threshold: or set
message_threshold: to compact based on message count.
Persisting contexts with ActiveRecord
llm.rb has built-in ActiveRecord support through acts_as_llm
and acts_as_agent. Both store the serialized context in a
single data column — your database table doesn't need
provider or model columns. Just wire them through hooks.
acts_as_llm (wraps LLM::Context)
Use acts_as_llm when you want full control over tool
execution. The provider: hook resolves a provider instance,
and context: injects defaults like model and mode:
require "llm"
require "active_record"
require "llm/active_record"
class Context < ApplicationRecord
acts_as_llm provider: :set_provider, context: :set_context
private
def set_provider
LLM.openai(key: ENV["OPENAI_SECRET"])
end
def set_context
{model: "gpt-5.4-mini", mode: :responses, store: false}
end
end
ctx = Context.create!
ctx.talk("Remember that my favorite language is Ruby")
puts ctx.talk("What is my favorite language?").content
acts_as_agent (wraps LLM::Agent)
Use acts_as_agent when you want automatic tool-loop
execution with agent DSL features like instructions, tools, and
concurrency:
require "llm"
require "active_record"
require "llm/active_record"
class Ticket < ApplicationRecord
acts_as_agent provider: :set_provider, context: :set_context
model "gpt-5.4-mini"
instructions "You are a concise support assistant."
tools SearchDocs, Escalate
concurrency :thread
private
def set_provider
LLM.openai(key: ENV["OPENAI_SECRET"])
end
def set_context
{mode: :responses, store: false}
end
end
ticket = Ticket.create!
puts ticket.talk("How do I rotate my API key?").content
Both support format: :jsonb on PostgreSQL for native JSON
column storage, and tracer: for attaching observability
to the provider.
Persisting contexts with Sequel
Sequel support works the same way through plugin :llm.
The provider:, context:, and
tracer: hooks use the same contract as ActiveRecord:
require "llm"
require "net/http/persistent"
require "sequel"
require "sequel/plugins/llm"
class Context < Sequel::Model
plugin :llm, provider: :set_provider, context: :set_context
private
def set_provider
LLM.openai(key: ENV["OPENAI_SECRET"], persistent: true)
end
def set_context
{model: "gpt-5.4-mini", mode: :responses, store: false}
end
end
ctx = Context.create
ctx.talk("Remember that my favorite language is Ruby")
puts ctx.talk("What is my favorite language?").content
Sequel also supports plugin :agent for the agent wrapper.
On PostgreSQL, use format: :jsonb for native JSON columns.
Sending images, audio, and files
llm.rb supports multimodal prompts — text plus images, audio, or
files. Build a prompt as an array, mixing text with helpers like
ctx.image_url or ctx.local_file:
Image from a URL
ctx.talk ["Describe this image",
ctx.image_url("https://example.com/cat.jpg")]
Local file attachment
Use ctx.ask(..., with:) to attach a file directly.
The provider adapter handles the format — no need to upload first:
puts ctx.ask("Summarize this document.", with: "README.md").content
Generating audio
res = llm.audio.create_speech(input: "Hello world")
IO.copy_stream res.audio, "hello.mp3"
Generating images
res = llm.images.create(prompt: "a dog on a rocket to the moon")
IO.copy_stream res.images[0], "dogonrocket.png"
Observing requests and tool calls
Tracing is attached at the provider level. Once assigned, all requests and tool calls through that provider are instrumented — no per-context setup needed.
Logger tracer
Print request details to a stream. Useful for development and debugging:
llm.tracer = LLM::Tracer::Logger.new(llm, io: $stdout)
ctx = LLM::Context.new(llm)
ctx.talk("Hello")
Telemetry tracer
Collect spans programmatically. Useful for custom dashboards or integrating with APM tools:
llm.tracer = LLM::Tracer::Telemetry.new(llm)
ctx = LLM::Context.new(llm)
ctx.talk("Hello")
pp llm.tracer.spans
Langsmith tracer
Send traces to Langsmith for team-wide observability:
llm.tracer = LLM::Tracer::Langsmith.new(llm,
metadata: { env: "dev" }, tags: ["chatbot"])
Production patterns
These are small switches that make llm.rb production-ready — not a second framework. Providers are meant to be shared, contexts stay isolated, and performance or cost controls layer onto the same core objects.
Thread safety
Share provider instances across threads. Create one context per thread or request — that's where the mutable state lives:
llm = LLM.openai(key: ENV["KEY"])
Thread.new do
ctx = LLM::Context.new(llm)
ctx.talk("Hello from thread 1")
end
Thread.new do
ctx = LLM::Context.new(llm)
ctx.talk("Hello from thread 2")
end
Performance
Swap to a faster JSON backend (:oj) and enable
persistent HTTP connections when request volume grows:
LLM.json = :oj
llm = LLM.openai(key: ENV["KEY"], persistent: true)
Cost tracking
Each context tracks its own usage. ctx.cost returns
the estimated spend so far:
ctx = LLM::Context.new(llm)
ctx.talk("Hello")
puts "Estimated cost: $#{ctx.cost}"
Model registry
The local registry provides model capabilities, pricing, and limits without API calls — useful for making local decisions about model selection:
registry = LLM.registry_for(:openai)
info = registry.limit(model: "gpt-4.1")
puts "Context window: #{info.context} tokens"
Building a REPL
A simple interactive loop using LLM::Context directly.
Each line of input becomes a turn, and streaming sends output to the
terminal as it arrives:
require "llm"
llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: $stdout)
loop do
print "> "
ctx.talk(STDIN.gets || break)
puts
end
Cancelling a request mid-stream
Need to stop a long generation? ctx.interrupt! cancels
the active provider request and notifies any running tools through
on_interrupt. Useful in web apps where one message may
need to cancel an earlier in-flight request:
require "llm"
require "io/console"
llm = LLM.openai(key: ENV["KEY"])
ctx = LLM::Context.new(llm, stream: $stdout)
worker = Thread.new do
ctx.talk("Write a very long essay about network protocols.")
rescue LLM::Interrupt
puts "Request was interrupted!"
end
STDIN.getch
ctx.interrupt!
worker.join
More resources
The deepdive guide covers every feature with runnable examples — providers, tools, agents, schemas, MCP, A2A, compaction, tracing, and production patterns.
Relay
Relay is a production application built on llm.rb with context management, tool composition, concurrent workflows, and cost tracking.
Robert
Robert is a 2MB native FreeBSD AI assistant built with mruby-llm.
rails-llm
rails-llm provides
a Rails engine with chat UI, generators, and acts_as_agent.