Sep 22, 2025

What Led Us to Agents

There are agentic systems everywhere today. Product demos show AI tools that can search, plan, call APIs, write code, and loop until a task is done. Two years ago, this was not how most people built with LLMs. So what changed?

Below is a simple story of how we got here: from small context windows, to RAG, to function calling, and finally to modern agents.

The early days: tiny context length

When the first popular LLMs came out, context length was small. (many early releases like gpt-3.5 had 16k tokens; today most models offer 200k+.) You could only fit a short prompt and maybe a few paragraphs of reference text. This meant the model often lacked the facts it needed and hallucinated more often. You had two options:

Paste everything into the prompt (expensive, slow, often impossible), or
Find the right pieces of information and include only those.

Option 2 led to the famous method Retrieval Augmented Generation (RAG).

What is RAG (high level)

RAG was introduced in 2020 by Meta FAIR in “Retrieval-Augmented Generation for Knowledge-Intensive NLP” (Lewis et al., 2020). It quickly became popular because it helped with limited context length in LLMs.

RAG abstract from the original paper

RAG is a pattern where you:

Create embeddings for your documents and store them in a vector database. (If a document is very long, you can optionally split it into passages before embedding.)
For each user query, use embedding search to retrieve the most relevant passages.
Put those passages into the model’s context, along with the prompt.
Let the model generate an answer grounded in the retrieved text.

High-level RAG flow

Why it helped:

Fetches the latest knowledge for LLMs without needing to finetune the model.
Reduces hallucinations by grounding answers in your data.
Controls cost by adding only the relevant context.

This worked well for many simple Q&A use cases and is still a core building block today.

Where RAG falls short

As people pushed RAG further, they hit limits:

Embedding search coverage: Some queries don’t match well to chunks (long‑tail phrasing, multi‑hop reasoning, or very specific details).

# Embedding search coverage issue
query = "Compare EU and US refund cutoff rules for preorders"
results = vector_db.search(query, top_k=3)  
# relies on semantic similarity only
# None of the top passages mention both "refund" and "cutoff" together.

Why embedding-only fails here:

Compositional blur: The query implies AND constraints (refund ∧ cutoff ∧ preorder ∧ EU/US). Dense vectors reward partial matches to subsets of the meaning.
Underweighted anchors: Rare tokens like “cutoff” and jurisdiction names (“EU”, “US”) carry precise meaning that gets diluted by broader concepts like “refund policy”.

Concrete failure (illustrative bad hits):

Query: "Compare EU and US refund cutoff rules for preorders"

Top retrieval might be:
1) "EU right of withdrawal allows consumers to cancel within 14 days for distance sales..."
   # No mention of preorders or a preorder‑specific cutoff; EU‑only, no comparison.
2) "How chargeback deadlines work in the US (typically up to 120 days)..."
   # Wrong policy regime; not merchant preorder refund cutoff; missing EU/US comparison.
   # Semantically close due to deadline/US cues.
3) "Preorder refunds depend on retailer policies and stock availability..."
   # No explicit cutoff; no EU vs US comparison; too generic for the question.

Chunking brittleness: Important context may be split across chunks; naive chunk sizes miss structure in code, tables, or docs. Example: a function’s behavior lives across signature, comments, and tests; splitting by characters hides the full picture.

# Chunking brittleness
text = load_long_policy()
# Naive fixed-size splits may separate tables from their headers
chunks = naive_split(text, size=500)
# Better: structure-aware chunking (e.g., by headings/tables/code blocks)
chunks = split_by_structure(text, headings=True, tables=True)

Cross‑document reasoning: Answers that need to combine facts from many places are hard with basic top‑k retrieval. Example: join a pricing table from one doc with policy exceptions from another.

# Cross-document reasoning
policy = vector_db.search("refund cutoff dates", top_k=1)
pricing = vector_db.search("preorder pricing exceptions", top_k=1)
# Basic RAG returns two unrelated passages; the model must join them.
# A graph or multi-hop retriever can explicitly fetch and combine both.
joined = join_passages([policy[0], pricing[0]])

No memory: Classic RAG retrieves per query; it doesn’t remember what happened in previous steps or sessions. Example: it forgets the user’s team, region, or the previous failure path.

# No memory across steps (classic RAG)
def answer(query):
    ctx = retrieve(query)
    return llm(answer_prompt(query, ctx))  # forgets prior turns

# Add short-term memory by threading a scratchpad
def answer_with_memory(query, history):
    ctx = retrieve(query)
    return llm(answer_prompt(query, ctx, history=history))

Actions vs. answers: RAG fetches context, but it doesn’t do anything by itself. Wiring multiple data sources (databases, APIs, web search) requires extra tooling and orchestration beyond naive RAG.

# Actions vs. answers
# RAG: can fetch text, but cannot act
ctx = retrieve("send preorders report to finance")
resp = llm(f"Based on: {ctx}. Should I send the email?")  # text only

# Agents: call tools to act
plan = llm_plan("Send preorders report to finance", tools=[query_db, send_email])
if plan.tool_call:
    result = dispatch(plan.tool_call)  # performs the action

People started asking: how do we go beyond “find and answer” to “decide and act”?

Function calling: giving models hands

In 2023, OpenAI introduced function calling (see: https://openai.com/index/function-calling-and-other-api-updates/). Function calling lets the model return a structured payload that tells your app which tool to run and with what arguments. Instead of guessing JSON formats from plain text, the model targets a schema you define. Your runtime then executes the function (e.g., search the web, query a database, send an email) and returns the result back to the model.

You are able to define a tool schema like this:

{
  "name": "search_docs",
  "description": "Search internal docs",
  "parameters": {
    "type": "object",
    "properties": {
      "query": { "type": "string" },
      "top_k": { "type": "integer", "minimum": 1, "maximum": 10 }
    },
    "required": ["query"]
  }
}

the model then can propose a tool call by returning a structured output like this:

{
  "tool_call": {
    "name": "search_docs",
    "arguments": { "query": "refund policy cutoff dates", "top_k": 3 }
  }
}

Your runtime executes search_docs and returns results back to the model as an observation. The model then integrates the results and either calls another tool or produces a final answer.

This helps with

Structured I/O: Fewer parsing hacks; arguments match a schema.

Without function calling:

prompt = """Extract a city and date from: 'weather in Paris tomorrow'. 
            Return in JSON format {city, date}."""
raw = llm(prompt)
data = json.loads(raw)  # may fail or be malformed
city, date = data.get("city"), data.get("date")

With function calling:

tools = [{
  "name": "get_weather",
  "parameters": {
    "type": "object",
    "properties": {"city": {"type": "string"}, "date": {"type": "string"}},
    "required": ["city", "date"]
  }
}]
plan = llm(user_msg, tools=tools)
call = plan.tool_call  # already structured
weather = dispatch(call)  # no fragile parsing

Grounded actions: The model proposes; your code executes safely with auth, rate limits, and validations.

# Grounded actions: enforce auth and rate limits at the tool layer
@rate_limit("query_db", per_minute=30)
def query_db(sql: str):
    return db.execute(sql)

Lower hallucinations: The model uses tool outputs rather than inventing facts.

# Lower hallucinations: use tool output as source of truth
plan = llm("What's the weather in Paris tomorrow?", tools=[get_weather])
obs = dispatch(plan.tool_call)
answer = llm(f"Summarize this official weather data: {obs}")

Better control: You decide which tools exist, when they can be called, and how results are filtered.

# Better control: allow-list tools and validate inputs
ALLOWED = {"search_docs", "get_file", "query_db"}

def dispatch_tool(call):
    if call.name not in ALLOWED:
        raise PermissionError("Tool not allowed")
    validate(call.arguments)
    return TOOL_REGISTRY[call.name](**call.arguments)

With function calling, an LLM can do more than answer—it can operate.

How this changed people building LLM systems

With function calling, larger context windows, and smarter models, people started building agents that can chain many steps. For example: look at a dataset, realize the data is incomplete, fetch an additional file, read it, query an external API for fresh values, and only then produce a result. This fixes many limits of naive RAG because the model can decide when to retrieve, from which source, and what to do next.

A small text illustration of an agent using three tools — read a file, create a plan, and execute actions:

Tools available: read_file(path), create_plan(goal, history), execute(action, params)
Goal: “Summarize the dataset schema and produce a short report.”
Step 1 — Plan: The agent calls create_plan(goal, history) → Plan: (1) read data/schema.json, (2) extract key fields, (3) draft summary.
Step 2 — Read: The agent calls read_file('data/schema.json') → Gets the schema content (omitted for brevity).
Step 3 — Update: With the schema in hand, the agent refines the plan: highlight missing fields and note data types.
Step 4 — Execute: The agent calls execute('summarize_schema', { 'schema_path': 'data/schema.json' }) → Produces a concise report.
Step 5 — Finish: The agent returns the report as the final answer.

RAG still matters a lot. Instead of using RAG on every turn, the agent decides which tool to call and when retrieval is actually needed.

Function calling ties everything together by letting the model plan and your code execute. Memory and state give continuity across steps. The result is an agent that can gather context, make decisions, and take actions until the job is done.

Putting it all together

The path went like this:

Small context length forced selective context → RAG.
RAG solved grounding, but not actions or memory.
Function calling enabled safe, structured actions.
Better models, bigger contexts, and better tooling made orchestration practical.
Agents emerged: planners with tools, memory, and guardrails.

That’s how we got into agentic systems.