artiebits.com

How to Build a Local AI Agent

Updated on April 3, 2026

This is a beginner-friendly guide to building a local AI agent in JavaScript. An agent you can chat with in a terminal, powered by a powerful model running on your machine.

We’ll use Gemma 4, today’s best model for local use and run it via llama.cpp. This setup will give us zero API costs and full privacy.

Running the model locally

Install llama.cpp with Homebrew:

brew install llama.cpp

Then start the model server:

llama-server -hf ggml-org/gemma-4-E4B-it-GGUF

The -hf flag downloads the model from Hugging Face and starts an OpenAI-compatible HTTP server at http://127.0.0.1:8080.

Basically, it means the llama.cpp server exposes API similar to OpenAI’s ChatGPT API. So any app or tool that was built to work with OpenAI (VS Code extensions, LangChain, LlamaIndex, etc.) can connect to your local server too.

Setting up the project

mkdir local-agent
cd local-agent
npm init -y
npm pkg set type=module
npm install @openai/agents openai
touch index.js

We installed two packages:

npm pkg set type=module adds "type": "module" to package.json so we can use ES module imports.

The agent

Let’s start with connecting to the local server:

import OpenAI from "openai"
import {
  Agent,
  run,
  setDefaultOpenAIClient,
  setTracingDisabled,
  OpenAIChatCompletionsModel,
  MemorySession,
} from "@openai/agents"

const client = new OpenAI({
  baseURL: "http://127.0.0.1:8080/v1",
  apiKey: "no-key",
})
setDefaultOpenAIClient(client)
setTracingDisabled(true)

We point the client at our llama.cpp server. The API key doesn’t matter here but the SDK requires something non-empty. setTracingDisabled(true) turns off OpenAI’s telemetry since we’re not using their infra.

Now the agent itself:

const agent = new Agent({
  name: "Assistant",
  instructions: "You are a helpful assistant. Answer clearly and directly.",
  model: new OpenAIChatCompletionsModel(client, "local"),
})

Agent takes a name, a system prompt, and a model. The system prompt (instructions) is your agent’s personality and behavior. Change it to make the agent more formal, more terse, focused on a specific domain, or anything else.

Memory:

let session = new MemorySession()

MemorySession keeps track of the conversation. Without it, the agent would be completely stateless becuase LLMs have no built-in memory, every request is independent by default. The SDK uses the session to include the full message history on every request, which is what makes the model appear to remember what was said.

The chat function:

async function chat(userInput) {
  process.stdout.write("\nAssistant: ")
  const stream = await run(agent, userInput, { session, stream: true })
  stream.toTextStream({ compatibleWithNodeStreams: true }).pipe(process.stdout)
  await stream.completed
  process.stdout.write("\n")
}

run() sends the message and returns a stream.This makes the response appear word by word, as if the model is typing back to you. If you set stream to false, you will wait until the model generates the full response before seeing it.

And the CLI loop:

import readline from "node:readline"

const rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout,
  terminal: true,
  prompt: "\nYou: ",
})

async function main() {
  rl.prompt()
  rl.on("line", async (line) => {
    const input = line.trim()
    if (!input) {
      rl.prompt()
      return
    }
    rl.pause()
    try {
      await chat(input)
    } catch (err) {
      console.error("\n[Error]", err.message ?? err)
    } finally {
      rl.resume()
      rl.prompt()
    }
  })

  rl.on("close", () => process.exit(0))
}

main()

The loop reads a line, handles special commands (exit, quit), then calls chat(). We pause readline while waiting for a response so user input doesn’t get mixed in with the streamed output.

Let’s try it

$ node index.js


You: hi, my name is Artur
Assistant: Hi Artur! Nice to meet you. What can I help you with?

You: what's my name?
Assistant: Your name is Artur.

It remembers your name because of the session.

Adding tools

Right now the agent can only talk. To make it actually useful, we give it tools — functions it can call to interact with the world.

When we pass tools to the agent, the model looks at the user’s request and decides whether to call one. The SDK handles the rest: it executes the function, sends the result back to the model, and the model decides what to do next — call another tool or respond to the user.

We’ll use the tool() helper from @openai/agents and zod to define the parameters. Install zod first:

npm install zod
import fs from "node:fs"
import { tool } from "@openai/agents"
import { z } from "zod"

const readFileTool = tool({
  name: "read_file",
  description:
    "Read the contents of a given relative file path. Use this when you want to see what's inside a file. Do not use this with directory names.",
  parameters: z.object({
    path: z.string().describe("The relative path of a file in the working directory."),
  }),
  execute: async ({ path }) => {
    if (!fs.existsSync(path)) {
      throw new Error(`File does not exist: ${path}`)
    }
    return fs.readFileSync(path, "utf8")
  },
})

If the file doesn’t exist, we throw an error. The SDK catches it and sends the message back to the model, so it can self-correct or explain what went wrong.

Then add it to the agent:

const agent = new Agent({
  name: "Assistant",
  instructions:
    "You are a helpful, concise assistant. Answer clearly and directly.",
  model: new OpenAIChatCompletionsModel(client, "local"),
  tools: [readFileTool],
})

Let’s see it in action:

$ node index.js

You: what's in index.js? be brief
Assistant: `index.js` sets up a local AI agent using the OpenAI Agents SDK pointed at a llama.cpp server. It creates a streaming chat loop with memory.

Pretty cool, right?

You can build tools for almost anything. Reading/writing files, searching the web, creating files, setting reminders.

And there are plenty of examples online. Tools for OpenClaw are open source. There is a leaked Claude Code’s codebase, you can copy-paste tools from there too.

Wrapping Up

Tools like Cursor or Claude Code seem magical when you watch them work. But at their core, they are built on same foundation: loop that maintains conversation context, set of tools, and model that uses it.