P26-06-07">

Ollama as the default LLM for enterprise-shaped systems

PROVIDER=ollama, granite4.1:3b, zero API keys, no Azure account. How to make a multi-agent project that demonstrates enterprise patterns run end-to-end on a laptop in 90 seconds.

Most multi-agent demos require a cloud API key. The reading audience has to either trust you with billing keys, sign up for credits, or skip the demo. None of those are healthy defaults.

This project's default is PROVIDER=ollama with OLLAMA_MODEL=granite4.1:3b. The full make all — pre-flight, install, observability stack, tests, all three working workflows, both eval harnesses, URLs — runs on a laptop in about 90 seconds with no API keys. This post is about how that's built.

The provider abstraction

providers.py is 30 lines:

def build_chat_client(settings: Settings | None = None) -> Any:
    s = settings or load_settings()

    if s.provider == "openai":
        from agent_framework.openai import OpenAIChatCompletionClient
        return OpenAIChatCompletionClient(model=s.openai_model, api_key=s.openai_api_key)

    if s.provider == "foundry":
        from agent_framework.foundry import FoundryChatClient
        from azure.identity import AzureCliCredential
        return FoundryChatClient(
            project_endpoint=s.foundry_project_endpoint,
            model=s.foundry_model,
            credential=AzureCliCredential(),
        )

    if s.provider == "ollama":
        from agent_framework.ollama import OllamaChatClient
        return OllamaChatClient(host=s.ollama_host, model=s.ollama_model)

    raise ValueError(f"Unknown provider: {s.provider}")

Every agent factory in the project takes its client from build_chat_client(). Nothing else cares which provider it is. To swap from local development to Azure Foundry production, you change two env vars; nothing else.

Why granite4.1:3b

The provider abstraction works with any Ollama model. The choice of default matters because it sets what users see the first time they run the demo. The candidates I tried, with rough quality on the test prompts:

Model Size Sequential Concurrent Multi-turn eval (judge)
llama3.2:1b 1.3 GB Hallucinated content; agent prompts mostly ignored Workers returned empty 1/3 cases passed
granite4.1:3b 2.1 GB All four agents followed instructions; critic produced exact format All three specialists ran clean 2/3 cases passed (one judge nit)
qwen3.5:latest 6.6 GB Same as granite, slightly higher quality Same 3/3 cases passed
nemotron3:33b 27.6 GB Best quality Slow Slow

granite4.1:3b is the lowest model size where the agent prompts get followed reliably enough to demonstrate the framework instead of the failure modes of a small LM. It's the sweet spot for a default — 2GB pull is acceptable for a first-run, and the speed is fine on consumer hardware.

The Makefile lets you override: make all MODEL=qwen3.5:latest. The default is just a default.

The Makefile's preflight target

preflight: ## Verify Ollama is running and the default model is pulled
    @if ! curl -sf $(OLLAMA_HEALTH_URL) >/dev/null 2>&1; then \
        echo "  ❌ Ollama is not reachable at http://localhost:11434"; \
        echo "     Install: brew install ollama (or https://ollama.com/download)"; \
        echo "     Start  : ollama serve &"; \
        exit 1; \
    fi
    @echo "  ✅ Ollama is running"
    @if ! curl -s $(OLLAMA_HEALTH_URL) | grep -q '"$(MODEL)"'; then \
        echo "  ⏬ Pulling $(MODEL) (one-time)..."; \
        ollama pull $(MODEL); \
    fi
    @echo "  ✅ Model $(MODEL) is available"

Idempotent, friendly errors, single pull when needed. The whole user-facing setup is two commands: brew install ollama and make all.

Two upstream bugs worth knowing

The agent-framework-ollama package has two surface-level issues that bit me:

1. allow_multiple_tool_calls kwarg

In MAF 1.0.0b260521, agent_framework_ollama._chat_client._stream forwards **kwargs to ollama.AsyncClient.chat(). HandoffBuilder passes allow_multiple_tool_calls=... as one of those kwargs. The ollama Python client has never accepted that parameter. Result:

ChatClientException: ("Ollama streaming chat request failed :
  AsyncClient.chat() got an unexpected keyword argument 'allow_multiple_tool_calls'", ...)

This breaks the handoff pattern on Ollama specifically. Sequential, concurrent, and custom-graph workflows are fine. The fix is upstream; for now, run handoff against OpenAI or Foundry.

2. ollama version pin

agent-framework-ollama 1.0.0b260521 pins ollama<0.5.4 in metadata but its code uses behaviour only present in ollama>=0.6.0. Pip will warn:

agent-framework-ollama 1.0.0b260521 requires ollama<0.5.4,>=0.5.3,
but you have ollama 0.6.2 which is incompatible.

…and install 0.6.2 anyway. That's the right outcome — 0.6.2 is what the framework actually needs.

OpenAI-compatible everywhere

The multi-turn eval harness uses the raw OpenAI Python SDK because it drives the agent loop explicitly (which is the whole point of multi-turn eval). The SDK works against any OpenAI-compatible base_url. Ollama exposes one at http://localhost:11434/v1:

# evals/multi_turn/_clients.py
_DEFAULT_BASE_URL = "http://localhost:11434/v1"
_DEFAULT_API_KEY = "ollama"   # Ollama ignores the key; the SDK requires one.
_DEFAULT_MODEL = "granite4.1:3b"

You point at OpenAI or Azure OpenAI by changing the env vars:

export EVAL_AGENT_BASE_URL=https://api.openai.com/v1
export EVAL_AGENT_API_KEY=sk-...
export EVAL_AGENT_MODEL=gpt-5-mini
export EVAL_JUDGE_MODEL=gpt-5.1

The harness code doesn't change. The same multi-turn eval runs against Ollama in dev and OpenAI in CI.

Defaults aligned

.env.example is the single source of truth for what the unconfigured project does:

PROVIDER=ollama
OLLAMA_HOST=http://localhost:11434
OLLAMA_MODEL=granite4.1:3b

EVAL_AGENT_BASE_URL=http://localhost:11434/v1
EVAL_AGENT_API_KEY=ollama
EVAL_AGENT_MODEL=granite4.1:3b
EVAL_JUDGE_MODEL=granite4.1:3b

OTEL_SERVICE_NAME=multi-agent-maf
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc

The OpenAI and Foundry sections of .env.example are below those, clearly labelled "only if PROVIDER=…". The intent is that you don't have to look at them on first run.

The make all recipe

The Makefile target that ties everything together:

all: ## Run EVERYTHING locally: pre-flight, stack, tests, workflows, evals, URLs
    @printf "\n==> 1/8  Pre-flight: Ollama + model\n"
    @$(MAKE) -s preflight
    @printf "\n==> 2/8  Install / venv\n"
    @$(MAKE) -s install-dev
    @printf "\n==> 3/8  Boot observability stack\n"
    @$(MAKE) -s stack-up
    @printf "\n==> 4/8  Unit tests (48 cases, no LLM)\n"
    @$(MAKE) -s test-fast
    @printf "\n==> 5/8  Smoke: registry + custom_graph\n"
    @$(MAKE) -s smoke
    @printf "\n==> 6/8  Workflows: sequential + concurrent + custom_graph\n"
    @$(MAKE) -s sequential PROMPT="$(PROMPT)"
    @$(MAKE) -s concurrent PROMPT="$(PROMPT)"
    @$(MAKE) -s custom PROMPT="hello there"
    @printf "\n==> 7/8  Evals: tool-call + multi-turn\n"
    @-$(MAKE) -s eval-tool
    @-$(MAKE) -s eval-multi-fast
    @printf "\n==> 8/8  Done. Inspect the data:\n"
    @$(MAKE) -s show-urls

The eval lines have a - prefix because the tool-call eval intentionally returns a non-zero exit code on the "calc-wrong-tool" case (the scorer correctly detects a deny). For the all target that's signal, not a build failure.

The Grafana / Prometheus / Jaeger URLs are printed at the end so the next thing you read after a successful run is "go look at the data."

Why this matters

Two reasons:

  1. Reproducibility. A demo that needs an API key is unreproducible by definition — your reader can't be sure whether their failure is the framework's fault or their account's. Ollama removes that variable.
  2. Cost as a non-concern in dev. You can run make all a hundred times and the cost is electricity. The architecture lessons aren't about token economics; they're about wiring. Wiring should be free to iterate on.

The bigger picture: the multi-agent reference architecture is not specific to Azure. The framework is multi-vendor. The provider abstraction in 30 lines is what lets the lesson generalise. Ollama-by-default isn't a compromise on production — it's the right teaching surface for what's portable.

When you ship to production, you flip two env vars to point at Foundry or Azure OpenAI, run make sequential once to confirm, and the same code runs the same workflow. That symmetry is what the provider abstraction buys.