Self-Hosting Firecrawl with LLM-Powered Web Scraping
Last month I replaced my SearXNG instance with Firecrawl — a self-hosted web scraping platform that can do far more than just search. Here's how I set it up with OpenRouter as the LLM backend, what works, what doesn't, and why I made the switch.
Why Firecrawl?
I was running SearXNG at 192.168.100.32:8181 for web search. It worked, but it was just search — no content extraction, no structured data, no JavaScript rendering. Firecrawl adds:
- JS rendering — crawls SPAs and dynamic sites
- LLM-powered extraction — structured JSON from any page
- Crawl, map, and batch operations — beyond single-page scraping
- Same self-hosted model — no API costs for scraping itself
Architecture
Hermes Agent → Firecrawl API (192.168.100.32:33002) → OpenRouter (LLM backend)Firecrawl handles the scraping. When I need AI extraction (summaries, structured data), it calls OpenRouter. The scraping is local and free; only the LLM calls cost money.
Setting Up the LLM Backend
Firecrawl v2.9.0 uses Vercel's AI SDK (@ai-sdk/openai), which hardcodes the Responses API (/responses). This is critical — providers that only support /chat/completions will fail.
Docker Compose Configuration
services:
api:
environment:
- OPENAI_BASE_URL=https://openrouter.ai/api/v1
- OPENAI_API_KEY=sk-or-v1-...
What Works and What Doesn't
OpenRouter — ✅ Working (supports /responses)
OpenAI direct — ✅ Working (native support)
GLM/Z.ai — ❌ Broken (only /chat/completions)
The GLM failure was frustrating — I tried both api.z.ai and open.bigmodel.cn endpoints, but Firecrawl constructs /responses URLs that these providers don't recognize.
Verified Capabilities
Core Scraping (No LLM Required)
POST /v1/scrape— Single page to markdownPOST /v1/crawl— Recursive site crawlingPOST /v1/map— URL discoveryPOST /v1/search— Web searchPOST /v1/batch/scrape— Async multi-URL
LLM-Powered Extraction
Firecrawl can extract structured data using a JSON schema. I tested this on several sites and got back structured product data with TAM estimates — all without writing a custom parser.
What Doesn't Work
questionformat — "Query generation failed after all models"highlightsformat — Same error/v1/extract(batch) — Deprecated, never completes/v1/agent— 500 error
I stick to json format with explicit schemas.
SPA Handling
One pleasant surprise: Firecrawl handles JavaScript-rendered sites well. For a Next.js site I tested, the map returned empty because there's no <a href> in the initial HTML. But crawl executed JavaScript and found routes like /signup, /login, /forgot-password.
Troubleshooting
"Failed to parse URL from /responses" — Cause: OPENAI_BASE_URL is missing. Fix: Set valid base URL. Even empty value causes Firecrawl to call https:///responses.
"token expired or incorrect" (401) — Cause: API key rejected. Fix: Verify OPENAI_API_KEY is set in Docker Compose environment section, not just .env file.
Key Takeaways
- OpenRouter is the practical choice for Firecrawl's LLM backend — it supports the Responses API.
- Use json format with schemas for structured extraction.
- Crawl beats map for SPAs — JS execution finds routes that static analysis misses.
- Self-hosted means no scraping costs — you only pay for LLM extraction when you use it.
Firecrawl isn't perfect — some endpoints are broken, the AI SDK dependency is restrictive — but for self-hosted web scraping with optional AI extraction, it's the best tool I've found.