June 23, 2026Paul Vilevac

The Chat on This Site Isn't a SaaS — It's the Demo

The assistant on bleenq.com runs on our own open-source platform: a scoped gateway key, a dedicated public knowledge base, semantic plus graph retrieval, and an outbound-only tunnel so nothing on our network is ever exposed. Here's the architecture, end to end.

AI Chat
RAG
Architecture
Security

Most company chatbots are a wrapper around someone else’s API. You type a question, a third-party model answers, and the “AI” is a billing relationship.

Ours isn’t.

The assistant on this site answers about Bleenq, our writing, and our open-source projects — and it runs on the same platform we build for clients. Same model gateway. Same scoped-key governance. Same knowledge-base pipeline. When you ask it a question, you’re not seeing a demo of someone else’s technology. You’re using ours.

That’s the whole point. For a consulting firm and incubator, “we can build serious AI systems” is a claim. A chat widget that is one of those systems, live on the homepage, is proof. So it’s worth showing exactly how it works — because the architecture is the argument.

Here’s the thing most people miss about a public AI endpoint: the model is the easy part. The hard part is everything around it — keeping secrets out of the browser, keeping your network off the public internet, grounding answers in real content, and making sure a public chatbox can’t be turned into a free LLM, a data-exfiltration vector, or a cost bomb. That’s where the engineering lives, and that’s what the rest of this piece is about.

1. No key ever touches the browser

The first rule of a public AI endpoint is the oldest rule in security: the browser is hostile territory. Anything you ship to it — a key, a gateway URL, a system prompt — is public.

So the chat you see is a tiny Astro island (a Preact component, the only interactive JavaScript on an otherwise static, zero-JS site). It knows exactly one thing: the address of our backend-for-frontend, a small hardened proxy we call the BFF. It streams your message there and renders what comes back. It never sees a model key, the gateway, the system prompt, or the knowledge base.

The BFF is where the real work happens, and it’s the only dynamic, secret-holding component in the whole system. Everything else is static files on a CDN. That’s deliberate: a static site behind a CDN is nearly impossible to attack, so we concentrate the entire attack surface into one small, heavily-guarded service and treat it as the crown jewels.

2. The model stays home — reached through an outbound-only tunnel

Here’s the part that surprises people: the language model isn’t in the cloud. It runs on our own infrastructure — the open-source ai-homelab platform we published — on hardware we own.

That raises an obvious question. If the model is at home, how does a public website reach it without exposing your home network?

The answer is a Cloudflare Tunnel. Instead of opening a port and inviting the internet in, a lightweight client on the platform makes an outbound connection to Cloudflare, and Cloudflare routes the chat traffic back down that pipe. The result:

Zero inbound ports. Nothing on our network listens to the public internet.
The origin IP is never exposed. Visitors hit Cloudflare; Cloudflare reaches us over the tunnel.
DDoS, WAF, and TLS are handled at the edge, before a request ever gets near us.

This is the quiet superpower of the design. The “our site’s AI runs on our own platform” story usually comes with a scary asterisk about exposing your network. The tunnel removes the asterisk. The model is home; the front door is Cloudflare’s.

3. One door for every model call — with a scoped key

Inside the platform, the BFF never talks to a model directly. Every call goes through a single gateway (LiteLLM), and it carries a scoped virtual key.

That key is not the master key. It’s a bleenq-web key, scoped to exactly two things: the chat model and the embedding model. Nothing else. Try to call a different model with it and the gateway says no. If the public endpoint were ever abused, that key is revocable in one place — and revoking it kills the public bot instantly without touching anything else on the platform.

This is the same per-application, scoped-key governance we apply to every project on the platform, and it’s the difference between “we have an API key in an env file” and a system where access, cost, and tracing are properties of the architecture rather than a policy nobody enforces. Every call is traced, every token is metered, and the public bot can only ever reach the two models it’s allowed to.

4. A dedicated, public-only knowledge base

A general-purpose LLM will happily make things up about your company. We don’t let it. The bot answers from a dedicated knowledge base that contains only public-safe content — our published articles, the company and profile pages, and the public docs of our open-source work. There is a hard wall between this KB and anything internal. The bot can reach this one collection and nothing else.

Two details matter here.

It’s grounded. Before the model answers, the question is embedded (with BGE-M3, through the same gateway) and used to retrieve the most relevant chunks from the KB. The model answers from that context and cites its sources. If the KB doesn’t cover something, the bot says so instead of inventing — and points you to a human.

It stays current from two places. The KB is rebuilt from both the site content (every article, profile, and project on this site) and the public source repository (the open-source platform’s docs). Publish an article or push a doc, and a refresh re-embeds the new material so the bot knows about it. The assistant’s knowledge tracks what we actually ship.

5. Two ways to know things: semantic and structural

Most RAG chatbots can do one kind of retrieval: semantic similarity. Ask “what does Bleenq do” and a vector search over the text gives a good answer. Ask “what depends on the vector database” or “what’s reachable through the reverse proxy” and pure semantic search struggles, because that’s not a question about what a document says — it’s a question about how things connect.

So the KB has two halves over the same public sources:

Semantic search — a Qdrant vector store for “what” and “why” questions. This is the classic RAG path: meaning-based retrieval over articles, profiles, projects, and docs.
A structure graph — a Neo4j graph built from the open-source repository’s Docker and folder structure: which services exist, which networks they share, what depends on what, what’s exposed through the proxy. Built deterministically from the source, so it’s exact, not inferred.

The chat is an agent with both as tools, and it routes per question. “How does the platform handle secrets?” goes to semantic search. “What’s behind the reverse proxy?” goes to the graph and comes back with the actual list of routed services. Ask something that spans both and it uses both. The model is the easy part; the routing and the two complementary indexes are where the answers get genuinely good.

6. Assume abuse

A public LLM endpoint will be probed, scraped, and cost-attacked. That’s not pessimism; it’s Tuesday. So the BFF assumes it:

CORS locked to this site, per-IP and global rate limits, and a daily token budget with an automatic kill switch — when the budget is spent, the bot rests until tomorrow rather than running up a bill.
Input caps (length, one message at a time, text only) and an output filter that refuses to echo the system prompt, environment, or anything key-shaped.
Prompt-injection posture: the system prompt is locked server-side and never reflected, your text is treated as untrusted data, the retrieved KB is the only “trusted” source, and attempts to override instructions or escape scope are declined. The rendered answer is escaped before it hits the page, behind a strict Content-Security-Policy.

None of this is glamorous. All of it is the difference between a demo and something you can leave running on the public internet.

What didn’t come for free

I’ll be honest, because the honest parts are the useful parts.

Wiring an agent that routes between two retrieval tools is fiddlier than a single-shot RAG call — tool definitions, a model that actually supports tool-calling, and getting the structure-graph queries to survive real-world phrasing all took iteration. Translating a casual question like “what’s behind the proxy” into a correct graph query needed examples and guardrails before it was reliable. And the usual operational gremlins showed up: a cached credential here, a response wrapped in the wrong format there. None of it was the model. All of it was the system around the model.

That’s the recurring lesson of three decades of building things that have to last: the interesting failures are never in the clever part. They’re in the plumbing. Which is exactly why the plumbing is where we spend our attention.

Why this works: the platform is the multiplier

Here’s what makes this buildable in afternoons instead of months. None of the hard pieces — the gateway, the scoped-key governance, the vector store, the graph database, the tracing, the tunnel — were built for this chatbot. They already existed, running, as part of the platform. The chatbot is an application of that foundation, not a new stack.

Stand up the boring layers once, properly, and every later project gets to assume them. A public, grounded, secure, graph-aware AI assistant stops being a quarter of work and becomes a focused build on top of infrastructure you already trust. That compounding is the entire reason to invest in the foundation first — and it’s the same reason our clients’ systems get cheaper to extend over time, not more expensive.

Takeaways you can use

Never put a model key in the browser. Put a small backend-for-frontend in front of the gateway and let the browser talk only to that.
You don’t have to expose your network to host AI on it. An outbound-only tunnel gives you a public front door with zero inbound ports and a hidden origin.
Scope the key. One revocable, per-application key scoped to exactly the models it needs turns access and cost control into a property of the system.
Ground every answer and cite sources — and keep the public KB strictly public, with a hard wall to anything internal.
Add a structure graph when “what connects to what” matters. Vector search answers “what does this say”; a graph answers “what depends on this.” Route between them.
Assume abuse from day one. Rate limits, a hard budget, a kill switch, and output guards aren’t optional on a public endpoint.

The stack, at a glance

Layer	What we used
Front end	Astro static site, one Preact island (the chat)
Edge	Cloudflare (CDN, WAF, DDoS, TLS) + an outbound Tunnel
Proxy	A hardened backend-for-frontend (rate/cost/injection guards)
Gateway	LiteLLM, with a scoped `bleenq-web` virtual key + tracing
Retrieval	Qdrant (semantic) + Neo4j (structure graph), routed by an agent
Embeddings	BGE-M3, via the gateway
Knowledge base	Public-only, rebuilt from site content and the source repo

If you’re building a public AI feature and wrestling with the parts that aren’t the model — the security boundary, the grounding, the cost controls, the “how do we host this without exposing ourselves” question — that’s exactly the work we do. Talk to us. Or, of course, ask the assistant on this page how it works. It runs on the platform we’d build for you.