Running Transformers.js in a Chrome Extension: What I Learned Building with Gemma 4

I’ve been playing with Transformers.js in browser extensions lately, and it’s a fascinating space. The Hugging Face team recently shipped a demo extension powered by Gemma 4 E2B that helps users navigate the web with local AI. I dug into the code and built something similar, and there are a few things worth talking about.

The architecture that actually works

Manifest V3 makes things harder than they need to be. Service workers can be killed and restarted at any point, which is a pain when you’ve got a 2GB model loaded. But the architecture in this project handles it well.

Three entry points in manifest.json:

Background service worker (the brains)
Side panel (the chat UI)
Content script (page interaction)

The key insight: keep all the heavy lifting in the background. Don’t load models in the side panel or content script. You’ll thank me later when Chrome doesn’t eat your memory.

Messaging is everything

Once you split things across runtimes, you need a solid messaging contract. The project uses typed enums for all communication, which is the right call. I’ve seen too many extensions break because someone sent a string and the receiver expected a different format.

Side panel talks to background for model initialization and text generation. Background talks to content script for page extraction and highlighting. Simple, predictable, and debuggable.

Where models actually run

All inference happens in the background service worker. This is the right call for a few reasons:

Single model host for all tabs and sessions
No duplicate memory usage
Models are cached under the extension origin, not per-website

The project uses two models: Gemma 4 for reasoning and tool decisions, and MiniLM for embeddings. The split makes sense – you don’t need a 7B parameter model just to find similar text.

The download dance

Model downloading is where things get real. The extension explicitly checks what’s cached, estimates remaining download size, and emits progress events. This is the kind of UX detail that separates a prototype from something people actually use.

One thing that caught me: MV3 service workers can be suspended. So model state needs to be recoverable. The code handles this through explicit CHECK_MODELS and INITIALIZE_MODELS tasks, which is clean but adds complexity.

What I’d do differently

If I were building this from scratch, I’d probably skip the side panel and go with a popup for simpler use cases. The side panel is nice for persistent chat, but adds a lot of UI code that might not be necessary.

Also, the conversation history lives in the background. This makes sense architecturally, but it means the UI is always fetching state from the background. For a chat app, that feels backwards. I’d rather keep messages in the side panel and sync to background when needed.

The bottom line

This architecture works. It’s not perfect, but it handles the MV3 constraints well and the code is clean enough to learn from. If you’re building a local AI extension, steal the messaging patterns and the model loading lifecycle. Just be prepared for the service worker lifecycle headache.

The source code is on GitHub if you want to poke around. I’ll probably write more about the content script side of things later – that’s where the interesting DOM interaction happens.