I built a development environment with an AI coding assistant - and it works without internet.
Real terminal. Real file system. Real code execution. AI-powered help. All running in your browser.
The Problem
Every AI coding tool requires API keys and cloud connections. Claude needs Anthropic's servers. GitHub Copilot needs Microsoft's. Even local tools like Ollama require installation and setup.
I wanted to prove that meaningful AI-assisted development could happen entirely in the browser - download once, use forever.
The Stack I'd Never Used
This project required five technologies I'd never worked with before:
XTerm.js: Terminal emulation in the browser. It renders ANSI escape codes, handles cursor positioning, and looks like a real terminal. But connecting it to an actual execution environment was non-trivial.
Monaco Editor: The same editor that powers VS Code, extracted as a standalone component. It has IntelliSense, syntax highlighting, and keyboard shortcuts built in. But configuring it for a browser context required understanding its worker-based architecture.
WebContainers: This is the magic. It runs a real Node.js 18 runtime in your browser using WebAssembly. Not a simulation - actual Node execution. npm install works. Package.json resolves. Scripts run.
WebLLM: On-device AI inference using WebGPU. It downloads model weights once and runs inference entirely locally via GPU acceleration.
IndexedDB: Persistent browser storage for both file systems and model weights.
Each technology has its own quirks. Getting them to work together took significant iteration.
The AI: Phi-3 Running Locally
I chose Microsoft's Phi-3-mini-4k-instruct model - 3.8 billion parameters, quantized to 4-bit.
Why Phi-3?
- Small enough to download (1.8GB) but capable enough to be useful
- 4-bit quantization cuts size 4x with minimal quality loss
- Trained for instruction following, which is what coding assistance needs
The WebLLM library handles the heavy lifting. It downloads model weights, caches them in IndexedDB, and runs inference via WebGPU.
The 1.8GB Problem
1.8GB is a lot to download. On slow connections, users stare at a blank screen for minutes.
I implemented detailed progress callbacks:
initProgressCallback: (report) => {
this.setState({
status: "loading",
progress: report.progress * 100,
text: report.text,
});
};
Users see: "Downloading model weights... 47%"
At least they know something's happening.
The real fix is caching. After that first download, the model persists in IndexedDB. Return visits load from local storage - nearly instant.
Web Workers for Responsiveness
Running a 3.8B parameter model would freeze the UI completely. Inference takes 2-3 seconds for a decent response.
I moved the entire inference engine to a dedicated Web Worker:
this.engine = await CreateWebWorkerMLCEngine(
new Worker(new URL('./worker.ts', import.meta.url), { type: 'module' }),
'Phi-3-mini-4k-instruct-q4f16_1-MLC',
{ initProgressCallback: ... }
);
The main thread stays responsive for editing and navigation. AI generates in the background. When it's done, the response appears.
Without this architecture, every AI request would lock the browser.
WebContainers: Real Node in the Browser
WebContainers is the technology that makes this feel like a real development environment instead of a code playground.
When you write a file, it exists in an in-memory file system. When you run node index.js, actual Node.js executes. When you run npm install express, npm downloads and installs the package.
The file system syncs to IndexedDB so your work persists across sessions. Close the tab, come back tomorrow, your files are there.
This isn't emulation. It's WebAssembly-compiled Node.js running in your browser's sandbox.
Giving Users Control
Power users kept asking: "How do I free up that 1.8GB?"
So I added an explicit cache deletion feature:
async deleteCache() {
if (this.engine) {
await this.engine.unload();
}
if ('caches' in window) {
const keys = await caches.keys();
for (const key of keys) {
if (key.includes('webllm')) {
await caches.delete(key);
}
}
}
this.engine = null;
this.setState({ status: "idle", progress: 0, text: "" });
}
One click clears the model from IndexedDB. Users can reclaim storage and re-download when needed.
This matters for trust. People are more willing to download 1.8GB when they know they can remove it.
What I Learned
WebGPU is production-ready. I was skeptical that in-browser AI could be useful. It absolutely can. Phi-3 generates helpful code suggestions in 2-3 seconds - comparable to cloud APIs.
4-bit quantization works. The quality loss from 16-bit to 4-bit is minimal for coding tasks. The 4x size reduction is worth it.
Trust comes from control. Users accept large downloads when they understand what's happening and can undo it. Progress indicators and delete buttons build confidence.
WebContainers is underrated. Most developers don't know you can run actual Node.js in a browser. It opens possibilities for sandboxed development environments, tutorials, and collaborative coding.
What I'd Do Differently
1.8GB is still too much for casual users. I'd explore smaller models - Phi-2 at 1.7B parameters might be fast enough for most coding assistance while cutting download time.
I'd also add model selection. Power users could choose between fast/small and slow/capable models based on their needs.
And I'd implement prompt caching. Repeated context (like the system prompt) could be cached to speed up inference.
Try It
- Live Demo: terminal.sreekarreddy.com
- Project Details: /portfolio/projects/sr-terminal
- GitHub: View Source
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.