What if you could sketch a rough UI wireframe and watch AI write the React code in real-time?
That's what I set out to build with Mirage - a sketch-to-code engine powered by Vision AI.
The Problem
Every designer-to-developer handoff has the same friction: translating visual intent into code. Existing tools either required polished Figma designs or generated unusable boilerplate.
I wanted something different. Draw boxes and labels on a canvas, and get working React components back - instantly.
The First Approach: Local Vision Models
I started with Ollama running locally. The plan was simple: use a smaller vision model, keep everything on-device, avoid API costs.
Reality hit fast.
The output was... inconsistent. I'd draw a green border around a button, and the model would output border-blue-500. Layout relationships were misunderstood - elements that should be in a row would stack vertically. Text content would sometimes be hallucinated entirely.
The fundamental problem: smaller vision models (7B parameters) don't have enough capacity to understand both visual semantics AND code generation simultaneously.
The Migration to Cloud
I switched to Ollama Cloud with Qwen3-VL 235B - a 235 billion parameter vision-language model.
The difference was immediate. Colors matched. Layouts made sense. Text was preserved exactly as drawn.
But I learned something important: raw capability isn't enough. The model needed explicit guidance on how to analyze the image.
Chain-of-Thought Vision Prompts
I discovered that Vision models need structured thinking steps before generating code. Just saying "convert this to React" produces garbage.
So I designed a 4-step analysis protocol that the model follows before writing any code:
Step 1: Element Inventory
- Count every rectangle, circle, line, and text element
- Identify shapes: filled vs outlined, solid vs dashed
Step 2: Spatial Layout
- Is it a single column, row, or grid?
- How are elements positioned relative to each other?
- Are any elements overlapping?
Step 3: Color Extraction
- Map observed colors to exact Tailwind classes
- "Light/mint green borders → border-green-400"
- "Pink/coral fill → bg-pink-400 or bg-rose-400"
Step 4: Code Generation
- Only now write the JSX
- Preserve the exact layout and colors identified above
This structured approach eliminated most hallucinations. The model knows what it's looking at before deciding how to code it.
The Color Consistency Problem
Even with chain-of-thought prompts, colors were inconsistent. I'd draw a mint green border, and sometimes get border-green-400, sometimes border-teal-400, sometimes border-emerald-300.
The fix was a color vocabulary in the system prompt - explicit mappings from observed colors to specific Tailwind classes:
Light/mint green borders → border-green-400
Dark green → border-green-600 or bg-green-600
Pink/coral fill → bg-pink-400 or bg-rose-400
Red → bg-red-500
Blue → bg-blue-500
This grounding gave the model a consistent vocabulary to work with.
The Temperature Setting
I experimented with generation temperature extensively:
- 0.7: Creative variations, but layouts would drift
- 0.3: More consistent, occasional creativity issues
- 0.05: Nearly deterministic, maximum consistency
I settled on 0.05. For code generation, you don't want creativity - you want accuracy. The model should reproduce exactly what it sees, not interpret it artistically.
WebContainers for Live Preview
Generated code is useless if you can't see it working. I integrated WebContainers to run a complete Vite dev server inside the browser.
The flow: Draw → Generate → Preview - all without leaving the page.
The preview updates as soon as generation completes. You see exactly what the AI produced, running as real React code, not just syntax-highlighted text.
Protecting the API
Qwen3-VL 235B isn't cheap to run. I needed to prevent unauthorized usage.
I implemented an access code system: users enter a passcode that's verified server-side before any AI call happens. The actual API key never leaves the server. Invalid codes get a 401 response.
This pattern keeps the API secure without requiring user accounts or complex authentication flows.
What I Learned
Temperature matters more than you'd think. For code generation, lower is better. Any creativity from the model translates to inconsistency in output.
Vision models aren't magic. They need the same structured thinking we use. Chain-of-thought isn't just for text reasoning - it's essential for visual understanding.
235B > 7B for spatial reasoning. Some tasks genuinely require larger models. Vision-to-code is one of them. The gap between "kinda works" and "actually usable" required 30x more parameters.
What I'd Do Differently
The current flow is: sketch → generate → done. But what if you could sketch, generate, see the result, and sketch on top of that result for refinements?
I'd also implement streaming code preview - showing partial JSX as the model generates, rather than waiting for the complete output. Users could watch their UI take shape in real-time.
Try It
- Live Demo: mirage.sreekarreddy.com
- Project Details: /portfolio/projects/mirage
- GitHub: View Source
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.