The models
Studio is a thin, readable pipeline wrapped around six models. Each one runs locally - most as a Docker container Studio talks to over HTTP on loopback - and each is swappable. This page is the map: what each model is, where it runs, where its weights sit on disk, and how to point Studio at a different one.
What gets loaded
| Model | Job | Where it runs | Default |
|---|---|---|---|
| OmniVoice | Cloned character voices (TTS) | container omnivoice · :3900 | k2-fsa OmniVoice (zero-shot, 600+ langs) |
| Piper HAL | Robot / synthetic-AI voices | container piper · :5050 | Piper (fast neural TTS) |
| ComfyUI | Animated video masters | container comfyui · :8188 | zeroscope_v2_576w (text-to-video) |
| RIFE | Frame interpolation (24→72f) | rife-ncnn-vulkan (Vulkan binary) | rife-v4.6 |
| faster-whisper | Subtitle transcription / timing | CPU venv (.whisper-venv) | large-v3 (int8) |
| Ollama | Shot lists, card copy, SFX, dubs | container ollama · :11434 | qwen2.5:7b-instruct-q4_K_M |
Voices - OmniVoice & Piper
Dialogue is routed per speaker. Cloned human-ish voices go to OmniVoice; robot and synthetic-AI characters go to Piper(“HAL”). The mapping lives in the manifest’s voice.speaker_map, and you set it from the Audio tab’s Voices panel.
OmniVoice
A zero-shot multilingual TTS - clone a voice from a few clean seconds of reference audio, then synthesize any text in it. Studio runs it as the omnivoice container on :3900 (the OmniVoice-Studio ↗ server image, which wraps the k2-fsa OmniVoice ↗ model). Cloned profiles and the model cache live under /mnt/storage/omnivoice/. You clone new voices from the Audio tab’s Create voice button. OmniVoice’s instruct and speedparameters are covered on the Prompts page.
Piper HAL
A small, very fast neural TTS used for the show’s machine voices. Studio runs it as the piper container on :5050. Upstream: rhasspy/piper ↗.
Video - ComfyUI & zeroscope
The moving pictures come from a text-to-video diffusion model driven through ComfyUI ↗ (docs.comfy.org ↗), run as the comfyuicontainer on :8188. The default checkpoint is zeroscope_v2_576w ↗- a watermark-free drop-in for the original ModelScope text-to-video weights. Studio renders each unique character pose and b-roll once as a short animated master (24 frames, square), at the steps/cfg set in the manifest’s comfyui block.
- Weights live in ComfyUI’s models tree at /mnt/storage/comfyui/models/; the zeroscope graph uses text2video_pytorch_model.pth + open_clip_pytorch_model.bin.
- Renders land in /mnt/storage/comfyui/output/macu/, where the render service picks them up.
Interpolation - RIFE
The 24-frame masters are tripled to 72 frames with rife-ncnn-vulkan ↗, a single-binary Vulkan build of RIFE (model rife-v4.6). No CUDA wheels needed - it runs on the same GPU through Vulkan. This is what makes the cheap, short diffusion clips read as smooth motion.
Transcription - faster-whisper
Subtitles are timed by faster-whisper ↗ running the large-v3 model on CPU (int8) - it produces word-level timings, which stage 7 aligns against your actual manifest text so the captions read exactly what you wrote, timed to the audio. It runs in its own Python venv (.whisper-venv) so it doesn’t fight the GPU services for VRAM.
The local LLM - Ollama + Qwen2.5
The “generate” buttons - Generate shot list, the title-card copy writer, the sound-effect spotter, and the 48-language Localize translator - are all backed by a local large language model: Qwen2.5 7B ↗ (qwen2.5:7b-instruct-q4_K_M) served by Ollama ↗ on :11434. It uses structured (JSON-schema-constrained) outputs so the results drop straight into the manifest. Like the other heavy services it’s started on demand and stopped when idle.
Swap / bring your own model
Most of the model choices are configuration, not code. Here’s the actual surface:
Video checkpoint & workflow
The manifest’s comfyui block holds checkpoint and workflow, plus frames/width/height/steps/cfg. Point checkpointat any model you’ve dropped into ComfyUI’s models dir. Caveat: the stage-2 graph is currently hardcoded in pipeline/stage_2_masters.py for the zeroscope/ModelScope node layout, so a checkpoint that needs a different graph means editing that file (and the matching builder in the comfyui-mcp workflows.js) - not just changing the manifest string.
Subtitle transcription size
Trade accuracy for speed by changing the whisper model in pipeline/stage_6_whisper.py - large-v3 down to medium/small/base/tiny.
The LLM
Pull any Ollama model that supports structured output and set it as the default in two places (keep them in sync):
ollama pull <your-model>
# studio/backend/macu_studio/llm.py -> DEFAULT_MODEL = "<your-model>"
# pipeline/llm_ollama.py -> same model idRIFE model & subtitle font
The interpolation model is the -m rife-v4.6 flag in pipeline/stage_3_rife.py. The subtitle font is pure manifest: subtitles.font, subtitles.fontsdir, subtitles.fontsize, and the libass subtitles.force_style string.
Point at a remote / different service
Every service URL is an environment override in .env, so you can run a service on another box or a different port without touching code:
MACU_COMFY_URL=http://127.0.0.1:8188
MACU_PIPER_URL=http://127.0.0.1:5050
MACU_OMNIVOICE_URL=http://127.0.0.1:3900
# Ollama: http://127.0.0.1:11434The complete, commented list of environment knobs is in the repo’s .env.example.