Reference · Models

The models

Studio is a thin, readable pipeline wrapped around six models. Each one runs locally - most as a Docker container Studio talks to over HTTP on loopback - and each is swappable. This page is the map: what each model is, where it runs, where its weights sit on disk, and how to point Studio at a different one.

The roster

What gets loaded

ModelJobWhere it runsDefault
OmniVoiceCloned character voices (TTS)container omnivoice · :3900k2-fsa OmniVoice (zero-shot, 600+ langs)
Piper HALRobot / synthetic-AI voicescontainer piper · :5050Piper (fast neural TTS)
ComfyUIAnimated video masterscontainer comfyui · :8188zeroscope_v2_576w (text-to-video)
RIFEFrame interpolation (24→72f)rife-ncnn-vulkan (Vulkan binary)rife-v4.6
faster-whisperSubtitle transcription / timingCPU venv (.whisper-venv)large-v3 (int8)
OllamaShot lists, card copy, SFX, dubscontainer ollama · :11434qwen2.5:7b-instruct-q4_K_M
Consumer-lifecycle, not always-on.
The heavy GPU services (OmniVoice, ComfyUI, Ollama) are started when a stage needs them and stopped afterward so they don’t all hold VRAM at once on an 11 GB card. Studio drives the docker start/stopfor you; you don’t manage it by hand.
Stage 1

Voices - OmniVoice & Piper

Dialogue is routed per speaker. Cloned human-ish voices go to OmniVoice; robot and synthetic-AI characters go to Piper(“HAL”). The mapping lives in the manifest’s voice.speaker_map, and you set it from the Audio tab’s Voices panel.

OmniVoice

A zero-shot multilingual TTS - clone a voice from a few clean seconds of reference audio, then synthesize any text in it. Studio runs it as the omnivoice container on :3900 (the OmniVoice-Studio server image, which wraps the k2-fsa OmniVoice model). Cloned profiles and the model cache live under /mnt/storage/omnivoice/. You clone new voices from the Audio tab’s Create voice button. OmniVoice’s instruct and speedparameters are covered on the Prompts page.

Piper HAL

A small, very fast neural TTS used for the show’s machine voices. Studio runs it as the piper container on :5050. Upstream: rhasspy/piper.

Stage 2

Video - ComfyUI & zeroscope

The moving pictures come from a text-to-video diffusion model driven through ComfyUI (docs.comfy.org), run as the comfyuicontainer on :8188. The default checkpoint is zeroscope_v2_576w- a watermark-free drop-in for the original ModelScope text-to-video weights. Studio renders each unique character pose and b-roll once as a short animated master (24 frames, square), at the steps/cfg set in the manifest’s comfyui block.

  • Weights live in ComfyUI’s models tree at /mnt/storage/comfyui/models/; the zeroscope graph uses text2video_pytorch_model.pth + open_clip_pytorch_model.bin.
  • Renders land in /mnt/storage/comfyui/output/macu/, where the render service picks them up.
Stage 3

Interpolation - RIFE

The 24-frame masters are tripled to 72 frames with rife-ncnn-vulkan, a single-binary Vulkan build of RIFE (model rife-v4.6). No CUDA wheels needed - it runs on the same GPU through Vulkan. This is what makes the cheap, short diffusion clips read as smooth motion.

Stage 6

Transcription - faster-whisper

Subtitles are timed by faster-whisper running the large-v3 model on CPU (int8) - it produces word-level timings, which stage 7 aligns against your actual manifest text so the captions read exactly what you wrote, timed to the audio. It runs in its own Python venv (.whisper-venv) so it doesn’t fight the GPU services for VRAM.

The assistant in the box

The local LLM - Ollama + Qwen2.5

The “generate” buttons - Generate shot list, the title-card copy writer, the sound-effect spotter, and the 48-language Localize translator - are all backed by a local large language model: Qwen2.5 7B (qwen2.5:7b-instruct-q4_K_M) served by Ollama on :11434. It uses structured (JSON-schema-constrained) outputs so the results drop straight into the manifest. Like the other heavy services it’s started on demand and stopped when idle.

Translation is local and included.
The Localize dub and translated subtitles use this same local LLM (with a per-show glossary and a length budget) - there’s no cloud translation service and no API key. An offline argos-translate path is available as a fallback.
Make it yours

Swap / bring your own model

Most of the model choices are configuration, not code. Here’s the actual surface:

Video checkpoint & workflow

The manifest’s comfyui block holds checkpoint and workflow, plus frames/width/height/steps/cfg. Point checkpointat any model you’ve dropped into ComfyUI’s models dir. Caveat: the stage-2 graph is currently hardcoded in pipeline/stage_2_masters.py for the zeroscope/ModelScope node layout, so a checkpoint that needs a different graph means editing that file (and the matching builder in the comfyui-mcp workflows.js) - not just changing the manifest string.

Subtitle transcription size

Trade accuracy for speed by changing the whisper model in pipeline/stage_6_whisper.py - large-v3 down to medium/small/base/tiny.

The LLM

Pull any Ollama model that supports structured output and set it as the default in two places (keep them in sync):

ollama pull <your-model>

# studio/backend/macu_studio/llm.py   ->  DEFAULT_MODEL = "<your-model>"
# pipeline/llm_ollama.py              ->  same model id

RIFE model & subtitle font

The interpolation model is the -m rife-v4.6 flag in pipeline/stage_3_rife.py. The subtitle font is pure manifest: subtitles.font, subtitles.fontsdir, subtitles.fontsize, and the libass subtitles.force_style string.

Point at a remote / different service

Every service URL is an environment override in .env, so you can run a service on another box or a different port without touching code:

MACU_COMFY_URL=http://127.0.0.1:8188
MACU_PIPER_URL=http://127.0.0.1:5050
MACU_OMNIVOICE_URL=http://127.0.0.1:3900
# Ollama: http://127.0.0.1:11434

The complete, commented list of environment knobs is in the repo’s .env.example.