Generate standalone music tracks (songs, instrumentals, cues) at maximum audio fidelity using ACE Step 1.5 XL on the Workstation workstation. USE THIS SKILL when the user asks to make a music track, song, instrumental, beat, album cut, demo, jingle, score cue, or any standalone audio deliverable where music quality is the primary concern (not background bed for a radio drama or video). Triggers on: "make a song", "generate music", "create a track", "music in the style of X", "instrumental", "lofi / jazz / classical / ambient / rock / etc. track", "album", "demo", "jingle", "music cue", "original song". This is DISTINCT from the radio-drama-production skill, which uses a faster "good enough" music pipeline optimized for dialogue beds. This skill uses the full APG + SamplerCustomAdvanced chain with the 20GB fp32 XL base model for the highest possible audio integrity — lossless FLAC 48kHz stereo output.
Resources
11Install
npx skillscat add mechanized-scapegrace221/aeon-music-maker Install via the SkillsCat registry.
Music Producer — ACE Step 1.5 XL, maximum fidelity
Generate standalone music tracks at the highest audio quality the system is capable of. For radio-drama music beds (fast, "good enough"), use the radio-drama-production skill instead; this one exists for tracks where the audio is the deliverable.
0. Target host + tool
- Host:
${SSH_USER}@127.0.0.1(Workstation — RTX 5090, 64 GB RAM, Win 11 + OpenSSH) - Tool:
${COMFYUI_ROOT}\music_tool\music_maker.py - Templates:
music_tool\templates\ace_step_music_apg_api.json(APG chain) +ace_step_music_simple_api.json(simple KSampler) - ComfyUI endpoint on Workstation:
http://127.0.0.1:8188
1. Why a dedicated tool
scene_production_tool/radio_drama.py uses a simple KSampler template tuned for turbo variants — fast, clean enough to sit under dialogue, but the ceiling is the xl_base_sft merged model at CFG 3. The APG-requiring base models (xl_base fp32, xl_sft bf16) distort audibly under that template because ACE Step's full base models need SamplerCustomAdvanced + APG + CFGGuider to avoid artifacts (per NerdyRodent's v35 reference workflow and Stability's training notes).
music_maker.py here uses the proper APG chain for xl_base / xl_sft, producing clean output at true base-model quality. It also defaults to lossless FLAC output (48 kHz stereo), unlike the radio-drama pipeline which writes MP3 V0.
2. Variants — pick by quality/speed tradeoff
| Variant | UNet | Chain | Steps | CFG | Time (per 90 s) | Best for |
|---|---|---|---|---|---|---|
xl_base (default) |
acestep_v1.5_xl_base.safetensors (19.95 GB fp32) | APG | 50 | 7.0 | ~21 s | Album masters, standalone songs, hero cues |
xl_sft |
acestep_v1.5_xl_sft_bf16.safetensors | APG | 45 | 6.0 | ~18 s | Near-base quality, faster, bf16 |
xl_base_sft |
acestep_v1.5_xl_merge_base_sft_ta_0.5.safetensors | simple KSampler | 35 | 3.0 | ~21 s | Balance (shared default with radio-drama) |
xl_turbo |
acestep_v1.5_xl_turbo_bf16.safetensors | simple KSampler | 10 | 1.0 | ~12 s | Preview iterations, fast A/B |
base_turbo |
acestep_v1.5_turbo.safetensors (4.8 GB) | simple KSampler | 8 | 1.0 | ~8 s | Smallest/fastest, lowest quality |
APG variants use SamplerCustomAdvanced with:
APG(eta=0.7, norm_threshold=2.5, momentum=-0.75)(v35 params)CFGGuider(cfg=per-variant)KSamplerSelect("gradient_estimation")BasicScheduler("simple", steps, denoise=1.0)ModelSamplingAuraFlow(shift=3)RandomNoise(seed)
Simple variants use a straight KSampler with euler / simple — works because those models are distilled (turbo) or merged (base+SFT).
3. Quick-start
Three ways to invoke from anywhere:
Direct SSH one-liner
ssh ${SSH_USER}@127.0.0.1 'cd ${COMFYUI_ROOT} && python music_tool\music_maker.py --prompt "lofi jazz, warm Rhodes, soft saxophone, brushed drums, vinyl crackle" --duration 180 --bpm 78 --key "A minor"'From a sidecar script (recommended for longer tracks)
ssh ${SSH_USER}@127.0.0.1 'start /B python ${COMFYUI_ROOT}\music_tool\music_maker.py --prompt "..." --duration 240 --variant xl_base > ${USER_HOME}\music_maker_run.log 2>&1'
ssh ${SSH_USER}@127.0.0.1 'powershell -Command "Get-Content ${USER_HOME}\music_maker_run.log -Wait -Tail 10"'Pull the result
scp ${SSH_USER}@127.0.0.1:${COMFYUI_ROOT}/output/music/lofi_jazz_*.flac .4. Argument reference
python music_maker.py [options]
--prompt STR (required) comma-separated music descriptors
--duration FLOAT track length in seconds (default 120, max ~240)
--bpm INT tempo (default 75)
--key STR key/scale, e.g. "A minor", "C# major" (default "A minor")
--lyrics STR_OR_PATH literal lyrics OR path to .txt file (default empty = instrumental)
--variant {xl_base|xl_sft|xl_base_sft|xl_turbo|base_turbo} (default xl_base)
--steps INT override the variant's preset step count
--cfg FLOAT override the variant's preset CFG
--seed INT fixed seed for reproducibility
--output / -o PATH output file (.flac / .wav / .mp3) — default is
output/music/<slug>_<seed>.flac5. Writing good prompts
ACE Step understands music the way image models understand art — the prompt is a cloud of descriptors, not a sentence. Pile on comma-separated tags across four categories:
Genre + subgenre
lofi jazz / jazz fusion / bossa nova / swing / cool jazz / bebop
ambient drone / cinematic ambient / dark ambient / space music
lofi hiphop / boom bap / trip hop / chillhop / study beats
neo-soul / R&B / funk / gospel
classical / chamber / string quartet / solo piano / minimalist / romantic
cinematic orchestral / film score / epic trailer / horror score / ghibli-style
indie rock / shoegaze / post-rock / dream pop / synthwave / vaporwave
electronic / IDM / techno / house / drum and bass / ambient techno
world / flamenco / tango / celtic / middle eastern / afrobeat / reggaeInstruments (more specific = better)
warm Rhodes piano, muted saxophone, brushed jazz drums,
upright bass walking line, vibraphone, muted trumpet,
Fender Rhodes, clean Stratocaster, nylon-string guitar,
Moog bass, analog synth pad, mellotron strings,
violin section, cello, timpani, woodwinds,
hand drums, sitar, oud, didgeridoo, kotoProduction / mix character
vinyl crackle, tape hiss, analog warmth, lo-fi compression,
big reverb, long delay, spring reverb, plate reverb,
close-mic'd, room ambience, field recording,
dry and intimate, lush and wide, spectral shimmer,
sidechained pump, pumping kick, saturated bassMood / setting
nocturnal, rainy window, coffee shop, late-night drive,
contemplative, melancholic, uplifting, triumphant, dark foreboding,
urgent, tense, calm and measured, reverent, sacred,
morning coffee, sunrise, sunset, winter, summer, desert, forestRhythm / groove cues (reinforces BPM)
relaxed 4/4 swing, boom-bap groove, head-nod groove,
samba syncopation, waltz 3/4, odd meter 7/8,
driving straight 8ths, laid back behind the beatFull example prompt
lofi jazz, mellow hip hop beat, warm Rhodes piano, soft muted saxophone,
brushed jazz drums, upright bass walking line, vinyl crackle,
rainy window atmosphere, nocturnal, study beats, relaxed 4/4 swingAnti-patterns
- ❌ Full sentences ("A beautiful jazz song with piano") — ACE expects tags, not prose
- ❌ Requesting specific artists ("in the style of Miles Davis") — might hint but not reliable
- ❌ Contradictory tags ("aggressive peaceful / loud quiet") — model averages to mush
- ❌ Song-structure prose ("verse 1 goes like...") — use the
--lyricsarg for vocals
Writing for dynamics, feel, and punch
If your tracks sound flat / same-level / lifeless, the prompt is usually why. ACE Step mirrors the energy envelope of its tags. A "wall of sound" prompt produces a wall-of-sound track — no peaks, no valleys, no feel.
Words that CREATE dynamics (use these):
punchy, snappy, transient-rich, kick-forward, staccato, percussive,
breathy, restrained, sparse, minimal, space between notes,
quiet intro, slow build, drops to silence, sudden hit,
accent on the one, ghost note, syncopated, rhythmic tension,
call and response, rest, pause, breathing room,
rises and falls, crescendo, decrescendo, swell, taper,
loud-quiet-loud dynamics, cinematic dynamics,
sidechain pump, ducking, gated, stabbed, plucked, stabs,
muted, then big, whispered then roaredWords that KILL dynamics (avoid or use sparingly):
wall of sound, dense mix, thick, maximal, lush full arrangement,
constant energy, always moving, never stops, saturated everything,
massive, huge, overwhelming, pounding nonstop,
layered and layered, everything at once,
compressed to the max, radio-ready loud ← asks the model to pre-compressStructural cues (for lyrics or instrumental builds):
[verse: hushed] [chorus: full] [breakdown: drums only]
[drop] [build] [silence]
[intro: solo piano] [outro: solo cello, sparse]Tempo + dynamics: slower tempos (60–90 BPM) naturally have more room for dynamics than fast ones (140+). If a genre is dense and fast by default (dubstep, drum-and-bass), insert explicit dynamic cues ("drop to solo bass, silence, then full drop", "pause at bar 16") to force contrast.
Diagnostic: measure your output. After --master auto the tool prints LUFS / TP / LRA / DR / crest before and after. Targets for a track that has "feeling":
| Metric | Flat (bad) | Good | Wide (excellent) |
|---|---|---|---|
| LRA | < 3 LU | 5–8 LU | 10+ LU (classical territory) |
| DR | < 8 dB | 15–25 dB | 30+ dB |
| Crest | < 4 | 5–8 | 9+ |
LRA of 1.8 means your track is squashed flat. LRA of 6–8 is what pop/EDM typically ships at. LRA of 20+ is cinematic/classical.
6. Lyrics (vocal songs)
ACE Step is one of the few audio models that sings. Pass lyrics either inline:
python music_maker.py --prompt "indie folk, acoustic guitar, soft vocals, melancholic" \
--lyrics "[verse]
The road was long, the night was colder
The stars were hidden in the rain
I walked along with only shadows
Listening for your voice in vain
[chorus]
Find me, find me, in the morning
When the sun breaks through again..." \
--duration 180 --bpm 90 --key "D minor"Or from a file:
python music_maker.py --prompt "..." --lyrics ./lyrics/my_song.txt --duration 240Lyrics syntax
ACE Step respects structural tags in square brackets. Use these to guide the model:
[intro] instrumental intro, no vocals
[verse] verse vocals
[chorus] chorus — typically higher energy, repeated hook
[pre-chorus] build-up lines before the chorus
[bridge] contrasting section
[outro] instrumental outro
[instrumental] skip vocals for this section
[solo: saxophone] instrumental solo on the named instrument
[hook] catchy short phrase
[break] brief silence or drum-onlyKeep lines short (4–8 words) and meter-consistent within sections. ACE handles English best; other languages via the language field in the text encoder (not yet exposed as a CLI flag — edit the template if you need that).
7. Key and tempo guide
| Genre | Typical BPM | Typical keys |
|---|---|---|
| Lofi hiphop / study beats | 70–90 | A minor, D minor, E minor |
| Jazz / bossa nova | 80–140 | any — major for upbeat, minor for ballad |
| Ambient / drone | 60–70 or n/a | E minor, A minor, D minor (modal) |
| Classical / chamber | varies | full chromatic range |
| Cinematic orchestral | 60–100 hero, 120–160 action | D minor (dread), E♭ major (hero), C minor (tragedy) |
| Neo-soul / R&B | 70–95 | any — minor keys + 7ths for color |
| Indie / alt rock | 100–130 | E minor / G major / D major |
| EDM / techno | 120–135 | any |
| Dnb / jungle | 160–175 | minor keys |
Valid keyscales: "C major", "C minor", "C# major", "C# minor", ... through the full chromatic range + all major/minor pairs. ACE accepts modal hints too (e.g. "A dorian", "E phrygian") but fidelity to mode is best-effort.
8. Output formats
The tool's output is determined by the --output extension:
.flac(default) — lossless, 48 kHz stereo, typical 6–10 MB / 90 s. Use for masters..wav— 48 kHz pcm_s24le (24-bit), larger files, no quality difference from FLAC, useful for DAW import without re-encoding..mp3— libmp3lame at-q:a 0(highest VBR, ~245 kbps typical). Use for delivery / sharing.
FLAC is the right default. The model's native output is 48 kHz stereo; FLAC captures that losslessly.
9. Reproducibility + iteration workflows
Every run prints the seed used. Save it — rerunning with the same --prompt + --seed + --variant gives you the exact same track, bit-for-bit, even weeks later. This is how you:
- Iterate on a loved draft (same seed, tweak one tag)
- Bounce multiple mixes (same seed, different variant)
- Generate stems (same seed, sequential prompts emphasizing different instruments)
# You loved this one; now try a version with stronger sax
python music_maker.py --prompt "lofi jazz, warm Rhodes, PROMINENT soft saxophone, brushed drums..." \
--seed 1210744748 --duration 90 --bpm 78 --key "A minor"Mastering-preset A/B on the same generation
When you have a great draft and want to hear it through two different mastering chains without re-generating the 2-3 minute track, use --keep-raw to stash the pre-master:
# First pass: generate + master with auto-detect
python music_maker.py --prompt "..." --seed 259461068 --duration 150 --keep-raw -o track.flac
# → produces track.flac (mastered) + track.raw.flac (pre-master)
# Second pass: re-master the raw with a different preset
python scene_production_tool/music_mastering.py track.raw.flac --preset edm -o track_edm.flac
python scene_production_tool/music_mastering.py track.raw.flac --preset default --target-lufs -14 -o track_hifi.flac
python scene_production_tool/music_mastering.py track.raw.flac --preset orchestral -o track_transparent.flac
# Compare all three side by sideThis is the fastest way to find the right sonic character for a track: generate once, master many.
Sweep seeds cheaply, master the winner
When a prompt might go many different directions, do a fast preview sweep with xl_turbo --master off, pick the best seed, then re-generate at xl_base quality with mastering on:
# Sweep 5 seeds at turbo speed, raw output (no mastering yet)
for i in 1 2 3 4 5; do
python music_maker.py --prompt "..." --duration 30 --variant xl_turbo --master off \
--output "preview_${i}.flac" --seed $RANDOM
done
# Pick the one you like, note its seed, then:
python music_maker.py --prompt "..." --duration 180 --variant xl_base --seed <winning_seed>Remaster an existing track in-place
# If you already have a .flac from a previous run and want to try the mastering chain:
python scene_production_tool/music_mastering.py \
output/music/some_older_track.flac --preset jazz \
--output output/music/some_older_track_mastered.flac9b. Diagnosing a flat-sounding track
If a rendered track feels lifeless, the --master auto run prints before/after LRA / DR / crest. Decision tree:
LRA (raw) < 3.0 LU?
├── YES → the GENERATOR produced a flat track. Mastering can't fix this.
│ FIX: rewrite prompt with dynamics vocabulary (section 5).
│ Add [intro: sparse] / [drop] / [breakdown] structural cues.
│ Drop dynamics-killing words ("wall of sound", "dense", "massive").
│
└── NO (LRA > 3.0) → generator is fine.
└── LRA (after) << LRA (before)?
├── YES → mastering chain is compressing too hard.
│ CHECK: did you force a preset that adds a Compressor?
│ (Default presets have NO compressor; only `use_compressor: True`
│ presets compress. None of the named presets set this.)
│
└── NO → dynamics intact. If track still feels "dull":
- Check saturation_db in preset (raise for more color)
- Check presence_db / high_shelf_db (raise for more clarity)
- Try a different preset: chill → default → edm (ascending color)Rule of thumb target LRA by genre:
| Style | Target LRA |
|---|---|
| Pop / commercial EDM | 5–8 LU |
| Hip-hop / trap | 4–7 LU |
| Rock / indie | 6–10 LU |
| Jazz | 8–14 LU |
| Film score / cinematic | 12–20 LU |
| Classical | 15–25 LU |
If your prompt is for "cinematic orchestral" and the track measures LRA 4, the prompt lost the fight — rewrite to emphasize dynamics (section 5).
10. Failure modes
| Symptom | Cause | Fix |
|---|---|---|
| Distorted / tinny / hollow audio | You're using simple KSampler with xl_base or xl_sft via a non-APG template |
Use music_maker.py (this tool), NOT radio_drama.py --stage music — this one uses APG |
| Truncated audio | --duration > ~240 s exceeds model coherence window |
Split into 2–3 tracks, crossfade in DAW |
| Lyrics not sung | Model ignored the lyrics tag | Ensure the --lyrics arg is set and the prompt includes vocal tags ("soft vocals", "sung", "male/female voice") |
| Timing wrong for BPM | Prompt contradicts --bpm (e.g. BPM 80 but tags say "uptempo") |
Either tighten tags or change --bpm to match |
GatedRepoError or missing model |
Model file not on disk | Confirm with ssh ${SSH_USER}@127.0.0.1 'dir ${COMFYUI_ROOT}\models\diffusion_models\acestep_*' |
| Out-of-memory | Happens occasionally with xl_base fp32 + long durations + concurrent ComfyUI work |
Set --variant xl_sft (bf16, ~10 GB), or wait for other jobs to finish |
[Errno 22] on ComfyUI load |
Windows mmap bug | Verify comfy/utils.py line 41 is DISABLE_MMAP = True on Workstation |
11. Recipes
Lofi jazz bed (a solid starting point)
python music_maker.py \
--prompt "lofi jazz, mellow hip hop beat, warm Rhodes piano, soft muted saxophone, brushed jazz drums, upright bass walking line, vinyl crackle, rainy window atmosphere, nocturnal, study beats, relaxed 4/4 swing" \
--duration 180 --bpm 78 --key "A minor" --variant xl_baseCinematic hero cue (for a trailer)
python music_maker.py \
--prompt "cinematic orchestral, epic trailer, powerful strings rising, heroic horns, thunderous timpani, choir swell, uplifting resolution, dolby atmos wide" \
--duration 90 --bpm 100 --key "E♭ major" --variant xl_base --cfg 8.0Dark ambient drone (atmosphere bed)
python music_maker.py \
--prompt "dark ambient drone, sub bass sustain, granular synthesis pad, distant wind chimes, long reverb tail, cave-like space, slow evolving, no rhythm" \
--duration 240 --bpm 60 --key "D minor" --variant xl_base --steps 70Indie folk demo with lyrics
python music_maker.py \
--prompt "indie folk, intimate acoustic guitar fingerpicking, soft female vocals, close-mic'd, cold morning, minimal reverb" \
--lyrics "[verse]
I left my heart up on the ridge
Between the pine and snow
You said you'd come and find me there
But winter just would not let go
[chorus]
Find me, find me, find me before the thaw
Where the silence holds the note" \
--duration 160 --bpm 88 --key "G major" --variant xl_basePreview iteration (fast, disposable)
python music_maker.py --prompt "..." --duration 30 --variant xl_turbo12. Mastering — post-generation dynamics-preserving chain
Raw ACE output benefits from a quick mastering pass: subtle EQ for presence and air, whisper of tape warmth, gain-match to a sensible LUFS target, and a safety ceiling to catch stray peaks. The tool does this automatically unless you opt out.
What's in the chain
raw.flac
|
|-- HighpassFilter(20–30 Hz) remove sub-rumble
|-- LowShelf(-1 dB @ 150–200 Hz) tame mud
|-- PeakFilter(+1..2 dB @ 2.5–4.5 kHz, Q 0.6–0.9) presence
|-- HighShelf(+1.5..3 dB @ 10–13 kHz) air
|-- Distortion(0..4 dB) light tape warmth (skipped at 0)
|-- Gain match to target LUFS CAPPED at +6 dB to avoid boosting silence
|-- Clipping ceiling (−0.8..−2.0 dBFS) brick-wall safety (rarely engages)
|
mastered.flacKey design choice: no compressor in the default chain, no loudnorm LRA=X. Both flatten dynamics. Instead the chain relies on additive EQ + light saturation to add perceived "life" without squashing transients, then gain-matches to LUFS via ebur128 measurement.
Presets
| Preset | Target LUFS | Saturation | Best for |
|---|---|---|---|
default |
−14 | 2.5 dB | Unknown / mixed genre |
edm |
−12 | 3.5 dB | Dubstep, dnb, trance, psy, house, DMT-flash |
trap |
−12 | 3.0 dB | 808 rap, drill, hip-hop |
chill |
−16 | 1.5 dB | Lofi, ambient, chillhop |
orchestral |
−18 | 0 dB | Classical, film score (fully transparent) |
jazz |
−15 | 1.0 dB | Jazz, bossa, bebop, smooth jazz |
Usage
Auto-detect (default, recommended):
python music_maker.py --prompt "lofi jazz, Rhodes, saxophone, brushed drums" --duration 180
# auto-picks "jazz" preset from prompt keywordsForce a preset:
python music_maker.py --prompt "..." --master orchestral
python music_maker.py --prompt "..." --master edm --target-lufs -11 # louder than defaultOpt out (raw ACE output, no mastering):
python music_maker.py --prompt "..." --master offKeep both raw and mastered:
python music_maker.py --prompt "..." --keep-raw
# writes <output>.flac (mastered) AND <output>.raw.flac (pre-master)Standalone mastering (on any audio file)
The chain is also callable directly for mastering pre-existing tracks:
python scene_production_tool/music_mastering.py input.flac --preset orchestral
python scene_production_tool/music_mastering.py input.flac --preset edm -o out.wav --target-lufs -11Validation
The tool prints a before/after table:
[BEFORE] epic_orchestral_xl_base.flac
LUFS -11.9 TP 0.30 LRA 23.1 DR 28.5 crest 6.68
[AFTER] epic_orchestral_xl_base_orchestral.flac (preset='orchestral')
LUFS -18.0 TP -5.82 LRA 22.9 DR 27.9 crest 7.47
Δ LRA -0.2 Δ DR -0.6 Δ crest +0.79What to expect: LUFS hits target within ±0.2 LU. LRA should change by less than 1 LU (preferably +0 or better). Crest factor may drop slightly on tracks whose source was clipping (peaks get cleaned up) or may rise on clean sources (EQ adds punch). TP should land in the −1 to −3 dBTP range (safe for streaming).
What NOT to expect: Mastering cannot restore dynamics the generator never produced. A track that comes out of ACE with LRA 1.8 stays at LRA 1.8 after mastering — you get back what was there, with more color and safer peaks. To get more dynamics, fix the prompt (see section 5: Writing for dynamics).
13. When to use this vs. radio-drama music
Use this (music_maker.py) when:
- The music is the deliverable (song, single, album cut, demo, jingle)
- You care about audio fidelity — lossless masters, full dynamic range, no mp3 artifacts
- Tracks will be listened to front-and-center, not under dialogue
- You want full base-model quality via APG
Use scene_production_tool/radio_drama.py --stage music when:
- The music sits UNDER dialogue with sidechain ducking
- 30–60 s cues per scene, MP3 output fine
- Part of a larger radio-drama production
- Speed matters more than ceiling quality
Different templates, different defaults, different output formats. They don't conflict — both exist side-by-side.