Heartmula — HeartMuLa: Suno-like song generation from lyrics + tags
Heartmula
Section titled “Heartmula”HeartMuLa: Suno-like song generation from lyrics + tags.
Skill metadata
Section titled “Skill metadata”| Source | Bundled (installed by default) |
| Path | skills/media/heartmula |
| Version | 1.0.0 |
| Tags | music, audio, generation, ai, heartmula, heartcodec, lyrics, songs |
| Related skills | audiocraft |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”Info The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
HeartMuLa - Open-Source Music Generation
Section titled “HeartMuLa - Open-Source Music Generation”Overview
Section titled “Overview”HeartMuLa is a family of open-source music foundation models (Apache-2.0) that generates music conditioned on lyrics and tags, with multilingual support. Generates full songs from lyrics + tags. Comparable to Suno for open-source. Includes:
- HeartMuLa - Music language model (3B/7B) for generation from lyrics + tags
- HeartCodec - 12.5Hz music codec for high-fidelity audio reconstruction
- HeartTranscriptor - Whisper-based lyrics transcription
- HeartCLAP - Audio-text alignment model
When to Use
Section titled “When to Use”- User wants to generate music/songs from text descriptions
- User wants an open-source Suno alternative
- User wants local/offline music generation
- User asks about HeartMuLa, heartlib, or AI music generation
Hardware Requirements
Section titled “Hardware Requirements”- Minimum: 8GB VRAM with
--lazy_load true(loads/unloads models sequentially) - Recommended: 16GB+ VRAM for comfortable single-GPU usage
- Multi-GPU: Use
--mula_device cuda:0 --codec_device cuda:1to split across GPUs - 3B model with lazy_load peaks at ~6.2GB VRAM
Installation Steps
Section titled “Installation Steps”1. Clone Repository
Section titled “1. Clone Repository”cd ~/ # or desired directorygit clone https://github.com/HeartMuLa/heartlib.gitcd heartlib2. Create Virtual Environment (Python 3.10 required)
Section titled “2. Create Virtual Environment (Python 3.10 required)”uv venv --python 3.10 .venv. .venv/bin/activateuv pip install -e .3. Fix Dependency Compatibility Issues
Section titled “3. Fix Dependency Compatibility Issues”IMPORTANT: As of Feb 2026, the pinned dependencies have conflicts with newer packages. Apply these fixes:
# Upgrade datasets (old version incompatible with current pyarrow)uv pip install --upgrade datasets
# Upgrade transformers (needed for huggingface-hub 1.x compatibility)uv pip install --upgrade transformers4. Patch Source Code (Required for transformers 5.x)
Section titled “4. Patch Source Code (Required for transformers 5.x)”Patch 1 - RoPE cache fix in src/heartlib/heartmula/modeling_heartmula.py:
In the setup_caches method of the HeartMuLa class, add RoPE reinitialization after the reset_caches try/except block and before the with device: block:
# Re-initialize RoPE caches that were skipped during meta-device loadingfrom torchtune.models.llama3_1._position_embeddings import Llama3ScaledRoPEfor module in self.modules(): if isinstance(module, Llama3ScaledRoPE) and not module.is_cache_built: module.rope_init() module.to(device)Why: from_pretrained creates model on meta device first; Llama3ScaledRoPE.rope_init() skips cache building on meta tensors, then never rebuilds after weights are loaded to real device.
Patch 2 - HeartCodec loading fix in src/heartlib/pipelines/music_generation.py:
Add ignore_mismatched_sizes=True to ALL HeartCodec.from_pretrained() calls (there are 2: the eager load in __init__ and the lazy load in the codec property).
Why: VQ codebook initted buffers have shape [1] in checkpoint vs [] in model. Same data, just scalar vs 0-d tensor. Safe to ignore.
5. Download Model Checkpoints
Section titled “5. Download Model Checkpoints”cd heartlib # project roothf download --local-dir './ckpt' 'HeartMuLa/HeartMuLaGen'hf download --local-dir './ckpt/HeartMuLa-oss-3B' 'HeartMuLa/HeartMuLa-oss-3B-happy-new-year'hf download --local-dir './ckpt/HeartCodec-oss' 'HeartMuLa/HeartCodec-oss-20260123'All 3 can be downloaded in parallel. Total size is several GB.
GPU / CUDA
Section titled “GPU / CUDA”HeartMuLa uses CUDA by default (--mula_device cuda --codec_device cuda). No extra setup needed if the user has an NVIDIA GPU with PyTorch CUDA support installed.
- The installed
torch==2.4.1includes CUDA 12.1 support out of the box torchtunemay report version0.4.0+cpu— this is just package metadata, it still uses CUDA via PyTorch- To verify GPU is being used, look for “CUDA memory” lines in the output (e.g. “CUDA memory before unloading: 6.20 GB”)
- No GPU? You can run on CPU with
--mula_device cpu --codec_device cpu, but expect generation to be extremely slow (potentially 30-60+ minutes for a single song vs ~4 minutes on GPU). CPU mode also requires significant RAM (~12GB+ free). If the user has no NVIDIA GPU, recommend using a cloud GPU service (Google Colab free tier with T4, Lambda Labs, etc.) or the online demo at https://heartmula.github.io/ instead.
Basic Generation
Section titled “Basic Generation”cd heartlib. .venv/bin/activatepython ./examples/run_music_generation.py \ --model_path=./ckpt \ --version="3B" \ --lyrics="./assets/lyrics.txt" \ --tags="./assets/tags.txt" \ --save_path="./assets/output.mp3" \ --lazy_load trueInput Formatting
Section titled “Input Formatting”Tags (comma-separated, no spaces):
piano,happy,wedding,synthesizer,romanticor
rock,energetic,guitar,drums,male-vocalLyrics (use bracketed structural tags):
[Intro]
[Verse]Your lyrics here...
[Chorus]Chorus lyrics...
[Bridge]Bridge lyrics...
[Outro]Key Parameters
Section titled “Key Parameters”| Parameter | Default | Description |
|---|---|---|
--max_audio_length_ms | 240000 | Max length in ms (240s = 4 min) |
--topk | 50 | Top-k sampling |
--temperature | 1.0 | Sampling temperature |
--cfg_scale | 1.5 | Classifier-free guidance scale |
--lazy_load | false | Load/unload models on demand (saves VRAM) |
--mula_dtype | bfloat16 | Dtype for HeartMuLa (bf16 recommended) |
--codec_dtype | float32 | Dtype for HeartCodec (fp32 recommended for quality) |
Performance
Section titled “Performance”- RTF (Real-Time Factor) ≈ 1.0 — a 4-minute song takes ~4 minutes to generate
- Output: MP3, 48kHz stereo, 128kbps
Pitfalls
Section titled “Pitfalls”- Do NOT use bf16 for HeartCodec — degrades audio quality. Use fp32 (default).
- Tags may be ignored — known issue (#90). Lyrics tend to dominate; experiment with tag ordering.
- Triton not available on macOS — Linux/CUDA only for GPU acceleration.
- RTX 5080 incompatibility reported in upstream issues.
- The dependency pin conflicts require the manual upgrades and patches described above.
- Repo: https://github.com/HeartMuLa/heartlib
- Models: https://huggingface.co/HeartMuLa
- Paper: https://arxiv.org/abs/2601.10547
- License: Apache-2.0