Audiocraft Audio Generation — AudioCraft: MusicGen text-to-music, AudioGen text-to-sound
Audiocraft Audio Generation
Section titled “Audiocraft Audio Generation”AudioCraft: MusicGen text-to-music, AudioGen text-to-sound.
Skill metadata
Section titled “Skill metadata”| Source | Bundled (installed by default) |
| Path | skills/mlops/models/audiocraft |
| Version | 1.0.0 |
| Author | Orchestra Research |
| License | MIT |
| Dependencies | audiocraft, torch>=2.0.0, transformers>=4.30.0 |
| Tags | Multimodal, Audio Generation, Text-to-Music, Text-to-Audio, MusicGen |
Reference: full SKILL.md
Section titled “Reference: full SKILL.md”Info The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.
AudioCraft: Audio Generation
Section titled “AudioCraft: Audio Generation”Comprehensive guide to using Meta’s AudioCraft for text-to-music and text-to-audio generation with MusicGen, AudioGen, and EnCodec.
When to use AudioCraft
Section titled “When to use AudioCraft”Use AudioCraft when:
- Need to generate music from text descriptions
- Creating sound effects and environmental audio
- Building music generation applications
- Need melody-conditioned music generation
- Want stereo audio output
- Require controllable music generation with style transfer
Key features:
- MusicGen: Text-to-music generation with melody conditioning
- AudioGen: Text-to-sound effects generation
- EnCodec: High-fidelity neural audio codec
- Multiple model sizes: Small (300M) to Large (3.3B)
- Stereo support: Full stereo audio generation
- Style conditioning: MusicGen-Style for reference-based generation
Use alternatives instead:
- Stable Audio: For longer commercial music generation
- Bark: For text-to-speech with music/sound effects
- Riffusion: For spectogram-based music generation
- OpenAI Jukebox: For raw audio generation with lyrics
Quick start
Section titled “Quick start”Installation
Section titled “Installation”# From PyPIpip install audiocraft
# From GitHub (latest)pip install git+https://github.com/facebookresearch/audiocraft.git
# Or use HuggingFace Transformerspip install transformers torch torchaudioBasic text-to-music (AudioCraft)
Section titled “Basic text-to-music (AudioCraft)”import torchaudiofrom audiocraft.models import MusicGen
# Load modelmodel = MusicGen.get_pretrained('facebook/musicgen-small')
# Set generation parametersmodel.set_generation_params( duration=8, # seconds top_k=250, temperature=1.0)
# Generate from textdescriptions = ["happy upbeat electronic dance music with synths"]wav = model.generate(descriptions)
# Save audiotorchaudio.save("output.wav", wav[0].cpu(), sample_rate=32000)Using HuggingFace Transformers
Section titled “Using HuggingFace Transformers”from transformers import AutoProcessor, MusicgenForConditionalGenerationimport scipy
# Load model and processorprocessor = AutoProcessor.from_pretrained("facebook/musicgen-small")model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-small")model.to("cuda")
# Generate musicinputs = processor( text=["80s pop track with bassy drums and synth"], padding=True, return_tensors="pt").to("cuda")
audio_values = model.generate( **inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)
# Savesampling_rate = model.config.audio_encoder.sampling_ratescipy.io.wavfile.write("output.wav", rate=sampling_rate, data=audio_values[0, 0].cpu().numpy())Text-to-sound with AudioGen
Section titled “Text-to-sound with AudioGen”from audiocraft.models import AudioGen
# Load AudioGenmodel = AudioGen.get_pretrained('facebook/audiogen-medium')
model.set_generation_params(duration=5)
# Generate sound effectsdescriptions = ["dog barking in a park with birds chirping"]wav = model.generate(descriptions)
torchaudio.save("sound.wav", wav[0].cpu(), sample_rate=16000)Core concepts
Section titled “Core concepts”Architecture overview
Section titled “Architecture overview”AudioCraft Architecture:┌──────────────────────────────────────────────────────────────┐│ Text Encoder (T5) ││ │ ││ Text Embeddings │└────────────────────────┬─────────────────────────────────────┘ │┌────────────────────────▼─────────────────────────────────────┐│ Transformer Decoder (LM) ││ Auto-regressively generates audio tokens ││ Using efficient token interleaving patterns │└────────────────────────┬─────────────────────────────────────┘ │┌────────────────────────▼─────────────────────────────────────┐│ EnCodec Audio Decoder ││ Converts tokens back to audio waveform │└──────────────────────────────────────────────────────────────┘Model variants
Section titled “Model variants”| Model | Size | Description | Use Case |
|---|---|---|---|
musicgen-small | 300M | Text-to-music | Quick generation |
musicgen-medium | 1.5B | Text-to-music | Balanced |
musicgen-large | 3.3B | Text-to-music | Best quality |
musicgen-melody | 1.5B | Text + melody | Melody conditioning |
musicgen-melody-large | 3.3B | Text + melody | Best melody |
musicgen-stereo-* | Varies | Stereo output | Stereo generation |
musicgen-style | 1.5B | Style transfer | Reference-based |
audiogen-medium | 1.5B | Text-to-sound | Sound effects |
Generation parameters
Section titled “Generation parameters”| Parameter | Default | Description |
|---|---|---|
duration | 8.0 | Length in seconds (1-120) |
top_k | 250 | Top-k sampling |
top_p | 0.0 | Nucleus sampling (0 = disabled) |
temperature | 1.0 | Sampling temperature |
cfg_coef | 3.0 | Classifier-free guidance |
MusicGen usage
Section titled “MusicGen usage”Text-to-music generation
Section titled “Text-to-music generation”from audiocraft.models import MusicGenimport torchaudio
model = MusicGen.get_pretrained('facebook/musicgen-medium')
# Configure generationmodel.set_generation_params( duration=30, # Up to 30 seconds top_k=250, # Sampling diversity top_p=0.0, # 0 = use top_k only temperature=1.0, # Creativity (higher = more varied) cfg_coef=3.0 # Text adherence (higher = stricter))
# Generate multiple samplesdescriptions = [ "epic orchestral soundtrack with strings and brass", "chill lo-fi hip hop beat with jazzy piano", "energetic rock song with electric guitar"]
# Generate (returns [batch, channels, samples])wav = model.generate(descriptions)
# Save eachfor i, audio in enumerate(wav): torchaudio.save(f"music_{i}.wav", audio.cpu(), sample_rate=32000)Melody-conditioned generation
Section titled “Melody-conditioned generation”from audiocraft.models import MusicGenimport torchaudio
# Load melody modelmodel = MusicGen.get_pretrained('facebook/musicgen-melody')model.set_generation_params(duration=30)
# Load melody audiomelody, sr = torchaudio.load("melody.wav")
# Generate with melody conditioningdescriptions = ["acoustic guitar folk song"]wav = model.generate_with_chroma(descriptions, melody, sr)
torchaudio.save("melody_conditioned.wav", wav[0].cpu(), sample_rate=32000)Stereo generation
Section titled “Stereo generation”from audiocraft.models import MusicGen
# Load stereo modelmodel = MusicGen.get_pretrained('facebook/musicgen-stereo-medium')model.set_generation_params(duration=15)
descriptions = ["ambient electronic music with wide stereo panning"]wav = model.generate(descriptions)
# wav shape: [batch, 2, samples] for stereoprint(f"Stereo shape: {wav.shape}") # [1, 2, 480000]torchaudio.save("stereo.wav", wav[0].cpu(), sample_rate=32000)Audio continuation
Section titled “Audio continuation”from transformers import AutoProcessor, MusicgenForConditionalGeneration
processor = AutoProcessor.from_pretrained("facebook/musicgen-medium")model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-medium")
# Load audio to continueimport torchaudioaudio, sr = torchaudio.load("intro.wav")
# Process with text and audioinputs = processor( audio=audio.squeeze().numpy(), sampling_rate=sr, text=["continue with a epic chorus"], padding=True, return_tensors="pt")
# Generate continuationaudio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=512)MusicGen-Style usage
Section titled “MusicGen-Style usage”Style-conditioned generation
Section titled “Style-conditioned generation”from audiocraft.models import MusicGen
# Load style modelmodel = MusicGen.get_pretrained('facebook/musicgen-style')
# Configure generation with stylemodel.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=5.0 # Style influence)
# Configure style conditionermodel.set_style_conditioner_params( eval_q=3, # RVQ quantizers (1-6) excerpt_length=3.0 # Style excerpt length)
# Load style referencestyle_audio, sr = torchaudio.load("reference_style.wav")
# Generate with text + styledescriptions = ["upbeat dance track"]wav = model.generate_with_style(descriptions, style_audio, sr)Style-only generation (no text)
Section titled “Style-only generation (no text)”# Generate matching style without text promptmodel.set_generation_params( duration=30, cfg_coef=3.0, cfg_coef_beta=None # Disable double CFG for style-only)
wav = model.generate_with_style([None], style_audio, sr)AudioGen usage
Section titled “AudioGen usage”Sound effect generation
Section titled “Sound effect generation”from audiocraft.models import AudioGenimport torchaudio
model = AudioGen.get_pretrained('facebook/audiogen-medium')model.set_generation_params(duration=10)
# Generate various soundsdescriptions = [ "thunderstorm with heavy rain and lightning", "busy city traffic with car horns", "ocean waves crashing on rocks", "crackling campfire in forest"]
wav = model.generate(descriptions)
for i, audio in enumerate(wav): torchaudio.save(f"sound_{i}.wav", audio.cpu(), sample_rate=16000)EnCodec usage
Section titled “EnCodec usage”Audio compression
Section titled “Audio compression”from audiocraft.models import CompressionModelimport torchimport torchaudio
# Load EnCodecmodel = CompressionModel.get_pretrained('facebook/encodec_32khz')
# Load audiowav, sr = torchaudio.load("audio.wav")
# Ensure correct sample rateif sr != 32000: resampler = torchaudio.transforms.Resample(sr, 32000) wav = resampler(wav)
# Encode to tokenswith torch.no_grad(): encoded = model.encode(wav.unsqueeze(0)) codes = encoded[0] # Audio codes
# Decode back to audiowith torch.no_grad(): decoded = model.decode(codes)
torchaudio.save("reconstructed.wav", decoded[0].cpu(), sample_rate=32000)Common workflows
Section titled “Common workflows”Workflow 1: Music generation pipeline
Section titled “Workflow 1: Music generation pipeline”import torchimport torchaudiofrom audiocraft.models import MusicGen
class MusicGenerator: def __init__(self, model_name="facebook/musicgen-medium"): self.model = MusicGen.get_pretrained(model_name) self.sample_rate = 32000
def generate(self, prompt, duration=30, temperature=1.0, cfg=3.0): self.model.set_generation_params( duration=duration, top_k=250, temperature=temperature, cfg_coef=cfg )
with torch.no_grad(): wav = self.model.generate([prompt])
return wav[0].cpu()
def generate_batch(self, prompts, duration=30): self.model.set_generation_params(duration=duration)
with torch.no_grad(): wav = self.model.generate(prompts)
return wav.cpu()
def save(self, audio, path): torchaudio.save(path, audio, sample_rate=self.sample_rate)
# Usagegenerator = MusicGenerator()audio = generator.generate( "epic cinematic orchestral music", duration=30, temperature=1.0)generator.save(audio, "epic_music.wav")Workflow 2: Sound design batch processing
Section titled “Workflow 2: Sound design batch processing”import jsonfrom pathlib import Pathfrom audiocraft.models import AudioGenimport torchaudio
def batch_generate_sounds(sound_specs, output_dir): """ Generate multiple sounds from specifications.
Args: sound_specs: list of {"name": str, "description": str, "duration": float} output_dir: output directory path """ model = AudioGen.get_pretrained('facebook/audiogen-medium') output_dir = Path(output_dir) output_dir.mkdir(exist_ok=True)
results = []
for spec in sound_specs: model.set_generation_params(duration=spec.get("duration", 5))
wav = model.generate([spec["description"]])
output_path = output_dir / f"{spec['name']}.wav" torchaudio.save(str(output_path), wav[0].cpu(), sample_rate=16000)
results.append({ "name": spec["name"], "path": str(output_path), "description": spec["description"] })
return results
# Usagesounds = [ {"name": "explosion", "description": "massive explosion with debris", "duration": 3}, {"name": "footsteps", "description": "footsteps on wooden floor", "duration": 5}, {"name": "door", "description": "wooden door creaking and closing", "duration": 2}]
results = batch_generate_sounds(sounds, "sound_effects/")Workflow 3: Gradio demo
Section titled “Workflow 3: Gradio demo”import gradio as grimport torchimport torchaudiofrom audiocraft.models import MusicGen
model = MusicGen.get_pretrained('facebook/musicgen-small')
def generate_music(prompt, duration, temperature, cfg_coef): model.set_generation_params( duration=duration, temperature=temperature, cfg_coef=cfg_coef )
with torch.no_grad(): wav = model.generate([prompt])
# Save to temp file path = "temp_output.wav" torchaudio.save(path, wav[0].cpu(), sample_rate=32000) return path
demo = gr.Interface( fn=generate_music, inputs=[ gr.Textbox(label="Music Description", placeholder="upbeat electronic dance music"), gr.Slider(1, 30, value=8, label="Duration (seconds)"), gr.Slider(0.5, 2.0, value=1.0, label="Temperature"), gr.Slider(1.0, 10.0, value=3.0, label="CFG Coefficient") ], outputs=gr.Audio(label="Generated Music"), title="MusicGen Demo")
demo.launch()Performance optimization
Section titled “Performance optimization”Memory optimization
Section titled “Memory optimization”# Use smaller modelmodel = MusicGen.get_pretrained('facebook/musicgen-small')
# Clear cache between generationstorch.cuda.empty_cache()
# Generate shorter durationsmodel.set_generation_params(duration=10) # Instead of 30
# Use half precisionmodel = model.half()Batch processing efficiency
Section titled “Batch processing efficiency”# Process multiple prompts at once (more efficient)descriptions = ["prompt1", "prompt2", "prompt3", "prompt4"]wav = model.generate(descriptions) # Single batch
# Instead offor desc in descriptions: wav = model.generate([desc]) # Multiple batches (slower)GPU memory requirements
Section titled “GPU memory requirements”| Model | FP32 VRAM | FP16 VRAM |
|---|---|---|
| musicgen-small | ~4GB | ~2GB |
| musicgen-medium | ~8GB | ~4GB |
| musicgen-large | ~16GB | ~8GB |
Common issues
Section titled “Common issues”| Issue | Solution |
|---|---|
| CUDA OOM | Use smaller model, reduce duration |
| Poor quality | Increase cfg_coef, better prompts |
| Generation too short | Check max duration setting |
| Audio artifacts | Try different temperature |
| Stereo not working | Use stereo model variant |
References
Section titled “References”- Advanced Usage - Training, fine-tuning, deployment
- Troubleshooting - Common issues and solutions
Resources
Section titled “Resources”- GitHub: https://github.com/facebookresearch/audiocraft
- Paper (MusicGen): https://arxiv.org/abs/2306.05284
- Paper (AudioGen): https://arxiv.org/abs/2209.15352
- HuggingFace: https://huggingface.co/facebook/musicgen-small
- Demo: https://huggingface.co/spaces/facebook/MusicGen