musicgen — synthetic music dataset generator

A Python library and CLI for generating reproducible, fully-annotated synthetic music datasets for ML/MIR research. Each sample is a complete training example: mixed audio, per-layer stems, per-layer MIDI, and a rich JSON annotation with every musical and synthesis parameter.

Suitable for training models that learn music tagging, source separation, beat/tempo/downbeat detection, and audio→MIDI transcription at the 1k–10k sample scale.

Versions

Version	What shipped
v0.8	Soundfont license audit (default SF2 → FluidR3_GM MIT); sharded layout `--shard-width 3` for 100k+ datasets; SF2 pool expansion (GeneralUserGS/MuseScoreGeneral/SGM-V2); `rock` genre; beat-pattern coverage for all time sigs across all genres; `chord_type_hard_filter` set for classical/pop; `measures_per_part_override`; `create_genre.py` wizard
v0.7	Dataset export (`musicgen export`/`stats`); quality pipeline (`musicgen score`/`filter`); eval CLI (`musicgen eval reliability`/`validity`); neural tests
v0.6	Asset downloader — `musicgen download-assets` bootstraps SF2 soundfonts and MIDI corpora from open sources; `assets.toml` registry with checksums and license metadata
v0.5	ML-assisted generators — LSTM chord/melody models trained on self-generated corpus; `extract-sequences` + `train` CLI
v0.4	Sample composition — real audio samples alongside/substituting FluidSynth layers; `musicality` standalone package
v0.3	Higher-order Markov — 2nd-order chords, two-layer quality gate, calibration harness
v0.2	Genre system — 8 built-in genres, `GenreSpec` composition engine, extended chord vocabulary
v0.1	Initial release — single-sample API, parallel batch, full CLI, determinism contract

Quick start

# 1. Clone and install
git clone https://github.com/dobidu/layered_music_gen.git
cd layered_music_gen
python -m venv .venv && source .venv/bin/activate
pip install -e '.[dev]'

# 2. Install FluidSynth (system dependency)
sudo apt-get install fluidsynth      # Ubuntu/Debian
# brew install fluidsynth            # macOS

# 3. Download default soundfonts  ← required before generating (FluidR3_GM, MIT, ~141 MB)
musicgen download-assets --sf2

# 4. Generate a dataset
musicgen generate --seed 42 --count 10 --out ./dataset

# 5. Explore the output
ls dataset/000000/
# mix.wav  sample.json  stems/  midi/

Each sample directory contains:

dataset/
├── manifest.jsonl
└── 000000/
    ├── sample.json      # full annotation (written last — completion sentinel)
    ├── mix.wav          # final mixed audio
    ├── stems/           # post-FX per-layer WAV stems
    │   ├── beat.wav
    │   ├── melody.wav
    │   ├── harmony.wav
    │   └── bassline.wav
    └── midi/            # per-layer MIDI (concatenated across all song parts)
        ├── beat.mid
        ├── melody.mid
        ├── harmony.mid
        └── bassline.mid

For 100k+ samples use --shard-width 3 (see Sharded layout).

Installation

Base (required)

pip install -e .

Requires Python ≥ 3.10 and FluidSynth on PATH:

sudo apt-get install fluidsynth   # Ubuntu/Debian
brew install fluidsynth           # macOS
# Windows: https://www.fluidsynth.org/

Optional extras

pip install -e '.[samples]'   # real audio sample composition (v0.4)
pip install -e '.[neural]'    # LSTM chord/melody backends (v0.5)
pip install -e '.[dev]'       # test suite

Asset management (v0.6)

musicgen ships an asset registry (assets.toml) and a downloader that bootstraps soundfonts and MIDI corpora from free/open sources. No manual file hunting.

Soundfonts

# Download default SF2 (FluidR3_GM, 141 MB, MIT — placed in all sf/<layer>/ dirs)
musicgen download-assets --sf2

# List all available sources
musicgen download-assets --list

# Download a specific opt-in source by name
musicgen download-assets --name GeneralUserGS    # ~31 MB, melody/harmony layers
musicgen download-assets --name TimGM6mb         # ~5.7 MB, all layers (GPL-2.0, see note)

# Re-download (overwrite)
musicgen download-assets --sf2 --force

Default SF2 sources (included in --sf2):

Name	Size	Layers	License	Dataset-safe?
FluidR3_GM	~141 MB	all	MIT	Yes

Opt-in SF2 sources (use --name):

Name	Size	Layers	License	Dataset-safe?
GeneralUserGS	~31 MB	all	custom-permissive	Yes
MuseScoreGeneral	~206 MB	all	MIT	Yes
SGM-V2	~236 MB	all	CC-BY 3.0	Yes — with attribution
TimGM6mb	5.7 MB	all	GPL-2.0	Only if dataset is GPL-compatible

GPL notice: Audio rendered with TimGM6mb is a derivative work under GPL-2.0. Do not use it when building datasets intended for public distribution unless your dataset license is GPL-compatible. Use FluidR3_GM (default, MIT) instead. See LICENSES.soundfonts.md for full details.

MIDI corpora

Downloaded MIDI files land in midi_assets/<layer>/ and can be indexed with musicgen index-midi.

# Download default MIDI corpora (GrooveMIDI + FreeMidiChords)
musicgen download-assets --midi

# Opt-in large corpus
musicgen download-assets --name LakhCleanMIDI    # 223 MB, CC-BY-4.0

Default MIDI sources (included in --midi):

Name	Files	Layers	License
GrooveMIDI	1,150 drum files	beat	CC-BY-4.0
FreeMidiChords	11,400 progressions	harmony	MIT

Opt-in MIDI sources:

Name	Size	Layers	License
FannonChords	~400 chord shapes	harmony	MIT
FannonScales	~300 scale patterns	melody	MIT
LakhCleanMIDI	17k multi-track	melody, harmony, bassline	CC-BY-4.0

Auto-download

When auto_download_sf2 = true (default), musicgen silently downloads default SF2 sources on the first generate call if any layer pool is empty. Disable with:

export MUSICGEN_AUTO_DOWNLOAD_SF2=false

Adding custom sources

Edit assets.toml. Fill url= and run sha256sum <file> to get the checksum:

[[sf2]]
name         = "MySF2"
description  = "My custom soundfont"
url          = "https://example.com/my.sf2"
sha256       = "abc123..."          # sha256sum my.sf2
filename     = "my.sf2"
size_hint_mb = 20
license      = "MIT"
license_url  = "https://example.com/LICENSE"
layers       = ["melody", "harmony"]
default      = false

CLI reference

`musicgen generate`

musicgen generate --seed SEED [options]

Option	Default	Description
`--seed / -s`	(required)	Global RNG seed.
`--count / -n`	`1`	Number of samples to generate.
`--out / -o`	`./dataset`	Output directory.
`--workers / -w`	all cores	Parallel workers.
`--output-mode / -m`	`full`	`full` \| `mix-only` \| `stems-only` \| `midi-only`
`--genre / -g`	None	Genre preset (repeatable for composition).
`--genres-dir`	`<repo>/genres`	Custom genres directory.
`--min-musicality-score`	—	Via `Config`; see Musicality scoring.
`--verbose / -v`	—	DEBUG logging.
`--quiet / -q`	—	ERROR-only logging.

Sample composition options (requires musicgen[samples]):

Option	Default	Description
`--sample-db PATH`	None	SampleManager JSON library. Enables sample composition.
`--sample-beat MODE`	`alongside`	`alongside` \| `substitution` \| `adlib` \| `off`
`--sample-bassline MODE`	`alongside`	Same modes.
`--sample-melody MODE`	`off`	Same modes.
`--sample-harmony MODE`	`off`	Same modes.
`--sample-gain DB`	`-3.0`	Gain applied to all sample layers.
`--sample-min-score FLOAT`	`0.0`	Min musicality score for sample selection.

Neural backend options (requires musicgen[neural]):

Option	Default	Description
`--chord-backend`	`markov`	`markov` \| `neural`
`--melody-backend`	`markov`	`markov` \| `neural`
`--models-dir PATH`	`<repo>/models`	Directory with trained `.pt` files.

Layout options (v0.8):

Option	Default	Description
`--shard-width INT`	`0`	Shard prefix length. `0` = flat; `3` = `<root>/000/000042/` for ≤1M samples.

`musicgen export` / `musicgen stats`

Export a generated dataset to JSONL, CSV, Parquet, or HuggingFace AudioFolder.

musicgen export ./dataset --out dataset.jsonl           # JSONL (default)
musicgen export ./dataset --out dataset.csv --fmt csv
musicgen export ./dataset --out dataset.parquet --fmt parquet   # requires musicgen[export]
musicgen export ./dataset --out hf/ --fmt hf            # HuggingFace audiofolder + metadata.jsonl
musicgen export ./dataset --out dataset.jsonl --relative  # paths relative to dataset root

musicgen stats ./dataset                                # distribution summary (text)
musicgen stats ./dataset --fmt json                     # machine-readable

Option	Default	Description
`--out / -o`	(required)	Output path (file for JSONL/CSV/Parquet, dir for HF).
`--fmt`	`jsonl`	`jsonl` \| `csv` \| `parquet` \| `hf`
`--relative / -r`	`False`	Store paths relative to dataset root.
`--no-midi`	`False`	Omit MIDI path columns from output.

Both commands auto-detect flat and sharded dataset layouts.

`musicgen score` / `musicgen filter`

Post-hoc quality scoring and filtering for generated datasets.

# Score all unscored samples; updates sample.json atomically
musicgen score ./dataset

# Re-score even already-scored samples
musicgen score ./dataset --force

# Filter: move samples below threshold to a reject directory
musicgen filter ./dataset --min-score 0.6
musicgen filter ./dataset --min-score 0.6 --reject-dir ./bad
musicgen filter ./dataset --min-score 0.6 --dry-run     # preview without moving

Option	Default	Description
`--force`	`False`	(`score`) Re-score even samples that already have a score.
`--min-score FLOAT`	(required)	(`filter`) Samples below this are moved to reject dir.
`--reject-dir PATH`	`<dataset>/rejected`	(`filter`) Destination for rejected samples.
`--dry-run`	`False`	(`filter`) Report what would happen without moving anything.

`musicgen eval`

Measure scorer reliability and construct validity.

musicgen eval reliability --type det     # determinism: same input → same score
musicgen eval reliability --type rinv    # rank-invariance: monotone transform preserves order
musicgen eval reliability --type seed    # seed stability: score variance across seeds

musicgen eval validity --mode both       # AUROC ≥ 0.80 criterion (8 pathologies)
musicgen eval validity --mode good       # good-set statistics only

Option	Default	Description
`--type`	`det`	(`reliability`) `det` \| `rinv` \| `seed`
`--n-samples`	`10`	Samples per test.
`--mode`	`both`	(`validity`) `good` \| `bad` \| `both`
`--bootstrap-n`	`200`	Bootstrap iterations for AUROC CI.
`--output PATH`	None	Write JSON result to file.

Both commands exit non-zero when the criterion fails.

`musicgen download-assets`

musicgen download-assets [--sf2] [--midi] [--all] [--name NAME] [--list] [--force]

Option	Description
`--sf2`	Download all default SF2 sources.
`--midi`	Download all default MIDI corpus sources.
`--all`	Download all default sources (SF2 + MIDI).
`--name NAME`	Download a specific source by name (ignores `default` flag).
`--list / -l`	List all sources with URLs, layers, and license.
`--force / -f`	Re-download even if files already exist.

`musicgen samples build`

Build a SampleManager library from a directory of audio files.

musicgen samples build --dir ./drums --output drums.json --musicality
musicgen samples build --dir ./loops --output loops.json \
    --category bass --genre electronic --recursive

Option	Default	Description
`--dir / -d`	(required)	Audio files directory (WAV/FLAC/OGG/AIF).
`--output / -o`	(required)	Output SampleManager JSON path.
`--category`	auto	Force category: `beat` \| `bass` \| `melody` \| `harmony`.
`--genre TAG`	None	Genre tag applied to all samples (repeatable).
`--mood TAG`	None	Mood tag (repeatable).
`--tag TAG`	None	Extra tag (repeatable).
`--musicality`	`False`	Score samples with `musicality.explain()`.
`--recursive / -r`	`False`	Walk subdirectories.

Category is inferred from filename keywords when --category is not set:

Category	Keywords
`beat`	beat, kick, hat, snare, drum, perc, clap, hh, hihat
`bass`	bass, sub
`harmony`	pad, chord, harm, atmo, ambient, strings, vox, choir, keys, piano, organ
`melody`	lead, melody, lick, riff, synth, arp, melo, hook (default fallback)

`musicgen index-midi`

Index generated MIDI files into a MidiManager database (requires midi_file_manager).

musicgen index-midi --dataset ./dataset --out ./midi_db.json [--csv ./midi_db.csv]

Option	Default	Description
`--dataset / -d`	(required)	musicgen dataset root.
`--out / -o`	`./midi_db.json`	Output database path.
`--midi-dir`	None	Base dir for relative MIDI paths in the db.
`--csv`	None	Also export a CSV.

`musicgen index-audio`

Index generated WAV stems into a SampleManager database (requires audio_sample_manager).

musicgen index-audio --dataset ./dataset --out ./audio_db.json [--csv ./audio_db.csv]

Option	Default	Description
`--dataset / -d`	(required)	musicgen dataset root.
`--out / -o`	`./audio_db.json`	Output database path.
`--samples-dir`	None	Base dir for relative WAV paths in the db.
`--csv`	None	Also export a CSV.

Other commands

musicgen list-genres [--genres-dir DIR]   # list available genre presets
musicgen calibrate [-v]                   # measure FluidSynth pre-roll offset (run once per machine)
musicgen clean --failed [--out DIR]       # remove partial sample directories

Genre system (v0.2)

musicgen ships 9 built-in genre presets that constrain generation parameters. Genres are composable — specify multiple to merge their constraints.

musicgen generate --seed 42 --genre jazz
musicgen generate --seed 42 --genre jazz --genre latin   # composition
musicgen list-genres

Genre	Tempo	Swing	Time sigs	Style
`jazz`	80–200 BPM	0.60–0.75	4/4, 3/4, 6/8, 12/8	Swing-heavy, maj7/m7 chords
`hip-hop`	70–110 BPM	0.50–0.65	4/4 dominant	Heavy kick-snare, minor-key bias
`blues`	60–140 BPM	0.55–0.70	4/4, 6/8, 12/8	Dominant 7ths, shuffle feel
`pop`	90–140 BPM	0.50–0.55	4/4 dominant	Clean patterns, major-key bias
`electronic`	110–160 BPM	0.50–0.55	4/4 dominant	Four-on-floor, synth layers
`latin`	90–140 BPM	0.50–0.60	4/4, 3/4, 6/8	Clave syncopation, conga patterns
`reggae`	60–90 BPM	0.50–0.58	4/4 dominant	One-drop + steppers patterns, bass-heavy
`classical`	50–160 BPM	0.50–0.52	4/4, 3/4, 2/4, 5/4, 6/8, 12/8	Wide dynamics, orchestral timbres
`rock`	70–180 BPM	0.50–0.57	4/4, 3/4, 6/8, 12/8	Strong backbeat, power chords, guitar-driven

Genre constraints applied per parameter type:

Tempo/swing — hard bounds: drawn value clamped to [min, max]
Time signature — soft weights: shifts draw probabilities
Key/scale, chord type, inversions — soft weight dicts + optional hard filter
Chord type hard filter — when set, restricts the allowed chord vocabulary entirely (e.g. classical blocks sus/add9; jazz blocks plain triads)
Drum patterns — per-time-sig patterns_*.txt files; each genre ships patterns for every time sig it uses
FX profile — multiplies effect probabilities
Soundfonts — per-layer tag overrides when SoundfontManager is active

Genre wizard

create_genre.py is an interactive terminal wizard for authoring new genre configurations:

python create_genre.py                   # guided wizard, start fresh
python create_genre.py --from rock       # clone an existing genre as defaults
python create_genre.py --list            # list all genres with their files
python create_genre.py --midi            # MIDI drum note reference table

The wizard walks through all spec.json fields, auto-normalizes weight dicts, generates starter beat-pattern files (choose a style: backbeat / swing / electronic / one-drop / minimal), and optionally installs a chord-transition Markov matrix.

See genres/README.md for the full spec.json format and how to write custom genres.

Output format

Directory layout

Flat layout (default, --shard-width 0):

<dataset_root>/
├── manifest.jsonl                  # one append-per-sample log
└── 000042/
    ├── sample.json                 # full annotation — written LAST (completion sentinel)
    ├── mix.wav
    ├── stems/
    │   ├── beat.wav
    │   ├── melody.wav
    │   ├── harmony.wav
    │   └── bassline.wav
    └── midi/
        ├── beat.mid
        ├── melody.mid
        ├── harmony.mid
        └── bassline.mid

Sharded layout (--shard-width 3, recommended for 100k+ samples):

<dataset_root>/
├── manifest.jsonl
├── 000/
│   ├── 000000/          # shard prefix = first 3 chars of zero-padded index
│   │   ├── sample.json
│   │   └── ...
│   └── 000042/
│       └── ...
└── 001/
    └── 001000/
        └── ...

Use sample_dir_path(dataset_root, index, shard_width) from config.py to compute paths. Export, quality pipeline, and manifest traversal all auto-detect the layout.

sample.json is always written last. Its presence means the sample is complete. Re-running generate() with the same (global_seed, sample_index) skips work when this sentinel exists.

`sample.json` schema

Every sample carries:

Identity: seed, musicgen_version, fluidsynth_version
Musical params: key, mode, tempo_bpm, time_signature, swing, duration_seconds
Structure: song_arrangement ([{part, start_seconds, end_seconds}])
Per-part: chord_progression, active_layers, soundfonts, fx_params, time_signatures_per_part, measures_per_part
Annotations: beat_times, downbeat_times (seconds, swing-aware from MIDI ticks)
Quality: musicality_score (tempo 30%, harmony 30%, rhythm 25%, noise 15%, with render-integrity penalty)
Routing: split (train / valid / test, deterministic from seed)
Paths: mix, stems.*, midi.* (relative to sample dir)
Sample composition: used_samples (when --sample-db is active)

sample.json is serialized with sort_keys=True — byte-identical re-runs are detectable via SHA-256 without parsing.

`manifest.jsonl`

One JSON object per sample: sample_index, seed, status (ok/failed), split, path, musicality_score, duration_seconds, attempt, wrote_at.

Musicality scoring and quality gate (v0.3)

Rejects samples that would contaminate a training distribution — not rank good music from very good music.

Two-layer architecture

Layer 1 — symbolic (pre-render, < 5 ms). check_midi_quality(midi_paths, key) runs hard checks (empty layer, stuck pitch > 80%, extreme pitch range > 36 semitones) and soft metrics on the melody (Krumhansl–Schmuckler key-profile correlation, scale adherence, melodic step fraction, n-gram entropy, LZ compression ratio). Failing hard checks → score 0.0, no render.

Layer 2 — audio integrity (post-render). get_musicality_score(filename) applies a render-integrity penalty (clipping, silence, DC offset) to a weighted musical analysis (tempo stability/clarity 30%, harmony KS correlation 30%, rhythm regularity/strength 25%, noise/spectral 15%).

Quality-gate loop

result = generate(Config(
    global_seed=42,
    sample_index=0,
    dataset_root="./dataset",
    min_musicality_score=0.6,   # reject below 0.6; 0.0 = disabled
    max_attempts=3,             # re-roll up to 3x with distinct seeds
))
print(result.attempt)           # which attempt was accepted (1, 2, or 3)

Standalone musicality package

pip install -e '.[samples]'   # musicality is bundled in src/musicality/

musicality score  ./mix.wav
musicality explain ./mix.wav
musicality batch  ./dataset/**/*.wav --output scores.csv

See docs/musicality-scoring.md for metric derivations and literature references.

Neural backends (v0.5)

Replace Markov matrices with small LSTMs trained on a self-generated corpus.

Install

pip install -e '.[neural]'   # requires torch >= 2.0

Workflow

# 1. Generate a training corpus (MIDI-only is fast)
musicgen generate --count 500 --seed 1 --out ./corpus --output-mode midi-only

# 2. Extract chord/melody sequences
musicgen extract-sequences --dataset ./corpus --output sequences.json

# 3. Train models
musicgen train --sequences sequences.json --layer chord --output-dir ./models
musicgen train --sequences sequences.json --layer melody --output-dir ./models

# Genre-specific models take precedence at inference
musicgen train --sequences sequences.json --layer chord --genre jazz --output-dir ./models

# 4. Generate with neural backends
musicgen generate --count 32 --seed 1 --out ./dataset \
    --chord-backend neural --melody-backend neural \
    --models-dir ./models

models_dir lookup order: chord_{genre}.pt → chord.pt. Missing file → Markov fallback with warning.

Model sizes

Model	Params	Architecture
ChordLSTM	~35 K	2-layer LSTM, hidden=64, genre one-hot conditioning
MelodyLSTM	~10 K	2-layer LSTM, hidden=32

Determinism

The determinism contract is preserved: logits are pure (fixed weights → fixed given input), and sampling uses rng.choices(tokens, weights=softmax(logits)) — the same seeded random.Random instance used by the Markov path.

See docs/neural-generators.md for model architecture, sequences.json schema, and training hyperparameters.

Sample composition (v0.4)

Mix real audio samples alongside or instead of FluidSynth-rendered layers.

Install

pip install -e '.[samples]'   # audio-sample-manager, soundfile, rubberband-stretch

Workflow

# 1. Build a sample library
musicgen samples build --dir ./my_drums --output drums.json --musicality
musicgen samples build --dir ./bass_loops --output bass.json \
    --category bass --genre electronic --recursive

# 2. Generate with sample composition
musicgen generate --seed 42 --count 10 --out ./dataset \
    --sample-db drums.json \
    --sample-beat alongside \
    --sample-bassline alongside \
    --sample-gain -6 \
    --sample-min-score 0.65

Mixing modes

Mode	Behaviour
`alongside`	Sample overlaid on the FluidSynth-rendered mix (additive).
`substitution`	Sample replaces the FluidSynth stem before mixing.
`adlib`	One-shot placed at a specific beat offset. Requires `oneshot_at_beat` in Python API.
`off`	FluidSynth only (default for melody and harmony).

Python API

from musicgen import generate, Config
from musicgen.sample_composition import SampleLayerRule, SampleCompositionConfig

cfg = Config(
    global_seed=42,
    sample_index=0,
    dataset_root="./dataset",
    sample_composition=SampleCompositionConfig(
        sample_db_path="./library.json",
        layer_rules={
            "beat": SampleLayerRule(
                layer="beat",
                mode="alongside",
                gain_db=-6.0,
                max_bpm_stretch_pct=15.0,
                min_musicality_score=0.65,
                genre=["hip-hop"],
            ),
            "bassline": SampleLayerRule(
                layer="bassline",
                mode="substitution",
                gain_db=-3.0,
            ),
        },
        global_min_musicality=0.50,
        allow_transposition=True,
        allow_time_stretching=True,
    ),
)
result = generate(cfg)

See docs/sample-composition.md for full reference.

Optional integrations (v0.2)

All three integrations are opt-in with zero new hard dependencies. Each package is lazy-imported; a clear ImportError with an install hint is raised when absent.

SoundfontManager — tag-based soundfont selection

pip install git+https://github.com/dobidu/soundfont_manager

Replaces blind rng.choice(os.listdir(...)) with metadata-aware tag-based selection from a SoundfontManager JSON database.

result = generate(Config(
    global_seed=42,
    sample_index=0,
    dataset_root="./dataset",
    soundfont_manager_db="/path/to/soundfonts.json",
    soundfont_manager_sf_dir="/path/to/sf2/files",
))

Layer → tag mapping: beat → ["drums", "percussion"], melody → ["melody", "lead", "piano", "strings"], harmony → ["harmony", "chords", "pads"], bassline → ["bass"].

Fallback: any error or empty tag result → sorted directory scan.

MIDI indexer

pip install git+https://github.com/dobidu/midi_file_manager

musicgen index-midi --dataset ./dataset --out ./midi_db.json

Indexes all generated MIDI files into a MidiManager database with ground-truth musicgen metadata (tempo_bpm, key, time_signature, split, musicality_score).

Audio indexer

pip install git+https://github.com/dobidu/audio_sample_manager

musicgen index-audio --dataset ./dataset --out ./audio_db.json

Indexes generated WAV stems into a SampleManager database alongside external audio libraries — enables unified cross-library queries (e.g., "all bass stems at 90 BPM in A minor").

Library API

from musicgen import generate, generate_batch, Config, SampleResult, BatchResult

# Single sample
result = generate(Config(global_seed=42, sample_index=0, dataset_root="./dataset"))
print(result.sample_dir)        # "./dataset/000000"
print(result.split)             # "train" | "valid" | "test"
print(result.musicality_score)  # float
print(result.status)            # "ok" | "failed"

# Batch
result = generate_batch(Config(global_seed=1, count=32, dataset_root="./dataset", workers=4))
print(result.succeeded, result.failed, result.skipped)

Re-running with the same (global_seed, sample_index) short-circuits when sample.json exists — batches are idempotent.

Configuration reference

Config is a @dataclass with three precedence layers: CLI args > env vars > defaults.

Core fields

Field	Default	Env var	Notes
`global_seed`	None	—	Required at generate time.
`sample_index`	`0`	—	Per-sample identity within dataset.
`dataset_root`	`<repo>/dataset`	`MUSICGEN_DATASET_ROOT`	Output directory.
`count`	`1`	`MUSICGEN_COUNT`	Samples per `generate_batch`.
`workers`	None (all cores)	—	`generate_batch` parallelism.
`output_mode`	`"full"`	`MUSICGEN_OUTPUT_MODE`	`full` / `mix-only` / `stems-only` / `midi-only`
`split_ratios`	`(0.8, 0.1, 0.1)`	—	Train/valid/test split.

Quality gate

Field	Default	Env var	Notes
`min_musicality_score`	`0.0`	`MUSICGEN_MIN_MUSICALITY_SCORE`	`0.0` = disabled.
`max_attempts`	`1`	`MUSICGEN_MAX_ATTEMPTS`	Max re-roll attempts per sample.

Soundfonts and assets

Field	Default	Env var	Notes
`sf_dir`	`<repo>/sf`	`MUSICGEN_SF_DIR`	Root directory for `sf/<layer>/` subdirs.
`auto_download_sf2`	`True`	`MUSICGEN_AUTO_DOWNLOAD_SF2`	Download default SF2 on empty pool.
`assets_toml`	`<repo>/assets.toml`	—	Asset registry path.
`soundfont_manager_db`	None	`MUSICGEN_SOUNDFONT_MANAGER_DB`	Activates tag-based soundfont selection.
`soundfont_manager_sf_dir`	None	`MUSICGEN_SOUNDFONT_MANAGER_SF_DIR`	Base dir for relative SF2 paths in SM db.

Genre

Field	Default	Env var	Notes
`genre`	None	`MUSICGEN_GENRE` (comma-separated)	Genre name(s) for constrained generation.
`genres_dir`	`<repo>/genres`	`MUSICGEN_GENRES_DIR`	Root dir for genre spec files.

Neural backends

Field	Default	Notes
`chord_backend`	`"markov"`	`"markov"` or `"neural"`. Falls back to Markov when model is absent.
`melody_backend`	`"markov"`	Same.
`models_dir`	`<repo>/models`	Directory with `.pt` checkpoint files.

Layout (v0.8)

Field	Default	Notes
`shard_width`	`0`	Shard prefix length. `0` = flat (`<root>/000042/`); `3` = `<root>/000/000042/`. Range 0–5.
`measures_per_part_override`	`None`	Dict overriding per-part measure counts after time-sig scaling. E.g. `{"intro": 4, "verse": 8, "chorus": 8, "bridge": 4, "outro": 4}` for short listening demos.

Domain-specific config files

File	Purpose
`song_structures.json`	Song arrangements (intro/verse/chorus/bridge/outro).
`chord_patterns.txt`	Chord progressions per song part.
`beat_roll_patterns_<sig>.txt`	Drum patterns per time signature.
`inst_probabilities.json`	Per-layer inclusion probabilities.
`levels.json`	Per-layer gain and pan.
`*_fx.json`	FX chain parameter ranges per layer.

Determinism

Same global_seed + same sample_index → bit-identical MIDI + bit-identical canonical sample.json regardless of PYTHONHASHSEED. WAV bit-identity holds when the FluidSynth binary version matches.

Five named random.Random instances per sample, derived deterministically from the sample seed:

sample_seed = derive_sample_seed(global_seed, sample_index)   # sha256[:8]
rngs = make_rngs(sample_seed)
# params, generators, soundfonts, fx, mix — each seeded with seed ^ offset

Zero bare random.* calls anywhere in src/musicgen/ — enforced by an AST static guard. Global random state is never touched.

Regression tests (tests/test_determinism_golden.py):

TestSameProcessStability — fast, no FluidSynth — runs generate() twice and asserts sha256(sample.json) matches.
TestDeterminismGoldens — @pytest.mark.slow — compares SHA-256 artifacts for mix.wav + MIDIs + sample.json across separate process invocations.

Architecture

src/musicgen/
├── __init__.py           # public exports: generate, generate_batch, Config, SampleResult, BatchResult
├── api.py                # generate(Config) — composition root; resolve_genre_spec
├── batch.py              # generate_batch(Config) → BatchResult via ProcessPoolExecutor
├── cli.py                # typer app — all CLI commands
├── config.py (root)      # Config dataclass with CLI > env > defaults precedence
├── asset_downloader.py   # download SF2/MIDI from assets.toml; auto-trigger on empty pool
├── calibrate.py          # FluidSynth pre-roll measurement + .musicgen/ cache
├── seeds.py              # derive_sample_seed, make_rngs, save_random_state, assign_split
├── genre.py              # GenreSpec, load_genre, merge_genres, resolve_genres
├── sampler.py            # SongParams + genre-constrained draws
├── generators/
│   ├── chord.py          # Markov/neural chord generation; extended chord vocab
│   ├── melody.py         # Markov/neural melody; scale-degree path
│   ├── bassline.py       # Bassline generation (keyed to chords + melody)
│   └── beat.py           # Drum patterns + swing; genre pattern union
├── neural/               # optional — requires musicgen[neural]
│   ├── model.py          # ChordLSTM, MelodyLSTM, NeuralSampler
│   ├── trainer.py        # train(), save_model(), load_model()
│   └── sampler.py        # sample_chord_neural(), sample_melody_neural()
├── corpus_extractor.py   # extract_sequences() — dataset → sequences.json
├── renderer.py           # FluidSynth wrapper; ThreadPoolExecutor stem rendering; soundfont selection
├── mixer.py              # FX (pedalboard), pydub overlay, layer mask, part concat
├── beats.py              # MIDI-tick beat/downbeat extraction (mido), swing-aware
├── annotator.py          # pure-function sample.json assembler
├── musicality.py         # Layer 1 MIDI quality + Layer 2 audio integrity scorer
├── writer.py             # atomic sample dir, sum-of-stems assertion, output_mode routing
├── manifest.py           # ManifestWriter (JSONL, append-under-lock)
├── quality.py            # score_dataset(), filter_dataset(), quality_report() — batch quality pipeline
├── exporter.py           # collect_samples(), export_dataset() — JSONL/CSV/Parquet/HF export
├── sample_composition.py # SampleLayerRule, SampleCompositionConfig
├── sample_mixer.py       # BPM stretch, key shift, loop tiling, alongside/substitution
├── sample_builder.py     # build_library() — WAV dir → SampleManager JSON
├── midi_indexer.py       # index_midi_dataset() — indexes MIDI into MidiManager db
└── audio_indexer.py      # index_audio_dataset() — indexes WAV into SampleManager db

Pipeline:

resolve_genre_spec → sampler (genre-constrained draws)
  → generators (chord: LSTM or Markov; melody: LSTM or Markov; bassline/beat: Markov)
  → check_midi_quality (Layer 1: hard + soft symbolic checks, < 5 ms)
  → [re-roll up to max_attempts if score < min_musicality_score]
  → renderer (FluidSynth parallel stems; genre soundfont tags; auto-download on empty pool)
  → mixer (FX + overlay + concat; genre FX profile)
  → beats (MIDI-tick extraction)
  → get_musicality_score (Layer 2: audio integrity + musical analysis)
  → annotator (sample.json dict + pre-roll offset)
  → writer (atomic sample dir + sum-of-stems + output_mode routing)
  → manifest (JSONL append)
  → SampleResult

generate_batch wraps generate in a ProcessPoolExecutor (spawn context) and returns BatchResult.

Try in the cloud

Platform	How
Google Colab	Click a badge at the top. Each notebook has a setup cell that `apt install`s FluidSynth + `fluid-soundfont-gm` and pip-installs musicgen. Demo · Sample composition · Neural generators
mybinder.org	JupyterLab in the browser, all deps pre-wired. Cold build ~5–10 min. Launch
HuggingFace Spaces	Gradio web UI wrapping `musicgen.generate()`. Source under `hf_space/` (Dockerfile + `app.py`). See `hf_space/README.md`.

Tests

pytest -m "not slow"    # fast suite — 1648 tests, ~12 s
pytest -m slow          # requires FluidSynth binary + populated sf/ pools
pytest                  # everything

Coverage target: ≥ 80% on pure functions (samplers, generators, annotator, beats, validators).

Contributing

PRs welcome. Run pytest -m "not slow" before submitting. Project planning lives under .planning/.

License

See LICENSE.

Acknowledgments

music21 for music theory primitives
FluidSynth for soundfont synthesis
pedalboard for audio effects
mido for MIDI manipulation
librosa for audio analysis

Name		Name	Last commit message	Last commit date
Latest commit History 237 Commits
.github/workflows		.github/workflows
.planning		.planning
benchmarks		benchmarks
binder		binder
docs		docs
eval_results		eval_results
genres		genres
hf_space		hf_space
notebooks		notebooks
scripts		scripts
sf		sf
src		src
tests		tests
.continue-here.md		.continue-here.md
.gitignore		.gitignore
LICENSE		LICENSE
LICENSES.soundfonts.md		LICENSES.soundfonts.md
README.md		README.md
assets.toml		assets.toml
bassline_fx.json		bassline_fx.json
beat_fx.json		beat_fx.json
beats_annotations.txt		beats_annotations.txt
chord_patterns.txt		chord_patterns.txt
config.py		config.py
create_genre.py		create_genre.py
demo_roteiro.md		demo_roteiro.md
generate_all_genres.py		generate_all_genres.py
harmony_fx.json		harmony_fx.json
inst_probabilities.json		inst_probabilities.json
levels.json		levels.json
melody_fx.json		melody_fx.json
music_gen.py		music_gen.py
musicgen_musicality_audit.md		musicgen_musicality_audit.md
pyproject.toml		pyproject.toml
song_structures.json		song_structures.json
soundfonts.json		soundfonts.json
timesig.py		timesig.py

Folders and files

Latest commit

History

Repository files navigation

musicgen — synthetic music dataset generator

Versions

Quick start

Installation

Base (required)

Optional extras

Asset management (v0.6)

Soundfonts

MIDI corpora

Auto-download

Adding custom sources

CLI reference

musicgen generate

musicgen export / musicgen stats

musicgen score / musicgen filter

musicgen eval

musicgen download-assets

musicgen samples build

musicgen index-midi

musicgen index-audio

Other commands

Genre system (v0.2)

Genre wizard

Output format

Directory layout

sample.json schema

manifest.jsonl

Musicality scoring and quality gate (v0.3)

Two-layer architecture

Quality-gate loop

Standalone musicality package

Neural backends (v0.5)

Install

Workflow

Model sizes

Determinism

Sample composition (v0.4)

Install

Workflow

Mixing modes

Python API

Optional integrations (v0.2)

SoundfontManager — tag-based soundfont selection

MIDI indexer

Audio indexer

Library API

Configuration reference

Core fields

Quality gate

Soundfonts and assets

Genre

Neural backends

Layout (v0.8)

Domain-specific config files

Determinism

Architecture

Try in the cloud

Tests

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 7

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`musicgen generate`

`musicgen export` / `musicgen stats`

`musicgen score` / `musicgen filter`

`musicgen eval`

`musicgen download-assets`

`musicgen samples build`

`musicgen index-midi`

`musicgen index-audio`

`sample.json` schema

`manifest.jsonl`

Packages