Milestone 5: MusicBrainz MBID canonicalizer

Give tracks a source-agnostic identity so the same song from different
sources no longer replays in a loop.

- Canonicalizer resolves (artist, title) to a MusicBrainz recording MBID
  (no API key; ~1 req/s, descriptive User-Agent, best-effort). Hits and
  confirmed misses are cached in SQLite; transient errors are not.
- Track.key becomes mbid:<id> when resolved, else a normalized
  name:<artist>|<title> fallback — still source-agnostic.
- Scheduler now owns the authoritative anti-repeat on the canonical key,
  canonicalizing the drawn track with a bounded retry; providers keep a
  cheap recent-locator filter to limit retries.
- db: canonical_cache table, history.locator column with migration for
  existing databases, recent_locators().
- Canonicalization can be turned off via RADIEO_CANONICAL_ENABLED=0.

Verified: MBID hit/cache/miss, cross-source key collapse, scheduler
dodging a recent play, schema migration, and full stack (Navidrome +
yt-dlp) with zero Python tracebacks and a valid 192 kbps MP3 stream.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
nemunaire 2026-07-02 18:46:30 +08:00
commit 7e0f08b863
11 changed files with 292 additions and 33 deletions

View file

@ -67,12 +67,18 @@ source); the file being absent also disables yt-dlp.
## Current status
**Milestone 4 — yt-dlp provider: done.**
**Milestone 5 — MBID canonicalizer: done.**
- Two playback sources feed a weighted scheduler: a Navidrome/OpenSubsonic
playlist and a hand-maintained list of yt-dlp URLs (`config/urls.txt`).
Container URLs (playlist/album/label/artist) are expanded and one track is
drawn at random, honouring the anti-repeat window.
drawn at random.
- Each track is canonicalized to a MusicBrainz recording MBID (no API key
needed; ~1 req/s, best-effort, results cached in SQLite). This gives a
source-agnostic identity, so the same song from two sources collapses to one;
when no confident match is found it falls back to a normalized
`(artist, title)` key. The scheduler uses this canonical key for anti-repeat,
with the providers applying a cheap locator filter first.
- Each source has its own fetcher (Subsonic stream / yt-dlp download); files are
cached ahead of playback (prefetch buffer) and decoded by Liquidsoap.
- Play history and LRU retention are tracked in a SQLite database under
@ -86,7 +92,9 @@ source); the file being absent also disables yt-dlp.
- HTTP stream served at `http://localhost:8000/radio.mp3` (MP3, 192 kbps),
multiple simultaneous listeners supported.
The ListenBrainz suggestion feed comes next.
The ListenBrainz suggestion feed comes next. (Known cosmetic quirk: at startup
the fallback logs a few harmless ffmpeg "Invalid data" warnings while probing
non-audio files such as `.gitkeep`; to be quieted in the polish milestone.)
## Roadmap
@ -97,7 +105,7 @@ The ListenBrainz suggestion feed comes next.
LRU retention and play history.
4. ✅ **yt-dlp provider** — fetch tracks from a maintained URL/artist list;
weighted mixing between sources.
5. **Canonicalizer** — ListenBrainz MBID lookup for source-agnostic
5. **Canonicalizer** — MusicBrainz MBID lookup for source-agnostic
de-duplication.
6. **ListenBrainz provider** — parse the RSS suggestions feed and resolve each
one to Navidrome or yt-dlp.