Milestone 5: MusicBrainz MBID canonicalizer

Give tracks a source-agnostic identity so the same song from different sources no longer replays in a loop. - Canonicalizer resolves (artist, title) to a MusicBrainz recording MBID (no API key; ~1 req/s, descriptive User-Agent, best-effort). Hits and confirmed misses are cached in SQLite; transient errors are not. - Track.key becomes mbid:<id> when resolved, else a normalized name:<artist>|<title> fallback — still source-agnostic. - Scheduler now owns the authoritative anti-repeat on the canonical key, canonicalizing the drawn track with a bounded retry; providers keep a cheap recent-locator filter to limit retries. - db: canonical_cache table, history.locator column with migration for existing databases, recent_locators(). - Canonicalization can be turned off via RADIEO_CANONICAL_ENABLED=0. Verified: MBID hit/cache/miss, cross-source key collapse, scheduler dodging a recent play, schema migration, and full stack (Navidrome + yt-dlp) with zero Python tracebacks and a valid 192 kbps MP3 stream. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-07-02 18:46:30 +08:00 · 2026-07-02 18:46:30 +08:00 · 7e0f08b863
commit 7e0f08b863
parent 8774f5c2a1
11 changed files with 292 additions and 33 deletions
--- a/.env.example
+++ b/.env.example
@ -19,6 +19,13 @@ RADIEO_NAVIDROME_PLAYLIST=Radio
 RADIEO_WEIGHT_NAVIDROME=3
 RADIEO_WEIGHT_YTDLP=1
 # --- Canonicalizer MBID (optionnel) ---
 # Résout (artiste, titre) -> MBID MusicBrainz pour dédupliquer entre sources.
 # Aucune clé requise. Mettre 0 pour désactiver (clé = (artiste, titre)).
 RADIEO_CANONICAL_ENABLED=1
 # User-Agent envoyé à MusicBrainz (qui en exige un, descriptif).
 RADIEO_USER_AGENT=radieo/0.1 (personal music radio)
 # --- Rétention du cache (optionnel) ---
 # Nombre de morceaux joués conservés sur disque avant éviction (LRU).
 RADIEO_RETENTION_KEEP=20
--- a/README.md
+++ b/README.md
@ -67,12 +67,18 @@ source); the file being absent also disables yt-dlp.
 ## Current status
-**Milestone 4 — yt-dlp provider: done.**
+**Milestone 5 — MBID canonicalizer: done.**
 - Two playback sources feed a weighted scheduler: a Navidrome/OpenSubsonic
  playlist and a hand-maintained list of yt-dlp URLs (`config/urls.txt`).
  Container URLs (playlist/album/label/artist) are expanded and one track is
-  drawn at random, honouring the anti-repeat window.
+  drawn at random.
 - Each track is canonicalized to a MusicBrainz recording MBID (no API key
  needed; ~1 req/s, best-effort, results cached in SQLite). This gives a
  source-agnostic identity, so the same song from two sources collapses to one;
  when no confident match is found it falls back to a normalized
  `(artist, title)` key. The scheduler uses this canonical key for anti-repeat,
  with the providers applying a cheap locator filter first.
 - Each source has its own fetcher (Subsonic stream / yt-dlp download); files are
  cached ahead of playback (prefetch buffer) and decoded by Liquidsoap.
 - Play history and LRU retention are tracked in a SQLite database under
@ -86,7 +92,9 @@ source); the file being absent also disables yt-dlp.
 - HTTP stream served at `http://localhost:8000/radio.mp3` (MP3, 192 kbps),
  multiple simultaneous listeners supported.
-The ListenBrainz suggestion feed comes next.
+The ListenBrainz suggestion feed comes next. (Known cosmetic quirk: at startup
 the fallback logs a few harmless ffmpeg "Invalid data" warnings while probing
 non-audio files such as `.gitkeep`; to be quieted in the polish milestone.)
 ## Roadmap
@ -97,7 +105,7 @@ The ListenBrainz suggestion feed comes next.
   LRU retention and play history.
 4. ✅ **yt-dlp provider** — fetch tracks from a maintained URL/artist list;
   weighted mixing between sources.
-5. **Canonicalizer** — ListenBrainz MBID lookup for source-agnostic
+5. ✅ **Canonicalizer** — MusicBrainz MBID lookup for source-agnostic
   de-duplication.
 6. **ListenBrainz provider** — parse the RSS suggestions feed and resolve each
   one to Navidrome or yt-dlp.
--- a/docker-compose.yml
+++ b/docker-compose.yml
@ -22,6 +22,9 @@ services:
      # Dosage du mix entre les sources (0 désactive).
      - RADIEO_WEIGHT_NAVIDROME=${RADIEO_WEIGHT_NAVIDROME:-3}
      - RADIEO_WEIGHT_YTDLP=${RADIEO_WEIGHT_YTDLP:-1}
      # Canonicalizer MusicBrainz (identité MBID inter-sources ; sans clé).
      - RADIEO_CANONICAL_ENABLED=${RADIEO_CANONICAL_ENABLED:-1}
      - RADIEO_USER_AGENT=${RADIEO_USER_AGENT:-radieo/0.1 (personal music radio)}
    restart: unless-stopped
  stream:
--- a/ingest/radieo/main.py
+++ b/ingest/radieo/main.py
@ -11,8 +11,18 @@ from .scheduler import Scheduler
 log = logging.getLogger("radieo")
 class _NullCanonicalizer:
    """Used when canonicalization is disabled: leaves the Track untouched."""
    def canonicalize(self, track):
        return track
    def close(self):
        pass
 def _build_pipeline(db: Database):
-    """Return (scheduler, fetchers). Assembles whichever sources are enabled;
+    """Return (providers, fetchers). Assembles whichever sources are enabled;
    when none is, the scheduler yields nothing and the stream plays its local
    cache fallback."""
    providers = []  # list[(provider, weight)]
@ -58,7 +68,7 @@ def _build_pipeline(db: Database):
    if not providers:
        log.warning("no source active: the stream plays its local cache only.")
-    return Scheduler(providers), fetchers
+    return providers, fetchers
 def _sweep_temp_files() -> None:
@ -94,7 +104,19 @@ def main() -> None:
    _sweep_temp_files()
    db = Database(config.STATE_DB)
-    scheduler, fetchers = _build_pipeline(db)
+
    if config.CANONICAL_ENABLED:
        from .canonicalizer import Canonicalizer
        canonicalizer = Canonicalizer(db)
        log.info("Canonicalizer enabled (MusicBrainz, min_score=%d)",
                 config.CANONICAL_MIN_SCORE)
    else:
        canonicalizer = _NullCanonicalizer()
        log.info("Canonicalizer disabled: tracks keyed by (artist, title).")
    providers, fetchers = _build_pipeline(db)
    scheduler = Scheduler(providers, canonicalizer, db)
    queue = TrackQueue(scheduler, fetchers, db)
    queue.start()
@ -113,6 +135,7 @@ def main() -> None:
    finally:
        queue.stop()
        server.server_close()
        canonicalizer.close()
        db.close()
--- a/ingest/radieo/canonicalizer.py
+++ b/ingest/radieo/canonicalizer.py
@ -0,0 +1,94 @@
 """Canonicalizer: resolve a Track to a MusicBrainz recording MBID.
 Given ``(artist, title)`` it queries the MusicBrainz recording search and keeps
 the best match above a score threshold. Results — hits *and* confirmed misses —
 are cached in SQLite so the network is hit at most once per distinct track.
 Genuine no-matches are cached; transient network errors are not, so they get
 retried later.
 MusicBrainz asks anonymous clients to stay under ~1 request/second and to send a
 descriptive User-Agent; both are honoured here. The whole thing is best-effort:
 any failure just leaves ``mbid`` unset and the Track falls back to its
 name-based key.
 """
 import logging
 import threading
 import time
 from dataclasses import replace
 import httpx
 from . import config
 from .models import Track, norm_name
 log = logging.getLogger("radieo.canonicalizer")
 class Canonicalizer:
    def __init__(self, db):
        self._db = db
        self._http = httpx.Client(
            timeout=httpx.Timeout(connect=10.0, read=30.0, write=10.0, pool=10.0),
            headers={"User-Agent": config.USER_AGENT},
            follow_redirects=True,
        )
        self._rate_lock = threading.Lock()
        self._last_call = 0.0
    def canonicalize(self, track: Track) -> Track:
        """Return the Track with ``mbid`` filled when resolvable, else unchanged."""
        artist_norm = norm_name(track.artist)
        title_norm = norm_name(track.title)
        if not artist_norm or not title_norm:
            return track
        cached, mbid = self._db.get_canonical(artist_norm, title_norm)
        if not cached:
            ok, mbid = self._lookup(track.artist, track.title)
            if ok:  # cache hits and genuine misses; skip transient errors
                self._db.put_canonical(artist_norm, title_norm, mbid)
        return replace(track, mbid=mbid) if mbid else track
    # --- MusicBrainz ------------------------------------------------------
    def _lookup(self, artist: str, title: str) -> tuple[bool, str | None]:
        """Return (ok, mbid). ``ok`` False on a transient error (do not cache)."""
        query = f'artist:"{_escape(artist)}" AND recording:"{_escape(title)}"'
        self._throttle()
        try:
            resp = self._http.get(
                config.MUSICBRAINZ_URL,
                params={"query": query, "fmt": "json", "limit": 3},
            )
            resp.raise_for_status()
            data = resp.json()
        except (httpx.HTTPError, ValueError) as exc:
            log.warning("MusicBrainz lookup failed for %s — %s: %s", artist, title, exc)
            return False, None
        recordings = data.get("recordings") or []
        if recordings:
            best = recordings[0]
            if int(best.get("score", 0)) >= config.CANONICAL_MIN_SCORE:
                mbid = best.get("id")
                log.info("MBID %s for %s — %s", mbid, artist, title)
                return True, mbid
        log.info("no confident MBID for %s — %s", artist, title)
        return True, None  # genuine miss: cache it
    def _throttle(self) -> None:
        with self._rate_lock:
            wait = config.CANONICAL_RATE_INTERVAL - (time.time() - self._last_call)
            if wait > 0:
                time.sleep(wait)
            self._last_call = time.time()
    def close(self) -> None:
        self._http.close()
 # Lucene special characters we quote around; escape the ones that would break a
 # quoted phrase.
 def _escape(value: str) -> str:
    return value.replace("\\", "\\\\").replace('"', '\\"')
--- a/ingest/radieo/config.py
+++ b/ingest/radieo/config.py
@ -27,6 +27,23 @@ PREFETCH_INTERVAL = float(os.environ.get("RADIEO_PREFETCH_INTERVAL", "2.0"))
 RETENTION_KEEP = int(os.environ.get("RADIEO_RETENTION_KEEP", "20"))
 # Do not replay a track seen among the last N plays, when avoidable.
 ANTIREPEAT_WINDOW = int(os.environ.get("RADIEO_ANTIREPEAT_WINDOW", "50"))
 # How many draws the scheduler tries to dodge a recent repeat before giving up.
 SCHEDULER_MAX_TRIES = int(os.environ.get("RADIEO_SCHEDULER_MAX_TRIES", "5"))
 # --- Canonicalizer (MusicBrainz MBID lookup) ---
 # Resolves (artist, title) -> recording MBID for source-agnostic de-dup.
 CANONICAL_ENABLED = os.environ.get("RADIEO_CANONICAL_ENABLED", "1") != "0"
 MUSICBRAINZ_URL = os.environ.get(
    "RADIEO_MUSICBRAINZ_URL", "https://musicbrainz.org/ws/2/recording"
 )
 # Minimum MusicBrainz match score (0-100) to accept a recording as canonical.
 CANONICAL_MIN_SCORE = int(os.environ.get("RADIEO_CANONICAL_MIN_SCORE", "90"))
 # Minimum seconds between MusicBrainz requests (anonymous limit ~1 req/s).
 CANONICAL_RATE_INTERVAL = float(os.environ.get("RADIEO_CANONICAL_RATE_INTERVAL", "1.1"))
 # Sent to MusicBrainz (which requires a descriptive User-Agent) and yt-dlp/others.
 USER_AGENT = os.environ.get(
    "RADIEO_USER_AGENT", "radieo/0.1 (personal music radio)"
 )
 # --- Navidrome / OpenSubsonic source ---
 # Left empty means the provider is disabled (the stream then plays its own
--- a/ingest/radieo/db.py
+++ b/ingest/radieo/db.py
@ -1,13 +1,17 @@
-"""SQLite state: play history (anti-repeat + stats) and cache-file retention.
+"""SQLite state: play history, cache-file retention and MBID canonical cache.
-Two concerns, two tables:
+Three concerns, three tables:
- ``history`` is append-only. It drives anti-repeat (recently played track
+- ``history`` is append-only. It drives anti-repeat (recently played canonical
-  keys) and survives cache eviction, so a track can stay "recently played" even
+  keys, and raw locators for the providers' cheap local filter) and survives
-  after its file is deleted.
+  cache eviction, so a track can stay "recently played" even after its file is
  deleted.
 - ``cache_files`` tracks downloaded files so we can keep only the N most
  recently *played* ones (LRU retention). Files not yet played are never
  evicted.
 - ``canonical_cache`` memoizes ``(artist, title) -> MBID`` lookups (a NULL mbid
  means a confirmed no-match) so the Canonicalizer hits the network at most once
  per distinct track.
 """
 import sqlite3
@ -21,6 +25,7 @@ _SCHEMA = """
 CREATE TABLE IF NOT EXISTS history (
    id        INTEGER PRIMARY KEY,
    track_key TEXT NOT NULL,
    locator   TEXT,
    artist    TEXT,
    title     TEXT,
    origin    TEXT,
@ -33,6 +38,14 @@ CREATE TABLE IF NOT EXISTS cache_files (
    track_key TEXT,
    played_at REAL           -- NULL until the file has been played
 );
 CREATE TABLE IF NOT EXISTS canonical_cache (
    artist_norm TEXT NOT NULL,
    title_norm  TEXT NOT NULL,
    mbid        TEXT,         -- NULL means a confirmed no-match
    resolved_at REAL NOT NULL,
    PRIMARY KEY (artist_norm, title_norm)
 );
 """
@ -45,6 +58,16 @@ class Database:
        )
        self._conn.row_factory = sqlite3.Row
        self._conn.executescript(_SCHEMA)
        self._migrate()
    def _migrate(self) -> None:
        # Add columns introduced after a DB may have been created (milestone 5).
        cols = {
            r["name"]
            for r in self._conn.execute("PRAGMA table_info(history)").fetchall()
        }
        if "locator" not in cols:
            self._conn.execute("ALTER TABLE history ADD COLUMN locator TEXT")
    # --- anti-repeat / history -------------------------------------------
@ -56,12 +79,56 @@ class Database:
            ).fetchall()
        return {r["track_key"] for r in rows}
    def recent_locators(self, limit: int) -> set[str]:
        """Raw backend locators recently played (providers' cheap local filter)."""
        with self._lock:
            rows = self._conn.execute(
                "SELECT locator FROM history WHERE locator IS NOT NULL"
                " ORDER BY played_at DESC LIMIT ?",
                (limit,),
            ).fetchall()
        return {r["locator"] for r in rows}
    def record_play(self, track: Track) -> None:
        with self._lock:
            self._conn.execute(
-                "INSERT INTO history (track_key, artist, title, origin, played_at)"
+                "INSERT INTO history"
-                " VALUES (?, ?, ?, ?, ?)",
+                " (track_key, locator, artist, title, origin, played_at)"
-                (track.key, track.artist, track.title, track.origin, time.time()),
+                " VALUES (?, ?, ?, ?, ?, ?)",
                (
                    track.key,
                    track.locator,
                    track.artist,
                    track.title,
                    track.origin,
                    time.time(),
                ),
            )
    # --- MBID canonical cache --------------------------------------------
    def get_canonical(self, artist_norm: str, title_norm: str):
        """Return (cached: bool, mbid: str | None). ``cached`` False means the
        pair was never looked up; a cached row with mbid None is a known miss."""
        with self._lock:
            row = self._conn.execute(
                "SELECT mbid FROM canonical_cache"
                " WHERE artist_norm = ? AND title_norm = ?",
                (artist_norm, title_norm),
            ).fetchone()
        if row is None:
            return False, None
        return True, row["mbid"]
    def put_canonical(
        self, artist_norm: str, title_norm: str, mbid: str | None
    ) -> None:
        with self._lock:
            self._conn.execute(
                "INSERT OR REPLACE INTO canonical_cache"
                " (artist_norm, title_norm, mbid, resolved_at)"
                " VALUES (?, ?, ?, ?)",
                (artist_norm, title_norm, mbid, time.time()),
            )
    # --- cache-file retention --------------------------------------------
--- a/ingest/radieo/models.py
+++ b/ingest/radieo/models.py
@ -6,8 +6,18 @@ it into a local file; the queue and the state database use ``key`` for
 de-duplication and anti-repeat.
 """
 import re
 from dataclasses import dataclass
 _WS = re.compile(r"\s+")
 def norm_name(value: str) -> str:
    """Normalize an artist/title for stable keying and cache lookups:
    case-fold and collapse whitespace. Kept deliberately light (no accent
    stripping) so distinct titles never collapse together."""
    return _WS.sub(" ", value.strip()).casefold()
@dataclass(frozen=True)
 class Track:
@ -22,14 +32,16 @@ class Track:
    @property
    def key(self) -> str:
-        """Stable identity for de-duplication and anti-repeat.
+        """Stable, source-agnostic identity for de-duplication and anti-repeat.
-        Until the Canonicalizer (milestone 5) fills ``mbid``, we key on the
+        The Canonicalizer fills ``mbid`` when it can, giving a truly
-        backend locator, which is unique within a source.
+        cross-source identity. When it can't, we fall back to the normalized
        ``(artist, title)`` — still source-agnostic, so the same track fetched
        from two backends collapses to one key.
        """
        if self.mbid:
            return f"mbid:{self.mbid}"
-        return f"{self.backend}:{self.locator}"
+        return f"name:{norm_name(self.artist)}|{norm_name(self.title)}"
    def __str__(self) -> str:
        return f"{self.artist} — {self.title} [{self.origin}]"
--- a/ingest/radieo/providers/navidrome.py
+++ b/ingest/radieo/providers/navidrome.py
@ -1,9 +1,10 @@
 """NavidromeProvider: picks tracks from an OpenSubsonic playlist.
 Emits ``subsonic`` tracks (locator = song id). The playlist is cached in
-memory and refreshed periodically. Anti-repeat is applied by filtering out
+memory and refreshed periodically. A cheap local anti-repeat filters out songs
-tracks whose key is among the recently played ones; if that empties the pool
+whose id was played recently; if that empties the pool (short playlist), the
-(short playlist), the filter is dropped so playback never stalls.
+filter is dropped so playback never stalls. The authoritative, source-agnostic
 anti-repeat lives in the Scheduler (on the canonical key).
 """
 import logging
@ -53,9 +54,9 @@ class NavidromeProvider:
        if not self._songs:
            return None
-        recent = self._db.recent_keys(config.ANTIREPEAT_WINDOW)
+        recent = self._db.recent_locators(config.ANTIREPEAT_WINDOW)
        candidates = [
-            s for s in self._songs if f"subsonic:{s['id']}" not in recent
+            s for s in self._songs if str(s["id"]) not in recent
        ] or self._songs
        song = random.choice(candidates)
        return Track(
--- a/ingest/radieo/providers/ytdlp.py
+++ b/ingest/radieo/providers/ytdlp.py
@ -63,7 +63,7 @@ class YtdlpProvider:
        self._load_urls()
        if not self._urls:
            return None
-        recent = self._db.recent_keys(config.ANTIREPEAT_WINDOW)
+        recent = self._db.recent_locators(config.ANTIREPEAT_WINDOW)
        # Try source lines in random order until one yields a usable track.
        candidates = list(self._urls)
        random.shuffle(candidates)
@ -81,7 +81,7 @@ class YtdlpProvider:
            return None
        if not entries:
            return None
-        pool = [e for e in entries if f"ytdlp:{e['url']}" not in recent] or entries
+        pool = [e for e in entries if e["url"] not in recent] or entries
        entry = random.choice(pool)
        return Track(
            backend="ytdlp",
--- a/ingest/radieo/scheduler.py
+++ b/ingest/radieo/scheduler.py
@ -1,10 +1,15 @@
-"""Weighted scheduler mixing several providers.
+"""Weighted scheduler mixing several providers, with canonical anti-repeat.
-Picks a provider at random, weighted by ``SOURCE_WEIGHTS``, and asks it for a
+Two responsibilities:
-track. If the chosen provider has nothing right now (empty list, unreachable
+
-source…), the remaining providers are tried in weighted-random order, so a
+- **Source mix**: pick a provider at random, weighted by ``SOURCE_WEIGHTS``. If
-temporarily-idle source never stalls playback. Returns ``None`` only when every
+  the chosen one has nothing right now, try the rest in weighted-random order,
-provider is exhausted.
+  so a temporarily-idle source never stalls playback.
 - **Authoritative anti-repeat**: canonicalize the picked track (fill its MBID,
  cached/best-effort) and reject it if its canonical key was played recently,
  retrying a bounded number of times. This is source-agnostic: the same track
  coming from two backends collapses to one key. Providers still apply a cheap
  locator-based filter first, which keeps the number of retries low.
 The weights live in one place (``config.SOURCE_WEIGHTS``) so the mix can later
 move to a config file without touching this logic.
@ -13,15 +18,37 @@ move to a config file without touching this logic.
 import logging
 import random
 from . import config
 log = logging.getLogger("radieo.scheduler")
 class Scheduler:
-    def __init__(self, entries):
+    def __init__(self, entries, canonicalizer, db):
        # entries: list[(provider, weight)]; drop non-positive weights.
        self._entries = [(p, w) for p, w in entries if w > 0]
        self._canonicalizer = canonicalizer
        self._db = db
    def next(self):
        if not self._entries:
            return None
        recent = self._db.recent_keys(config.ANTIREPEAT_WINDOW)
        last = None
        for _ in range(config.SCHEDULER_MAX_TRIES):
            track = self._pick()
            if track is None:
                return None  # no source has anything right now
            track = self._canonicalizer.canonicalize(track)
            if track.key not in recent:
                return track
            last = track  # recently played; try another
            log.debug("skipping recent %s", track)
        # Everything drawn was recent (e.g. tiny library): play the last anyway.
        return last
    def _pick(self):
        """Weighted provider draw, falling through to the others when empty."""
        pool = list(self._entries)
        while pool:
            weights = [w for _, w in pool]