Skip to main content

DCASE 2025 Challenge: Language-based Audio Retrieval

· 3 min read

TL;DR. We built a system that finds the right sound clip for a short text like “a fast train passes and a bell rings.” Our model learns to place audio and text in the same “meaning space,” so matching pairs end up close together. We competed in DCASE 2025 Task 6 and placed 5th overall.

What is this task about

Language-based audio retrieval means: you type a short caption, the system ranks a large pool of audio clips by how well they match the text. The official task uses the Clotho v2.1 dataset for development and an evaluation set derived from DCASE 2024, and it allows multiple correct audio files per query. Main metrics include Recall@K and mean Average Precision (mAP).

In plain words:

  • You write a sentence about a sound.
  • The system returns the most relevant sound clips.

Example:

Query: “two dogs bark, then a door closes”
Good results: clips with dog barking and a door click near the end.

What we built (simple view)

We used a dual-encoder design:

  • an audio encoder turns each sound clip into a vector
  • a text encoder turns each caption into a vector
  • a contrastive loss trains both encoders so that matching audio and text get close, and mismatched pairs move apart

Why this works: once audio and text live in the same space, retrieval is just nearest neighbors search.

We explored several practical choices:

  • Which audio backbone to fine-tune and how much to fine-tune
  • How to pool features over time
  • Projection heads and normalization choices
  • Negative sampling strategies and hard-negative mining
  • Data cleaning for captions and light text augmentation
  • Ensembling a few variants at the end

See the Technical Report for the full recipe, ablations, and exact settings.

Data we used

  • Development: Clotho v2.1 (6974 audio clips, each with 5 captions, 15–30 s). Splits follow multi-label stratification by unique caption words.
  • Evaluation: 1000 queries with multiple relevant audios per query, which makes mAP a natural metric.

How results are measured

  • Recall@K (R@1, R@5, R@10): did a correct clip appear in the top K list
  • mAP@10 and mAP@16: measures both ranking quality and multiple relevants

The official baseline for the dev-test split reports: R@1 0.233, R@5 0.522, R@10 0.648, mAP@10 0.352, mAP@16 0.406. These numbers help you judge whether a system is “above baseline.”

Our outcome

  • Challenge result: 5th place overall on the DCASE 2025 Task 6 leaderboard.
  • Takeaway: careful fine-tuning of the audio backbone, solid text cleaning, and strong negatives helped most. Ensembling a small number of diverse checkpoints added a small but steady gain.

For exact scores, model configs, and ablations, see our Technical Report:
Technical Report