DCASE 2025 Challenge: Language-based Audio Retrieval

July 1, 2025 · 3 min read

TL;DR. We built a system that finds the right sound clip for a short text like “a fast train passes and a bell rings.” Our model learns to place audio and text in the same “meaning space,” so matching pairs end up close together. We competed in DCASE 2025 Task 6 and placed 5th overall.

What is this task about

Language-based audio retrieval means: you type a short caption, the system ranks a large pool of audio clips by how well they match the text. The official task uses the Clotho v2.1 dataset for development and an evaluation set derived from DCASE 2024, and it allows multiple correct audio files per query. Main metrics include Recall@K and mean Average Precision (mAP).

In plain words:

You write a sentence about a sound.
The system returns the most relevant sound clips.

Example:

Query: “two dogs bark, then a door closes”
Good results: clips with dog barking and a door click near the end.

What we built (simple view)

We used a dual-encoder design:

an audio encoder turns each sound clip into a vector
a text encoder turns each caption into a vector
a contrastive loss trains both encoders so that matching audio and text get close, and mismatched pairs move apart

Why this works: once audio and text live in the same space, retrieval is just nearest neighbors search.

We explored several practical choices:

Which audio backbone to fine-tune and how much to fine-tune
How to pool features over time
Projection heads and normalization choices
Negative sampling strategies and hard-negative mining
Data cleaning for captions and light text augmentation
Ensembling a few variants at the end

See the Technical Report for the full recipe, ablations, and exact settings.

Data we used

Development: Clotho v2.1 (6974 audio clips, each with 5 captions, 15–30 s). Splits follow multi-label stratification by unique caption words.
Evaluation: 1000 queries with multiple relevant audios per query, which makes mAP a natural metric.

How results are measured

Recall@K (R@1, R@5, R@10): did a correct clip appear in the top K list
mAP@10 and mAP@16: measures both ranking quality and multiple relevants

The official baseline for the dev-test split reports: R@1 0.233, R@5 0.522, R@10 0.648, mAP@10 0.352, mAP@16 0.406. These numbers help you judge whether a system is “above baseline.”

Our outcome

Challenge result: 5th place overall on the DCASE 2025 Task 6 leaderboard.
Takeaway: careful fine-tuning of the audio backbone, solid text cleaning, and strong negatives helped most. Ensembling a small number of diverse checkpoints added a small but steady gain.

For exact scores, model configs, and ablations, see our Technical Report:
Technical Report

What is this task about​

What we built (simple view)​

Data we used​

How results are measured​

Our outcome​

What is this task about

What we built (simple view)

Data we used

How results are measured

Our outcome