FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation
| Accepted at ACM MultiMedia 2025 |
Yuxuan Jiang1,2,*,
Zehua Chen1,2,*,
Zeqian Ju3,*,
Chang Li2,3,
Weibei Dou1,
Jun Zhu1,✉
1Tsinghua University
2Shengshu AI
3University of Science and Technology of China
✉ dcszj AT mail.tsinghua.edu.cn
Abstract
Text-to-audio (T2A) generation has achieved promising results with the recent progress of generative models. However, because of the limited quality and quantity of temporally-aligned audio-text data pairs, existing T2A methods usually face challenges when handling the complex text prompts that contain precise timing control, e.g., owl hooted at 2.4s–5.2s. Recent works have explored data augmentation techniques or taken timing conditions as model input to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., owl hooted at 2.4s–5.2s and crickets chirping at 0s–24s. Specifically, we first employ LLM for a timing-to-window planning, decomposing the text prompts with complex timing control into multiple time windows of varying lengths. Then we introduce: 1) decoupling and aggregating attention control for precise timing control capabilities; 2) contextual latent composition and reference guidance for coherent long-form generation capabilities, respectively. Extensive experiments show that: 1) FreeAudio achieves a new record of timing-conditioned T2A synthesis quality among training-free methods and is comparable to state-of-the-art training-based method; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis.
Contribution
Figure 1: Left: Planning Stage, where the LLM parses the text prompt and timing prompts into a sequence of non-overlapping time windows, each associated with a recaptioned prompt. Right: Generation Stage, where the Decoupling & Aggregating Attention Control aligns each recaptioned prompt with its corresponding time window, enabling precise timing control in attention layers.
Timing-controlled Long-Form Audio Generation
Text prompt Birds chirp in the background as a man speaks, followed by a dog barking. Later, a car speeds by, and a child laughs in the distance. |
Timing prompt 1. 0s - 10s. Birds chirping in the background. 2. 0s - 6s. Man speaking. 3. 6s - 10s. Dog barking. 4. 10s - 14s. A car speeds by on a nearby road. 5. 14s - 20s. A child laughs in the distance. |
![]() |
Text prompt In a quiet forest, the wind blows softly as birds chirp and a fire crackles nearby. Footsteps rustle through dry leaves, crickets chirp continuously, an owl calls briefly, and a stream flows steadily in the distance. |
Timing prompt 1. 0s - 10s. Forest wind blowing. 2. 0s - 4s. Birds chirping. 3. 4s - 6s. Wood burning. 4. 6s - 16s. Animal footsteps on dry leaves. 5. 10s - 16s. Crickets chirping. 6. 16s - 19s. Owl chirping. 7. 17s - 26s. Stream water flowing. |
![]() |
Text prompt A country music song with a strong, expressive male vocal leading the melody, accompanied by soft acoustic guitar and minimal instrumentation. |
Timing prompt 1. 0s - 8s. Soft acoustic guitar strumming sets the rhythm. 2. 8s - 16s. The male vocal enters, expressively leading the melody alongside the guitar. 6. 16s - 22s. The vocals rise emotionally while the acoustic guitar maintains a steady background. 7. 22s - 26s. The song softens slightly, with the gentle guitar continuing to play. |
![]() |
Timing-Controlled Audio Generation
|
||||||
FreeAudio(Ours) | AudioLDM2 | Tango | Stable Audio* | Ground Truth | ||
---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
||
|
||||||
FreeAudio(Ours) | AudioLDM2 | Tango | Stable Audio* | Ground Truth | ||
---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
||
|
||||||
FreeAudio(Ours) | AudioLDM2 | Tango | Stable Audio* | Ground Truth | ||
---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
||
|
||||||
FreeAudio(Ours) | AudioLDM2 | Tango | Stable Audio* | Ground Truth | ||
---|---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
||
Long-Form Audio Generation
Base means that the text prompt is entered directly without going through the LLM Planning process.
Prompt: An adult male is speaking and an audience is laughing.
Planning Sub-Prompt (You can click me!)1. 0–7s: An adult male speaks calmly while audience members whisper softly in the background.2. 7–11s: An adult male speaks with energy, accompanied by quiet murmurs and occasional giggles from the audience. 3. 11–15s: An adult male makes a statement as the audience begins to chuckle audibly in the background. 4. 15–19s: An adult male speaks as the audience bursts into loud laughter and lively reactions. 5. 19–26s: An adult male speaks with confidence while the audience chatters and laughs in the background. |
||||
---|---|---|---|---|
FreeAudio(Ours) | FreeAudio(Base) | AudioLDM2 | AudioGen | Stable Audio Open |
|
|
|
|
|
Prompt: Food sizzles as a man speaks with music playing.
Planning Sub-Prompt1. 0–7s: Food sizzles gently in a quiet kitchen environment.2. 7–10s: A man speaks in a steady voice with sizzling sounds nearby. 3. 10–21s: Soft background music plays while food sizzles in the foreground. 4. 21–26s: An immersive blend of food sizzling, male speech, and melodic background music. |
||||
---|---|---|---|---|
FreeAudio(Ours) | FreeAudio(Base) | AudioLDM2 | AudioGen | Stable Audio Open |
|
|
|
|
|
Prompt: A man shouting as another man talks in the background while a series of
gunshots fire and footsteps running on concrete followed by guns cocking and a dog growling.
Planning Sub-Prompt1. 0–6s: A man is shouting loudly in an open area, his voice echoing with urgency.2. 6–11s: Another man murmurs softly in the background, his voice nearly drowned out by the surrounding noise. 3. 11–15s: Gunshots fire sharply through the air, creating a tense and chaotic soundscape. 4. 15–19s: Footsteps rush across a concrete surface with rapid, uneven pacing. 5. 19–26s: Guns cock mechanically as dogs growl deeply nearby. |
||||
---|---|---|---|---|
FreeAudio(Ours) | FreeAudio(Base) | AudioLDM2 | AudioGen | Stable Audio Open |
|
|
|
|
|
Prompt: The low quality recording features a lullaby that consists of soft
bells melody. It sounds like a music box melody and it is relaxing, passionate and mellow.
Planning Sub-Prompt1. 0–6s: A soft lullaby plays with bell-like tones in a low-fidelity recording.2. 6–10s: Gentle music box melodies ring out with delicate bell chimes. 3. 10–12s: A mellow and relaxing bell-driven tune flows in a warm acoustic space. 4. 12–20s: Passionate and soothing lullaby melodies emerge through lo-fi textures. 5. 20–26s: A calming atmosphere of softly ringing bells evokes a dreamy, peaceful mood. |
||||
---|---|---|---|---|
FreeAudio(Ours) | FreeAudio(Base) | AudioLDM2 | MusicGen | Stable Audio Open |
|
|
|
|
|
Prompt: This is an instrumental cover of a ballad piece. The harp is gently
playing the melody with improvisational touches differing from the original version. There is a
calming and soothing atmosphere to this piece. This music could be playing in the background at a
hotel or a spa resort.
Planning Sub-Prompt1. 0–6s: A solo harp plays a soft and flowing melody, evoking a gentle ballad style.2. 6–10s: The harp introduces subtle improvisations with delicate rhythmic changes. 3. 10–16s: Expressive harp flourishes add a personal touch to the familiar ballad theme. 4. 16–20s: Warm harp tones resonate with relaxed timing and spontaneous melodic turns. 5. 20–26s: Ambient harp phrases unfold in a peaceful space, suggesting a live spa or hotel performance. |
||||
---|---|---|---|---|
FreeAudio(Ours) | FreeAudio(Base) | AudioLDM2 | MusicGen | Stable Audio Open |
|
|
|
|
|