FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation

| Accepted at ACM MultiMedia 2025 |

Yuxuan Jiang^1,2,*, Zehua Chen^1,2,*, Zeqian Ju^3,*, Chang Li^2,3, Weibei Dou¹, Jun Zhu^1,✉

¹Tsinghua University ²Shengshu AI ³University of Science and Technology of China

✉ dcszj AT mail.tsinghua.edu.cn

Abstract

Text-to-audio (T2A) generation has achieved promising results with the recent progress of generative models. However, because of the limited quality and quantity of temporally-aligned audio-text data pairs, existing T2A methods usually face challenges when handling the complex text prompts that contain precise timing control, e.g., owl hooted at 2.4s–5.2s. Recent works have explored data augmentation techniques or taken timing conditions as model input to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., owl hooted at 2.4s–5.2s and crickets chirping at 0s–24s. Specifically, we first employ LLM for a timing-to-window planning, decomposing the text prompts with complex timing control into multiple time windows of varying lengths. Then we introduce: 1) decoupling and aggregating attention control for precise timing control capabilities; 2) contextual latent composition and reference guidance for coherent long-form generation capabilities, respectively. Extensive experiments show that: 1) FreeAudio achieves a new record of timing-conditioned T2A synthesis quality among training-free methods and is comparable to state-of-the-art training-based method; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis.

Contribution

We propose FreeAudio, the first training-free text-to-audio framework that enables both precise temporal control and coherent long-form generation from complex natural language prompts.

To achieve fine-grained alignment without training, we introduce Decoupling & Aggregating Attention Control and Contextual Latent Composition, along with Reference Guidance for global consistency.

Despite requiring no training or supervision, FreeAudio delivers competitive or superior performance in both temporal precision and structural coherence across extensive evaluations.

**Figure 1:** *Left*: Planning Stage, where the LLM parses the text prompt and timing prompts into a sequence of non-overlapping time windows, each associated with a recaptioned prompt. *Right*: Generation Stage, where the Decoupling & Aggregating Attention Control aligns each recaptioned prompt with its corresponding time window, enabling precise timing control in attention layers.

Timing-controlled Long-Form Audio Generation

Text prompt Birds chirp in the background as a man speaks, followed by a dog barking. Later, a car speeds by, and a child laughs in the distance.	Timing prompt 1. 0s - 10s. Birds chirping in the background. 2. 0s - 6s. Man speaking. 3. 6s - 10s. Dog barking. 4. 10s - 14s. A car speeds by on a nearby road. 5. 14s - 20s. A child laughs in the distance.

Text prompt In a quiet forest, the wind blows softly as birds chirp and a fire crackles nearby. Footsteps rustle through dry leaves, crickets chirp continuously, an owl calls briefly, and a stream flows steadily in the distance.	Timing prompt 1. 0s - 10s. Forest wind blowing. 2. 0s - 4s. Birds chirping. 3. 4s - 6s. Wood burning. 4. 6s - 16s. Animal footsteps on dry leaves. 5. 10s - 16s. Crickets chirping. 6. 16s - 19s. Owl chirping. 7. 17s - 26s. Stream water flowing.

Text prompt A country music song with a strong, expressive male vocal leading the melody, accompanied by soft acoustic guitar and minimal instrumentation.	Timing prompt 1. 0s - 8s. Soft acoustic guitar strumming sets the rhythm. 2. 8s - 16s. The male vocal enters, expressively leading the melody alongside the guitar. 3. 16s - 22s. The vocals rise emotionally while the acoustic guitar maintains a steady background. 4. 22s - 26s. The song softens slightly, with the gentle guitar continuing to play.