We propose DreamFoley, a autoregressive multimodal Video-to-Audio generation model. Given an input video and an optional text prompt, the model synthesizes high-fidelity audio that is semantically aligned and temporally synchronized with the video content, encompassing elements such as sound effects and background music.
We construct an efficient data production pipeline. The whole pipeline consists of three main components: data standardization, quality assessment and caption annotation.
Comparison with state-of-the-art methods on VGGSound-Test.
Comparison with state-of-the-art methods on Kling-Eval.
The sound of a black cat meowing in a lush green outdoor setting.
The sound of a tiger growling behind bars in an enclosure.
The sound of the digital battle scene with intense storm-like sound effects.
The sound of cows in the grassy outdoor setting.
The sound of a dog barking in the outdoor enclosure.
The sound of a car door being repeatedly opened and closed.
The sound of objects colliding as hands manipulate vibrant blue and purple materials.
The sound of a game protagonist jumping, swinging, and shouting in a swamp-like environment.
The sound of cookie crumbles and ice cubes being added to the blender cups and then poured into the plastic cups.
The sound of firecrackers and fireworks bursting in the sky during a vibrant display.
The sound of a pen writing on a grid-lined notebook.
The sound of machine gun fire amidst background noise.
The sound of a woodworking tool shaping a piece of wood.
The sound of a person prepares two beverages by pouring liquids, adding ice, and stirring.
The sound of a Minecraft player's actions, including surface contact and generic impacts, along with background noise.
The sound of an M13 (300 Conversion) firearm being fired and reloaded.
The sound of a motorcycle engine accelerating and revving as it navigates a wet, winding road.
The sound of a person flipping through a newspaper.
The sound of crowd cheering at the squash match.
The sound of a person sharpening a sword with a sharpening stone.
The sound of a large yellow excavator's mechanisms and bucket striking a two-story white house during demolition.
The sound of a person sawing logs in a forested area.
The sound of an engine running as a person uses a string trimmer to maintain a garden.
The sound of a person typing on a vintage-style mechanical keyboard.
The sound of hands typing on a keyboard with black keys and pink accents.
The sound of a train moving along the tracks as it departs from the station.
The sound of a helicopter's rotors spinning up as it prepares for takeoff.
The sound of cars passing by and a car horn honking during a scenic drive
The sound of a Porsche 911 GT3 Cup engine accelerating and revving as it navigates a racetrack.
The sound of a steam locomotive's engine as it pulls train cars along the track.
The sound of a person crunching and chewing while eating fried snacks on skewers.
The sound of a person chewing while eating custard apple.
The sound of footsteps on the snow in a serene winter forest.
The sound of fingernails scratching the floor.
The sound of a knife scraping a pink, translucent block.
The sound of tweezers picking up and moving colorful square-shaped candies.
The sound of a shimmering tambourine and repetitive chime bells melody.
The sound of a cat-shaped cutting tool slicing through a white foam tube.
The sound of a person gently squeezing a heart-shaped object during an ASMR activity.
The sound of a person playing a trumpet with clear, melodic, harmonious music.
The sound of a person playing the bassoon with piano in the background.
The sound of a violin playing a continuous melody indoors.
The sound of a person playing three conga drums indoors.
The sound of a person playing electric guitar, demonstrating delay effect with a guitar solo.
The sound of an orchestra performing a grand classical piece with brass instruments.