Diffusion models are widely used in the fields of image, audio, and speech generation. While diffusion models for image editing have been extensively studied, their application to music-editing tasks has been relatively neglected. The current research has several shortcomings: 1) lack of support for 44.1KHz stereo music, 2) generation of non-music content during editing, and 3) an editing algorithm that does not fully consider existing information, resulting in unnatural edits. To address these limitations, we propose the LDMME model. We first create a high-quality 44.1KHz sampling rate stereo audio dataset, which excluded non-music data, to train the LDMME. In addition, we enhance the quality of the generated music by strengthening the detailed modeling capability of LDMME. Subsequently, we improve the existing editing algorithm by considering the original music information during editing to enhance the naturalness of the edited music. In both music generation and editing tasks, the LDMME model outperforms AudioLDM and MusicLDM according to various subjective and objective metrics. The samples are available on this website: https://runchuanye.github.io/LDMME-Latent-Diffusion-Model-for-Music-Editing/.
The model architecture of the proposed LDMME is shown in Fig 1. We base our LDMME implementation on the code of Stable Audio [7]. In particular, we use a pre-trained DAC [13] as the audio codec and a pre-trained CLAP [24] Text Encoder to encode text conditions. Drawing inspiration from Stable Audio [7], the Unet model is developed based on Mousai [21] and integrates Seconds Total and Seconds Start encoders for training and inference across music of varying lengths. It is important to note that our model processes both input and output audio as waveform signals, rather than symbolic signals.
Smooth Concatenation of Music Segments: For the smooth splicing of two music segments task, the model accepts two music segments and subsequently splices them into one segment, taking the splicing of the two segments as the to-be-modified portion, and the other portions remain unchanged. Samples are available at Table 2: Smooth Concatenation of Music Segments.
Style Transfer of Music Segments Based on Text Descriptions: To perform stylized modification of a music segment based on textual descriptions, we take an audio piece along with specified start and end times for the segment to be altered. The designated segment is then modified according to the provided text description. The part of the music that does not need to be modified remains unchanged. Samples are available at Table 3: Style Transfer of Music Segments Based on Text Descriptions.