LDMME: Latent Diffusion Model for Music Editing

Abstract

Diffusion models are widely used in the fields of image, audio, and speech generation. While diffusion models for image editing have been extensively studied, their application to music-editing tasks has been relatively neglected. The current research has several shortcomings: 1) lack of support for 44.1KHz stereo music, 2) generation of non-music content during editing, and 3) an editing algorithm that does not fully consider existing information, resulting in unnatural edits. To address these limitations, we propose the LDMME model. We first create a high-quality 44.1KHz sampling rate stereo audio dataset, which excluded non-music data, to train the LDMME. In addition, we enhance the quality of the generated music by strengthening the detailed modeling capability of LDMME. Subsequently, we improve the existing editing algorithm by considering the original music information during editing to enhance the naturalness of the edited music. In both music generation and editing tasks, the LDMME model outperforms AudioLDM and MusicLDM according to various subjective and objective metrics. The samples are available on this website: https://runchuanye.github.io/LDMME-Latent-Diffusion-Model-for-Music-Editing/.

Model Architecture

The model architecture of the proposed LDMME is shown in Fig 1. We base our LDMME implementation on the code of Stable Audio [7]. In particular, we use a pre-trained DAC [13] as the audio codec and a pre-trained CLAP [24] Text Encoder to encode text conditions. Drawing inspiration from Stable Audio [7], the Unet model is developed based on Mousai [21] and integrates Seconds Total and Seconds Start encoders for training and inference across music of varying lengths. It is important to note that our model processes both input and output audio as waveform signals, rather than symbolic signals.

[7] Evans, Z., Carr, C., Taylor, J., Hawley, S.H., Pons, J.: Fast timing-conditioned latent audio diffusion. arXiv preprint arXiv:2402.04825 (2024)

[13] Kumar, R., Seetharaman, P., Luebs, A., Kumar, I., Kumar, K.: High-fidelity audio compression with improved rvqgan. Advances in Neural Information Processing Systems 36 (2024)

[21] Schneider, F., Kamal, O., Jin, Z., Schölkopf, B.: Mo\ˆ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757 (2023)

[24] Wu, Y., Chen, K., Zhang, T., Hui, Y., Berg-Kirkpatrick, T., Dubnov, S.: Largescale contrastive language-audio pretraining with feature fusion and keyword-tocaption augmentation. In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 1–5. IEEE (2023)

Music Editing Definition

Smooth Concatenation of Music Segments: For the smooth splicing of two music segments task, the model accepts two music segments and subsequently splices them into one segment, taking the splicing of the two segments as the to-be-modified portion, and the other portions remain unchanged. Samples are available at Table 2: Smooth Concatenation of Music Segments.

Style Transfer of Music Segments Based on Text Descriptions: To perform stylized modification of a music segment based on textual descriptions, we take an audio piece along with specified start and end times for the segment to be altered. The designated segment is then modified according to the provided text description. The part of the music that does not need to be modified remains unchanged. Samples are available at Table 3: Style Transfer of Music Segments Based on Text Descriptions.

Table of Contents

Table 1 shows music generation based on text descriptions from the test dataset. Table 4 shows music generation based on text descriptions generated by GPT.
Table 2 illustrates the Smooth Concatenation of Music Segments task, where the content from 2.5 to 7.5 seconds of "Concatenation" is modified by AudioLDM, MusicLDM, or LDMME (Ours), while the rest remains unchanged.
Table 3 illustrates the Style Transfer of Music Segments Based on Text Descriptions task, where the content from 2.5 to 7.5 seconds of "GT" is modified by AudioLDM, MusicLDM, or LDMME (Ours) based on text conditions, while the rest remains unchanged.