Recently, I’ve started experimenting with RAVE, “A variational autoencoder for fast and high-quality neural audio synthesis” created by Antoine Caillon and Philippe Esling of Artificial Creative Intelligence and Data Science (ACIDS) at IRCAM, Paris.
What it is
Simplified, variational autoencoders are artificial neural network architectures in which a given input is compressed by an encoder to the latent space and then processed through a decoder to generate output. Both encoder and decoder are trained together in the process of representation learning.
With RAVE, Caillon and Esling developed a two phase approach with phase one being representation learning on the given dataset followed by an adversarial fine tuning in a second phase of the training, which, according to their paper, allows RAVE to create both high fidelity reconstruction as well as fast to real time processing models, both which has been difficult to accomplish with earlier machine or deep learning technologies which either require a high amount of computational resources or need to trade off for a lower fidelity, sufficient for narrow spectrum audio information (e.g. speech) but limited on broader spectrum information like music.
What it does
Models trained with RAVE basically allow to transfer audio characteristics or timbre of a given dataset to similar inputs in a real time environment via nn~, an object for Max/MSP, Pure Data as well as a VST for other DAWs.
How it’s done
For training models with RAVE, it’s suggested that the input dataset is large enough (3h and more), homogenic to an extent where similarities can be detected and in high quality (up to 48Khz). Technically, smaller and heterogenous datasets can lead to interesting and surprising results. As always, it’s pretty much up to the intended creative use case.
The training itself can be performed either on a local machine with enough GPU resources or on cloud services like Google Colab or Kaggle. The length of the process usually depends on the size of the training data and the desired outcome and can take several days.
How it sounds
I’m presenting two model variants to showcase two use case scenarios. Model one has been trained on roughly 1,3 GB of Amen Break samples (split and unsplit, various processings, generally homogenous) using RAVE V2 and the spherical encoder configuration, the second model has been trained on a dataset consisting of Martsman material pre 2010 (subjectively heterogenous) using RAVE V2 and ELBO encoder.
Additional resources
For training on Colab or Kaggle, I’ve created two notebooks with a bit of additional documentation here and here.