David Cuny
Grand Poobah, Royal Order of WordBuilders
Someone posted a link to similar non-commercial project from 2020 that scraped audio data from the web. It had a training dataset of roughly 1.5 million songs, of which roughly half were in English.Given what Udio gets from prompting it to sing atonally and almost certainly the fairly small number of recordings available to generate these, I wonder how massive the training data actually needs to be.
Keep in mind that neural networks are classifiers. That means that once it has good coverage for something, when it "learns" about something new, it only has to determine the features which make that thing unique.
For example, a neural network that has already been trained on a large variety of voices, it's created a large "latent space" of those voice characteristics. Learning a new voice is a matter of determining what set of existing characteristics the new voice can be classified with.
It's like if you were a sketch artist - once you've drawn enough people, you've got a collection of "go to" features you can pull from. So new people are mostly variations on what you already know.
But building coverage for that initial knowledge base can require a lot of data.