What's new

Made with AI: Suno, Udio and others

Given what Udio gets from prompting it to sing atonally and almost certainly the fairly small number of recordings available to generate these, I wonder how massive the training data actually needs to be.
Someone posted a link to similar non-commercial project from 2020 that scraped audio data from the web. It had a training dataset of roughly 1.5 million songs, of which roughly half were in English.

Keep in mind that neural networks are classifiers. That means that once it has good coverage for something, when it "learns" about something new, it only has to determine the features which make that thing unique.

For example, a neural network that has already been trained on a large variety of voices, it's created a large "latent space" of those voice characteristics. Learning a new voice is a matter of determining what set of existing characteristics the new voice can be classified with.

It's like if you were a sketch artist - once you've drawn enough people, you've got a collection of "go to" features you can pull from. So new people are mostly variations on what you already know.

But building coverage for that initial knowledge base can require a lot of data.
 
Udio is coming up with a fantastic variety of ways to butcher my lyrics, lol. The lack of control is really galling. Though it certainly does suggest possible deliveries I hadn't thought of (... usually for good reason...).
Suno vs Udio:



Suno nails it, Udio messes a lot of it up with over-"interpretation" and embellishments. (I'm reminded of Hamlet: "Speak the speech, I pray you, as I pronounced it to you, trippingly on the tongue: but if you mouth it, as many of your players do, I had as lief the town-crier spoke my lines.") And those are the best of many generations...

Here's another example of how Udio messes up the opening lyrics:


Those dramatic pauses it adds... ugh. Like a bad "Shakespearean actor".
 
Udio is currently indeed better than most AI music generators. However, it is not a threat to composers who can compose at an advanced level.
Unfortunately that's true, but only that level I'm afraid. Actually, it eventually can be a threat... when does the director on a budget stop being the paying customer when other free plagiarized music is good enough?
I have a good idea of what will happen and I am already preparing for the near future. But for me, it is certain that any hobby composer who composes classical/new-age/film-music with only chord progressions and a melody will be sidelined.
110%, bingo. The next upcoming generation who is born with a device in their hand hearing scores from games and movies won't care less how it got there... "HZ or JW or AI, who cares Bruh as long as it's lit?" :cautious:

The libraries that want to survive will have to use filters to reject AI music, or else they will be inundated with superficial music. Although, many libraries already survive by commercializing music without actually selling it.
Actually, we have no actual idea where it'll go. Your assessment is right as far as we see it now, but ten years from now, what will really be going on? Not sure I want to know. There has to be a way to think ahead on how adapt to where it's going, just can't see it yet. All I know is, I don't want to be like the VP from Kodak who was shown this project for a digital sensor, and the Veep replied that it was a nice pet project but he wasn't going to support some technology that would compete with their film products. Not an urban myth, unfortunately for them. Kodak was mostly responsible for inventing the first digital sensors, and totally completely missed the train. I hate AI, but I don't want to be Kodak.

If we can annihilate AI from the film composing world, I promise never to complain again about chunky legatos and SINE not having MIDI assignable faders!! :grin:
 
I wonder if it would it be possible one day to feed AI some type of mock up or the notated part and have it generate playback audio for use in a project.
This is the way forward.

Text to audio by itself provides so little control that using these services is less like composing music and much more like randomly requesting a book from Borges' Library of Babel where an infinity of tomes reside on an infinity of shelves in an infinity of rooms.

On the other hand audio-to-audio combined with prompting could be much more like working alongside history's finest orchestrator/producer/studio system all in one.

If you look at the image generation area, Stable Diffusion has this well advanced with ControlNet which allows for extremely fine control when generating imagery. It's the open source image generator so tools like these grew like weeds in that ecosystem while Midjourney which has superior image quality lags behind in this area. Maybe it will be the same for music, with the closed source services like Udio lacking the tooling ecosystem required to make them really useful and less toy-like.

I'm pretty sure that ControlNet could readily be ported from the image domain to the audio domain, given that these large models are all using the same basic methods. Don't be surprised if audio to music or midi to music emerges in the next year, or the next week.
 
Also why are these AI's names so damn uncreative? Probably not named by a human anyway.

UDIO
SORA
SUNO
BOMZ

The last one doesn't exist but who cares, can't keep track of all these 4 letter monsters
 
It seems I'm out of the game, then. Luckily, a "hobby composer" is just that: he doesn't have to make money from composing, unlike advanced professionals who don't care about melodies but can come up with masterpieces made of clusters and expertly crafted sound designs. I'm sure these professionals will never be replaced by an algorithm!
As if professional composers don't care about melodies. You're being silly.
 
Someone posted a link to similar non-commercial project from 2020 that scraped audio data from the web. It had a training dataset of roughly 1.5 million songs, of which roughly half were in English.

Keep in mind that neural networks are classifiers. That means that once it has good coverage for something, when it "learns" about something new, it only has to determine the features which make that thing unique.

For example, a neural network that has already been trained on a large variety of voices, it's created a large "latent space" of those voice characteristics. Learning a new voice is a matter of determining what set of existing characteristics the new voice can be classified with.

It's like if you were a sketch artist - once you've drawn enough people, you've got a collection of "go to" features you can pull from. So new people are mostly variations on what you already know.

But building coverage for that initial knowledge base can require a lot of data.
This was very interesting. Thank you. I'll keep it in mind.
 
This is the way forward.

Text to audio by itself provides so little control that using these services is less like composing music and much more like randomly requesting a book from Borges' Library of Babel where an infinity of tomes reside on an infinity of shelves in an infinity of rooms.

On the other hand audio-to-audio combined with prompting could be much more like working alongside history's finest orchestrator/producer/studio system all in one.

If you look at the image generation area, Stable Diffusion has this well advanced with ControlNet which allows for extremely fine control when generating imagery. It's the open source image generator so tools like these grew like weeds in that ecosystem while Midjourney which has superior image quality lags behind in this area. Maybe it will be the same for music, with the closed source services like Udio lacking the tooling ecosystem required to make them really useful and less toy-like.

I'm pretty sure that ControlNet could readily be ported from the image domain to the audio domain, given that these large models are all using the same basic methods. Don't be surprised if audio to music or midi to music emerges in the next year, or the next week.
Stable Audio 2 (which it says is trained only on licensed data from AudioSparx) already has audio2audio with prompting. There's also Dance Diffusion, which is a bit older now, but you train on your own data to create your own models (which I've done) that supports inpainting. So there's no "controlnet" yet but other stable diffusion tools are here for audio--it's no doubt possible.

Aside, I'd also point out that one of the biggest changes in gen ai in the past year, at least for llms, has been efficiency--it turns out that, contrary to Sam Altman, smaller high quality datasets perform well against large training sets, to the point that if you have a decent gaming card or an m3 with lots ram, you can run very surprisingly capable language models at home without the need for server farms. We'll see if the same applies to audio in the coming years.
 
Has anyone tried taking lyrics from a real song, putting it into Udio and prompting it with the name of the original artist?

Might be an interesting experiment.
 
Has anyone tried taking lyrics from a real song, putting it into Udio and prompting it with the name of the original artist?

Might be an interesting experiment.
You can't... I tried with Frank Sinatra and the first lyrics from My Way... I get "Moderation Error"
 
it's incredible: it "composed" the imitation too...
https://www.udio.com/songs/1uMmd3hgpXN4kTUQphna9f
This sounds shockingly good.

During my experiments, I found that there is still a huge surprise factor regarding the outcome wether it sounds good or not, especially in terms of realistic orchestral / acoustic instruments. Sometimes the audio quality is just terrible, or I had staccato sounds that seemed to be... reversed? All kinds of strange errors that make the illusion obvious.
Other times it's just shockingly good. At least for a limited time span.

What I also noticed is there is no real understanding of melodic development involved. The AI understands harmonics, it understands rhythm, very well, but continous development of a solid leitmotif, that I haven't heard yet. Not yet.

I guess it makes sense if these AIs work similar to ChatGPT and others, I have no deep understanding of algorithm here. What I mean is the ability of looking into the future of it's own creation, making connections inside a whole piece of "art". ChatGPT doesn't know how it's sentence is gonna end while writing it. I feel like Suno / Udio don't know how to end a melodic theme either, let alone make some complex developments of a theme over time with variations. It rather understands what sounds good right now at any given moment - which is why it works way better for harmonic than melodic stuff.
 
Last edited:
As if professional composers don't care about melodies. You're being silly.
I didn't say that. Of course that's not the case. But there is a lot of music made by professionals that is just like what I described, and that kind of music is much more of a candidate to be generated with AI than the compositions of "any hobby composer who composes classical/new-age/film-music with only chord progressions and a melody". In addition to this, an amateur composer can't possibly have his career as a composer threatened by AI because he doesn't have such a career, obviously. So, why mention, hobby composers in this context?
 
Top Bottom