I haven't played with Suno yet, but if the Model is not very responsive to prompting, that's the problem with the training data.
This will be little speculative. I guess the training data consists of songs/tracks and their descriptions based on tags and the description/comment/title that the author provided. However most of the time this will be something like "Epic, Orchestral, Emotional, Soundtrack". This is very limited and a lot of tracks will have this similar description. But if the tracks will be described like: "The piece starts with a slow rhythm played by the taiko drums. Next the violins start to play an ostinato pattern in D minor harmonized by cellos and doubled by basses. At bar 20 french horn starts playing emotional melody with the low brass and second violins playing slow chords.".
The more detail in the description, the better the Model can find the individual elements of the music. If the training data will get this kind of description level, then the Model will generate more varieties and have better prompt responsiveness. It is very hard to describe and quantify music. Actually the best descriptions for music we have right now are the notation of classical music. Having a full score notation you can easily see what instruments are playing and what is the rhythm, harmony, structure and overall emotion. Those are not text descriptions, because notation is more of a graphical language, but it those are avery good descriptions.
What is holding the level of quality of current Text-to-Music Models right now is a lack (that I know of) of good Music-to-Text models. Current Text-to-Image and Image-To-Text Models can train one another. You generate random promt, then you generate image based on that prompt and next description of this image. Finally you compare prompt with description and if they match, you keep the Model parameters, if not you update the parameters and start again. After you "kickstart" the training with the real world data, then at some point you can continue the training only on synthesized data. And one can argue that after several iterations using this method, there is no real world (and Copyrighted) data inside the Models anymore.
The moment a good Music-to-Text Model will appear, all Text-to-Music Models will skyrocket.