So, there's two main ways we perceive directionality: level difference and time difference. Basic example being how a mono sound can be made to sound more left by decreasing the level of the the right channel, which would be using level difference.
.
So this is generally true, and I don't mean to be pedantic, but actually, the two primary localization cues that our brain uses to determine the source of a sound are time delay and frequency response.
The latter is a result of what's called the auditory shadow. Any frequencies higher than about 2k will be blocked by your head. So if a sound is coming from 90° to your left, then that sound will reach your right ear at about the same volume, except those frequencies that are blocked by the auditory shadow.
You can test this by taking a mono signal and splitting it into two channels coming out of your left and right speakers at equal volume. Then roll-off the frequencies above 2K on the right channel and it will sound very much like the sound source is on your left.
The reason is is that the human head is on average about 5 in wide. The frequency whose wavelength is less than 5 in will be blocked by your head, whereas wavelengths longer than 5 in will go around your head. So sound waves around 2K or 2500 K have a wavelength smaller than 5 in. Therefore they get blocked.
The other primary localization cue is the time delay between each ear, as you already mentioned. But again, a difference in level is not required for the localization to occur.
Indeed, it takes an extreme difference in level for localization to occur. If you take your split mono source which is running to two separate channels on a mixer and merely pull the volume down on one channel, localization will only begin to occur when the difference is quite drastic.
But what is happening is that if, for example, you turn down the right Channel, then what you're really doing is allowing the other localization cues to take effect
I mean, if the Sound Source is only coming from your left speaker then you are not simulating localization cues, you actually have a genuine soundsource that's on your left side. So the time delay and the frequency roll off are allowed to have their effect.
When you have a mono sound source running to two speakers, they each have their own set of localization cues, time delay and auditory shadow. But these localization cues are masked by the other speaker. So it sounds like it's coming from The Middle, sort of.
So lowering the level of one side is actually just removing this masking effect.
This all might seem a bit academic, but it actually has real-world consequences. When you have a mono source coming out of two different speakers that are spatially separated. Each ear is hip hearing delayed versions of the opposite speaker combined with the non delayed speaker on each side. This creates a kind of fuzziness that we've all just gotten used to it. If you want to hear a mix without this fuzziness listen to some of the old Beatles records where they were panning everything hard left or heart right.
We all find this painting method a bit novel in these days but actually it was responsible for making the Beatles tracks sound incredibly punchy. When you're in a room and the bass is coming out of one speaker and one speaker only, clarity ensues.