It's totally GPU. SLI makes it so you can have say 2 PCIe cards in one box that share the graphics processing load. However, unless I'm mistaken, their jobs require synchronization between them since GPU processing is not discreet. So the PCIe bus has to pass sync data between them. Therefore having 2 PCIe cards doesn't net you exactly twice the performance. It's faster but not twice as fast.
But I'm not the expert in this arena for sure. I think vs is the Cache heavyweight when it comes to gaming hardware. He described one of his systems and I began muttering like Beaker from Sesame Street.
Yeah but like you guessed, vsloathe is talking about gaming but I'm talking about CUDA. Last time I checked SLI wasn't supported very well and everybody who I asked recommended to write your own task dispatch system because all cores are viewed as individuals. I don't know enough to say if PCIe bus slows it down but at least for now I don't see any reason why it would affect a lot if you are sending 1+1 to core 1 and 2+2 to core 2 and so on. But with very large processing pieces syncing might become a little issue. However I think at least with my CUDA coding skills the bottleneck is my code and not the hardware

But this new GigaÜberThread system is most likely added to CUDA as well so it might make CUDA even more powerful by default.