Artificial Intelligence and deep learning are constantly in the headlines these days, whether it be ChatGPT generating poor advice, self-driving cars, artists being accused of using AI, medical advice from AI, and more. Most of these tools rely on complex servers with lots of hardware for training, but using the trained network via inference can be done on your PC, using its graphics card. But how fast are consumer GPUs for doing AI inference?
We’ve benchmarked Stable Diffusion, a popular AI image creator, on the latest Nvidia, AMD, and even Intel GPUs to see how they stack up. If you’ve by chance tried to get Stable Diffusion up and running on your own PC, you may have some inkling of how complex — or simple! — that can be. The short summary is that Nvidia’s GPUs rule the roost, with most software designed using CUDA and other Nvidia toolsets. But that doesn’t mean you can’t get Stable Diffusion running on the other GPUs.
We ended up using three different Stable Diffusion projects for our testing, mostly because no single package worked on every GPU. For Nvidia, we opted for Automatic 1111’s webui version (opens in new tab); it performed best, had more options, and was easy to get running. AMD GPUs were tested using Nod.ai’s Shark version (opens in new tab), and we are also testing (in Vulkan mode) on the Nvidia GPUs and will have an update shortly. Getting Intel’s Arc GPUs running was a bit more difficult, due to lack of support, but Stable Diffusion OpenVINO (opens in new tab) gave us some very basic functionality.
Disclaimers are in order. We didn’t code any of these tools, but we did look for stuff that was easy to get running (under Windows) that also seemed to be reasonably optimized. We’re relatively confident that the Nvidia 30-series tests do a good job of extracting close to optimal performance — particularly when xformers is enabled, which provides an additional ~20% boost in performance (though at reduced precision that may affect quality). RTX 40-series results meanwhile were lower initially, but George SV8ARJ provided this fix (opens in new tab), where replacing the PyTorch CUDA DLLs gave a healthy boost to performance.
The AMD results are also a bit of a mixed bag: RDNA 3 GPUs perform quite well while the RDNA 2 GPUs seem rather mediocre. Nod.ai let us know they’re still working on ‘tuned’ models for RDNA 2, which should boost performance quite a bit (potentially double) once they’re available. Finally, on Intel GPUs, even though the ultimate performance seems to line up decently with the AMD options, in practice the time to render is substantially longer — it takes 5–10 seconds before the actual generation task kicks off, and probably a lot of extra background stuff is happening that slows it down.
We’re also using different Stable Diffusion models, due to the choice of software projects. Nod.ai’s Shark version uses SD2.1, while Automatic 1111 and OpenVINO use SD1.4 (though it’s possible to enable SD2.1 on Automatic 1111). Again, if you have some inside knowledge of Stable Diffusion and want to recommend different open source projects that may run better than what we used, let us know in the comments (or just email Jarred (opens in new tab)).
Our testing parameters are the same for all GPUs, though there’s no option for a negative prompt option on the Intel version (at least, not that we could find). The above gallery was generated using Automatic 1111’s webui on Nvidia GPUs, with higher resolution outputs (that take much, much longer to complete). It’s the same prompts but targeting 2048×1152 instead of the 512×512 we used for our benchmarks. Note that the settings we chose were selected to work on all three SD projects; some options that can improve throughput are only available on Automatic 1111’s build, but more on that later. Here are the pertinent settings:
Positive Prompt:
postapocalyptic steampunk city, exploration, cinematic, realistic, hyper detailed, photorealistic maximum detail, volumetric light, (((focus))), wide-angle, (((brightly lit))), (((vegetation))), lightning, vines, destruction, devastation, wartorn, ruins
Negative Prompt:
(((blurry))), ((foggy)), (((dark))), ((monochrome)), sun, (((depth of field)))
Steps:
100
Classifier Free Guidance:
15.0
Sampling Algorithm:
Some Euler variant (Ancestral on Automatic 1111, Shark Euler Discrete on AMD)
The sampling algorithm doesn’t appear to majorly affect performance, though it can affect the output. Automatic 1111 provides the most options, while the Intel OpenVINO build doesn’t give you any choice.
Here are the results from our testing of the AMD RX 7000/6000-series, Nvidia RTX 40/30-series, and Intel Arc A-series GPUs. Note that each Nvidia GPU has two results, one using the default computational model (slower and in black) and a second using the faster “xformers” library from Facebook (opens in new tab) (faster and in green).
As expected, Nvidia’s GPUs deliver superior performance — sometimes by massive margins — than anything from AMD or Intel. With the DLL fix for Torch in place, the RTX 4090 delivers 50% more performance than the RTX 3090 Ti with xformers, and 43% better performance without xformers. It takes just over three seconds to generate each image, and even the RTX 4070 Ti is able to squeak past the 3090 Ti (but not if you disable xformers).
Things fall off in a pretty consistent fashion from the top cards for Nvidia GPUs, from the 3090 down to the 3050. Meanwhile, AMD’s RX 7900 cards only land at about the same level as an RTX 3080, and all the RTX 30-series cards end up beating AMD’s RX 6000-series parts (for now). Finally, the Intel Arc GPUs come in nearly last, with only the A770 managing to outpace the RX 6600. Let’s talk a bit more about the oddities.
Proper optimizations could double the performance on the RX 6000-series cards. Nod.ai says it should have tuned models in the coming days, at which point the overall standing should start to correlate better with the theoretical performance. Speaking of Nod.ai, we also did some testing of some Nvidia GPUs using that project, and with the Vulkan models the Nvidia cards were substantially slower than with Automatic 1111’s build (15.52 it/s on the 4090, 13.31 on the 4080, 11.41 on the 3090 Ti, and 10.76 on the 3090 — we couldn’t test the other cards as they need to be enabled first).
Based on the performance of the 7900 cards using tuned models, we also suspect most of the Nvidia cards aren’t using Tensor cores at all — more on that in a moment. If that’s true, fully utilizing Tensor cores could give a massive boost to Nvidia. That same logic applies to the Intel cards.
Intel’s Arc GPUs currently deliver very disappointing results, especially since they support XMX (matrix) operations that should deliver up to 4X the throughput as regular FP32 computations. We suspect the current Stable Diffusion OpenVINO project that we used also leaves a lot of room for improvement. Incidentally, if you want to try and run SD on an Arc GPU, note that you have to edit the ‘stable_diffusion_engine.py’ file and change “CPU” to “GPU” — otherwise it won’t use the graphics cards for the calculations and takes substantially longer.
Overall then, using the specified versions, Nvidia’s RTX cards are generally the fastest choice, especially for the top models (3080 and above). AMD’s RX 7000-series cards also do great, but the RX 6000-series underperforms, and Arc GPUs look generally poor. Things could change radically with updated software, and given the popularity of AI we expect it’s only a matter of time before we see better tuning (or find the right project that’s already tuned to deliver better performance).
Again, it’s not clear exactly how optimized any of these projects are. It’s also not clear if these projects are leveraging things like Nvidia’s Tensor cores or Intel’s XMX cores (no on the latter for sure, maybe on Nvidia). As such, we though it would be interesting to look at the maximum theoretical performance (TFLOPS) from the various GPUs. The following chart shows the theoretical FP16 performance for each GPU, using tensor/matrix cores where applicable.
Those Tensor cores on Nvidia clearly pack a punch, and obviously our Stable Diffusion testing doesn’t match up exactly with these figures. For example, on paper the RTX 4090 (using FP16) is up to 106% faster than the RTX 3090 Ti, while in our tests it was 43% faster without xformers, and 50% faster with xformers. Note also that we’re assuming the Stable Diffusion project we used (Automatic 1111) doesn’t even attempt to leverage the new FP8 instructions on Ada Lovelace GPUs, which could potentially double the performance on RTX 40-series again.
Meanwhile, look at the Arc GPUs. Their matrix cores should provide similar performance to the RTX 3060 Ti and RX 7900 XTX, give or take, with the A380 down around the RX 6800. In practice, Arc GPUs are nowhere near those marks. The fastest A770 GPUs land between the RX 6600 and RX 6600 XT, the A750 falls just behind the RX 6600, and the A380 is about one fourth the speed of the A750. So they’re all about a quarter of the expected performance, which would make sense if the XMX cores aren’t being used.
The internal ratios on Arc do look about right, though. Theoretical compute performance on the A380 is about one-fourth the A750, and that’s where it lands in terms of Stable Diffusion performance right now. Most likely, the GPUs are using shaders for the computations, in full precision FP32 mode, and missing out on some additional optimizations.
The other thing to notice is that theoretical compute on AMD’s RX 7900 XTX/XT improved a lot compared to the RX 6000-series. We’ll have to see if the tuned 6000-series models closes the gaps. Memory bandwidth wasn’t a critical factor, at least for the 512×512 target resolution we used — the 3080 10GB and 12GB models land relatively close together. It’s a bit odd that the 7900 XT performs nearly as well as the XTX, though, where raw compute should favor the XTX by about 19% rather than the 3% we measured.
Ultimately, this is more of a snapshot in time of Stable Diffusion performance on AMD, Intel, and Nvidia GPUs rather than a true statement of performance. With full optimizations, the performance should look more like the theoretical TFLOPS chart, and certainly newer RTX 40-series cards shouldn’t fall behind existing RTX 30-series parts.
Which brings us to one final chart, where we did some higher resolution testing. We didn’t test the new AMD GPUs yet, as we had to use Linux on the AMD RX 6000-series cards that we tested. But check out the RTX 40-series results, with the Torch DLLs replaced. The RTX 4090 is now 72% faster than the 3090 Ti without xformers, and a whopping 134% faster with xformers. The 4080 also beats the 3090 Ti by 55%/18% with/without xformers. The 4070 Ti interestingly was 22% slower than the 3090 Ti without xformers, but 20% faster with xformers.
It looks like the more complex target resolution of 2048×1152 starts to take better advantage of the potential compute resources, and perhaps the longer run times mean the Tensor cores can flex their muscle. (FWIW, I’m still not clear on whether or not Tensor cores are being used, or if the various SD projects are just using GPU shaders to do FP16.) Will we see similar improvements with AMD’s new GPUs, and what about Intel? We’ll see about revisiting this topic more in the coming year, hopefully with better optimized code for all the various GPUs.