“I’ve read that F1 cars are faster than those we drive on the streets… why people don’t use F1 cars then?” Well… The answer to this question is simple: F1 cars can’t break or turn as fast as most cars do (the slowest car could beat an F1 in that case). The case of GPUs is very similar, they are good at following a straight line of processing, but they are not so good when it comes to choosing different processing paths.
A program executed in te GPU makes sense when it must be executed many times in parallel, for instance when you have to blend all the pixels from Texture A with pixels from Texture B and put them all in Texture C. This task, when executed in a CPU, would be processed as something like this:
for( int i =0; i< nPixelCount; i++ ) TexC[i] = TexA[i] + TexB[i];
But this is slow when you have to process a lot of pixels, so the GPU instead of using the code above, it just uses the next one:
TexC[i] = TexA[i] + TexB[i];
and then it populates all the cores with this program (essentially copying the program to the core), assigning a value for
i for each. Then is where it comes the magic from the GPU and make all cores execute the program at the same time, making a lot of operations much faster than the linear CPU program could do.
This way of working is ok when you have to process in the same way a very lot of small inputs, but is really bad when you have to make a program that may have conditional branching. So now let’s see what the CPU does when it comes to some condition check:
- 1: Execute the program until the first logical operation
- 2: Evaluate
- 3: Continue executing from the memory address result of the comparison (as with a JNZ asm instruction)
This is very fast for the CPU as setting an index, but for the GPU to do the same, it’s a lot more complicated. Because the power from the GPU comes from executing the same instruction at the same time (they are SIMD cores), they must be synchronized to be able to take advantage of the chip architecture. Having to prepare the GPU to deal with branches implies more or less:
- 1: Make a version of the program that follows only branch A, populate this code in all cores.
- 2: Execute the program until the first logical operation
- 3: Evaluate all elements
- 4: Continue processing all elements that follow the branch A, enqueue all processes that chose path B (for which there is no program in the core!). Now all those cores which chose path B, will be IDLE!!–the worst case being a single core executing and every other core just waiting.
- 5: Once all As are finished processing, activate the branch B version of the program (by copying it from the memory buffers to some small core memory).
- 6: Execute branch B.
- 7: If required, blend/merge both results.
This method may vary based on a lot of things (ie. some very small branches are able to run without the need of this distinction) but now you can already see why branching would be an issue. The GPU caches are very small you can’t simply execute a program from the VRAM in a linear way, it has to copy small blocks of instructions to the cores to be executed and if you have branches enough your GPU will be mostly stalled than executing any code, which makes no sense when it comes when executing a program that only follows one branch, as most programs do–even if running in multiple threads.
Compared to the F1 example, this would be like having to open braking parachutes in every corner, then get out of the car to pack them back inside the car until the next corner you want to turn again or find a red semaphore (the next corner most likely).
Then of course there is the problem of other architectures being so good at the task of logical operations, far cheaper and more reliable, standarized, better known, power-efficient, etc. Newer videocards are hardly compatible with older ones without software emulation, they use different asm instructions between them even being from the same manufacturer, and that for the time being most computer applications do not require this type of parallel architecture, and even if if they need it them, they can use through standard apis such as OpenCL as mentioned by eBusiness, or through the graphics apis.
Probably in some decades we will have GPUs that can replace CPUs but I don’t think it will happen any time soon.
I recommend the documentation from the AMD APP which explains a lot on their GPU architecture and I also saw about the NVIDIA ones in the CUDA manuals, which helped me a lot on understanding this. I still don’t understand some things and I may be mistaken, probably someone who knows more can either confirm or deny my statements, which would be great for us all.
GPUs are very good a parallel tasks. Which is great… if you’re running a parallel tasks.
Games are about the least parallelizable kind of application. Think about the main game loop. The AI (let’s assume the player is handled as a special-case of the AI) needs to respond to collisions detected by the physics. Therefore, it must run afterwards. Or at the very least, the physics needs to call AI routines within the boundary of the physics system (which is generally not a good idea for many reasons). Graphics can’t run until physics has run, because physics is what updates the position of objects. Of course, AI needs to run before rendering as well, since AI can spawn new objects. Sounds need to run after AI and player controls
In general, games can thread themselves in very few ways. Graphics can be spun off in a thread; the game loop can shove a bunch of data at the graphics thread and say: render this. It can do some basic interpolation, so that the main game loop doesn’t have to be in sync with the graphics. Sound is another thread; the game loop says “play this”, and it is played.
After that, it all starts to get painful. If you have complex pathing algorithms (such as for RTS’s), you can thread those. It may take a few frames for the algorithms to complete, but they’ll be concurrent at least. Beyond that, it’s pretty hard.
So you’re looking at 4 threads: game, graphics, sound, and possibly long-term AI processing. That’s not much. And that’s not nearly enough for GPUs, which can have literally hundreds of threads in flight at once. That’s what gives GPUs their performance: being able to utilize all of those threads at once. And games simply can’t do that.
Now, perhaps you might be able to go “wide” for some operations. AIs, for instance, are usually independent of one another. So you could process several dozen AIs at once. Right up until you actually need to make them dependent on each other. Then you’re in trouble. Physics objects are similarly independent… unless there’s a constraint between them and/or they collide with something. Then they become very dependent.
Plus, there’s the fact that the GPU simply doesn’t have access to user input, which as I understand is kind of important to games. So that would have to be provided. It also doesn’t have direct file access or any real method of talking to the OS; so again, there would have to be some kind of way to provide this. Oh, and all that sound processing? GPUs don’t emit sounds. So those have to go back to the CPU and then out to the sound chip.
Oh, and coding for GPUs is terrible. It’s hard to get right, and what is “right” for one GPU architecture can be very, very wrong for another. And that’s not even just switching from AMD to NVIDIA; that could be switching from a GeForce 250 to a GeForce 450. That’s a change in the basic architecture. And it could easily make your code not run well. C++ and even C aren’t allowed; the best you get is OpenCL, which is sort of like C but without some of the niceties. Like recursion. That’s right: no recursion on GPUs.
Debugging? Oh I hope you don’t like your IDE’s debugging features, because those certainly won’t be available. Even if you’re using GDB, kiss that goodbye. You’ll have to resort to
printf debugging… wait, there’s no
printf on GPUs. So you’ll have to write to memory locations and have your CPU stub program read them back.
That’s right: manual debugging. Good luck with that.
Also, those helpful libraries you use in C/C++? Or perhaps you’re more of a .NET guy, using XNA and so forth. Or whatever. It doesn’t matter, since you can’t use any of them on the GPU. You must code everything from scratch. And if you have an already existing codebase, tough: time to rewrite all of that code.
So yeah. It’s horrible to actually do for any complex kind of game. And it wouldn’t even work, because games just aren’t parallel enough for it to help.
Why is not so easy to answer — it’s important to note that GPUs are specialized processors which are not really intended for generalized use like a regular CPU. Because of this specialization, it’s not surprising that a GPU can outperform a CPU for the things it was specifically designed (and optimized) for, but that doesn’t necessarily mean it can replace the full functionality and performance of a generalized CPU.
I suspect that developers don’t do this for a variety of reasons, including:
They want the graphics to be as fast and highest quality possible, and using valuable GPU resources could interfere with this.
GPU-specific code may have to be written, and this will likely introduce additional complexity to the overall programming of the game (or application) at hand.
A GPU normally doesn’t have access to resources like network cards, keyboards, mice, and joysticks, so it’s not possible for it to handle every aspect of the game anyway.
In answer to the second part of your question: Yes, there are other uses. For example, projects like SETI@Home (and probably other BOINC projects) are using GPUs (such as those by nVidia) for high-speed complex calculations:
Run SETI@home on your NVIDIA GPU
(I like your question because it poses an interesting idea.)