Home » Are there any benefits for using the CPU instead of the GPU?

Are there any benefits for using the CPU instead of the GPU?


“I’ve read that F1 cars are faster than those we drive on the streets… why people don’t use F1 cars then?” Well… The answer to this question is simple: F1 cars can’t break or turn as fast as most cars do (the slowest car could beat an F1 in that case). The case of GPUs is very similar, they are good at following a straight line of processing, but they are not so good when it comes to choosing different processing paths.

A program executed in te GPU makes sense when it must be executed many times in parallel, for instance when you have to blend all the pixels from Texture A with pixels from Texture B and put them all in Texture C. This task, when executed in a CPU, would be processed as something like this:

for( int i =0; i< nPixelCount; i++ )
     TexC[i] = TexA[i] + TexB[i];

But this is slow when you have to process a lot of pixels, so the GPU instead of using the code above, it just uses the next one:

     TexC[i] = TexA[i] + TexB[i];

and then it populates all the cores with this program (essentially copying the program to the core), assigning a value for i for each. Then is where it comes the magic from the GPU and make all cores execute the program at the same time, making a lot of operations much faster than the linear CPU program could do.

This way of working is ok when you have to process in the same way a very lot of small inputs, but is really bad when you have to make a program that may have conditional branching. So now let’s see what the CPU does when it comes to some condition check:

  • 1: Execute the program until the first logical operation
  • 2: Evaluate
  • 3: Continue executing from the memory address result of the comparison (as with a JNZ asm instruction)

This is very fast for the CPU as setting an index, but for the GPU to do the same, it’s a lot more complicated. Because the power from the GPU comes from executing the same instruction at the same time (they are SIMD cores), they must be synchronized to be able to take advantage of the chip architecture. Having to prepare the GPU to deal with branches implies more or less:

  • 1: Make a version of the program that follows only branch A, populate this code in all cores.
  • 2: Execute the program until the first logical operation
  • 3: Evaluate all elements
  • 4: Continue processing all elements that follow the branch A, enqueue all processes that chose path B (for which there is no program in the core!). Now all those cores which chose path B, will be IDLE!!–the worst case being a single core executing and every other core just waiting.
  • 5: Once all As are finished processing, activate the branch B version of the program (by copying it from the memory buffers to some small core memory).
  • 6: Execute branch B.
  • 7: If required, blend/merge both results.

This method may vary based on a lot of things (ie. some very small branches are able to run without the need of this distinction) but now you can already see why branching would be an issue. The GPU caches are very small you can’t simply execute a program from the VRAM in a linear way, it has to copy small blocks of instructions to the cores to be executed and if you have branches enough your GPU will be mostly stalled than executing any code, which makes no sense when it comes when executing a program that only follows one branch, as most programs do–even if running in multiple threads.
Compared to the F1 example, this would be like having to open braking parachutes in every corner, then get out of the car to pack them back inside the car until the next corner you want to turn again or find a red semaphore (the next corner most likely).

Then of course there is the problem of other architectures being so good at the task of logical operations, far cheaper and more reliable, standarized, better known, power-efficient, etc. Newer videocards are hardly compatible with older ones without software emulation, they use different asm instructions between them even being from the same manufacturer, and that for the time being most computer applications do not require this type of parallel architecture, and even if if they need it them, they can use through standard apis such as OpenCL as mentioned by eBusiness, or through the graphics apis.
Probably in some decades we will have GPUs that can replace CPUs but I don’t think it will happen any time soon.

I recommend the documentation from the AMD APP which explains a lot on their GPU architecture and I also saw about the NVIDIA ones in the CUDA manuals, which helped me a lot on understanding this. I still don’t understand some things and I may be mistaken, probably someone who knows more can either confirm or deny my statements, which would be great for us all.

GPUs are very good a parallel tasks. Which is great… if you’re running a parallel tasks.

Games are about the least parallelizable kind of application. Think about the main game loop. The AI (let’s assume the player is handled as a special-case of the AI) needs to respond to collisions detected by the physics. Therefore, it must run afterwards. Or at the very least, the physics needs to call AI routines within the boundary of the physics system (which is generally not a good idea for many reasons). Graphics can’t run until physics has run, because physics is what updates the position of objects. Of course, AI needs to run before rendering as well, since AI can spawn new objects. Sounds need to run after AI and player controls

In general, games can thread themselves in very few ways. Graphics can be spun off in a thread; the game loop can shove a bunch of data at the graphics thread and say: render this. It can do some basic interpolation, so that the main game loop doesn’t have to be in sync with the graphics. Sound is another thread; the game loop says “play this”, and it is played.

After that, it all starts to get painful. If you have complex pathing algorithms (such as for RTS’s), you can thread those. It may take a few frames for the algorithms to complete, but they’ll be concurrent at least. Beyond that, it’s pretty hard.

So you’re looking at 4 threads: game, graphics, sound, and possibly long-term AI processing. That’s not much. And that’s not nearly enough for GPUs, which can have literally hundreds of threads in flight at once. That’s what gives GPUs their performance: being able to utilize all of those threads at once. And games simply can’t do that.

Now, perhaps you might be able to go “wide” for some operations. AIs, for instance, are usually independent of one another. So you could process several dozen AIs at once. Right up until you actually need to make them dependent on each other. Then you’re in trouble. Physics objects are similarly independent… unless there’s a constraint between them and/or they collide with something. Then they become very dependent.

Plus, there’s the fact that the GPU simply doesn’t have access to user input, which as I understand is kind of important to games. So that would have to be provided. It also doesn’t have direct file access or any real method of talking to the OS; so again, there would have to be some kind of way to provide this. Oh, and all that sound processing? GPUs don’t emit sounds. So those have to go back to the CPU and then out to the sound chip.

Oh, and coding for GPUs is terrible. It’s hard to get right, and what is “right” for one GPU architecture can be very, very wrong for another. And that’s not even just switching from AMD to NVIDIA; that could be switching from a GeForce 250 to a GeForce 450. That’s a change in the basic architecture. And it could easily make your code not run well. C++ and even C aren’t allowed; the best you get is OpenCL, which is sort of like C but without some of the niceties. Like recursion. That’s right: no recursion on GPUs.

Debugging? Oh I hope you don’t like your IDE’s debugging features, because those certainly won’t be available. Even if you’re using GDB, kiss that goodbye. You’ll have to resort to printf debugging… wait, there’s no printf on GPUs. So you’ll have to write to memory locations and have your CPU stub program read them back.

That’s right: manual debugging. Good luck with that.

Also, those helpful libraries you use in C/C++? Or perhaps you’re more of a .NET guy, using XNA and so forth. Or whatever. It doesn’t matter, since you can’t use any of them on the GPU. You must code everything from scratch. And if you have an already existing codebase, tough: time to rewrite all of that code.

So yeah. It’s horrible to actually do for any complex kind of game. And it wouldn’t even work, because games just aren’t parallel enough for it to help.

Why is not so easy to answer — it’s important to note that GPUs are specialized processors which are not really intended for generalized use like a regular CPU. Because of this specialization, it’s not surprising that a GPU can outperform a CPU for the things it was specifically designed (and optimized) for, but that doesn’t necessarily mean it can replace the full functionality and performance of a generalized CPU.

I suspect that developers don’t do this for a variety of reasons, including:

  • They want the graphics to be as fast and highest quality possible, and using valuable GPU resources could interfere with this.

  • GPU-specific code may have to be written, and this will likely introduce additional complexity to the overall programming of the game (or application) at hand.

  • A GPU normally doesn’t have access to resources like network cards, keyboards, mice, and joysticks, so it’s not possible for it to handle every aspect of the game anyway.

In answer to the second part of your question: Yes, there are other uses. For example, projects like SETI@Home (and probably other BOINC projects) are using GPUs (such as those by nVidia) for high-speed complex calculations:

  Run SETI@home on your NVIDIA GPU

(I like your question because it poses an interesting idea.)

Related Solutions

Extract file from docker image?

You can extract files from an image with the following commands: docker create $image # returns container ID docker cp $container_id:$source_path $destination_path docker rm $container_id According to the docker create documentation, this doesn't run the...

Transfer files using scp: permission denied

Your commands are trying to put the new Document to the root (/) of your machine. What you want to do is to transfer them to your home directory (since you have no permissions to write to /). If path to your home is something like /home/erez try the following:...

What’s the purpose of DH Parameters?

What exactly is the purpose of these DH Parameters? These parameters define how OpenSSL performs the Diffie-Hellman (DH) key-exchange. As you stated correctly they include a field prime p and a generator g. The purpose of the availability to customize these...

How to rsync multiple source folders

You can pass multiple source arguments. rsync -a /etc/fstab /home/user/download bkp This creates bkp/fstab and bkp/download, like the separate commands you gave. It may be desirable to preserve the source structure instead. To do this, use / as the source and...

Benefits of Structured Logging vs basic logging

There are two fundamental advances with the structured approach that can't be emulated using text logs without (sometimes extreme levels of) additional effort. Event Types When you write two events with log4net like: log.Debug("Disk quota {0} exceeded by user...

Interfaces vs Types in TypeScript

2019 Update The current answers and the official documentation are outdated. And for those new to TypeScript, the terminology used isn't clear without examples. Below is a list of up-to-date differences. 1. Objects / Functions Both can be used to describe the...

Get total as you type with added column (append) using jQuery

One issue if that the newly-added column id's are missing the id number. If you look at the id, it only shows "price-", when it should probably be "price-2-1", since the original ones are "price-1", and the original ones should probably be something like...

Determining if a file is a hard link or symbolic link?

Jim's answer explains how to test for a symlink: by using test's -L test. But testing for a "hard link" is, well, strictly speaking not what you want. Hard links work because of how Unix handles files: each file is represented by a single inode. Then a single...

How to restrict a Google search to results of a specific language?

You can do that using the advanced search options: http://www.googleguide.com/sharpening_queries.html I also found this, which might work for you: http://www.searchenginejournal.com/how-to-see-google-search-results-for-other-locations/25203/ Just wanted to add...

Random map generation

Among the many other related questions on the site, there's an often linked article for map generation: Polygonal Map Generation for Games you can glean some good strategies from that article, but it can't really be used as is. While not a tutorial, there's an...

How to prettyprint a JSON file?

The json module already implements some basic pretty printing in the dump and dumps functions, with the indent parameter that specifies how many spaces to indent by: >>> import json >>> >>> your_json = '["foo", {"bar":["baz", null,...

How can I avoid the battery charging when connected via USB?

I have an Android 4.0.3 phone without root access so can't test any of this but let me point you to /sys/class/power_supply/battery/ which gives some info/control over charging issues. In particular there is charging_enabled which gives the current state (0 not...

How to transform given dataset in python? [closed]

From your expected result, it appears that each "group" is based on contiguous id values. For this, you can use the compare-cumsum-groupby pattern, and then use agg to get the min and max values. # Sample data. df = pd.DataFrame( {'id': [1, 2, 2, 2, 2, 2, 1, 1,...

Output of the following C++ Program [closed]

It works exactly like this non-recursive translation: int func_0() { return 2; } int func_1() { return 3; } int func_2() { return func_1() + func_0(); } // Returns 3 + 2 = 5 int func_3() { return func_2() + func_1(); } // Returns 5 + 3 = 8 int func_4() { return...

Making a circle out of . (periods) [closed]

Here's the maths and even an example program in C: http://pixwiki.bafsoft.com/mags/5/articles/circle/sincos.htm (link no longer exists). And position: absolute, left and top will let you draw: http://www.w3.org/TR/CSS2/visuren.html#choose-position Any further...

Should I use a code converter (Python to C++)?

Generally it's an awful way to write code, and does not guarantee that it will be any faster. Things which are simple and fast in one language can be complex and slow in another. You're better off either learning how to write fast Python code or learning C++...

tkinter: cannot concatenate ‘str’ and ‘float’ objects

This one line is more than enough to cause the problem: text="რეგულარი >> "+2.23+ 'GEL' 2.23 is a floating-point value; 'GEL' is a string. What does it mean to add an arithmetic value and a string of letters? If you want the string label 'რეგულარი...