Home » C++ low-level optimization tips [closed]

C++ low-level optimization tips [closed]


Optimise your data layout! (This applies to more languages than just C++)

You can go pretty deep making this specifically tuned for your data, your processor, handling multi-core nicely, etc. But the basic concept is this:

When you are processing things in a tight loop, you want to make the data for each iteration as small as possible, and as close together as possible in memory. That means the ideal is an array or vector of objects (not pointers) that contain only the data necessary for the calculation.

This way, when the CPU fetches the data for the first iteration of your loop, the next several iterations worth of data will get loaded into the cache with it.

Really the CPU is fast and the compiler is good. There’s not really much you can do with using fewer and faster instructions. Cache coherence is where it’s at (that’s a random article I Googled – it contains a good example of getting cache coherency for an algorithm that doesn’t simply run through data linearly).

A very, very low-level tip, but one that can come in handy:

Most compilers support some form of explicit conditional hinting. GCC has a function called __builtin_expect which lets you inform the compiler what the value of a result probably is. GCC can use that data to optimize conditionals to perform as quickly as possible in the expected case, with slightly slower execution in the unexpected case.

if(__builtin_expect(entity->extremely_unlikely_flag, 0)) {
  // code that is rarely run

I’ve seen a 10-20% speedup with proper use of this.

The first thing you need to understand is the hardware you’re running on. How does it handle branching? What about caching? Does it have a SIMD instruction set? How many processors can it use? Does it have to share processor time with anything else?

You may solve the same problem in very different ways – even your choice of algorithm should be dependent on the hardware. In some cases O(N) can run slower than O(NlogN) (depending on implementation).

As a crude overview of optimisation, the first thing I would do is look at exactly what problems and what data you are trying to solve for. Then optimise for that. If you want extreme performance then forget about generic solutions – you can special case everything that doesn’t match your most used case.

Then profile. Profile, profile, profile. Look at memory usage, look at branching penalties, Look at function call overhead, look at pipeline utilisation. Work out what is slowing down your code. It’s probably data access (I wrote an article called “The Latency Elephant” about the overhead of data access – google it. I can’t post 2 links here as I don’t have enough “reputation”), so closely examine that and then optimise your data layout (nice big flat homogeneous arrays are awesome) and data access (prefetch where possible).

Once you’ve minimised the overhead of the memory subsystem, try and determine if instructions are now the bottleneck (hopefully they are), then look at SIMD implementations of your algorithm – Structure-of-Arrays (SoA) implementations can be very data and instruction cache efficient. If SIMD isn’t a good match for your problem then intrinsics and assembler level coding may be needed.

If you still need more speed then go parallel. If you have the benefit of running on a PS3 then the SPUs are your friends. Use them, love them. If you’ve already written a SIMD solution then you’ll get a massive benefit moving to SPU.

And then, profile some more. Test in game scenarios – is this code still the bottleneck? Can you change the way this code is used at a higher level to minimise its usage (actually, this should be your first step)? Can you defer calculations over multiple frames?

Whatever platform you’re on, learn as much as you can about the hardware and the profilers available. Don’t assume that you know what the bottleneck is – find it with your profiler. And make sure you have a heuristic to determine if you have actually made your game go faster.

And then profile it again.

Related Solutions

Extract file from docker image?

You can extract files from an image with the following commands: docker create $image # returns container ID docker cp $container_id:$source_path $destination_path docker rm $container_id According to the docker create documentation, this doesn't run the...

Transfer files using scp: permission denied

Your commands are trying to put the new Document to the root (/) of your machine. What you want to do is to transfer them to your home directory (since you have no permissions to write to /). If path to your home is something like /home/erez try the following:...

What’s the purpose of DH Parameters?

What exactly is the purpose of these DH Parameters? These parameters define how OpenSSL performs the Diffie-Hellman (DH) key-exchange. As you stated correctly they include a field prime p and a generator g. The purpose of the availability to customize these...

How to rsync multiple source folders

You can pass multiple source arguments. rsync -a /etc/fstab /home/user/download bkp This creates bkp/fstab and bkp/download, like the separate commands you gave. It may be desirable to preserve the source structure instead. To do this, use / as the source and...

Benefits of Structured Logging vs basic logging

There are two fundamental advances with the structured approach that can't be emulated using text logs without (sometimes extreme levels of) additional effort. Event Types When you write two events with log4net like: log.Debug("Disk quota {0} exceeded by user...

Interfaces vs Types in TypeScript

2019 Update The current answers and the official documentation are outdated. And for those new to TypeScript, the terminology used isn't clear without examples. Below is a list of up-to-date differences. 1. Objects / Functions Both can be used to describe the...

Get total as you type with added column (append) using jQuery

One issue if that the newly-added column id's are missing the id number. If you look at the id, it only shows "price-", when it should probably be "price-2-1", since the original ones are "price-1", and the original ones should probably be something like...

Determining if a file is a hard link or symbolic link?

Jim's answer explains how to test for a symlink: by using test's -L test. But testing for a "hard link" is, well, strictly speaking not what you want. Hard links work because of how Unix handles files: each file is represented by a single inode. Then a single...

How to restrict a Google search to results of a specific language?

You can do that using the advanced search options: http://www.googleguide.com/sharpening_queries.html I also found this, which might work for you: http://www.searchenginejournal.com/how-to-see-google-search-results-for-other-locations/25203/ Just wanted to add...

Random map generation

Among the many other related questions on the site, there's an often linked article for map generation: Polygonal Map Generation for Games you can glean some good strategies from that article, but it can't really be used as is. While not a tutorial, there's an...

How to prettyprint a JSON file?

The json module already implements some basic pretty printing in the dump and dumps functions, with the indent parameter that specifies how many spaces to indent by: >>> import json >>> >>> your_json = '["foo", {"bar":["baz", null,...

How can I avoid the battery charging when connected via USB?

I have an Android 4.0.3 phone without root access so can't test any of this but let me point you to /sys/class/power_supply/battery/ which gives some info/control over charging issues. In particular there is charging_enabled which gives the current state (0 not...

How to transform given dataset in python? [closed]

From your expected result, it appears that each "group" is based on contiguous id values. For this, you can use the compare-cumsum-groupby pattern, and then use agg to get the min and max values. # Sample data. df = pd.DataFrame( {'id': [1, 2, 2, 2, 2, 2, 1, 1,...

Output of the following C++ Program [closed]

It works exactly like this non-recursive translation: int func_0() { return 2; } int func_1() { return 3; } int func_2() { return func_1() + func_0(); } // Returns 3 + 2 = 5 int func_3() { return func_2() + func_1(); } // Returns 5 + 3 = 8 int func_4() { return...

Making a circle out of . (periods) [closed]

Here's the maths and even an example program in C: http://pixwiki.bafsoft.com/mags/5/articles/circle/sincos.htm (link no longer exists). And position: absolute, left and top will let you draw: http://www.w3.org/TR/CSS2/visuren.html#choose-position Any further...

Should I use a code converter (Python to C++)?

Generally it's an awful way to write code, and does not guarantee that it will be any faster. Things which are simple and fast in one language can be complex and slow in another. You're better off either learning how to write fast Python code or learning C++...

tkinter: cannot concatenate ‘str’ and ‘float’ objects

This one line is more than enough to cause the problem: text="რეგულარი >> "+2.23+ 'GEL' 2.23 is a floating-point value; 'GEL' is a string. What does it mean to add an arithmetic value and a string of letters? If you want the string label 'რეგულარი...