Home » Disproving Euler proposition by brute force in C

Disproving Euler proposition by brute force in C

Solutons:


Note: At some point, this review drifted into the realm of assembler and GMP. An actual review is at the end of this post, whereas the first section discusses the runtime-problems concerning pow, wrong data types and arbitrary large integers.

No life time for run time

Would there be any way (on my current machine) to get this to run in my lifetime?

There’s a great saying in carpentry: measure twice, cut once. It concerns cutting wood or other material, where you have to throw away your resources if you accidentally cut at the wrong place.

A similar saying is there for software engineers: you can’t optimize what you can’t measure. There are several ways to measure your code, e.g. benchmarking, profiling, or looking at the generated as­sembler to see how many instructions a certain part of your code will take.

Here, we will take the latter route, start with the assembler, take considerations step by step and see where we end up.

A study in assembly

Lets have a look at your code. Well, not yours, but the assembler the compiler generates. You can use gcc -S -O3. On my platform, this results in the following “hot” section in main:

.L6:
        add     rbx, 1
        cmp     rbx, 500000
        je      .L18
.L8:
        mov     rax, QWORD PTR .LC0[rip]
        movsd   xmm0, QWORD PTR [rsp+40]
        movq    xmm1, rax
        call    pow                        ; (1)
        mov     rax, QWORD PTR .LC0[rip]
        movsd   QWORD PTR [rsp+8], xmm0
        movsd   xmm0, QWORD PTR [rsp+48]
        movq    xmm1, rax
        call    pow                        ; (2)
        mov     rax, QWORD PTR .LC0[rip]
        movsd   QWORD PTR [rsp+16], xmm0
        movsd   xmm0, QWORD PTR [rsp+32]
        movq    xmm1, rax
        call    pow                        ; (3)
        mov     rax, QWORD PTR .LC0[rip]
        movsd   QWORD PTR [rsp+24], xmm0
        pxor    xmm0, xmm0
        cvtsi2sd        xmm0, rbx
        movq    xmm1, rax
        call    pow                        ; (4)
        movsd   xmm2, QWORD PTR [rsp+8]
        addsd   xmm2, QWORD PTR [rsp+16]
        movapd  xmm1, xmm0
        movsd   xmm0, QWORD PTR [rsp+24]
        addsd   xmm0, xmm2
        ucomisd xmm0, xmm1
        jp      .L6
        jne     .L6

Even though you might not know assembler, you can see those four calls to pow. The first thing you need to know is that call is slow compared to those other operations. Those four calls happen in the innermost loop. The compiler removed the call to prop and instead replaced it by its code (that’s faster).

mov* assigns values to registers, add* adds values, and so on. The registers with xmm* are double precision registers, meant for double variables. So we’re basically calling pow with the right values and then add, subtract and modify our small little double values.

Double trouble

But wait a second. We’re trying to solve a completely integral problem! Why does our generated program use those registers at all?

This should raise a red flag. And indeed, if we remember pow‘s signature, it should be clear that it’s not the right tool. It takes a double base and exponent, which indicates that it’s suitable for terms like $15.12151^{3.1415926}$. This is a total overkill for your problem.

Using proper functions

So let’s use another pow version instead:

long int pow4(long int x){
    return x * x * x * x;
}

Note that your compiler should create something like this from that:

movq    %rdi, %rax
imulq   %rdi, %rax
imulq   %rax, %rax
ret

but if your compiler doesn’t recognize this potential (micro) optimization, you can use

long int pow4(long int x){
    const long int t = x * x;
    return t * t;
}

instead.

We also need to change prop:

int prop(long int A, long int B, long int C, long int D) {
  return (pow4(A) + pow4(B) + pow4(C) == pow4(D));
}

Allright. Now, before I show the times of the new program, let’s check the output of your old one:

a = 1, b = 1, c = 1000, time = 114.156000s

That’s when I hit ^C. How does the one using pow4 hold up?

a = 1, b = 1, c = 1000, time = 0.296000s
a = 1, b = 1, c = 2000, time = 0.578000s
a = 1, b = 1, c = 3000, time = 0.859000s
a = 1, b = 1, c = 4000, time = 1.140000s
a = 1, b = 1, c = 5000, time = 1.421000s
a = 1, b = 1, c = 6000, time = 1.703000s
a = 1, b = 1, c = 7000, time = 1.984000s
a = 1, b = 1, c = 8000, time = 2.265000s
a = 1, b = 1, c = 9000, time = 2.546000s
a = 1, b = 1, c = 10000, time = 2.828000s
a = 1, b = 1, c = 11000, time = 3.109000s
a = 1, b = 1, c = 12000, time = 3.390000s
a = 1, b = 1, c = 13000, time = 3.687000s
a = 1, b = 1, c = 14000, time = 3.968000s
a = 1, b = 1, c = 15000, time = 4.250000s
a = 1, b = 1, c = 16000, time = 4.531000s

Which is 0,2% of your original time, or a 500x speedup.

However, this comes at a cost: pow4(500000) is too large for a int64_t, since $log_2(500000^4) approx 76$. The greatest number you could check with a uint64_t is 65535, $2^{16}-1$, which shouldn’t be very surprising. As the standard does not provide int128_t or similar, you should make sure that your numbers don’t exceed those bounds.

You can either write your own large integer logic for this, or use GMP.

Proper bounds and parameter estimation

Next up, you can increase the lower bounds of b and c, so that $a le b le c$. And for d, well, if we have a, b, c, then there is only one solution for d. We can directly search for that solution with binary search.

The binary search makes a $mathcal O (n^3 log n)$ algorithm from your current $mathcal O (n^4)$ one, which should provide a lot more speed than the previous speedup.

Even better, if you used the appropriate bounds for a, b and c, we can bound d by

$$d^4 = a^4 + b^4 + c^4 le c^4 + c^4 + c^4 = 3c^4$$

and therefore get

$$c le d le sqrt[4]{3},c.$$

With the proper binary algorithm,you can finish the first a=1,b=1 case quickly:

…
a = 1, b = 1, c = 481000, time = 0.031000s
a = 1, b = 1, c = 482000, time = 0.031000s
a = 1, b = 1, c = 483000, time = 0.031000s
a = 1, b = 1, c = 484000, time = 0.031000s
a = 1, b = 1, c = 485000, time = 0.031000s
a = 1, b = 1, c = 486000, time = 0.031000s
a = 1, b = 1, c = 487000, time = 0.031000s
a = 1, b = 1, c = 488000, time = 0.031000s
a = 1, b = 1, c = 489000, time = 0.031000s
a = 1, b = 1, c = 490000, time = 0.031000s
a = 1, b = 1, c = 491000, time = 0.031000s
a = 1, b = 1, c = 492000, time = 0.031000s
a = 1, b = 1, c = 493000, time = 0.031000s
a = 1, b = 1, c = 494000, time = 0.031000s
a = 1, b = 1, c = 495000, time = 0.031000s
a = 1, b = 1, c = 496000, time = 0.031000s
a = 1, b = 1, c = 497000, time = 0.031000s
a = 1, b = 1, c = 498000, time = 0.031000s
a = 1, b = 1, c = 499000, time = 0.031000s
a = 1, b = 1, time = 0.031000s

Which brings us back into the realm of your lifetime.

Exercise

Write a function, that given a, b and c checks whether there exist a d, such that your property holds. It should return -1 if there does not exist such a d, and the d otherwise.

Use that function in your code. Make sure that you need roughly $log d_{text{max}}$ iterations in that function.

Important remark about integer sizes

Keep in mind that long int is usually just a 64 bit integer, which means that the largest integer you can store is $2^{63}-1$. Integer types with more bits have greater bounds, but are platform specific. Also, multiplication can be a tad slower, since multiplying 128bit numbers isn’t as easy as multiplying 64bit numbers.

See the next section how to get multiplications down.

An actual review

Our pow4 is now essentially two multiplications. However, we’re still using pow4 too often. After all, we don’t need to recalculate $a^4$ in every iteration. The compiler happily does, since it doesn’t optimize aggressively enough.

Which brings us to the actual review: your code is cleanly written, easy to read and to understand. Unfortunately, well-written, modular code often doesn’t squeeze the last bit (heh) out of your hardware, unless your compiler/runtime is very smart (and thus often expensive).

So let’s get back to the drawing board for a final review of your code:

Includes

#include <stdio.h>
#include <time.h>
#include <math.h>

I would sort them by name, but that’s fine. You don’t include anything that’s not necessary, nor did you forget something (and got saved by a non-standard compliant compiler).

Declaration

int main() {
    long int a, b, c, d;
    clock_t t;
    t = clock();

Depending on whether you write ANSI-C or C99, I would defer declaration of variables as long as possible. For example, at the moment it’s easy to accidentally change c to some bogus value, or forget a { and accidentally check the prop after the loops or similar:

for (a = 1; a < 100000; a++) 
    for (b = 1; b < 300000; b++) 
        for (c = 1; c < 500000; c++) 
            for (d = 1; d < 500000; d++) 
                printf("inner loop");
                if (prop(a,b,c,d))
                    printf("FOUND IT!na = %ldnb = %ldnc = %ldnd = %ldn", a, b, c, d);

Whoops. The if doesn’t get checked, and you don’t get a warning (in older compiler versions; new ones do warn about possible whitespace issues). If you declare your variables later (e.g. C99-style), errors like that cannot happen (although it introduces possible shadowing):

for (long int a = 1; a < 100000; a++) 
    for (long int b = 1; b < 300000; b++) 
        for (long int c = 1; c < 500000; c++) 
            for (long int d = 1; d < 500000; d++) 
                printf("inner loop");
                if (prop(a,b,c,d))
                    printf("FOUND IT!na = %ldnb = %ldnc = %ldnd %ldn", a, b, c, d);

This will now lead to a compiler error, since a, b and so on are out of scope. Either way, that depends on the language standard you want to use. Some people prefer one way, others the other one. Choose yours.

Types

Given that all values should be strictly greater than zero, long int is not the appropriate type, as it can be negative. We should accommodate that. However, instead of using long unsigned int through­out our code, let’s use a type synonym in case we want to change it later to a type with a greater range:

 typedef long unsigned int Number;

You can probably come up with a better name.

Cache results (by hand)

One thing that strikes me most is that you recalculate $a^4$ and so on every time. We can easily treat this with more variables (using your declaration style):

int main() {
    long int a, b, c, d;
    long int a4, b4, c4, d4; // new variables
    clock_t t;
    t = clock();

    for (a = 1; a < 100000; a++) {
        a4 = pow4(a);                             // remember
        for (b = a; b < 300000; b++) {
            b4 = pow4(b);                         // remember
            for (c = b; c < 500000; c++) {
                c4 = pow4(c);                     // the fourth power
                for (d = c; d < 500000; d++) {
                    d4 = pow4(d);                 // of this member
                    if (a4 + b4 + c4 == d4)
                        printf("FOUND IT!na = %ldnb = %ldnc = %ldnd = %ldn", a, b, c, d);
…

Remember how I said that nicely written, modular code isn’t often optimal? This is one of those un­fortunate examples where you have to help the compiler (unless you know exactly what optimization flags you have to use or your compiler is overly aggressive). The prop is gone, the calls to pow4 are now in your loop.

But the compiler cannot make a mistake here anymore: it’s now very clear that a4 doesn’t need to be recalculated 300000*500000*500000 times.

That being said, we should apply the other suggestions like the type synonym and the late declaration:

typedef long unsigned int Number;

int main() {
    clock_t t;
    t = clock();

    for (Number a = 1; a < 100000; a++) {
        const Number a4 = pow4(a);                         // remember
        for (Number b = a; b < 300000; b++) {
            const Number b4 = pow4(b);                     // remember
            for (Number c = b; c < 500000; c++) {
                const Number c4 = pow4(c);                 // the fourth power
                for (Number d = c; d < 500000; d++) {
                    const Number d4 = pow4(d);             // of this member
                    if (a4 + b4 + c4 == d4)
                        printf("FOUND IT!na = %ldnb = %ldnc = %ldnd = %ldn", a, b, c, d);
…

While const isn’t necessary here, it will make sure that we don’t change our cached results accidentally.

The time has come

Although our code is now more verbose, there is one small piece of code that repeats itself three times throughout your main:

((double)(clock() - t))/CLOCKS_PER_SEC)

That’s quite hard to read, isn’t it? It’s a perfect candidate for a function:

static inline seconds_since(clock_t t){
    return ((double)(clock() - t))/CLOCKS_PER_SEC;
}

This changes your printf from

printf("a = %ld, b = %ld, c = %ld, time = %fsn", a, b, c, ((double)(clock() - t))/CLOCKS_PER_SEC);

to

printf("a = %ld, b = %ld, c = %ld, time = %fsn", a, b, c, seconds_since(t));

Ah. Much easier to read. That’s what inline functions are for. Note that any sophisticated compiler should inline that function anyway, so you may also drop inline if you don’t want to use C99.

Well let’s start with this:

for (a = 1; a < 100000; a++) {
    for (b = 1; b < 300000; b++) {
        for (c = 1; c < 500000; c++) {

let’s ignore d for now. What do you do here? You check 1, 1, 1, then 1, 1, 2, then 1, 1, 3, … up to 1, 1, 499999. Then you start over at 1, 2, 1. But you already checked 1, 1, 2, so why are you checking 1, 2, 1? You could go straight to 1, 2, 2. That doesn’t save you much for these low numbers, but believe me, when you get to big numbers it adds up.

In short: a, b, c should be nondecreasing. We can achieve that by starting b at a, and starting c at b, so b is never smaller than a, and c is never smaller than b. So immediately you can get rid of about half the work you’re doing with

for (a = 1; a < 100000; a++) {
    for (b = a; b < 300000; b++) {
        for (c = b; c < 500000; c++) {

Next, consider d.

Of course we can immediately make the same optimization we just made for a, b and c, since d will never be smaller than any of them, and c is always the largest.

Also, once d4 is larger than the sum, we can stop incrementing d because it’s only going to get bigger.

So that will save a lot of time right there. But we can do way better than that.

The question you are asking is “do these four numbers have the sum property?” but the question you should be asking is “does a4 + b4 + c4 equal any fourth power?” If it does, then you can easily compute d much faster than trying all possible fourth powers. So, can you write a fast method that tells you if a particular sum is a fourth power or not?

If you know how to take a square root, what you can do is take the sum, take the square root twice, and then square the result twice. If you get back the original sum, then it was a fourth power, and if you don’t, then it wasn’t.

If you don’t know how to take a square root, you can do the following logic: we have sum; is 1284 equal to sum? No. Is it bigger? Yes. Then next try 644. Is it equal to sum? No. Is it smaller? Yes. Then try 964, and so on. Binary search for the result until you find it, or until there are no more numbers in your range to check. That does only a tiny handful of computations, compared to trying thousands upon thousands of possible fourth powers.

I’m not able to comment on vnp’s solution, but vnp was right the first time: you can brute force it in $O(n^2log(n))$ time and $O(n)$ space. You don’t need $O(n^2)$ space because you don’t have to store the whole list of $a^4+b^4$ or $d^4-c^4$ upfront. Instead you only need to be able to list the values of $a^4+b^4$ and $d^4-c^4$ in ascending order and then match up the two values.

To list the values of $a^4+b^4$, you can maintain, for each $a$, the smallest value of $b$, call it $b_a$, such that $a^4+b_a^4ge k$, where $k$ is the previous value of $a^4+b^4$ that you read off. To read off the next value, you need to find the value of $a$ such that $a^4+b_a^4$ is the smallest, which you can do in $O(log(n))$ time using a heap or something. Then you increment $b_a$ and you are ready to read off the next value of $a^4+b^4$.

This is an example C++11 program (using a GNU extension to get 128 bit integers) that finds the solution in about 7 hours on my PC. (This is just to illustrate the method – it is not very efficient in other ways.)

#include <cstdio>
#include <cmath>
#include <cstdlib>
#include <queue>
#include <vector>
using std::priority_queue;
using std::vector;

//typedef long long int bigint;
typedef __int128 bigint;

// Look for a^4+b^4+c^4 = d^4 by comparing
// ascending list of a^4+b^4 with ascending list of d^4-c^4

vector<bigint> list;
vector<int> diffptr,sumptr;

bool sumcmp(int a0,int a1){
  int b0=sumptr[a0];
  int b1=sumptr[a1];
  return list[a0]+list[b0]>list[a1]+list[b1];
}

bool diffcmp(int c0,int c1){
  int d0=diffptr[c0];
  int d1=diffptr[c1];
  return list[d0]-list[c0]>list[d1]-list[c1];
}

int main(int ac,char**av){
  int a,c,i,n;

  n=500000;
  printf("Using n = %dn",n);
  if(4*log(n)/log(2)+2>sizeof(bigint)*8){fprintf(stderr,"Error: %lu-bit type not large enoughn",sizeof(bigint)*8);exit(1);}

  bigint sumval,diffval;

  list.resize(n);
  diffptr.resize(n);
  sumptr.resize(n);

  priority_queue<int,vector<int>,decltype(&sumcmp)> sumnext(sumcmp);
  priority_queue<int,vector<int>,decltype(&diffcmp)> diffnext(diffcmp);

  for(i=0;i<n;i++)list[i]=bigint(i)*i*i*i;
  for(a=1;a<n;a++){sumptr[a]=a;sumnext.push(a);}
  for(c=0;c<n-1;c++){diffptr[c]=c+1;diffnext.push(c);}

  while(!sumnext.empty()&&!diffnext.empty()){
    a=sumnext.top();sumval=list[sumptr[a]]+list[a];
    c=diffnext.top();diffval=list[diffptr[c]]-list[c];
    if(sumval==diffval){printf("BINGO %d^4+%d^4+%d^4=%d^4n",a,sumptr[a],c,diffptr[c]);fflush(stdout);}
    if(sumval<=diffval){
      sumnext.pop();
      sumptr[a]++;if(sumptr[a]<n)sumnext.push(a);
    }else{
      diffnext.pop();
      diffptr[c]++;if(diffptr[c]<n)diffnext.push(c);
    }
  }

}

Related Solutions

Explaining computational complexity theory

Hoooo, doctoral comp flashback. Okay, here goes. We start with the idea of a decision problem, a problem for which an algorithm can always answer "yes" or "no." We also need the idea of two models of computer (Turing machine, really): deterministic and...

Building a multi-level menu for umbraco

First off, no need pass the a parent parameter around. The context will transport this information. Here is the XSL stylesheet that should solve your problem: <!-- update this variable on how deep your menu should be --> <xsl:variable...

How to generate a random string?

My favorite way to do it is by using /dev/urandom together with tr to delete unwanted characters. For instance, to get only digits and letters: tr -dc A-Za-z0-9 </dev/urandom | head -c 13 ; echo '' Alternatively, to include more characters from the OWASP...

How to copy a file from a remote server to a local machine?

The syntax for scp is: If you are on the computer from which you want to send file to a remote computer: scp /file/to/send username@remote:/where/to/put Here the remote can be a FQDN or an IP address. On the other hand if you are on the computer wanting to...

What is the difference between curl and wget?

The main differences are: wget's major strong side compared to curl is its ability to download recursively. wget is command line only. There's no lib or anything, but curl's features are powered by libcurl. curl supports FTP, FTPS, HTTP, HTTPS, SCP, SFTP, TFTP,...

Using ‘sed’ to find and replace [duplicate]

sed is the stream editor, in that you can use | (pipe) to send standard streams (STDIN and STDOUT specifically) through sed and alter them programmatically on the fly, making it a handy tool in the Unix philosophy tradition; but can edit files directly, too,...

How do I loop through only directories in bash?

You can specify a slash at the end to match only directories: for d in */ ; do echo "$d" done If you want to exclude symlinks, use a test to continue the loop if the current entry is a link. You need to remove the trailing slash from the name in order for -L to...

How to clear journalctl

The self maintenance method is to vacuum the logs by size or time. Retain only the past two days: journalctl --vacuum-time=2d Retain only the past 500 MB: journalctl --vacuum-size=500M man journalctl for more information. You don't typically clear the journal...

How can I run a command which will survive terminal close?

One of the following 2 should work: $ nohup redshift & or $ redshift & $ disown See the following for a bit more information on how this works: man nohup help disown Difference between nohup, disown and & (be sure to read the comments too) If your...

Get exit status of process that’s piped to another

bash and zsh have an array variable that holds the exit status of each element (command) of the last pipeline executed by the shell. If you are using bash, the array is called PIPESTATUS (case matters!) and the array indicies start at zero: $ false | true $...

Execute vs Read bit. How do directory permissions in Linux work?

When applying permissions to directories on Linux, the permission bits have different meanings than on regular files. The read bit (r) allows the affected user to list the files within the directory The write bit (w) allows the affected user to create, rename,...

What are the pros and cons of Vim and Emacs? [closed]

I use both, although if I had to choose one, I know which one I would pick. Still, I'll try to make an objective comparison on a few issues. Available everywhere? If you're a professional system administrator who works with Unix systems, or a power user on...

How do I use pushd and popd commands?

pushd, popd, and dirs are shell builtins which allow you manipulate the directory stack. This can be used to change directories but return to the directory from which you came. For example start up with the following directories: $ pwd /home/saml/somedir $ ls...

How to forward X over SSH to run graphics applications remotely?

X11 forwarding needs to be enabled on both the client side and the server side. On the client side, the -X (capital X) option to ssh enables X11 forwarding, and you can make this the default (for all connections or for a specific connection) with ForwardX11 yes...

What does “LC_ALL=C” do?

LC_ALL is the environment variable that overrides all the other localisation settings (except $LANGUAGE under some circumstances). Different aspects of localisations (like the thousand separator or decimal point character, character set, sorting order, month,...

What is a bind mount?

What is a bind mount? A bind mount is an alternate view of a directory tree. Classically, mounting creates a view of a storage device as a directory tree. A bind mount instead takes an existing directory tree and replicates it under a different point. The...

Turn off buffering in pipe

Another way to skin this cat is to use the stdbuf program, which is part of the GNU Coreutils (FreeBSD also has its own one). stdbuf -i0 -o0 -e0 command This turns off buffering completely for input, output and error. For some applications, line buffering may...