Home » Load fifty million integers as quickly as possible in Java

Load fifty million integers as quickly as possible in Java

Solutons:


Your code is a bit messy, and having the file name hard-coded in to your function is not great. Also, the “Hungarian Notation” (using things like arg to prefix your function parameters – note, it’s a parameter, not an argument, by the way)… is not conventional.

On the other hand, I understand this is an exercise to test performance…. and it’s not about the resusability (yet) of the code.

Still, I know from experience that reusability will happen, and your code will need some changes.

I also know, from experience, that memory-mapped IO is much faster in Java than other IO forms, so I figured I would have a kick at this problem.

I took my prime generator I had reveiewed here on Code Review previously ( Thread Safe Prime Generator ) and I generated the first 50,000,000 primes in to a file on my own system, then I tested your code against it (and after changing the file name to a parameter, it worked).

Out of interest, generating and writing to file of the primes took about 5 minutes on my computer – 311 seconds – much of which was probably IO time too.

Then, I converted your function to be:

public static int[] loadByQuantity(int argNumPrimes, String fname) {
    int numPrimes = Math.min(argNumPrimes, 50_000_000);
    int[] primes = new int[numPrimes];
    try (DataInputStream in = new DataInputStream(new FileInputStream(fname))) {
        for (int i = 0; i < primes.length; ++i) {
            primes[i] = in.readInt();
        }
    } catch (IOException ex) {
        throw new RuntimeException(ex);
    }
    return primes;
}

The only difference is the file-name is now a parameter.

Then I wrote a little test system to time how long your code took for me:

public static void time(Supplier<int[]> task, String name) {
    long nanos = System.nanoTime();
    int[] primes = task.get();
    System.out.printf("Task %s took %.3fmsn", name, (System.nanoTime() - nanos) / 1000000.0);
    System.out.printf("Count %dnFirst %dnLast %dn", primes.length, primes[0], primes[primes.length - 1]);
    System.out.println("----");
}

public static void main(String[] args) {
    int count = 50000000;
    time(() -> loadByQuantity(count, "primes.dat"), "OP");
}

I then implemented the same functionality using an NIO mechanism specifically using memory-mapped IO: http://docs.oracle.com/javase/8/docs/api/java/nio/MappedByteBuffer.html

This type of IO is designed to significantly reduce the amount of memory copies are made of the input file. It also means that Java reads the content out of the OS, without first copying it in to Java’s memory space.

For the types of sequential IO this problem has, I expected the performance improvements to be significant.

Further, the MappedByteBuffer has methods getInt() which also decodes 4 bytes in to an int for you: http://docs.oracle.com/javase/8/docs/api/java/nio/ByteBuffer.html#getInt–

Here’s the code I came up with. Note, I have used the same exception handling that you use, and also the same array initialization. I believe you should be throwing exceptions from these methods, and not just wrapping them in runtime exceptions:

public static int[] loadByNIO(int argNumPrimes, String fname) {
    int numPrimes = Math.min(argNumPrimes, 50_000_000);
    int[] primes = new int[numPrimes];
    try (FileChannel fc = FileChannel.open(Paths.get(fname))) {
        MappedByteBuffer mbb = fc.map(MapMode.READ_ONLY, 0, numPrimes * 4l);
        for (int i = 0; i < numPrimes; i++) {
            primes[i] = mbb.getInt();
        }
    } catch (IOException ex) {
        throw new RuntimeException(ex);
    }
    return primes;
}

I then ran the process a couple of times in the main method, and compared the results:

public static void main(String[] args) {
    int count = 50_000_000;
    time(() -> loadByQuantity(count, "primes.dat"), "OP");
    time(() -> loadByNIO(count, "primes.dat"), "NIO");
    time(() -> loadByQuantity(count, "primes.dat"), "OP");
    time(() -> loadByNIO(count, "primes.dat"), "NIO");
}

The NIO operation is, as expected, significantly faster… 1500 times faster

Task OP took 214163.250ms
Count 50000000
First 2
Last 982451653
----
Task NIO took 141.511ms
Count 50000000
First 2
Last 982451653
----
Task OP took 214633.128ms
Count 50000000
First 2
Last 982451653
----
Task NIO took 159.571ms
Count 50000000
First 2
Last 982451653
----

I admit, it is far faster than I was expecting too…. but, the results are correct.

Now, I encourage you to experiment with how to put this concept behind a Java stream, instead of populating the full 50,000,000 in to an array. By streaming the results you can start your “real” computation sooner, without the latency of having to read all the primes from the file. For example, consider what you could do with logic like:

primes.stream().....

where primes was reading the values on-demand from the file.

Here’s the full code I have been running:

package prperf;

import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.channels.FileChannel.MapMode;
import java.nio.file.Paths;
import java.util.function.Supplier;

public class PrimeReader {

    public static int[] loadByQuantity(int argNumPrimes, String fname) {
        int numPrimes = Math.min(argNumPrimes, 50_000_000);
        int[] primes = new int[numPrimes];
        try (DataInputStream in = new DataInputStream(new FileInputStream(fname))) {
            for (int i = 0; i < primes.length; ++i) {
                primes[i] = in.readInt();
            }
        } catch (IOException ex) {
            throw new RuntimeException(ex);
        }
        return primes;
    }

    public static int[] loadByNIO(int argNumPrimes, String fname) {
        int numPrimes = Math.min(argNumPrimes, 50_000_000);
        int[] primes = new int[numPrimes];
        try (FileChannel fc = FileChannel.open(Paths.get(fname))) {
            MappedByteBuffer mbb = fc.map(MapMode.READ_ONLY, 0, numPrimes * 4l);
            for (int i = 0; i < numPrimes; i++) {
                primes[i] = mbb.getInt();
            }
        } catch (IOException ex) {
            throw new RuntimeException(ex);
        }
        return primes;
    }

    public static void time(Supplier<int[]> task, String name) {
        long nanos = System.nanoTime();
        int[] primes = task.get();
        System.out.printf("Task %s took %.3fmsn", name, (System.nanoTime() - nanos) / 1000000.0);
        System.out.printf("Count %dnFirst %dnLast %dn", primes.length, primes[0], primes[primes.length - 1]);
        System.out.println("----");
    }

    public static void main(String[] args) {
        int count = 50_000_000;
        time(() -> loadByQuantity(count, "primes.dat"), "OP");
        time(() -> loadByNIO(count, "primes.dat"), "NIO");
        time(() -> loadByQuantity(count, "primes.dat"), "OP");
        time(() -> loadByNIO(count, "primes.dat"), "NIO");
    }

}

Using BufferedInputStream would be a quick fix with lesser modification of current code.

The reason why readX methods of
DataInputStream/FileInputStream are slow is that they ask for IO every time. BufferedInputStream simply loads a chunk of file to reduce the number of IO operations needed. Another solution is to use read() method to load the (whole or part of ) file into an byte array and then retrieve those integers from byte array.

Consider memory mapped files

Disclaimer: This is also for me the first time I’m playing with memory mapped files in Java. The way I’m doing it might be suboptimal or even worse.

I think that Java’s DataInputStream has high overhead for portability and error handling that is probably not needed in your situation. The fastest way to achieve your problem that I can think of is to simply mmap() the file into memory, assuming it contains a packed array of 32 bit integers in the host’s native byte order.

This is the solution I’ve come up with.

static int[] load(final File file, final int n) throws IOException {
    try (final FileChannel channel = FileChannel.open(
            file.toPath(),
            StandardOpenOption.READ
    )) {
        final MappedByteBuffer mapping = channel.map(
            FileChannel.MapMode.READ_ONLY,
            0,                 // offset
            n * Integer.BYTES  // length
        );
        mapping.order(ByteOrder.nativeOrder());
        final IntBuffer integers = mapping.asIntBuffer();
        final int[] array = new int[n];
        integers.get(array);
        return array;
    }
}

In order to write the data, I’ve used the following function.

static void store(final File file, final int[] array) throws IOException {
    try (final FileChannel channel = FileChannel.open(
            file.toPath(),
            StandardOpenOption.READ,
            StandardOpenOption.WRITE,
            StandardOpenOption.CREATE,
            StandardOpenOption.TRUNCATE_EXISTING
    )) {
        final MappedByteBuffer mapping = channel.map(
            FileChannel.MapMode.READ_WRITE,
            0,                            // offset
            array.length * Integer.BYTES  // length
        );
        mapping.order(ByteOrder.nativeOrder());
        final IntBuffer integers = mapping.asIntBuffer();
        integers.put(array);
    }
}

I’ve benchmarked the functions using an array of 50M random integers. After loading the array, I’ve also computed its sum and printed it out to make sure the load operation is not optimized away. The results were obtained using the time shell facility and therefore also include the generation of the random data, JVM startup and any other overhead. The results are still very clear.

implementation     store      load

data stream         5:54      1:31
memory mapping      0:01    < 0:01

Separate concerns

I’m confident that I’m not telling you anything new here but for record’s sake: Decouple the logic for reading an array of integers from a file from the logic that interprets these integers (as primes). Also don’t hardcode the file name or the size of the array into the function that loads them.

Don’t camouflage exceptions

Whether or not Java’s “handle it or declare it” philosophy regarding exceptions was a good idea is open to debate. However, given that it is as it is, I don’t think that working against the system is helpful. We gotta live with what we have now.

Beware the rules

I’ve only looked briefly into Project Euler and don’t know their rules but I wouldn’t be surprised if loading data from external resources would be considered cheating. Of course, if you’re only doing this for your own amusement, you are free to set your own rules.

Related Solutions

Calculate the sum with minimum usage of numbers

Here's a hint: 23 : 11 + 11+ 1 ( 3 magic numbers) 120: 110+ 10 (2 magic numbers) The highest digit in the target number is the answer, since you need exactly k magic numbers (all having 1 in the relevant position) in order for the sum to contain the digit k. So...

Why not drop the “auto” keyword? [duplicate]

Your proposal would be rejected on the basis of backward compatibility alone. But let's say for the sake of argument that the standards committee like your idea. You don't take into account the numerous ways you can initialize a variable widget w; // (a) widget...

Recursive to iterative using a systematic method [closed]

So, to restate the question. We have a function f, in our case fac. def fac(n): if n==0: return 1 else: return n*fac(n-1) It is implemented recursively. We want to implement a function facOpt that does the same thing but iteratively. fac is written almost in...

How can I match values in one file to ranges from another?

if the data file sizes are not huge, there is a simpler way $ join input1 input2 | awk '$5<$4 && $3<$5 {print $2, $5-$3+1}' B100002 32 B100043 15 B123465 3 This Perl code seems to solve your problem It is a common idiom: to load the entire...

Javascript difference between “=” and “===” [duplicate]

You need to use == or === for equality checking. = is the assignment operator. You can read about assignment operators here on MDN. As a quick reference as you are learning JS: = assignment operator == equal to === equal value and equal type != not equal !==...

Compiler complains about misplaced else [closed]

Your compiler complains about an misplaced else because, well, there is an else without a preceding if: // ... for (j=1; j<n-i; j++) { if(a[j]<=a[j+1]) { // ... } // END OF IF } // END OF FOR else { continue; } // ... The else in your code does not follow...

Bootstrap – custom alerts with progress bar

/* !important are just used to overide the bootstrap css in the snippet */ .alertContainer { border-radius: 0 !important; border-width: 0 !important; padding: 0 !important; height: auto !important; position: absolute !important; bottom: 15px !important; left:...

How to Garbage Collect an external Javascript load?

Yes, s.onload = null is useful and will garbage collect! As of 2019, it is not possible to explicitly or programmatically trigger garbage collection in JavaScript. That means it collects when it wants. Although there is cases where setting to null may do a GC...

Math programming with python

At first, what you are looking for is the modulo operator and the function math.floor() Modulo from wikipedia: In computing, the modulo operation finds the remainder after division of one number by another (sometimes called modulus). for example: 12%12=0...

Android slide over letters to create a word [closed]

Here some advice you can use: First for each cell you can create an object that represents the state of that cell: class Cell { char mChar; int row,column; boolean isSelected; } then you can create a 2D array of your cells Cell[][] mTable = ... For views you...

Sum two integers in Java

You reused the x and y variable names (hence the variable x is already defined in method main error), and forgot to assign the ints read from the Scanner to the x and y variables. Besides, there's no need to create two Scanner objects. public static void...

Extend three classes that implements an interface in Java

Using this simplified implementation of the library, using method() instead of M(): interface IFC { void method(); } class A implements IFC { public void method() { System.out.println("method in A"); }; } As akuzminykh mentions in their comment You'd write a...

How to set the stream content in PHPExcel? [closed]

Okey, First thing first PHPExcel_Worksheet_MemoryDrawing() can't solve your problem if you insist to use stream content and pass that to your worksheet your PDF will not render your image. But you can use `PHPExcel_Worksheet_Drawing()' if you want to render...

How to remove all files from a directory?

Linux does not use extensions. It is up to the creator of the file to decide whether the name should have an extension. Linux looks at the first few bytes to figure out what kind of file it is dealing with. To remove all non-hidden files* in a directory use: rm...