Home » Generate random numbers without repetitions

Generate random numbers without repetitions

Solutons:


Updated answer in response to bounty: See Is that your final answer? at the end, and other changes – basically answer is significantly rewritten.


To break your problem down in to requirements:

  • you need a set of random numbers
  • the numbers need to be unique
  • the order of the returned numbers needs to be random

Your current code indicates that the range of random numbers is specified by Random.Next(), which returns values in the [0 .. Int32.MaxValue) range (note, it excludes Int32.MaxValue). This is significant for the purpose of this question, because other answers have assumed that the range is configurable, and ‘small’.

If the range should be configurable, then the recommended algorithm would potentially be much larger.

Based on those assumptions, let’s do a code review…

Code Style

do … while

The most glaring problems here are the un-braced do-while loop. You already knew it, but this code is ugly:

    do number = random.Next();
    while (randomNumbers.Contains(number));

It really should be braced:

    do
    {
        number = random.Next();
    } while (randomNumbers.Contains(number));

This makes the statement clear, and significantly reduces confusion. Always use braces for 1-liners.

List Construction

The List class allows an initial capacity to be used. Since the capacity needs to just be count, it makes sense to initialize the list with this capacity:

List<int> randomNumbers = new List<int>(count);

Current Algorithm

This is where the most interesting observations can be made. Let’s analyze your current algorithm:

  • Create a container for the results
  • repeat until you have selected N values:
    • Select a random value
    • check if it has been previously selected
    • if it is ‘new’, then add it to the container

This algorithm will produce random values, in a random order, and with good random characteristics (no skews, biases, gaps, etc.).

In other words, your results are good.

The problem is with performance….

There are two performance concerns here, one small, the other large:

  1. the do-while loop to avoid collisions
  2. the List container

do-while performance

The do-while has a very low impact on performance… almost negligible. This is hotly debated, but, the reality is that you would need a very, very large count before this becomes a problem. The reasoning is as follows:

Collisions happen when the random value was previously selected. For the specified range of [0 .. Int32.MaxValue), you would need a very large count before collisions actually happened. For example, count would have to be about 65,000 before there was better than a 50% chance that there was even a single collision.

In a general sense, given a Range of $N$, select $M$ numbers. If $M < sqrt{N}$ then the probability of a single collision is < 50%. Since the Range is very large, the probability is small.

Obviously, if the range was small, then the probabilities would be significantly affected. But the range is fixed at Int32.MaxValue, so that’s OK.

Additionally, if the count was large, then the probabilities would also be affected. How large would be very large? Well, you would be running in to very large arrays before you run in to significant problems….. I would venture you are hitting close to $frac{Int32.MaxValue}{2}$ before you run in to a significant issue with performance.

List performance

This is without doubt your largest concern. You use the randomNumbers.Contains(number) call to determine whether a value was previously selected. This requires a scan of all previously-selected values to determine. As mentioned, this will almost always return false, and will thus have to scan the entire list.

As the count value increases, the length of time to perform the Contains will increase at an quadratic rate, $O(n^2)$ where n is count.

This performance problem will become critical much sooner than the random-collision problem.

Putting it together

The problem you have in your code is that you are trying to do too much at once, you are using a List because that is your return value, when a HashSet would be better. If you break the problem down in to stages, you will be able to solve things more elegantly.

If you add a duplicate value to a HashSet, it does not grow, and the operation performance is not dependent on the amount of data in the HashSet (it is $O(1)$). You can use the Count of the HashSet to manage the data uniqueness.

Once you have a clean set of unique random numbers, you can dump them in to a List, then shuffle the list using an efficient shuffle.

Combining these data structures, in the right way, leads to an overall $O(n)$ solution, which will scale fairly well.

Here is some code, which works in Ideone too. Note, my C# is weak, so I have tried to make the logic clear.

using System;
using System.Collections.Generic;

public class Test
{
    static Random random = new Random();

    public static List<int> GenerateRandom(int count)
    {
        // generate count random values.
        HashSet<int> candidates = new HashSet<int>();
        while (candidates.Count < count)
        {
            // May strike a duplicate.
            candidates.Add(random.Next());
        }

        // load them in to a list.
        List<int> result = new List<int>();
        result.AddRange(candidates);

        // shuffle the results:
        int i = result.Count;  
        while (i > 1)
        {  
            i--;  
            int k = random.Next(i + 1);  
            int value = result[k];  
            result[k] = result[i];  
            result[i] = value;  
        }  
        return result;
    }
    public static void Main()
    {
        List<int> vals = GenerateRandom(10);
        Console.WriteLine("Result: " + vals.Count);
        vals.ForEach(Console.WriteLine);
    }
}

The above code is my initial recommendation, and it will work well, and scale well for any reasonable number of values to return.

Second Alternate Algorithm

The problem with the above algorithm is threefold:

  1. When count is very large, the probability of collision is increased, and performance may be affected
  2. Data will need to be in both the HashSet and the List at some point, so the space usage is doubled.
  3. The shuffle at the end is needed to keep the data in a random order (HashSet does not keep the data in any specific order, and the hashing algorithm will cause the order to become biased, and skewed).

These are only performance issues when the count is very large. Note that only the collisions at large count will impact the scalability of the solution (at large count it is no longer quite $O(n)$, and it will be come progressively worse when count approaches Int32.MaxValue. Note that in real life this will not likely ever happen…. and memory will become a problem before performance does.

@JerryCoffin pointed to an alternate algorithm from Bob Floyd, where a trick is played to ensure that collisions never happen. This solves the problem of scalability at very large counts. It does not solve the need for both a HashSet and a List, and it does not solve the need for the shuffle.

The algorithm as presented:

initialize set S to empty
for J := N-M + 1 to N do
    T := RandInt(1, J)
    if T is not in S then
        insert T in S
    else
        insert J in S

assumes that RandInt(1, J) returns values inclusive of J.

To understand the above algorithm, you need to realize that you choose a random value from a range that is smaller than the full range, and then after each value, you extend that to include one more. In the event of a collision, you can safely insert the max because it was never possible to include it before.

The chances of a collision increase at the same rate that the number of values decreases, so the probability of any one number being in the result is not skewed, or biased.

Is this almost a final answer? No

So, using the above solution, in C#, would look like (in Ideone) (note, code is now different to Ideone):

public static List<int> GenerateRandom(int count)
{
    // generate count random values.
    HashSet<int> candidates = new HashSet<int>();
    for (Int32 top = Int32.MaxValue - count; top < Int32.MaxValue; top++)
    {
        Console.WriteLine(top);
        // May strike a duplicate.
        if (!candidates.Add(random.Next(top + 1)))
        {
            candidates.Add(top);
        }
    }

    // load them in to a list.
    List<int> result = candidates.ToList();

    // shuffle the results:
    int i = result.Count;  
    while (i > 1)
    {  
        i--;  
        int k = random.Next(i + 1);  
        int value = result[k];  
        result[k] = result[i];  
        result[i] = value;  
    }  
    return result;
}    

Note that you need to shuffle the results still, so that the HashSet issue is resolved. Also note the need to do the fancy loop-condition top > 0 because at overflow time, things get messy.

Can you avoid the shuffle?

So, this solves the need to do the collision loop, but, what about the shuffle. Can that be solved by maintaining the HashSet and the List at the same time. No! Consider this function(in Ideone too):

public static List<int> GenerateRandom(int count)
{
    List<int> result = new List<int>(count);

    // generate count random values.
    HashSet<int> candidates = new HashSet<int>();
    for (Int32 top = Int32.MaxValue - count; top < Int32.MaxValue; top++)
    {
        // May strike a duplicate.
        int value = random.Next(top + 1);
        if (candidates.Add(value))
        {
            result.Add(value);
        }
        else
        {
            result.Add(top);
            candidates.Add(top);
        }
    }

    return result;
}

The above answer is never going to allow the first value in the result to have any of the Max - Count to Max values (because there will never be a collision on the first value, and the range is not complete at that point), and this is thus a broken random generator.

Even with this collision-avoiding algorithm, you still need to shuffle the results in order to ensure a clean bias on the numbers.


TL;DR

Is This Your Final Answer? Yes!

Having played with this code a lot, it is apparent that it is useful to have a range-based input, as well as a Int32.MaxValue system. Messing with large ranges leads to potential overflows in the 32-bit integer space as well.

Working with @mjolka, the following code would be the best of both worlds:

    static Random random = new Random();

    // Note, max is exclusive here!
    public static List<int> GenerateRandom(int count, int min, int max)
    {

        //  initialize set S to empty
        //  for J := N-M + 1 to N do
        //    T := RandInt(1, J)
        //    if T is not in S then
        //      insert T in S
        //    else
        //      insert J in S
        //
        // adapted for C# which does not have an inclusive Next(..)
        // and to make it from configurable range not just 1.

        if (max <= min || count < 0 || 
                // max - min > 0 required to avoid overflow
                (count > max - min && max - min > 0))
        {
            // need to use 64-bit to support big ranges (negative min, positive max)
            throw new ArgumentOutOfRangeException("Range " + min + " to " + max + 
                    " (" + ((Int64)max - (Int64)min) + " values), or count " + count + " is illegal");
        }

        // generate count random values.
        HashSet<int> candidates = new HashSet<int>();

        // start count values before max, and end at max
        for (int top = max - count; top < max; top++)
        {
            // May strike a duplicate.
            // Need to add +1 to make inclusive generator
            // +1 is safe even for MaxVal max value because top < max
            if (!candidates.Add(random.Next(min, top + 1))) {
                // collision, add inclusive max.
                // which could not possibly have been added before.
                candidates.Add(top);
            }
        }

        // load them in to a list, to sort
        List<int> result = candidates.ToList();

        // shuffle the results because HashSet has messed
        // with the order, and the algorithm does not produce
        // random-ordered results (e.g. max-1 will never be the first value)
        for (int i = result.Count - 1; i > 0; i--)
        {  
            int k = random.Next(i + 1);  
            int tmp = result[k];  
            result[k] = result[i];  
            result[i] = tmp;  
        }  
        return result;
    }

    public static List<int> GenerateRandom(int count)
    {
        return GenerateRandom(count, 0, Int32.MaxValue);
    }

Yes, there definitely is.

You generate a collection of elements, mash it around and start pulling items out of it. A quick oneliner would be:

Enumerable.Range(0,100).OrderBy(x => Guid.NewGuid()).Take(20);

or alternatively

Enumerable.Range(0,100).OrderBy(x => random.Next()).Take(20);

This will give you 20 unique random values from the 0 to 100 range.

The difference with your approach is that you have a worst-case scenario of infinity: what if you are reaaaaally unlucky and end up with the same value constantly? You’ll never get your required amount of random values.

My approach on the other hand will first generate 100 values and then take a subset from it which you can criticize on its memory impact. Considering you used random.Next(), which uses half the integer range, you should in fact be wary of this since it will have a huge memory impact.

It also depends on your specific situation: if you have a very large pool of values (1.000.000) and you need 2 random values then your approach will be much better. But when you need 999.999 values from that same pool, my approach will be much better still be debatable.

It will take some time to generate those last values; a lot as you can test for yourself with this:

void Main()
{
    var random = new Random();
    var times = new TimeSpan[512];
    var values = new bool[512];
    var sw = new Stopwatch();

    for(int i = 0; i < times.Length; i++) 
    {   
        sw.Restart();
        while(true) {
            int rand = random.Next();

            if(rand > 7894500 && rand < 7894512) 
            {
                int index = rand - 7894500;
                if(!values[index])
                {
                    values[index] = true;
                    break;
                }
            }
        }
        sw.Stop();
        times[i] = sw.Elapsed;
        Console.WriteLine ("Elapsed time: " + sw.Elapsed);
    }

    var orderedTime = times.OrderBy(x => x);
    for(int i = 0; i < 512; i++)
    {
        Console.WriteLine (orderedTime.ElementAt(i));
    }
}

It will keep looping randomly 512 times through your list of values and consider the element found once it finds the (randomly picked out by myself) values between 7894500 and 7894512. Afterwards this value is considered visited to correctly mimic reality (in an earlier version all 512 turns had 512 values available to them). When you execute this you’ll notice it takes a lot of time to find the last values (sometimes it’s fast and it takes 39 ms, other times it takes over a minute). Evidently it’s fast at the start and slow at the end.

Compare that to the memory overhead of my approach which will first allocate 32 million integers, guids, padding, object overhead, etc and you’re out of a big chunk of memory.

You might be able to improve it by using a more “real” shuffling algorithm which doesn’t have the object and guid overhead.

Ultimately in an extreme situation where you need 32 million unique values in a randomized order out of a total population of 32 million + 1, you will either have to accept a big memory overhead or a big execution time overhead.


One last edit before this topic can be laid to rest from my part: I talked about it with @rolfl in chat and we have come to the conclusion that either of our solutions have a usage but it depends on what your situation is exactly. Summarized it would come to this:

If you have a big range of values (like 0 to int.MaxValue) then my solution will eat any memory your PC has. You’re looking at two collections with each 2.1 billion integers which you then take a slice from.

In my solution you first allocate this entire population, then you sort on it (different datastructure) and then you take some of it. If this “some” is not close to 2.1 billion you will have made a huge cost of allocating data without using it.

How does this compare to @rolfl’s answer? He basically allocates the data as it is needed: if he needs 32 million values then he will only allocate those 32 million (x2) and not more. If he needs 2.1 billion then he’ll end up with a memory footprint like I have but that’s an uncommon worst case scenario while for me it is standard behaviour.

The main disadvantage of his approach is that when the amount of values you want reaches the total population, it will become increasingly harder to get those last values because there will be many collisions. Yet again, this will only be a problem when the population is actually really big.

So when should you use my solution? Like most things, there is a tradeoff between readability and performance. If you have a relatively small population and a large dataset then the readability weighs up against the performance impact, in my opinion. And when the population is relatively small and the amount of values we want is near that, my solution will have a memory impact similar to that of the other approach but it will also avoid many collisions.

Take your pick depending on what your situation is.

Instead of using a List<int>, you should use an HashSet<int>. The HashSet<> prohibites multiple identical values. And the Add method returns a bool that indicates if the element was added to the list, this way you could change this code :

public static List<int> GetRandomNumbers(int count)
{
    List<int> randomNumbers = new List<int>(); 

    for (int i=0; i<count; i++) 
    {    
        int number;

        do number = random.Next();
        while (randomNumbers.Contains(number));

        randomNumbers.Add(number);
    }

    return randomNumbers;
}

to this :

public static IEnumerable<int> GetRandomNumbers(int count)
{
    HashSet<int> randomNumbers = new HashSet<int>();

    for (int i = 0; i < count; i++) 
        while (!randomNumbers.Add(random.Next()));

    return randomNumbers;
}

Note that I changed the return value from List<int> to IEnumerable<int>, if you don’t plan to add/remove values from your list, you should return IEnumerable<int>, if you plan to do it, return ICollection<int>. You shouldn’t return a List<int> because it is an implementation detail and that the List<> isn’t made to be extensible.

Related Solutions

How do I reset a lost administrative password?

By default the first user's account is an administrative account, so if the UI is prompting you for a password it's probably that person's user password. If the user doesn't remember their password you need to reset it. To do this you need to boot into recovery...

How can I use environment variables in Nginx.conf

From the official Nginx docker file: Using environment variables in nginx configuration: Out-of-the-box, Nginx doesn't support using environment variables inside most configuration blocks. But envsubst may be used as a workaround if you need to generate your...

Difference between .bashrc and .bash_profile

Traditionally, when you log into a Unix system, the system would start one program for you. That program is a shell, i.e., a program designed to start other programs. It's a command line shell: you start another program by typing its name. The default shell, a...

Custom query with Castle ActiveRecord

In this case what you want is HqlBasedQuery. Your query will be a projection, so what you'll get back will be an ArrayList of tuples containing the results (the content of each element of the ArrayList will depend on the query, but for more than one value will...

What is the “You have new mail” message in Linux/UNIX?

Where is this mail? It's likely to be in the spool file: /var/mail/$USER or /var/spool/mail/$USER are the most common locations on Linux and BSD. (Other locations are possible – check if $MAIL is set – but by default, the system only informs you about...

How can I find the implementations of Linux kernel system calls?

System calls aren't handled like regular function calls. It takes special code to make the transition from user space to kernel space, basically a bit of inline assembly code injected into your program at the call site. The kernel side code that "catches" the...

Is a composite index also good for queries on the first field?

It certainly is. We discussed that in great detail under this related question: Working of indexes in PostgreSQL Space is allocated in multiples of MAXALIGN, which is typically 8 bytes on a 64-bit OS or (much less common) 4 bytes on a 32-bit OS. If you are not...

Explaining computational complexity theory

Hoooo, doctoral comp flashback. Okay, here goes. We start with the idea of a decision problem, a problem for which an algorithm can always answer "yes" or "no." We also need the idea of two models of computer (Turing machine, really): deterministic and...

Building a multi-level menu for umbraco

First off, no need pass the a parent parameter around. The context will transport this information. Here is the XSL stylesheet that should solve your problem: <!-- update this variable on how deep your menu should be --> <xsl:variable...

How to generate a random string?

My favorite way to do it is by using /dev/urandom together with tr to delete unwanted characters. For instance, to get only digits and letters: tr -dc A-Za-z0-9 </dev/urandom | head -c 13 ; echo '' Alternatively, to include more characters from the OWASP...

How to copy a file from a remote server to a local machine?

The syntax for scp is: If you are on the computer from which you want to send file to a remote computer: scp /file/to/send username@remote:/where/to/put Here the remote can be a FQDN or an IP address. On the other hand if you are on the computer wanting to...

What is the difference between curl and wget?

The main differences are: wget's major strong side compared to curl is its ability to download recursively. wget is command line only. There's no lib or anything, but curl's features are powered by libcurl. curl supports FTP, FTPS, HTTP, HTTPS, SCP, SFTP, TFTP,...

Using ‘sed’ to find and replace [duplicate]

sed is the stream editor, in that you can use | (pipe) to send standard streams (STDIN and STDOUT specifically) through sed and alter them programmatically on the fly, making it a handy tool in the Unix philosophy tradition; but can edit files directly, too,...

How do I loop through only directories in bash?

You can specify a slash at the end to match only directories: for d in */ ; do echo "$d" done If you want to exclude symlinks, use a test to continue the loop if the current entry is a link. You need to remove the trailing slash from the name in order for -L to...

How to clear journalctl

The self maintenance method is to vacuum the logs by size or time. Retain only the past two days: journalctl --vacuum-time=2d Retain only the past 500 MB: journalctl --vacuum-size=500M man journalctl for more information. You don't typically clear the journal...

How can I run a command which will survive terminal close?

One of the following 2 should work: $ nohup redshift & or $ redshift & $ disown See the following for a bit more information on how this works: man nohup help disown Difference between nohup, disown and & (be sure to read the comments too) If your...