Codementor Events

How to efficiently generate a random subset?

Published Apr 13, 2020
How to efficiently generate a random subset?

Suppose you have an array with n elements and you want a random subset with k elements. What strategies can you think of to do this as efficiently as possible?

Take a moment.

Why should you care about efficiency? It's because random number generators must make a tradeoff between the statistical tests they can pass and their performance.

Take another moment.

Hint: You can translate this to a problem of generating random numbers from 0 to n - 1 with no duplicates? These numbers can serve as the indices you use to select the members of the array.

Now what? Does your solution seem to do too much work when k is small? Or is it unfeasible when k is near n? Here is a link where it is discussed. If your solution has one of the mentioned problems, don't feel bad Jon Skeet gave 3 solutions, none of which are efficient in general.

One interesting solution is to combine his first and second approaches, i.e. keep picking random numbers and putting them in a hash set or binary tree when k is small and shuffle the array [0, ..., n - 1] when k is large. But what does small and large mean?

For the long story, you can watch a screencast where I derive the answer. The short version is we can ask the question "How big can k be before the expected number of random numbers needed will exceed n"? The reason that question is relevant is that approach 2 always requires n random numbers. It turns out to be a beautiful answer; k < (1 - 1/e)n = 0.63212055882 * n if and only if the expected number of random numbers necessary is < n.

TL;DR
Here's an implementation in Python:

from random import randrange
def shuffle(list):
    n = len(list)
    for i in range(n - 1):
        j = randrange(i, n-1) # i <= j < n
        value = list[i]
        list[i] = list[j]
        list[j] = value
def getDistinctRandomNumbers(inclusiveMin, exclusiveMax, amount):
    result = set()
    while len(result) != amount:
        result.add(randrange(inclusiveMin, exclusiveMax))
    return result
def getRandomElements(list, amount):
    length = len(list)
    amount = min(length, amount)
    if amount < 0.63212055882 * length:
        randomNumbers = getDistinctRandomNumbers(0, length, amount)
        result = [None]*amount
        i = 0
        for num in randomNumbers:
            result[i] = list[num]
            i+=1
        return result
    else:
        shuffle(list)
        return list[0:amount]
Discover and read more posts from Alex Williams
get started
post commentsBe the first to share your opinion
Show more replies