How to efficiently generate a random subset?

Published Apr 13, 2020

Suppose you have an array with n elements and you want a random subset with k elements. What strategies can you think of to do this as efficiently as possible?

Take a moment.

Why should you care about efficiency? It's because random number generators must make a tradeoff between the statistical tests they can pass and their performance.

Take another moment.

Hint: You can translate this to a problem of generating random numbers from 0 to n - 1 with no duplicates? These numbers can serve as the indices you use to select the members of the array.

Now what? Does your solution seem to do too much work when k is small? Or is it unfeasible when k is near n? Here is a link where it is discussed. If your solution has one of the mentioned problems, don't feel bad Jon Skeet gave 3 solutions, none of which are efficient in general.

One interesting solution is to combine his first and second approaches, i.e. keep picking random numbers and putting them in a hash set or binary tree when k is small and shuffle the array [0, ..., n - 1] when k is large. But what does small and large mean?

For the long story, you can watch a screencast where I derive the answer. The short version is we can ask the question "How big can k be before the expected number of random numbers needed will exceed n"? The reason that question is relevant is that approach 2 always requires n random numbers. It turns out to be a beautiful answer; k < (1 - 1/e)n = 0.63212055882 * n if and only if the expected number of random numbers necessary is < n.

TL;DR
Here's an implementation in Python:

from random import randrange
def shuffle(list):
    n = len(list)
    for i in range(n - 1):
        j = randrange(i, n-1) # i <= j < n
        value = list[i]
        list[i] = list[j]
        list[j] = value
def getDistinctRandomNumbers(inclusiveMin, exclusiveMax, amount):
    result = set()
    while len(result) != amount:
        result.add(randrange(inclusiveMin, exclusiveMax))
    return result
def getRandomElements(list, amount):
    length = len(list)
    amount = min(length, amount)
    if amount < 0.63212055882 * length:
        randomNumbers = getDistinctRandomNumbers(0, length, amount)
        result = [None]*amount
        i = 0
        for num in randomNumbers:
            result[i] = list[num]
            i+=1
        return result
    else:
        shuffle(list)
        return list[0:amount]

Algorithm Mathematics Python

Report

Enjoy this post? Give Alex Williams a like if it's helpful.

Alex Williams

Full Stack / Automation Developer - Concept to Delighted Stakeholders

Hi, I productively zoom in on details with engineers and zoom out the to 30,000 ft view with CEOs, CTOs, and CFOs. When designing a user interface, an API, a database, or a development pipeline I uncover the needs of all stakehol...

Discover and read more posts from Alex Williams

get started