If you want to uniformly sample a handful of elements from a very large stream of data you probably don’t want to read it all into memory first. It would be ideal if you could sample while streaming the data. For this, reservoir sampling provides one answer. The algorithm is summarized thusly
- Let be the number of samples required (without replacement), the array to hold the samples, and the stream of elements.
- Fill with the first elements from
- for each , randomly sample , and set if .
- The sample set is the final state of
Can we show that this algorithm does indeed uniformly sample the elements (without replacement), that is, the probability of sampling any subset of elements is ?
Suppose , , and . The probability of picking is the probability of picking it at step when and then making sure in the later steps it doesn’t get replaced by another element
Thus, the sampling is uniform when . In general, for sampling elements the probability due to step is
That is the probability that each element is put into its own slot. The probability that doesn’t get overwritten by the upcoming elements (except for which have already been picked by step ) is
Thus, the probability of selecting according to the algorithm is