Reservoir Sampling

If you want to uniformly sample a handful of elements from a very large stream of data you probably don’t want to read it all into memory first. It would be ideal if you could sample while streaming the data. For this, reservoir sampling provides one answer. The algorithm is summarized thusly

  1. Let k be the number of samples required (without replacement), R the array to hold the k samples, and S the stream of elements.
  2. Fill R with the first k elements from S
  3. for each k+1 <= i <= |S|, randomly sample j \in [1,i], and set R[j] = S[i] if j \le k.
  4. The sample set is the final state of R

Can we show that this algorithm does indeed uniformly sample the elements (without replacement), that is, the probability of sampling any subset of k elements is \frac{k!}{(n)_k} = \frac{k}{n}\frac{k-1}{n-1}\dots\frac{1}{n-k+1}?

Suppose k=1, |S|=n, and 1 \le r \le n. The probability of picking r is the probability of picking it at step 3 when i=r and then making sure in the later steps it doesn’t get replaced by another element

\displaystyle  \frac{1}{r} \times \prod_{r < s \le n} \frac{s-1}{s} \\ = \frac{1}{r} \times \frac{r}{n} \\ = \frac{1}{n}

Thus, the sampling is uniform when k=1. In general, for sampling k elements r_1 < r_2 < \dots < r_k the probability due to step 3 is

\displaystyle  \frac{k}{r_1}\frac{k-1}{r_2}\dots\frac{1}{r_k}

That is the probability that each element is put into its own slot. The probability that r_1 doesn’t get overwritten by the upcoming elements (except for r_2,\dots,r_k which have already been picked by step 3) is

\displaystyle  \prod_{r_1 < s \le n-k+1} \frac{s-1}{s} = \frac{r_1}{n-k+1}

Thus, the probability of selecting r_1 < \dots < r_k according to the algorithm is

\displaystyle  \frac{k}{r_1}\frac{k-1}{r_2}\dots\frac{1}{r_k} \times \frac{r_1}{n-k+1}\dots \frac{r_k}{n} = \frac{k!}{(n)_k}

Advertisements
This entry was posted in statistics and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s