蓄水池抽样
2014-08-23 11:55
211 查看
http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/
We know that we have to hold on to the first element we see from this stream, because we don’t know if we’re in the case that the stream only has one element. When the second element comes along, we know that we want to return one of the two elements, each
with probability 1/2. So let’s generate a random number R between 0 and 1, and return the first element if
R is less than 0.5 and return the second element if R is greater than 0.5.
Now let’s try to generalize this approach to a stream with three elements. After we’ve seen the second element in the stream, we’re now holding on to either the first element or the second element, each with probability 1/2. When the third element arrives,
what should we do? Well, if we know that there are only three elements in the stream, we need to return this third element with probability 1/3, which means that we’ll return the other element we’re holding with probability 1 – 1/3 = 2/3. That means that the
probability of returning each element in the stream is as follows:
First Element: (1/2) * (2/3) = 1/3
Second Element: (1/2) * (2/3) = 1/3
Third Element: 1/3
By considering the stream of three elements, we see how to generalize this algorithm to any N: at every step N, keep the next element in the stream with probability 1/N. This means that we have an (N-1)/N probability of keeping the element we are currently
holding on to, which means that we keep it with probability (1/(N-1)) * (N-1)/N = 1/N.
This general technique is called reservoir sampling, and it is useful in a number of applications that require us to analyze very large data sets. You can find an excellent overview of a set of
algorithms for performing reservoir sampling in this blog post by
Greg Grothaus. I’d like to focus on two of those algorithms in particular, and talk about how they are used in
Cloudera ML, our open-source collection of
data preparation and machine learning algorithms for Hadoop.
A Primer on Reservoir Sampling
For this problem, the simplest concrete example would be a stream that only contained a single item. In this case, our algorithm should return this single element with probability 1. Now let’s try a slightly harder problem, a stream with exactly two elements.We know that we have to hold on to the first element we see from this stream, because we don’t know if we’re in the case that the stream only has one element. When the second element comes along, we know that we want to return one of the two elements, each
with probability 1/2. So let’s generate a random number R between 0 and 1, and return the first element if
R is less than 0.5 and return the second element if R is greater than 0.5.
Now let’s try to generalize this approach to a stream with three elements. After we’ve seen the second element in the stream, we’re now holding on to either the first element or the second element, each with probability 1/2. When the third element arrives,
what should we do? Well, if we know that there are only three elements in the stream, we need to return this third element with probability 1/3, which means that we’ll return the other element we’re holding with probability 1 – 1/3 = 2/3. That means that the
probability of returning each element in the stream is as follows:
First Element: (1/2) * (2/3) = 1/3
Second Element: (1/2) * (2/3) = 1/3
Third Element: 1/3
By considering the stream of three elements, we see how to generalize this algorithm to any N: at every step N, keep the next element in the stream with probability 1/N. This means that we have an (N-1)/N probability of keeping the element we are currently
holding on to, which means that we keep it with probability (1/(N-1)) * (N-1)/N = 1/N.
This general technique is called reservoir sampling, and it is useful in a number of applications that require us to analyze very large data sets. You can find an excellent overview of a set of
algorithms for performing reservoir sampling in this blog post by
Greg Grothaus. I’d like to focus on two of those algorithms in particular, and talk about how they are used in
Cloudera ML, our open-source collection of
data preparation and machine learning algorithms for Hadoop.
相关文章推荐
- 蓄水池抽样(转)
- Reservoir Sampling 蓄水池抽样 海量数据不知道总数只能遍历一次随机抽样问题
- 随机抽样——蓄水池抽样算法(Reservoir Sampling)
- Reservoir Sampling - 蓄水池抽样
- 蓄水池抽样-Reservoir Sampling
- Reservoir Sampling - 蓄水池抽样
- 蓄水池抽样-Random Pick Index
- 蓄水池抽样
- 面试题 从很长的数据流等概率随机采样 蓄水池抽样 Reservoir Sampling
- 面试题80:海量数据等概论抽样(蓄水池问题)
- Reservoir Sampling - 蓄水池抽样
- 大数据算法MOOC笔记3:水库抽样Reservoir Sampling(蓄水池问题)
- 数据工程师必知算法:蓄水池抽样
- Reservoir Sampling - 蓄水池抽样
- 蓄水池抽样
- 蓄水池抽样
- 蓄水池抽样 海量数据不知道总数只能遍历一次随机抽样问题
- 蓄水池抽样问题(随机抽样问题)
- Reservoir Sampling 蓄水池抽样算法,经典抽样
- 蓄水池抽样算法 Leetcode 382