您的位置:首页 > 运维架构

Bloom Filter科普

2015-10-29 23:54 363 查看
Bloom Filter的中文翻译叫做布隆过滤器,是1970年由布隆提出的。它实际上是一个很长的二进制向量和一系列随机映射函数。布隆过滤器可以用于检索一个元素是否在一个集合中。它的优点是空间效率和查询时间都远远超过一般的算法,缺点是有一定的误识别率和删除困难。如文章标题所述,本文只是做简单介绍,属于科普文章。

一、应用场景

在正式介绍Bloom Filter算法之前,先来看看什么时候需要用到Bloom Filter算法。
1. HTTP缓存服务器、Web爬虫等

主要工作是判断一条URL是否在现有的URL集合之中(可以认为这里的数据量级上亿)。

对于HTTP缓存服务器,当本地局域网中的PC发起一条HTTP请求时,缓存服务器会先查看一下这个URL是否已经存在于缓存之中,如果存在的话就没有必要去原始的服务器拉取数据了(为了简单起见,我们假设数据没有发生变化),这样既能节省流量,还能加快访问速度,以提高用户体验。

对于Web爬虫,要判断当前正在处理的网页是否已经处理过了,同样需要当前URL是否存在于已经处理过的URL列表之中。

2. 垃圾邮件过滤

假设邮件服务器通过发送方的邮件域或者IP地址对垃圾邮件进行过滤,那么就需要判断当前的邮件域或者IP地址是否处于黑名单之中。如果邮件服务器的通信邮件数量非常大(也可以认为数据量级上亿),那么也可以使用Bloom Filter算法。

二、几个专业术语

这里有必要介绍一下False Positive和False Negative的概念(更形象的描述可以阅读第4条参考)。
False Positive中文可以理解为“假阳性”,形象的一点说就是“误报”,后面将会说道Bloom Filter存在误报的情况,现实生活中也有误报,比如说去体检的时候,医生告诉你XXX检测是阳性,而实际上是阴性,也就是说误报了,是假阳性,杀毒软件误报也是同样的概念。
False Negative,中文可以理解为“假阴性”,形象的一点说是“漏报”。医生告诉你XXX检测为阴性,实际上你是阳性,你是有病的(Sorry, it’s just a joke),那就是漏报了。同样杀毒软件也存在漏报的情况。

三、Bloom Filter算法

好了,终于要正式介绍Bloom Filter算法了。

初始状态下,Bloom Filter是一个m位的位数组,且数组被0所填充。同时,我们需要定义k个不同的hash函数,每一个hash函数都随机的将每一个输入元素映射到位数组中的一个位上。那么对于一个确定的输入,我们会得到k个索引。

插入元素:经过k个hash函数的映射,我们会得到k个索引,我们把位数组中这k个位置全部置1(不管其中的位之前是0还是1)

查询元素:输入元素经过k个hash函数的映射会得到k个索引,如果位数组中这k个索引任意一处是0,那么就说明这个元素不在集合之中;如果元素处于集合之中,那么当插入元素的时候这k个位都是1。但如果这k个索引处的位都是1,被查询的元素就一定在集合之中吗?答案是不一定,也就是说出现了False Positive的情况(但Bloom Filter不会出现False Negative的情况)




在上图中,当插入x、y、z这三个元素之后,再来查询w,会发现w不在集合之中,而如果w经过三个hash函数计算得出的结果所得索引处的位全是1,那么Bloom Filter就会告诉你,w在集合之中,实际上这里是误报,w并不在集合之中。

四、Bloom Filter算法的False Positive Rate

Bloom Filter的误报率到底有多大?下面在数学上进行一番推敲。假设HASH函数输出的索引值落在m位的数组上的每一位上都是等可能的。那么,对于一个给定的HASH函数,在进行某一个运算的时候,一个特定的位没有被设置为1的概率是



那么,对于所有的k个HASH函数,都没有把这个位设置为1的概率是



如果我们已经插入了n个元素,那么对于一个给定的位,这个位仍然是0的概率是



那么,如果插入n个元素之后,这个位是1的概率是



如果对一个特定的元素存在误报,那么这个元素的经过HASH函数所得到的k个索引全部都是1,概率也就是



根据常数e的定义,可以近似的表示为:



五、关于误报

有时候误报对实际操作并不会带来太大的影响,比如对于HTTP缓存服务器,如果一条URL被误以为存在与缓存服务器之中,那么当取数据的时候自然会无法取到,最终还是要从原始服务器当中获取,之后再把记录插入缓存服务器,几乎没有什么不可以接受的。

对于安全软件,有着“另可错报,不可误报”的说法,如果你把一个正常软件误判为病毒,对使用者来说不会有什么影响(如果用户相信是病毒,那么就是删除这个文件罢了,如果用户执意要执行,那么后果也只能由用户来承担);如果你把一个病毒漏判了,那么对用户造成的后果是不可设想的……更有甚者,误报在某种程度上能让部分用户觉得你很专业……

六、java代码实现

具体的实现如下(来自项目http://code.google.com/p/java-bloomfilter/,代码实现不错,可直接使用):

/**
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU Lesser General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
* GNU Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public License
* along with this program.  If not, see <http://www.gnu.org/licenses/>.
*/

package com.skjegstad.utils;

import java.io.Serializable;
import java.nio.charset.Charset;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
import java.util.BitSet;
import java.util.Collection;

/**
* Implementation of a Bloom-filter, as described here:
* http://en.wikipedia.org/wiki/Bloom_filter *
* For updates and bugfixes, see http://github.com/magnuss/java-bloomfilter *
* Inspired by the SimpleBloomFilter-class written by Ian Clarke. This
* implementation provides a more evenly distributed Hash-function by
* using a proper digest instead of the Java RNG. Many of the changes
* were proposed in comments in his blog:
* http://blog.locut.us/2008/01/12/a-decent-stand-alone-java-bloom-filter-implementation/ *
* @param <E> Object type that is to be inserted into the Bloom filter, e.g. String or Integer.
* @author Magnus Skjegstad <magnus@skjegstad.com>
*/
public class BloomFilter<E> implements Serializable {
private BitSet bitset;
private int bitSetSize;
private double bitsPerElement;
private int expectedNumberOfFilterElements;
private int numberOfAddedElements;
private int k; // number of hash functions

static final Charset charset = Charset.forName("UTF-8");

static final String hashName = "MD5";
static final MessageDigest digestFunction;
static { // The digest method is reused between instances
MessageDigest tmp;
try {
tmp = java.security.MessageDigest.getInstance(hashName);
} catch (NoSuchAlgorithmException e) {
tmp = null;
}
digestFunction = tmp;
}

/**
* Constructs an empty Bloom filter. The total length of the Bloom filter will be
* c*n.
*
* @param c is the number of bits used per element.
* @param n is the expected number of elements the filter will contain.
* @param k is the number of hash functions used.
*/
public BloomFilter(double c, int n, int k) {
this.expectedNumberOfFilterElements = n;
this.k = k;
this.bitsPerElement = c;
this.bitSetSize = (int)Math.ceil(c * n);
numberOfAddedElements = 0;
this.bitset = new BitSet(bitSetSize);
}

/**
* Constructs an empty Bloom filter. The optimal number of hash functions (k) is estimated from
* the total size of the Bloom
* and the number of expected elements.
*
* @param bitSetSize defines how many bits should be used in total for the filter.
* @param expectedNumberOElements defines the maximum number of elements the filter is
* expected to contain.
*/
public BloomFilter(int bitSetSize, int expectedNumberOElements) {
this(bitSetSize / (double)expectedNumberOElements,
expectedNumberOElements,
(int) Math.round((bitSetSize / (double)expectedNumberOElements) * Math.log(2.0)));
}

/**
* Constructs an empty Bloom filter with a given false positive probability. The number of bits per
* element and the number of hash functions is estimated
* to match the false positive probability.
*
* @param falsePositiveProbability is the desired false positive probability.
* @param expectedNumberOfElements is the expected number of elements in the Bloom filter.
*/
public BloomFilter(double falsePositiveProbability, int expectedNumberOfElements) {
this(Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2))) / Math.log(2),
expectedNumberOfElements,
(int)Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2))));
}

/**
* Construct a new Bloom filter based on existing Bloom filter data.
*
* @param bitSetSize defines how many bits should be used for the filter.
* @param expectedNumberOfFilterElements defines the maximum number of elements the filter
*    is expected to contain.
* @param actualNumberOfFilterElements specifies how many elements have been inserted into
*    the <code>filterData</code> BitSet.
* @param filterData a BitSet representing an existing Bloom filter.
*/
public BloomFilter(int bitSetSize, int expectedNumberOfFilterElements,
int actualNumberOfFilterElements, BitSet filterData) {
this(bitSetSize, expectedNumberOfFilterElements);
this.bitset = filterData;
this.numberOfAddedElements = actualNumberOfFilterElements;
}

/**
* Generates a digest based on the contents of a String.
*
* @param val specifies the input data.
* @param charset specifies the encoding of the input data.
* @return digest as long.
*/
public static int createHash(String val, Charset charset) {
return createHash(val.getBytes(charset));
}

/**
* Generates a digest based on the contents of a String.
*
* @param val specifies the input data. The encoding is expected to be UTF-8.
* @return digest as long.
*/
public static int createHash(String val) {
return createHash(val, charset);
}

/**
* Generates a digest based on the contents of an array of bytes.
*
* @param data specifies input data.
* @return digest as long.
*/
public static int createHash(byte[] data) {
return createHashes(data, 1)[0];
}

/**
* Generates digests based on the contents of an array of bytes and splits the result into
* 4-byte int's and store them in an array. The
* digest function is called until the required number of int's are produced. For each call to
* digest a salt
* is prepended to the data. The salt is increased by 1 for each call.
*
* @param data specifies input data.
* @param hashes number of hashes/int's to produce.
* @return array of int-sized hashes
*/
public static int[] createHashes(byte[] data, int hashes) {
int[] result = new int[hashes];

int k = 0;
byte salt = 0;
while (k < hashes) {
byte[] digest;
synchronized (digestFunction) {
digestFunction.update(salt);
salt++;
digest = digestFunction.digest(data);
}

for (int i = 0; i < digest.length/4 && k < hashes; i++) {
int h = 0;
for (int j = (i*4); j < (i*4)+4; j++) {
h <<= 8;
h |= ((int) digest[j]) & 0xFF;
}
result[k] = h;
k++;
}
}
return result;
}

/**
* Compares the contents of two instances to see if they are equal.
*
* @param obj is the object to compare to.
* @return True if the contents of the objects are equal.
*/
@Override
public boolean equals(Object obj) {
if (obj == null) {
return false;
}
if (getClass() != obj.getClass()) {
return false;
}
final BloomFilter<E> other = (BloomFilter<E>) obj;
if (this.expectedNumberOfFilterElements != other.expectedNumberOfFilterElements) {
return false;
}
if (this.k != other.k) {
return false;
}
if (this.bitSetSize != other.bitSetSize) {
return false;
}
if (this.bitset != other.bitset && (this.bitset == null || !this.bitset.equals(other.bitset))) {
return false;
}
return true;
}

/**
* Calculates a hash code for this class.
* @return hash code representing the contents of an instance of this class.
*/
@Override
public int hashCode() {
int hash = 7;
hash = 61 * hash + (this.bitset != null ? this.bitset.hashCode() : 0);
hash = 61 * hash + this.expectedNumberOfFilterElements;
hash = 61 * hash + this.bitSetSize;
hash = 61 * hash + this.k;
return hash;
}

/**
* Calculates the expected probability of false positives based on
* the number of expected filter elements and the size of the Bloom filter.
* <br /><br />
* The value returned by this method is the <i>expected</i> rate of false
* positives, assuming the number of inserted elements equals the number of
* expected elements. If the number of elements in the Bloom filter is less
* than the expected value, the true probability of false positives will be lower.
*
* @return expected probability of false positives.
*/
public double expectedFalsePositiveProbability() {
return getFalsePositiveProbability(expectedNumberOfFilterElements);
}

/**
* Calculate the probability of a false positive given the specified
* number of inserted elements.
*
* @param numberOfElements number of inserted elements.
* @return probability of a false positive.
*/
public double getFalsePositiveProbability(double numberOfElements) {
// (1 - e^(-k * n / m)) ^ k
return Math.pow((1 - Math.exp(-k * (double) numberOfElements
/ (double) bitSetSize)), k);

}

/**
* Get the current probability of a false positive. The probability is calculated from
* the size of the Bloom filter and the current number of elements added to it.
*
* @return probability of false positives.
*/
public double getFalsePositiveProbability() {
return getFalsePositiveProbability(numberOfAddedElements);
}

/**
* Returns the value chosen for K.<br />
* <br />
* K is the optimal number of hash functions based on the size
* of the Bloom filter and the expected number of inserted elements.
*
* @return optimal k.
*/
public int getK() {
return k;
}

/**
* Sets all bits to false in the Bloom filter.
*/
public void clear() {
bitset.clear();
numberOfAddedElements = 0;
}

/**
* Adds an object to the Bloom filter. The output from the object's
* toString() method is used as input to the hash functions.
*
* @param element is an element to register in the Bloom filter.
*/
public void add(E element) {
add(element.toString().getBytes(charset));
}

/**
* Adds an array of bytes to the Bloom filter.
*
* @param bytes array of bytes to add to the Bloom filter.
*/
public void add(byte[] bytes) {
int[] hashes = createHashes(bytes, k);
for (int hash : hashes)
bitset.set(Math.abs(hash % bitSetSize), true);
numberOfAddedElements ++;
}

/**
* Adds all elements from a Collection to the Bloom filter.
* @param c Collection of elements.
*/
public void addAll(Collection<? extends E> c) {
for (E element : c)
add(element);
}

/**
* Returns true if the element could have been inserted into the Bloom filter.
* Use getFalsePositiveProbability() to calculate the probability of this
* being correct.
*
* @param element element to check.
* @return true if the element could have been inserted into the Bloom filter.
*/
public boolean contains(E element) {
return contains(element.toString().getBytes(charset));
}

/**
* Returns true if the array of bytes could have been inserted into the Bloom filter.
* Use getFalsePositiveProbability() to calculate the probability of this
* being correct.
*
* @param bytes array of bytes to check.
* @return true if the array could have been inserted into the Bloom filter.
*/
public boolean contains(byte[] bytes) {
int[] hashes = createHashes(bytes, k);
for (int hash : hashes) {
if (!bitset.get(Math.abs(hash % bitSetSize))) {
return false;
}
}
return true;
}

/**
* Returns true if all the elements of a Collection could have been inserted
* into the Bloom filter. Use getFalsePositiveProbability() to calculate the
* probability of this being correct.
* @param c elements to check.
* @return true if all the elements in c could have been inserted into the Bloom filter.
*/
public boolean containsAll(Collection<? extends E> c) {
for (E element : c)
if (!contains(element))
return false;
return true;
}

/**
* Read a single bit from the Bloom filter.
* @param bit the bit to read.
* @return true if the bit is set, false if it is not.
*/
public boolean getBit(int bit) {
return bitset.get(bit);
}

/**
* Set a single bit in the Bloom filter.
* @param bit is the bit to set.
* @param value If true, the bit is set. If false, the bit is cleared.
*/
public void setBit(int bit, boolean value) {
bitset.set(bit, value);
}

/**
* Return the bit set used to store the Bloom filter.
* @return bit set representing the Bloom filter.
*/
public BitSet getBitSet() {
return bitset;
}

/**
* Returns the number of bits in the Bloom filter. Use count() to retrieve
* the number of inserted elements.
*
* @return the size of the bitset used by the Bloom filter.
*/
public int size() {
return this.bitSetSize;
}

/**
* Returns the number of elements added to the Bloom filter after it
* was constructed or after clear() was called.
*
* @return number of elements added to the Bloom filter.
*/
public int count() {
return this.numberOfAddedElements;
}

/**
* Returns the expected number of elements to be inserted into the filter.
* This value is the same value as the one passed to the constructor.
*
* @return expected number of elements.
*/
public int getExpectedNumberOfElements() {
return expectedNumberOfFilterElements;
}

/**
* Get expected number of bits per element when the Bloom filter is full. This value is set by the constructor
* when the Bloom filter is created. See also getBitsPerElement().
*
* @return expected number of bits per element.
*/
public double getExpectedBitsPerElement() {
return this.bitsPerElement;
}

/**
* Get actual number of bits per element based on the number of elements that have currently been inserted and the length
* of the Bloom filter. See also getExpectedBitsPerElement().
*
* @return number of bits per element.
*/
public double getBitsPerElement() {
return this.bitSetSize / (double)numberOfAddedElements;
}
}


六、参考文档

1. 布隆过滤器 http://zh.wikipedia.org/wiki/Bloom_filter
2. Bloom filter http://en.wikipedia.org/wiki/Bloom_filter
3. Bloom Filter算法 http://www.cnblogs.com/yuyijq/archive/2012/02/08/2343374.html 4. 什么是False Positive和False Negative http://simon.blog.51cto.com/80/73395/
参考:http://www.programlife.net/bloom-filter.html


内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签:  算法 hadoop