Question

我正在探索布隆过滤器。我已经浏览了关于bloom fitlers的大多数博客，知道什么是但是仍然无法找到一个关于案例连接的例子。

每篇文章都说它会减少网络I / O，但它们都没有显示出来？特别是一个很好http://vanjakom.wordpress.com/tag/distributed-cache/，但它看起来很复杂，因为我刚开始使用map reduce。

任何人都可以帮我实现以下示例中的布隆过滤器（reduceide join）

2 mapers 读取用户记录和部门记录以及reducer加入

用户记录

id，name

3738，Richie Gore

12946，Rony Sam

17556，David Gart

3443，雷切尔史密斯

5799，Paul Rosta

部门记录

3738，销售

12946，市场营销

17556，市场营销

3738，销售

代码

public class UserMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text>{

 private Text outkey = new Text();
 private Text outval = new Text();
 private String id, name;

public void map (LongWritable key, Text value, OutputCollector<Text, Text> ouput,Reporter reporter)
             throws IOException {

     String line = value.toString();
     String arryUsers[] = line.split(",");
     id = arryUsers[0].trim();
     name = arryUsers[1].trim();

     outkey.set(id);
     outval.set("A"+ name);
     ouput.collect(outkey, outval);
   }
    }

public class DepartMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {

private Text Outk = new Text();
private Text Outv = new Text();
String depid, dep ;

public void map (LongWritable key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

    String line = value.toString();
    String arryDept[] = line.split(",");
    depid = arryDept[0].trim();
    dep = arryDept[1].trim();

    Outk.set(depid);
    Outv.set("B" + dep);

    output.collect(Outk, Outv);
}
    }

和Reducer

ublic class JoinReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

private Text tmp = new Text();
private ArrayList<Text> listA = new ArrayList<Text>();
private ArrayList<Text> listB = new ArrayList<Text>();

public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text>output, Reporter reporter) throws IOException {

    listA.clear();
    listB.clear();

    while (values.hasNext()) {

        tmp = values.next();
        if (tmp.charAt(0) == 'A') {
            listA.add(new Text(tmp.toString().substring(1)));
        } else if (tmp.charAt(0) == 'B') {
            listB.add(new Text(tmp.toString().substring(1)));
        }



    }
    executejoinlogic(output);

}

private void executejoinlogic(OutputCollector<Text, Text> output) throws IOException {

    if (!listA.isEmpty() && !listB.isEmpty()) {
        for (Text A : listA) {
        for (Text B : listB) {
        output.collect(A, B);
        }
        }
         }
    }
          }

是否可以在上述场景中实现布隆过滤器？

如果是，那么请帮我实现这个？

Answer 1

只有当您的两个输入表中的一个比另一个输入表小得多时，才能在此处实现布隆过滤器。您需要遵循的流程是：

在Mapper类的setup()方法中初始化bloom过滤器（过滤器对象本身应该是全局的，以便稍后可以通过map()方法访问它）：

filter = new BloomFilter(VECTOR_SIZE,NB_HASH,HASH_TYPE);
将较小的表读入Mapper的setup()方法。
将每条记录的ID添加到bloom过滤器：

filter.add(ID);
在map()方法本身中，对较大输入源中的任何ID使用filter.membershipTest(ID)。如果没有匹配项，则表示您的较小数据集中不存在该ID，因此不应将其传递给reducer。
请记住，你会在减速器中得到误报，所以不要假设所有东西都会被连接起来。

在减少侧连接中使用bloom过滤器

1 个答案: