您的位置:首页 > 运维架构

基于Hadoop的带词频属性的文档倒排索引

2013-04-24 16:40 239 查看
Inverted Index(倒排索引)是目前几乎所有支持全文检索的搜索引擎都要依赖的一个数据结构。基于索引结构,给出一个词(term),能取得含有这个term的文档列表(the list of documents)。例如:


如果考虑单词在每个文档中出现的词频、位置、对应Web文档的URL等诸多属性,简单的倒排算法就不足以有效工作。我们把这些词频、位置等诸多属性称为有效负载(Payload)。

其Map和Reduce实现伪代码如下:

1: class Mapper

2: procedure Map(docid n, doc d)

3: H ← new AssociativeArray

4: for all term t ∈ doc d do

5: H{t} ← H{t} + 1 //词频属性

6: for all term t ∈ H do

7: Emit(term t, posting <n, H{t}>)

1: class Reducer

2: procedure Reduce(term t, postings [<n1, f1>, <n2, f2>…])

3: P ← new List

4: for all posting <a, f> ∈ postings [<n1, f1>, <n2, f2>…] do

5: Append(P, <a, f>)

6: Sort(P)

7: Emit(term t; postings P)

mapreduce过程如下:



下面贴代码

import java.io.IOException;
import java.util.HashMap;
import java.util.Hashtable;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.util.Iterator;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.GenericOptionsParser;

public class InvertedIndex {

public static class InvertedIndexMapper extends Mapper<LongWritable, Text, Text, Text>
{	@Override
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException

{
FileSplit fileSplit = (FileSplit)context.getInputSplit();
String fileName = fileSplit.getPath().getName();

String word;
IntWritable frequence=new IntWritable();
int one=1;
Hashtable<String,Integer>	hashmap=new Hashtable();	//key关键字选择String而不是Text,选择Text会出错
StringTokenizer itr = new StringTokenizer(value.toString());
for(;itr.hasMoreTokens(); )
{

word=itr.nextToken();
if(hashmap.containsKey(word)){
hashmap.put(word,hashmap.get(word)+1);    //由于Map的输入key是每一行对应的偏移量,
、                                                                                 //所以只能统计每一行中相同单词的个数,
}else{
hashmap.put(word, one);

}

}

for(Iterator<String> it=hashmap.keySet().iterator();it.hasNext();){
word=it.next();
frequence=new IntWritable(hashmap.get(word));
Text fileName_frequence = new Text(fileName+"@"+frequence.toString());
context.write(new Text(word),fileName_frequence);        //以”fish  doc1@1“ 的格式输出
}

}
}

public static class InvertedIndexCombiner extends Reducer<Text,Text,Text,Text>{

protected void reduce(Text key,Iterable<Text> values,Context context)
throws IOException ,InterruptedException{      //合并mapper函数的输出

String fileName="";
int sum=0;
String num;
String s;
for (Text val : values) {

s= val.toString();
fileName=s.substring(0, val.find("@"));
num=s.substring(val.find("@")+1, val.getLength());      //提取“doc1@1”中‘@’后面的词频
sum+=Integer.parseInt(num);
}
IntWritable frequence=new IntWritable(sum);
context.write(key,new Text(fileName+"@"+frequence.toString()));
}
}

public static class InvertedIndexReducer extends Reducer<Text, Text, Text, Text>
{	@Override
protected void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException
{	Iterator<Text> it = values.iterator();
StringBuilder all = new StringBuilder();
if(it.hasNext())  all.append(it.next().toString());
for(;it.hasNext();) {
all.append(";");
all.append(it.next().toString());
}
context.write(key, new Text(all.toString()));
} //最终输出键值对示例:(“fish", “doc1@0; doc1@8;doc2@0;doc2@8 ")
}

public static void main(String[] args)
{
if(args.length!=2){
System.err.println("Usage: InvertedIndex <in> <out>");
System.exit(2);
}

try {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

Job job = new Job(conf, "invertedindex");
job.setJarByClass(InvertedIndex.class);

job.setMapperClass(InvertedIndexMapper.class);
job.setCombinerClass(InvertedIndexCombiner.class);
job.setReducerClass(InvertedIndexReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);

} catch (Exception e) {
e.printStackTrace();
}
}

}




参考文献:Jimmy Lin Lin and Chris Dyer:Data-Intensive Text Processing with MapReduce
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐