k pair of min jaccard distance
2016-10-09 15:11
232 查看
一个作业题,计算13万个用户的相似程度,每个用户有两个列表(L,U),分别是喜欢的电影和不喜欢的电影。找到最相似的100对用户。一共有90多亿次的计算。
jacard的计算公式:Jaccardij=(Li∩Lj)⋃(Ui∩UJ)(Li∪Lj)⋃(Ui∪UJ)
如果直接计算的话,利用set_union和set_intersection函数就可以获得。但是直接计算非常慢,下面给出几点优化的过程:
(1)使用一个Topk的堆,每次与最小的元素比较。如果超过则将最小元素出堆然后将新的元素入堆.
(2)然后我们可以利用一些技巧来进行优化:
[1]可以发现计算的最大热点在于计算交集和并集. set_union有一个要求就是集合必须是有序的,所以我们选择开始就将所有的用户都排序好,这样后面就不用计算了。
[2]对于交集和并集,可以写出几个不等式:a∪b≥max(a,b),a∩b≤min(a,b),利用这两个基本的不等式就可以避免很多不必要的计算,比如说我们可以在计算jaccard之前, 先利用这两个不等式进行一个估算,如果超过Topk的最小值,就不需要计算。如果不能避免,就计算一次jaccard,然后可以将估计进一步优化,然后再比较。一共可以进行四次比较,每失败一次,都需要进行一次jaccard计算来增加估算的精度。
[3]最后利用宏和一些类型提取,可以简化代码。
#include <fstream> #include <iostream> #include <string> #include <queue> #include <utility> #include <vector> #include <set> #include <algorithm> #include <map> #include <cstdio> #include <boost/algorithm/string/classification.hpp> #include <boost/algorithm/string/split.hpp> using namespace std; using rec_type = vector<int>; using pair_type = pair<pair<int, int>,float>; #define do(f,a,b,c) (f(a.begin(),a.end(),b.begin(),b.end(),inserter(c,c.begin()))) #define check(a,b,c,d) do {if (cnt>100 && minimum>((float)(a+b)/(float)max(c,d))) return 0;}while(0) struct cmp { bool operator () (const pair_type &left, const pair_type &right) { return left.second > right.second; } }; float minimum = 0; long long int cnt = 0; float jaccard(const rec_type& like_1, const rec_type& unlike_1, const rec_type& like_2, const rec_type& unlike_2); int main() { ifstream fin("result.txt"); string s; map<int, pair<rec_type, rec_type>> data; while (getline(fin, s)) { vector<string> line; boost::split(line, s, boost::is_any_of("\t"), boost::token_compress_on); vector<string> L,U; set<int> _L,_U; boost::split(L, line[1], boost::is_any_of(" "), boost::token_compress_on); if (line.size() == 3) boost::split(U, line[2], boost::is_any_of(" "), boost::token_compress_on); for(const auto &x:U) { if (x.size()!=0) _U.insert(stoi(x));} for(const auto &x:L) _L.insert(stoi(x)); data[stoi(line[0])] = pair<rec_type, rec_type> (rec_type(_L.begin(),_L.end()), rec_type(_U.begin(),_U.end())); } priority_queue<pair_type, vector<pair_type>, cmp> buf; for (auto it_1 = data.begin(); it_1 != data.end(); ++it_1) { rec_type& like_1 = it_1->second.first,unlike_1 = it_1->second.second;; for(auto it_2 = next(it_1); it_2 != data.end(); ++it_2) { rec_type& like_2 = it_2->second.first, unlike_2 = it_2->second.second; float result = jaccard(like_1, unlike_1, like_2, unlike_2); ++cnt; if (buf.size() < 100) { buf.push(make_pair(make_pair(it_1->first, it_2->first), result)); } else if (result > buf.top().second) { buf.pop(); buf.push(make_pair(make_pair(it_1->first, it_2->first), result)); minimum = buf.top().second; } if (cnt % 1000000 == 0) { cout << cnt / 1000000 << " "<<endl; } } } while (!buf.empty()) { printf("%d %d %f\n", buf.top().first.first, buf.top().first.second, buf.top().second); buf.pop(); } return 0; } float inline jaccard(const rec_type& like_1, const rec_type& unlike_1, const rec_type& like_2, const rec_type& unlike_2) { auto v1_approxiate = max(like_1.size(),like_2.size()); auto v2_approxiate = max(unlike_1.size(),unlike_2.size()); auto v3_approxiate = min(like_1.size(),like_2.size()); auto v4_approxiate = min(unlike_1.size(),unlike_2.size()); check(v3_approxiate,v4_approxiate,v1_approxiate,v2_approxiate); rec_type v3,v4; do(set_intersection,like_1,like_2,v3); do(set_intersection,unlike_1,unlike_2,v4); check(v3.size(),v4.size(),v1_approxiate,v2_approxiate); rec_type v1,v2; do(set_union,like_1,like_2,v1); do(set_union,unlike_1,unlike_2,v2); check(v3.size(),v4.size(),v1.size(),v2.size()); rec_type v6; do(set_union,v1,v2,v6); check(v3.size(),v4.size(),0.,(double)v6.size()); rec_type v5; do(set_union,v3,v4,v5); double jaccard = (double)v5.size() / (double)v6.size(); return jaccard; }
相关文章推荐
- compute the MAX and MIN of int
- UVALIVE 2927 "Shortest" pair of paths
- Requirements of pair programming
- Codeforces Round #209 (Div. 2) D. Pair of Numbers
- File /hbase could only be replicated to 0 nodes instead of minReplication (=1). There are 30 datanode(s) running and no node(s) are excluded in this operation.
- LeetCode - Maximum Length Of Pair Chain(算法)
- File***could only be replicated to 0 nodes instead of minReplication (=1)
- LeetCode 646 Maximum Length of Pair Chain(贪心)
- lightOJ 1366 Pair of Touching Circles(统计矩形内相切圆对)
- File***could only be replicated to 0 nodes instead of minReplication (=1)
- 错误 _COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 2 datano
- Cannot read property 'name' of undefined @ injectScripts.min.js:1
- 数据集成字符串匹配算法:EditDIstance,NeedlemanWunch,Soundex,Jaccard
- LeetCode Maximum Length of Pair Chain
- Extract sequences from FASTA file based on a pair of list
- [LinkedIn]Min cost of paint house with color
- Minmum Depth of Binary Tree
- find length of longest consecutive subsequence S in an unsorted array where min * 2 > max.
- hbase错误记录一; File /hbase/.tmp/hbase.version could only be replicated to 0 nodes instead of minReplica
- 【二维dp_右下递推】interleaving、Distinct Subsequences、字符串交错、min edit distance、longest common sequence(LCS)