天池竞赛-淘宝穿衣搭配(数据预处理部分)
2015-10-23 01:16
459 查看
赛题简介
数据格式
搭配套餐数据:dim_fashion_match_sets
dim_fashion_match_sets 样例:
商品信息表:dim_items
dim_items样例:
用户历史行为表:user_bought_history
user_bought_history样本:
待预测的商品列表:test_items
test_items样例:
Data Loading
Tool Preparation
Test
例如查看
淘宝网是中国深受欢迎的网购零售平台,其中服饰鞋包行业占据市场的绝大部分份额,围绕着淘宝诞生了一大批优秀的服饰鞋包导购类的产品。穿衣搭配是服饰鞋包导购中非常重要的课题,它所延伸出的技术、算法能广泛应用到大数据营销几乎所有场景中,如搜索、推荐和营销服务。淘宝穿衣搭配算法竞赛将为参赛者提供搭配专家和达人生成的搭配组合数据,百万级别的淘宝商品的文本和图像数据,同时还将提供用户的脱敏行为数据。期待参赛者能从以上行为、文本和图像数据中挖掘穿衣搭配模型,为用户提供个性化、优质的、专业的穿衣搭配方案。
数据格式
搭配套餐数据:dim_fashion_match_sets
coll_id || bigint || 搭配套餐ID || 1000 item_list || string || 搭配套餐中的商品ID列表(分号分隔,每个分号下可能会有不只一个商品,后面为可替换商品,逗号分隔)|| 1002,1003,1004;439201;1569773,234303;19836
dim_fashion_match_sets 样例:
1 160870;3118604 2 1842610;2741506 3 893028;993019,1375599,1913565,3036503;2849440;2546147;2329974,2661094,347849;884801,127779,3122713;2338561 4 2612866;1272124;2181942 5 3128145;2683359;855149
商品信息表:dim_items
item_id || bigint || 商品ID || 439201 cat_id || bigint || 商品所属类目ID || 16 terms || string || 商品标题分词后的结果,顺序被打乱 || 5263,2541,2876263 img_data || string || 商品图片(注:初赛图片直接以文件形式提供,图片文件命名为item_id.jpg,表中无该字段)
dim_items样例:
29 155 123950,53517,106068,59598,7503,171811,25618,147905,203432,123580,178091,154365,127004,31897,82406 49 228 73035,33202,116593,48909,92233,181255,127004,38910,182506,181709,207662,154365,103661,24893 59 284 123950,38910,22837,5026,15459,47776,158346,101881,131272,176333,196079,23211,148988,144893,167633
用户历史行为表:user_bought_history
user_id || bigint || 用户ID || 62378843278 item_id || bigint || 商品ID || 439201 create_at || string || 行为日期(购买)|| 20140911
user_bought_history样本:
1915871 8 20150417 4371603 8 20150418 8034236 8 20150516 6135829 8 20150405 11650079 8 20150404 5324797 23 20150509 7969649 23 20150413
待预测的商品列表:test_items
item_id || bigint || 商品ID || 90832747
test_items样例:
1417 2227 3967 7237 8467 10477 10777 12547
Data Loading
# -*- coding: utf-8 -*- """ Created on Sat Oct 03 13:53:48 2015 @author: Zhang_Jun """ import pandas as pd # load data dim_fashion_matchsets = pd.read_table('.\data\dim_fashion_matchsets.txt',\ sep='\s+',names = ['coll_id','item_list']) dim_items = pd.read_table('.\data\dim_items.txt',\ sep = '\s+' , names = ['item_id','cat_id','terms','img_data']) user_bought_history = pd.read_table('.\data\user_bought_history.txt',\ sep = '\s+' , names = ['user_id','item_id','create_at']) test_items = pd.read_table('.\data\items.txt', names = ['test_items_id'])
Tool Preparation
# -*- coding: utf-8 -*- """ Created on Sat Oct 03 15:49:46 2015 @author: Zhang_Jun """ from collections import Counter import itertools #----------------------------------------------------------- class item(object): def __init__(self,ID): self.id = ID self.match = [] self.replacement = [] self.title = [] self.category = [] self.buyer = [] # obj self.buy_date = [] self.img_data = [] self.match_counter = [] self.replace_counter =[] self.also_buy_counter = [] class buyer(object): def __init__(self,user_id,user_bought_history,items): self.id = user_id self.items = [] def get_buy_items(self,user_bought_history,items): item_id = get_item_id_from_user_history(user_bought_history,self.id) return [get_item(items,i) for i in item_id if i in [item.id for item in items]] #----------------------------------------------------------- def get_matchset(dim_fashion_matchsets,coll_id): # coll_id 套餐 ID """ return the match set of coll_id """ return dim_fashion_matchsets.item_list[dim_fashion_matchsets.coll_id \ == coll_id].values[0].split(';') def get_replace_matchset(dim_fashion_matchsets,coll_id): """ return the match set of coll_id (dealed with replace items)""" return [content.split(',') for content in get_matchset(dim_fashion_matchsets,coll_id)] def get_match_list(dim_fashion_matchsets,coll_id): """ return all the matched combinations of coll_id""" matchset_combine = get_replace_matchset(dim_fashion_matchsets,coll_id) prodcut_list = itertools.product(*matchset_combine) match_list = [match for match in prodcut_list] return match_list def get_category(dim_items,item_id): """ return the category ID of this item_id (cat_id)""" return dim_items.cat_id[dim_items.item_id == item_id].values[0] def get_term_title(dim_items,item_id): """ return term [the title of this term ]""" return dim_items.terms[dim_items.item_id == item_id].values[0].split(',') def get_term_img_data(dim_items,item_id): """ return image data""" return dim_items.img_data[dim_items.item_id == item_id].values def get_user_id(user_bought_history,item_id): """ return who bought this item """ return list(user_bought_history.user_id[user_bought_history.item_id == item_id].values) def get_buy_date(user_bought_history,item_id): """ return the time of buying this item """ return list(user_bought_history.create_at[user_bought_history.item_id == item_id].values) def get_detail_buy_date(buy_date_list): """ get the year , month , day of buying """ #detail_buy_date=[] year = [] month =[] day =[] for i in range(len(buy_date_list)): date = str(buy_date_list[i]) #detail_buy_date.append((date[:4],date[4:6],date[6:])) year.append(date[:4]) month.append(date[4:6]) day.append(date[6:]) #return detail_buy_date return year , month , day def get_item(items,item_id): """ use item_id to get item in the set of items(obj set)""" return [obj for obj in items if obj.id == item_id][0] def add_replacement_to_item(items,dim_fashion_matchsets): """ add replacement item to item in the set of items(obj set)""" for i in dim_fashion_matchsets.coll_id: match_replace = get_replace_matchset(dim_fashion_matchsets,i) for j in range(len(match_replace)): for k in range(len(match_replace[j])): if int(match_replace[j][k]) in [obj.id for obj in items]: get_item(items,int(match_replace[j][k])).replacement += match_replace[j] if len(set(get_item(items,int(match_replace[j][k])).replacement)) == 1: get_item(items,int(match_replace[j][k])).replacement = [] def add_replacement_counter_to_item(items): """ counter the frequency of replacement item""" for item in items: item.replace_counter = Counter(item.replacement) def add_match_to_item(items,dim_fashion_matchsets): """ add matched item to to item in the set of items(obj set)""" for i in dim_fashion_matchsets.coll_id: match = get_match_list(dim_fashion_matchsets,i) for j in range(len(match)): for k in range(len(match[j])): if int(match[j][k]) in [obj.id for obj in items]: get_item(items,int(match[j][k])).match += match[j] def add_match_counter_to_item(items): """ counter the frequency of match item""" for item in items: item.match_counter = sorted(Counter(item.match).items(),key= lambda d: d[1],reverse=True) def get_item_id_from_user_history(user_bought_history,user_id): """ return item_id based on user_id """ return list(user_bought_history.item_id[user_bought_history.user_id == user_id].values) def add_buyer_to_items(items,user_bought_history): """ add buyer obj [id / items] to to item in the set of items(obj set)""" for item in items: if item.id in user_bought_history.item_id: buyer_id = get_user_id(user_bought_history,item.id) item.buyer = [buyer(i,user_bought_history,items) for i in buyer_id] def get_also_buy_item_id(user_bought_history,items,item_id): """ get all the also_buy_items'id of item who's id is item_id""" item_list = [get_item(items,item_id).buyer[j].get_buy_items(user_bought_history,items) for j in range(len(get_item(items,item_id).buyer))] also_buy = [] for i in range(len(item_list)): for j in range(len(item_list[i])): also_buy.append(item_list[i][j].id) also_buy_counter = Counter(also_buy) return also_buy_counter def add_also_buy_counter_to_items(user_bought_history,items): """ counter the frequency of also_buy_id""" for item_id in [item.id for item in items]: get_item(items,item_id).also_buy_counter = get_also_buy_item_id(user_bought_history,items,item_id)
Test
# -*- coding: utf-8 -*- """ Created on Sun Oct 04 23:26:52 2015 @author: Zhang_Jun """ from tools_preparation import * items = [] for i in list(test_items.test_items_id): obj = item(i) obj.title = get_term_title(dim_items,i) obj.category = get_category(dim_items,i) obj.buy_date = get_buy_date(user_bought_history,i) items.append(obj) add_replacement_to_item(items,dim_fashion_matchsets) add_match_to_item(items,dim_fashion_matchsets) add_buyer_to_items(items,user_bought_history) add_match_counter_to_item(items) add_also_buy_counter_to_items(user_bought_history,items) add_replacement_counter_to_item(items)
例如查看
In [161]: get_match_list(dim_fashion_matchsets,11) Out[161]: [('1463018', '1596334', '2226122'), ('1463018', '1596334', '284814'), ('1463018', '1596334', '36278'), ('1463018', '1596334', '480281'), ('1463018', '1704853', '2226122'), ('1463018', '1704853', '284814'), ('1463018', '1704853', '36278'), ('1463018', '1704853', '480281'), ('230955', '1596334', '2226122'), ('230955', '1596334', '284814'), ('230955', '1596334', '36278'), ('230955', '1596334', '480281'), ('230955', '1704853', '2226122'), ('230955', '1704853', '284814'), ('230955', '1704853', '36278'), ('230955', '1704853', '480281') In [162]: get_buy_date(user_bought_history,33547) Out[162]: [20150531, 20150525, 20150506, 20150527, 20150528, 20150523, 20150526, 20150609, 20150428, 20150510, 20150608, 20150523, ...] In [160]: get_also_buy_item_id(user_bought_history,items,33547) Out[160]: Counter({33547: 81, 40867: 1} In [159]: [a.also_buy_counter for a in items] Out[159]: [Counter(), Counter(), Counter(), Counter({33517: 97}), Counter({33547: 81, 40867: 1}), Counter(), Counter({39667: 32}), Counter(), Counter({33547: 1, 40867: 139}), Counter(), Counter({42217: 501, 51367: 1}), Counter(), Counter({45517: 1}), Counter({45817: 85}), Counter({50377: 108}), Counter(), Counter({42217: 1, 51367: 165}), Counter(), Counter({55117: 15}), Counter()
相关文章推荐
- phonegap安装中的问题
- 遗传基因科普(5):DNA双螺旋结构的发现
- 数据结构--我的哈夫曼编译系统
- 我来。
- Mac下的Homebrew安装与使用
- 【OJ】QUST 1229
- Leetcode 162. Find Peak Element
- Java安全通信:HTTPS与SSL
- Codevs P3287 货车运输
- 常用UML模型简要小结
- Python pexpec 解决scp ssh
- 代码天敌之体积
- java基础-程序流程控制
- 一元三次方程求解
- 用python实现远程复制 (scp + expect )
- [数据挖掘课程笔记]关联规则挖掘 - Apriori算法
- C#读写EXCEL操作的简单封装
- 系统spt_values表--生成时间方便left join
- 蓝屏dump分析教程,附分析工具WinDbg下载
- Andriod下音频的相关操作