您的位置:首页 > 其它

pig强制转换(字符到整数):首位0怎么处理,‘01’到1的转化,

2014-12-09 15:13 281 查看
pig支持的类型转换(cast)Pig Latin supports casts as shown in this table.
from / tobagtuplemapintlongfloatdoublechararraybytearrayboolean
bagerrorerrorerrorerrorerrorerrorerrorerrorerror
tupleerrorerrorerrorerrorerrorerrorerrorerrorerror
maperrorerrorerrorerrorerrorerrorerrorerrorerror
interrorerrorerroryesyesyesyeserrorerror
longerrorerrorerroryesyesyesyeserrorerror
floaterrorerrorerroryesyesyesyeserrorerror
doubleerrorerrorerroryesyesyesyeserrorerror
chararrayerrorerrorerroryesyesyesyeserroryes
bytearrayyesyesyesyesyesyesyesyesyes
booleanerrorerrorerrorerrorerrorerrorerroryeserror
表中,将字符类型转化做int是可以的。那么001 转化为int型后是1么?测试如下数据文件test_file.txt内容为:1011,0120111,1000011,0100001,0101010,001读入test_file.txt,将第一列的chararray类型数据,提取前三个字段,强制转换为int类型。而第二列直接按int类型读入,看看首位0怎么处理pig 代码:%default testFile /user/wizad/test/lmj/test_file.txttest_data = LOAD '$testFile' USING PigStorage(',')AS(str1:chararray,number:int);dump test_data;my_result = foreach test_data generate (int)SUBSTRING(str1,0,3);dump my_result;describe my_result;--myts = sample g_log 0.0001;--myts = limit g_log 10;--dump myts;--STORE myts INTO '/user/wizad/tmp/my' USING PigStorage(',');运行结果dump test_data:(1011,12)(0111,100)(0011,10)(0001,10)(1010,1)dump my_result:(101)(11)(1)(0)(101)可以看出首位0处理没有任何问题。
顺便一提:
debug或检查数据时,能用store,不用dump。要用dump就只用dump。我dump前,都先limit 10,只dump 10条数据
因为,dump会让某些multi-query execution失效。看起来像降低运行数据。举例子:两个脚本一个执行 A > B >DUMP而另一个执行A > B > C > STORE第一个脚本:A = LOAD 'input'AS (x, y, z);B = FILTER A BY x> 5;DUMP B;C = FOREACH BGENERATE y, z;STORE C INTO'output';store脚本:生成output1和output2两个文件,执行A > B > C > STOREA = LOAD 'input'AS (x, y, z);B = FILTER A BY x> 5;STORE B INTO'output1';C = FOREACH BGENERATE y, z;STORE C INTO'output2';        
我工作中,需要比较两个日志的ip,time,os,来识别是否是相同用户。而time需要判定5分钟内相同用户。所以我做了一个小处理,将是时间从分钟切分:time_extract = foreach cookie_data generate cookie_id,ip,SUBSTRING(time,0,13) as time,SUBSTRING(time,14,16) as minute1,os_id,os_version;log_data = foreach click_log generate log_type,guid,ip,SUBSTRING(create_time,0,13) as time,SUBSTRING(create_time,14,16) as minute2,os_id,os_version;然后,将两个relation按 time进行join,就是比较到小时的:join_data = join time_extract by (ip,time,os_id), log_data by (ip,time,os_id);在相同小时的记录中,找5分钟内的:minute_compare = foreach join_data generate log_type,cookie_id,guid,(int)minute1,(int)minute2,time_extract::os_version,log_data::os_version;same_users = filter minute_compare by (ABS(minute1-minute2) <= 5);绝对值小于5的。完成代码如下:SET job.name 'mapping_from_mobile_to_pc';SET job.priority HIGH;REGISTER piggybank.jar;DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();%default cleanedLog /user/wizad/data/wizad/cleaned/2014-11-{0[3-9],1[0-8]}/*/part*%default cookieLog /user/wizad/tmp/ip_cookie.txt%default output_path1 /user/wizad/tmp/mapping_cookie%default output_path2 /user/wizad/tmp/edition_comparecookie_data = LOAD '$cookieLog' USING PigStorage(',')AS(cookie_id:chararray,ip:chararray,time:chararray,os_id:chararray,os_version:chararray);--test = limit cookie_data 100;--dump test;time_extract = foreach cookie_data generate cookie_id,ip,SUBSTRING(time,0,13) as time,SUBSTRING(time,14,16) as minute1,os_id,os_version;describe time_extract;origin_cleaned_data = LOAD '$cleanedLog' USING SequenceFileLoaderAS (ad_network_id:chararray,wizad_ad_id:chararray,guid:chararray,id:chararray,create_time:chararray,action_time:chararray,log_type:chararray,ad_id:chararray,positioning_method:chararray,location_accuracy:chararray,lat:chararray,lon:chararray,cell_id:chararray,lac:chararray,mcc:chararray,mnc:chararray,ip:chararray,connection_type:chararray,imei:chararray,android_id:chararray,android_advertising_id:chararray,udid:chararray,openudid:chararray,idfa:chararray,mac_address:chararray,uid:chararray,density:chararray,screen_height:chararray,screen_width:chararray,user_agent:chararray,app_id:chararray,app_category_id:chararray,device_model_id:chararray,carrier_id:chararray,os_id:chararray,device_type:chararray,os_version:chararray,country_region_id:chararray,province_region_id:chararray,city_region_id:chararray,ip_lat:chararray,ip_lon:chararray,quadkey:chararray);click_log = filter origin_cleaned_data by log_type=='2';log_data = foreach click_log generate log_type,guid,ip,SUBSTRING(create_time,0,13) as time,SUBSTRING(create_time,14,16) as minute2,os_id,os_version,device_type;--join_data = join time_extract by ip, log_data by ip;--join_data = join time_extract by (ip,time), log_data by (ip,time);join_data = join time_extract by (ip,time,os_id), log_data by (ip,time,os_id);minute_compare = foreach join_data generate log_type,cookie_id,guid,(int)minute1,(int)minute2,time_extract::os_version,log_data::os_version;same_users = filter minute_compare by (ABS(minute1-minute2) <= 5);--dump same_users;describe same_users;mapping_cookie_id = foreach same_users generate cookie_id,guid;uniq_cookie_guid = distinct mapping_cookie_id;store uniq_cookie_guid INTO '$output_path1' USING PigStorage(',');os_edition = foreach same_users generate cookie_id,guid,SUBSTRING(time_extract::os_version,0,2) as os_ut,SUBSTRING(log_data::os_version,0,2) as os_mdm;same_os_edition = filter os_edition by (os_ut == os_mdm) or (os_ut == 'x')or (os_mdm == 'x');dump same_os_edition;cookie_guid_with_edition = foreach same_os_edition generate cookie_id,guid;uniq_c_g_editions = distinct cookie_guid_with_edition;store uniq_c_g_editions INTO '$output_path2' USING PigStorage(',');
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐