两个周前挖了一个坑，现在跳了进去<riak mapreduce 分析>,顺便能实现分页功能了

langzhe

浏览: 279049 次
性别:
来自: 北京

最近访客更多访客>>

WoKo_Jb

perfect_control

u012781923

apple8422

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

riak

我现在有一个这样的bucket

id followed_id followr_id

1 lxw jason

2 jason lxw

3 langxw jason

4 jason langxw

要得到这样一组数据[1](统计粉丝数，当然粉丝数可以预先计算，不必每次都现查),当时看到代码里有现成的接口就直接调用了，当时还感觉这接口有问题，后面可能会用麻烦。果然现在查询数据总是超时。所以说无论如何不要想着偷懒，偷懒的结果只会让自己花更多的时间来弥补。

[1]、

{name,count}

{lxw, 1}

{jason, 2}

{langxw, 1

后来，自己写个查吧。首先想到用mapreduce 计算得出

[2]map是这样写的,只是为了得到一个list({FollowedId, 1})

map(Record, undefined, {sub_rank}) ->
    ?DEBUG("~p:map sub_rank ~p ~n", [?MODULE, ?LINE]),
    case riak_kv_util:is_x_deleted(Record) of
        false ->
            {struct, List} = mochijson:decode(riak_object:get_value(Record)),
            FollowedId = get_value(List, "followed_id"), 
            [{FollowedId, 1}];
        _ ->[]
    end;

接着写了这样一个reduce

reduce[1]: 把相同的followedId的value相加

reduce(Records,  {sub_rank}) ->
   FSum = fun({FollowedId, Count}, Acc) ->
                Value = proplists:get_value(FollowedId, Acc, 0),
                [{FollowedId, Value+Count}|proplists:delete(FollowedId, Acc)]
           end,
    lists:foldr(FSum, [], Records);

在3000个obj的情况下跑了一边，没问题大公告成。

因为还涉及到取出根据count排序前50条。

所以需要再添加reduce[2]

reduce[2]:取出前50条记录

reduce(Records,  {sub_rank, Max}) when is_integer(Max) ->
    ?DEBUG("~p:reduce sub_rank ~p Max=~p, Records=~p~n", [?MODULE, ?LINE, Max, Records]),
    lists:sublist(lists:reverse(lists:keysort(2,Records)), Max);

本以为这样就完事了，当20万个obj情况下，这个mapreduce照样查询不出数据来，一直提示timeout,设置10分钟，15分钟都是timeout。

Query代码:

Query=[{map,{modfun,trend_riak,map},{sub_rank},false},   
       {reduce,{modfun,trend_riak,reduce},{sub_rank},false},   
       {reduce,{modfun,trend_riak,reduce},{sub_rank, 50},true}],

1、分析慢的原因

只用riak_pb_socket:mapred/3执行map,不执行reduce

Query=[{map,{modfun,trend_riak,map},{sub_rank},false},

数据能查询出来大约花了4s。

我这时只是感觉奇怪，心想为什么返回数据多了还快了。reduce使其返回数据少了，为什么却慢了。

这下只好仔细看看reduce代码了。

定位到了proplist操作上。

Value = proplists:get_value(FollowedId, Acc, 0),
[{FollowedId, Value+Count}|proplists:delete(FollowedId, Acc)]

测试了一下，发现两个操作确实比较耗时。

这时我想到了用dict才实现

用dict之前是这样实现的：

这操作是在shell计算，20万条数据花费 141s
    FSum = fun({FollowedId, Count}, Acc) ->
                Value = proplists:get_value(FollowedId, Acc, 0),
                [{FollowedId, Value+Count}|proplists:delete(FollowedId, Acc)]
           end,
    lists:foldr(FSum, [], Records);

用dict首先想到这样实现：

这操作是在shell计算，20万条数据花费 12s
    FSum = fun({FollowedId, Count}, Acc) ->
                case  dict:is_key(FollowedId, Acc) of
                    true ->
                        Value = dict:fetch(FollowedId, Acc),
                        dict:store(FollowedId, Count+Value, Acc);
                    false ->
                        dict:store(FollowedId, Count, Acc)
                end
            end,
    lists:foldr(FSum, dict:new(), Records);

缺点：用dict时如果直接用dict:fetch/2函数时，如果K不存在会抛出一个异常错误，这也是我平常不用dict的原因懒的每次都调用dict:is_key/2判断。这里判断一次，取出一次，存储一次总共判断了三或两次。

为了减少操作就用用了dict:update_counter/3

   
    这操作是在shell计算，20万条数据花费 10s
    FSum2 = fun({FollowedId, Count}, Acc) ->
                case  dict:is_key(FollowedId, Acc) of
                    true ->
                        dict:update_counter(FollowedId, Count, Acc);
                    false ->
                        dict:store(FollowedId, Count, Acc)
                end
            end,
    lists:foldr(FSum2, dict:new(), Records);

其实可以更简单,省去了判断。

这操作是在shell计算，20万条数据花费 3s
    FSum3 = fun({FollowedId, Count}, Acc) ->
                dict:update_counter(FollowedId, Count, Acc)
            end,
    lists:foldr(FSum3, dict:new(), Records);

Add Increment to the value associated with Key and store this value. If Key is not present in the dictionary then Incrementwill be stored as the first value.

这时看上去已经省去了不少时间了由141s降到了3s。

当把这段代码放到reduce代码里后发现查询依然比较慢，欣慰的是数据查询出来了。但总体花了101s，显然没达到预期结果。

根据map-reduce原理特点分析了一下reduce代码。猜测可能很多被reduce处理过的数据会多次reduce，具体reduce多少次，目前我无法判断。那只好把reduce结果改成返回一个结果添加个分之判断一下，被reduce过的数据不在参与reduce的遍历处理，还能统计在一起

实现方法：

 返回一个元素的list，[dict()]                      
reduce(Records,  {sub_rank}) ->
    FSum3 = fun({FollowedId, Count}, Acc) ->
                   dict:update_counter(FollowedId, Count, Acc);
               (Dict, Acc) ->
                   dict:merge(fun(_K, V, V1) ->V+V1 end, Dict, Acc)
            end,
    Return = lists:foldr(FSum3, dict:new(), Records),
    [Return];
顺便也得重写第二个reduce
% 取出前Max个
reduce([Records],  {sub_rank, Max}) when is_integer(Max) ->
    lists:sublist(lists:reverse(lists:keysort(2,dict:to_list(Records))), Max);

这下终于算是OK ，

Query代码:执行花费12s

Query=[{map,{modfun,trend_riak,map},{sub_rank},false},   
       {reduce,{modfun,trend_riak,reduce},{sub_rank},false},   
       {reduce,{modfun,trend_riak,reduce},{sub_rank, 50},true}],

根据这个最后的思路加上第二个reduce也能实现分页功能了

分享到：

解决更新自定义mapreduce代码需要riak ... | mongdb "errmsg" : "exception: 'out' has ...

2013-08-08 18:36
浏览 789
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论