<![CDATA[关于group by]]>

本文探讨了在搜索引擎中使用Lucene实现GroupBy操作的方法,包括如何通过Hash表进行数据聚合,以及此方法可能带来的I/O负担和内存消耗等问题。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

周五因为在搜索引擎中要使用group by操作,原有的 count发似乎会出现很大的效率问题,所以也简单考虑了下groupby的实现,在sql里我猜想(我对这个算法完全没有概念),首先利用矩阵算法将所有的row排序好,然后以此顺序取出,然后再加入聚合函数count或者sum,也就是说在sql中实际上他是先把按groupby字段索引排序然后再进行数据聚合,最后再拉出需要的数据(这个猜想是基于group by字段需要索引得到的,不知道是否正确)
而在lucene中也是类似的状况,现在我们考虑lucene 大家知道在lucene中得到的结果集是hits,hits实际上和sql中的结果集recordset是一个概念的,即使你得到了一个数百万的结果集,数据也并没有立即从中拉出(这应该就是为什么表需要有主键的原因)。但lucene中有个问题,我无法得到一个多结果集,也就是说我无法得到一个按groupby排序的某个中间结果,我得到的只是输出,因此在group中我只能选择先将所有数据全部读出这条路,那么读出来的数据怎样groupby?我实现的是基于hash表的 数组遍历。是这样的过程:
hash<字段名,groupby结果集>  
groupby结果集<字段值,聚合值>
用这样的hash嵌套实现,算法中我只使用了一次hash循环,但比较大的伤害是,数据集似乎被从头到尾取了一次,会产生很多io,并且由于我的group和查询是一种并行算法,因此等于结果集被取出了两遍(目标中的group是类似taobao的产品列表的类别归类查询页面,所以group是一个独立结果集并非输出结果集),目前测试下来还没有感觉内存消耗有极其大的影响,但不知道在高负荷状态下的水平,因此还需要继续查看。
关于group的算法,如果有更好的办法和算法或者在lucene下有更好的解决方案欢请路过不要吝啬你的idea告诉我吧。


Tags - lucene , groupby , 算法
select wo.id, wo.work_order_no, wo.work_order_stauts, wo.curr_approver_code, wo.curr_approver_name, wo.create_time, wo.city_code, wo.city_name, wo.applicant_code, wo.applicant_name, wo.district_code, wo.district_name, wo.approver_desc, wo.annex_info, wo.curr_work_id, wo.work_id from t_process_work_order wo left join t_process_claim_user u on u.city_code = wo.city_code and wo.district_code = u.district_code and u.person_type = wo.curr_approver_code <where> <if test="id != null"> and wo.id = #{id} </if> <if test="workOrderNo != null and workOrderNo != ''"> and wo.work_order_no = #{workOrderNo} </if> <if test="status != null and status == 1"> and wo.work_order_stauts = 1 </if> <if test="status != null and status == 2"> and wo.work_order_stauts in (2,3) </if> <if test="workOrderStauts != null"> and wo.work_order_stauts = #{workOrderStauts} </if> <if test="currApproverCode != null and currApproverCode != ''"> and wo.curr_approver_code = #{currApproverCode} </if> <if test="currApproverName != null and currApproverName != ''"> and wo.curr_approver_name like concat('%',#{currApproverName},'%') </if> <if test="endTime != null"> and wo.create_time <![CDATA[<=]]> #{endTime} </if> <if test="startTime != null"> and wo.create_time >= #{startTime} </if> <if test='areaLevel == "2"'> and wo.city_code = #{areaCode} </if> <if test='areaLevel == "3"'> and wo.district_code = #{areaCode} </if> <if test="cityCode != null and cityCode != ''"> and wo.city_code = #{cityCode} </if> <if test="cityName != null and cityName != ''"> and wo.city_name = #{cityName} </if> <if test="applicantCode != null and applicantCode != ''"> and wo.applicant_code = #{applicantCode} </if> <if test="applicantName != null and applicantName != ''"> and wo.applicant_name concat like concat('%',#{applicantName},'%') </if> <if test="districtCode != null and districtCode != ''"> and wo.district_code = #{districtCode} </if> <if test="districtName != null and districtName != ''"> and wo.district_name like concat('%', #{districtName},'%') </if> <if test="approverDesc != null and approverDesc != ''"> and wo.approver_desc = #{approverDesc} </if> <if test="annexInfo != null and annexInfo != ''"> and wo.annex_info = #{annexInfo} </if> <if test="currWorkId != null"> and wo.curr_work_id = #{currWorkId} </if> <if test="workId != null"> and wo.work_id = #{workId} </if> </where>帮我改一下,,现在会对应多个用户数据,但是我现在不需要展示多条,因为我只要用户的名字,把用户的名字通过逗号拼接成为 curr_approver_name
07-22
Select <foreach collection="mapList" item="map" separator="," close=","> <foreach collection="map.entrySet()" item="item" index="key" separator=","> max( round( accuracy,4)) Filter (Where bucket =#{item} ) As "${key}" </foreach> </foreach> case when sum(system_lock_qty) = 0 then 0 when 1.0 - (sum(system_dif_qty_auto) / sum(system_lock_qty)) <![CDATA[<]]> 0 then 0 <!--小于0的时候也显示为0--> else round(1.0 - (sum(system_dif_qty_auto) / sum(system_lock_qty)),4) end as TOTAL, prod_series_en_name From ( Select prod_series_en_name, bucket, sum(system_lock_qty) system_lock_qty, sum(system_dif_qty_auto) system_dif_qty_auto, case when sum(system_lock_qty) = 0 then 0 when 1.0 - (sum(system_dif_qty_auto) /sum(system_lock_qty)) <![CDATA[< ]]> 0 then 0 else 1.0 - (sum(system_dif_qty_auto) / sum(system_lock_qty)) end as accuracy From ( Select prod_series_en_name, system_lock_qty, work_order_qty, abs(system_lock_qty-work_order_qty) system_dif_qty_auto, <choose> <when test="condition.dimensionStr eq 1"> (select to_char(week,'yyyy-mm-dd')) as bucket <!--周--> </when> <when test="condition.dimensionStr eq 2"> (select to_char(week,'yyyy-mm')) bucket <!--月--> </when> <when test="condition.dimensionStr eq 3"> ( select to_char(week,'yyyy')||'_Q'||(select quarters from ( select extract (quarter from week) quarters)t )) as bucket <!--季度--> </when> <when test="condition.dimensionStr eq 4"> (select to_char(week,'yyyy')) bucket <!--年--> </when> </choose> <!--维度处理地方--> From t_base_weekly_plan_accuracy Where 1 = 1 <!--过滤地方--> <if test="condition.startweekTimeStr != null and condition.startweekTimeStr != ''"> AND week >= TO_TIMESTAMP(#{condition.startweekTimeStr,jdbcType=VARCHAR},'yyyy-mm-dd') </if> <if test="condition.endweekTimeStr != null and condition.endweekTimeStr != ''"> AND week <= TO_TIMESTAMP(#{condition.endweekTimeStr,jdbcType=VARCHAR},'yyyy-mm-dd') </if> <if test="condition.produceTypeList != null and condition.produceTypeList.size > 0"> AND prod_item_type IN <foreach collection="condition.produceTypeList" item="produceType" separator="," index="index" open="(" close=")"> #{produceType,jdbcType=VARCHAR} </foreach> </if> <if test="condition.prodLineCodeList != null and condition.prodLineCodeList.size > 0"> AND prod_line_cn_name IN <foreach collection="condition.prodLineCodeList" item="prodLineCode" separator="," index="index" open="(" close=")"> #{prodLineCode,jdbcType=VARCHAR} </foreach> </if> <if test="condition.prodSeriesCodeList != null and condition.prodSeriesCodeList.size > 0"> AND prod_series_en_name IN <foreach collection="condition.prodSeriesCodeList" item="prodSeriesCode" separator="," index="index" open="(" close=")"> #{prodSeriesCode,jdbcType=VARCHAR} </foreach> </if> ) b Group By prod_series_en_name, bucket) c Group By prod_series_en_name </select>转化成sql语句
最新发布
08-07
<scenario name="Basic Sipstone UAC"> <!-- In client mode (sipp placing calls), the Call-ID MUST be --> <!-- generated by sipp. To do so, use [call_id] keyword. --> <!-- Content-Type: application/sdp --> <send retrans="500"> <![CDATA[ INVITE sip:[service]@[remote_ip]:[remote_port] SIP/2.0 Via: SIP/2.0/[transport] [local_ip]:[local_port];branch=[branch] From: sipp <sip:sipp@[local_ip]:[local_port]>;tag=[pid]SIPpTag00[call_number] To: sut <sip:[service]@[remote_ip]:[remote_port]> Call-ID: [call_id] CSeq: 1 INVITE Contact: sip:sipp@[local_ip]:[local_port] Max-Forwards: 70 Subject: Performance Test Content-Type: application/sdp Content-Length: [len] v=0 o=user1 53655765 2353687637 IN IP[local_ip_type] [local_ip] s=- c=IN IP[media_ip_type] [media_ip] t=0 0 m=audio [media_port] RTP/AVP 0 a=rtpmap:0 PCMU/8000 ]]> </send> <recv response="100" optional="true"> </recv> <recv response="180" optional="true"> </recv> <recv response="183" optional="true"> </recv> <!-- By adding rrs="true" (Record Route Sets), the route sets --> <!-- are saved and used for following messages sent. Useful to test --> <!-- against stateful SIP proxies/B2BUAs. --> <recv response="200"> <action> <ereg regexp="30061" search_in="hdr" header="To:" check_it="true" assign_to="6"/> </action> </recv> <nop test="6" next="success" /> <nop next="fail"/> <!-- Packet lost can be simulated in any send/recv message by --> <!-- by adding the 'lost = "10"'. Value can be [1-100] percent. --> <label id="success"/> <send> <![CDATA[ ACK sip:[service]@[remote_ip]:[remote_port] SIP/2.0 Via: SIP/2.0/[transport] [local_ip]:[local_port];branch=[branch] From: sipp <sip:sipp@[local_ip]:[local_port]>;tag=[pid]SIPpTag00[call_number] To: sut <sip:[service]@[remote_ip]:[remote_port]>[peer_tag_param] Call-ID: [call_id] CSeq: 1 ACK Contact: sip:sipp@[local_ip]:[local_port] Max-Forwards: 70 Subject: Performance Test Content-Length: 0 ]]> </send> <!-- This delay can be customized by the -d command-line option --> <!-- or by adding a 'milliseconds = "value"' option here. --> <pause/> <!-- The 'crlf' option inserts a blank line in the statistics report. --> <!-- 延迟挂断 --> <pause milliseconds="5000"/> <send retrans="500"> <![CDATA[ BYE sip:[service]@[remote_ip]:[remote_port] SIP/2.0 Via: SIP/2.0/[transport] [local_ip]:[local_port];branch=[branch] From: sipp <sip:sipp@[local_ip]:[local_port]>;tag=[pid]SIPpTag00[call_number] To: sut <sip:[service]@[remote_ip]:[remote_port]>[peer_tag_param] Call-ID: [call_id] CSeq: 2 BYE Contact: sip:sipp@[local_ip]:[local_port] Max-Forwards: 70 Subject: Performance Test Content-Length: 0 ]]> </send> <recv response="200" crlf="true"> </recv> <nop next="end"/> <label id="fail"/> <pause milliseconds="3000"/> <send retrans="500"> <![CDATA[ BYE sip:[service]@[remote_ip]:[remote_port] SIP/2.0 Via: SIP/2.0/[transport] [local_ip]:[local_port];branch=[branch] From: sipp <sip:sipp@[local_ip]:[local_port]>;tag=[pid]SIPpTag00[call_number] To: sut <sip:[service]@[remote_ip]:[remote_port]>[peer_tag_param] Call-ID: [call_id] CSeq: 2 BYE Contact: sip:sipp@[local_ip]:[local_port] Max-Forwards: 70 Subject: Performance Test Content-Length: 0 ]]> </send> <recv response="200" crlf="true"> </recv> <nop next="end"/> <label id="end"/> <!-- 统计分析 --> <!-- definition of the response time repartition table (unit is ms) --> <ResponseTimeRepartition value="10, 20, 30, 40, 50, 100, 150, 200"/> <!-- definition of the call length repartition table (unit is ms) --> <CallLengthRepartition value="10, 50, 100, 500, 1000, 5000, 10000"/> </scenario>
06-14
以下sql存在多条一样的general_master_type_nm数据,怎么修改呢? WITH -- 表示権限区分が0の汎用マスタ区分を洗い出す a1 AS ( SELECT DISTINCT GENERAL_MASTER_TYPE, DISPLAY_AUTH_CLS AS displayAuthCls FROM m_general WHERE DISPLAY_AUTH_CLS = '0' ), -- 更新権限区分が0の汎用マスタ区分を洗い出す a2 AS ( SELECT DISTINCT GENERAL_MASTER_TYPE, UPD_AUTH_CLS AS updAuthCls FROM m_general WHERE UPD_AUTH_CLS = '0' ), b AS ( SELECT GENERAL_MASTER_TYPE, GENERAL_MASTER_TYPE_NM, MIN( DISPLAY_AUTH_CLS ) AS displayAuthCls, MIN( UPD_AUTH_CLS ) AS updAuthCls FROM m_general GROUP BY GENERAL_MASTER_TYPE, GENERAL_MASTER_TYPE_NM ),-- a と bを合併した最小表示権限区分 c AS ( SELECT b.GENERAL_MASTER_TYPE, b.GENERAL_MASTER_TYPE_NM, IFNULL ( a1.displayAuthCls, b.displayAuthCls ) AS displayAuthCls, IFNULL ( a2.updAuthCls, b.updAuthCls ) AS updAuthCls FROM b LEFT JOIN a1 ON b.GENERAL_MASTER_TYPE = a1.GENERAL_MASTER_TYPE LEFT JOIN a2 ON b.GENERAL_MASTER_TYPE = a2.GENERAL_MASTER_TYPE ) -- 抽出結果 SELECT DISTINCT (mg.GENERAL_MASTER_TYPE), mg.GENERAL_MASTER_TYPE_NM, c.updAuthCls, c.displayAuthCls, #{authority.slpmAuthCls} as slpmAuthCls FROM m_general mg INNER JOIN c ON mg.GENERAL_MASTER_TYPE = c.GENERAL_MASTER_TYPE AND (c.displayAuthCls <![CDATA[ <> '0']]> OR ( #{authority.slpmAuthCls} = '0' AND #{authority.levelType} ='1') ) WHERE 1 = 1 <if test="!containsDel"> and mg.del_flg <![CDATA[ <> '1']]> </if> ORDER BY mg.GENERAL_MASTER_TYPE_NM ASC </select>
04-21
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值