Survive by day and develop by night.
talk for import biz , show your perfect code,full busy,skip hardness,make a better result,wait for change,challenge Survive.
happy for hardess to solve denpendies.
目录
概述
网络爬虫的是一个非常常见的需求。
需求:
设计思路
实现思路分析
1.单批量式
private void processSingle(List<Map<String, Object>> list1) {
//1.遍历
for (int i=0;i<list1.size();i++){
//4.转化对应的map记录
Map<String, Object> dataMap = list1.get(i);
Map<Object,Object> dm=new HashMap<>();
for (Map.Entry<String, Object> entry : dataMap.entrySet()) {
dm.put(lineToHump(entry.getKey()), entry.getValue());
dm.put("description","描述");
dm.put("year",2008);
dm.put("trxId","交易ID");
dm.put("contractNo","12332131");
dm.put("deadline",12332L);
}
ArcDocument arcDocument =arcDocumentConvert.convert(dm);
arcDocumentService.createDoc(arcDocument);
log.info("Do create action, id={}"+" 记录数={}", arcDocument.getId(),i);
}
}
2.批量式
private void processBatch(List<Map<String, Object>> list1) {
ArrayList<ArcDocument> docList=new ArrayList<>();
//1.遍历
for (int i=0;i<list1.size();i++){
//4.转化对应的map记录
Map<String, Object> dataMap = list1.get(i);
Map<Object,Object> dm=new HashMap<>();
for (Map.Entry<String, Object> entry : dataMap.entrySet()) {
dm.put(lineToHump(entry.getKey()), entry.getValue());
dm.put("description","描述");
dm.put("year",2008);
dm.put("trxId","交易ID");
dm.put("contractNo","12332131");
dm.put("deadline",12332L);
}
ArcDocument arcDocument =arcDocumentConvert.convert(dm);
docList.add(arcDocument);
log.info("batch action, id={}"+" 记录数={}", arcDocument.getId(),i);
}
arcDocumentService.insertBatch(docList);
}
异步代码式样:
/**
* Elasticsearch数据导入
*/
public void addElasticsearchData(List<Map<String, Object>> addEsDataMapList) {
//获取连接
RestHighLevelClient client = restHighLevelClient();
try {
//创建请求
BulkRequest bulkRequest = new BulkRequest();
//创建index请求 千万注意,这个写在循环外侧,否则UDP协议会有丢数据的情况,看运气
IndexRequest requestData = null;
Map<Object,Object> dataMap=new HashMap<>();
for (Map<String, Object> addEsDataMap : addEsDataMapList) {//添加数据
for (Map.Entry<String, Object> entry : addEsDataMap.entrySet()) {
dataMap.put(lineToHump(entry.getKey()), entry.getValue());
dataMap.put("description","描述");
dataMap.put("year",2008);
dataMap.put("trxId","交易ID");
dataMap.put("contractNo","12332131");
dataMap.put("deadline",12332L);
}
ArcDocument arcDocument =arcDocumentConvert.convert(dataMap);
requestData=new IndexRequest(arc_document, "_doc", dataMap.get("arcId").toString()).source(arcDocument, XContentType.JSON);
bulkRequest.add(requestData);
}
log.info("es同步数据数量:{}", bulkRequest.numberOfActions());
//设置索引刷新规则
bulkRequest.setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE);
//分批次提交,数量控制
if (bulkRequest.numberOfActions() >= 1) {
// BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);
// log.info("es同步数据结果:{}", bulkResponse.hasFailures());
BulkResponse bulkResponse = client.bulk(bulkRequest, RequestOptions.DEFAULT);
if(bulkResponse.hasFailures()){
log.info("数据写入失败:{}",bulkResponse.buildFailureMessage());
}else {
log.info("实时消息es写入成功");
}
}
} catch (Exception e) {
e.printStackTrace();
log.error("es同步数据执行失败:{}", addEsDataMapList);
} finally {
try {
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
拓展实现
这里参考:github:简单实现上述流程:
入门级实现:
: [部分源码实现]
: 源码实现
性能参数测试:
参考资料和推荐阅读
- 暂无
- https://blog.youkuaiyun.com/weixin_43702146/article/details/128494180
- https://blog.youkuaiyun.com/hellow0rd/article/details/108168060
- https://blog.youkuaiyun.com/u011250186/article/details/125483759
5.https://blog.youkuaiyun.com/huakai_sun/article/details/79163298
6.https://my.oschina.net/u/4269649/blog/3296267
7.https://blog.youkuaiyun.com/Octopus21/article/details/128988806
欢迎阅读,各位老铁,如果对你有帮助,点个赞加个关注呗!~