练习本地数据导入数据到Doris表
测试场景:
1,匹配数据中缺少一列数据
2,匹配数据中有一行数据重复
建表语句:
create table if not exists table2
(
`event_day` date,
`siteid` int default '10' comment "地址编号",
`citycode` smallint comment "城市编码",
`username` varchar(32) default '' comment "用户姓名",
`pv` bigint sum default '0',
times int sum default '0' comment "次数"
)
engine=olap
aggregate key(`event_day`,`siteid`,`citycode`,`username`)
partition by range(`event_day`)
(
partition `p202006` values less than("2020-07-01"),
partition `p202007` values less than("2020-08-01"),
partition `p202008` values less than("2020-09-01")
)
distributed by hash(`siteid`) buckets 10
properties("replication_num"="1");
本地数据table2_data.txt:
2020-07-03|11|1|jim|2
2020-07-05|12|1|grace|2
2020-07-05|12|1|grace|2
2020-07-12|13|2|tom|2
2020-07-15|14|3|bush|3
2020-07-12|15|3|helen|3
第一次进行批量导入:
[root@node1 data]# curl --location-trusted -u root:123456 -H "label:table2_20221118" -H "column_separator:|" -T table2_data.txt http://node1:8030/api/test_db/table2/_stream_load
{
"TxnId": 9008,
"Label": "table2_20221118",
"TwoPhaseCommit": "false",
"Status": "Fail",
"Message": "too many filtered rows",
"NumberTotalRows": 6,
"NumberLoadedRows": 0,
"NumberFilteredRows": 6,
"NumberUnselectedRows": 0,
"LoadBytes": 139,
"LoadTimeMs": 6,
"BeginTxnTimeMs": 0,
"StreamLoadPutTimeMs": 2,
"ReadDataTimeMs": 0,
"WriteDataTimeMs": 2,
"CommitAndPublishTimeMs": 0,
"ErrorURL": "http://192.168.88.161:8040/api/_load_error_log?file=__shard_2/error_log_insert_stmt_de416744d8ce5505-5faf6147979143b1_de416744d8ce5505_5faf6147979143b1"
}
然后对本地数据table2_data.txt进行数据补充:
2020-07-03|11|1|jim|2|1
2020-07-05|12|1|grace|2|2
2020-07-05|12|1|grace|2|3
2020-07-12|13|2|tom|2|3
2020-07-15|14|3|bush|3|3
2020-07-12|15|3|helen|3|1
然后导入操作
[root@node1 data]# vim table2_data.txt
[root@node1 data]# curl --location-trusted -u root:123456 -H "label:table2_20221118" -H "column_separator:|" -T table2_data.txt http://node1:8030/api/test_db/table2/_stream_load
{
"TxnId": 9010,
"Label": "table2_20221118",
"TwoPhaseCommit": "false",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 6,
"NumberLoadedRows": 6,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 151,
"LoadTimeMs": 329,
"BeginTxnTimeMs": 0,
"StreamLoadPutTimeMs": 1,
"ReadDataTimeMs": 0,
"WriteDataTimeMs": 311,
"CommitAndPublishTimeMs": 14
}
导入状态为Success,进行数据查询
mysql> select * from table2;
+------------+--------+----------+----------+------+-------+
| event_day | siteid | citycode | username | pv | times |
+------------+--------+----------+----------+------+-------+
| 2020-07-15 | 14 | 3 | bush | 3 | 3 |
| 2020-07-12 | 15 | 3 | helen | 3 | 1 |
| 2020-07-05 | 12 | 1 | grace | 4 | 5 |
| 2020-07-03 | 11 | 1 | jim | 2 | 1 |
| 2020-07-12 | 13 | 2 | tom | 2 | 3 |
+------------+--------+----------+----------+------+-------+
5 rows in set (0.03 sec)
由于本次测试的使用的是aggregate模型进行测试的,并且value值也是使用的sum函数,所以重复的数据会进行相加
注意:如果后面大家测试的时候遇到"Message": "too many filtered rows"(过滤过的行太多)问题,可以考虑是不是需要导入的数据是不是存在缺少匹配列的问题!
1701

被折叠的 条评论
为什么被折叠?



