日常Doris导入本地文件测试练习

原创已于 2022-11-17 20:44:33 修改 · 893 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#数据库开发

于 2022-11-17 20:43:41 首次发布

每日随笔专栏收录该内容

13 篇文章

订阅专栏

练习本地数据导入数据到Doris表

测试场景：

1，匹配数据中缺少一列数据

2，匹配数据中有一行数据重复

建表语句：

create table if not exists table2
   (
   `event_day` date,
   `siteid` int default '10' comment "地址编号",
   `citycode` smallint comment "城市编码",
   `username` varchar(32) default '' comment "用户姓名",
   `pv` bigint sum default '0',
   times int sum default '0' comment "次数"
   )
   engine=olap
   aggregate key(`event_day`,`siteid`,`citycode`,`username`)
   partition by range(`event_day`)
   (
    partition `p202006` values less than("2020-07-01"),
    partition `p202007` values less than("2020-08-01"),
    partition `p202008` values less than("2020-09-01")
   )
    distributed by hash(`siteid`) buckets 10
    properties("replication_num"="1");

本地数据table2_data.txt：

2020-07-03|11|1|jim|2
2020-07-05|12|1|grace|2
2020-07-05|12|1|grace|2
2020-07-12|13|2|tom|2
2020-07-15|14|3|bush|3
2020-07-12|15|3|helen|3

第一次进行批量导入：

[root@node1 data]# curl --location-trusted -u root:123456 -H "label:table2_20221118" -H "column_separator:|" -T table2_data.txt http://node1:8030/api/test_db/table2/_stream_load
{
    "TxnId": 9008,
    "Label": "table2_20221118",
    "TwoPhaseCommit": "false",
    "Status": "Fail",
    "Message": "too many filtered rows",
    "NumberTotalRows": 6,
    "NumberLoadedRows": 0,
    "NumberFilteredRows": 6,
    "NumberUnselectedRows": 0,
    "LoadBytes": 139,
    "LoadTimeMs": 6,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 2,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 2,
    "CommitAndPublishTimeMs": 0,
    "ErrorURL": "http://192.168.88.161:8040/api/_load_error_log?file=__shard_2/error_log_insert_stmt_de416744d8ce5505-5faf6147979143b1_de416744d8ce5505_5faf6147979143b1"
}

然后对本地数据table2_data.txt进行数据补充：

2020-07-03|11|1|jim|2|1
2020-07-05|12|1|grace|2|2
2020-07-05|12|1|grace|2|3
2020-07-12|13|2|tom|2|3
2020-07-15|14|3|bush|3|3
2020-07-12|15|3|helen|3|1

然后导入操作

[root@node1 data]# vim table2_data.txt 
[root@node1 data]# curl --location-trusted -u root:123456 -H "label:table2_20221118" -H "column_separator:|" -T table2_data.txt http://node1:8030/api/test_db/table2/_stream_load
{
    "TxnId": 9010,
    "Label": "table2_20221118",
    "TwoPhaseCommit": "false",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 6,
    "NumberLoadedRows": 6,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 151,
    "LoadTimeMs": 329,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 1,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 311,
    "CommitAndPublishTimeMs": 14
}

导入状态为Success，进行数据查询

mysql> select * from table2;
+------------+--------+----------+----------+------+-------+
| event_day  | siteid | citycode | username | pv   | times |
+------------+--------+----------+----------+------+-------+
| 2020-07-15 |     14 |        3 | bush     |    3 |     3 |
| 2020-07-12 |     15 |        3 | helen    |    3 |     1 |
| 2020-07-05 |     12 |        1 | grace    |    4 |     5 |
| 2020-07-03 |     11 |        1 | jim      |    2 |     1 |
| 2020-07-12 |     13 |        2 | tom      |    2 |     3 |
+------------+--------+----------+----------+------+-------+
5 rows in set (0.03 sec)

由于本次测试的使用的是aggregate模型进行测试的，并且value值也是使用的sum函数，所以重复的数据会进行相加

注意：如果后面大家测试的时候遇到"Message": "too many filtered rows"（过滤过的行太多）问题，可以考虑是不是需要导入的数据是不是存在缺少匹配列的问题！