Byzer-lang 拥有声明式语言的特点,这和 SQL 非常类似。不同的是,Byzer-lang 也支持 Python 脚本,用户也可以使用 Scala/Java 动态开发和注册 UDF 函数,这使得其灵活度得到了很大提高。
Byzer-lang 针对大数据领域的流程抽象出了如下几个句法结构:
- 数据加载/Load
- 数据转换/Select
- 数据保存/Save
- 代码引入/Include
- 宏函数/!Macro
- 变量设置/Set
- 分支语句/!If|!Else
而针对机器学习领域,也做了类似的抽象:
- 模型训练/Train|Run
- 模型注册/Register
- 模型预测/Predict
此外,在代码复用上,Byzer-lang 支持脚本和包的管理。
数据加载 / Load
- Load json 数据源
-- 通过 set 语法 定义一个 json 数据
set abc='''
{ "x": 100, "y": 200, "z": 200 ,"dataType":"A group"}
{ "x": 120, "y": 100, "z": 260 ,"dataType":"B group"}
''';
-- load json string
load jsonStr.`abc` as table1;
- Load csv 数据源
-- 定义数据类型为 csv, 传入路径,保留 header,开启数据类型推断
load csv.`/tmp/upload/green_tripdata_2022-01.csv` where header='true' and inferSchema='true' as trip_data;
- Load JDBC Connection
connect jdbc where
url="jdbc:mysql://127.0.0.1:3306/wow?characterEncoding=utf8&zeroDateTimeBehavior=convertToNull&tinyInt1isBit=false"
and driver="com.mysql.jdbc.Driver"
and user="xxxx"
and password="xxxx"
as mysql_instance;
load jdbc.`mysql_instance.test1` where directQuery='''
select * from test1 limit 10
''' as newtable;
select * from newtable as output;
数据查询转换 / Select
- 使用 select 语法创建和查询 table
-- 创建
select 1 as col1 as table1;
-- 查询
select * from table1 as output1;
- 使用 select 语法的模板功能处理数据
方式一:
select "" as features, 1 as label as mockData;
select #set($colums=["features","label"])
#foreach( $column in $colums )
SUM( case when `$column` is null or `$column`='' then 1 else 0 end ) as $column,
#end
1 as a from mockData as output1;
方式二:
select "" as features, 1 as label as mockData;
set sum_tpl = '''
SUM( case when `{0}` is null or `{0}`='' then 1 else 0 end ) as {0}
''';
select ${template.get("sum_tpl","label")}, ${template.get("sum_tpl","label")} from mockData as output1;
保存数据 / Save
基本用法,以保存 json 为例:
- overwrite :全量
set rawData='''
{"jack":1,"jack2":2}
{"jack":2,"jack2":3} ''';
load jsonStr.`rawData` as table1;
save overwrite table1 as json.`/tmp/jack`;
- append :增量(注意:mock 数据必须要换行才能解析为两条数据)
set rawData1='''
{"jack":1,"jack2":2}
{"jack":2,"jack2":3}
''';
load jsonStr.`rawData1` as table1;
save overwrite table1 as json.`/tmp/jack`;
set rawData2='''
{"jack":3,"jack2":4}
{"jack":4,"jack2":5}
''';
load jsonStr.`rawData2` as table2;
save append table2 as json.`/tmp/jack`;
load json.`/tmp/jack` as output;
- ignore :文件存在则跳过不写
set rawData1='''
{"jack":1,"jack2":2}
{"jack":2,"jack2":3}
''';
load jsonStr.`rawData1` as table1;
save overwrite table1 as json.`/tmp/jack`;
set rawData2='''
{"jack":3,"jack2":4}
{"jack":4,"jack2":5}
''&