一、查询引擎测试压测demo实现逻辑
很久没写spark工程了,近期需要一个查询引擎测试压测工具,以hive(HDFS)中每日落盘的查询来压测引擎性能,正适合用spark读hdfs,结果落hive。小结个小demo吧
(1) 实现逻辑
- spark读取HDFS中存储的随机某天(以参数形式传入)的查询(
hive_test.engine_queryjson
表的第二列即为查询) - 以2秒为间隔向引擎提交查询
- 每隔2秒轮询查询结果,5分钟查询未完毕视为查询超时
- 将查询执行状态写入状态结果表(
hive_test.query_result_info
)
(2)提交代码至spark执行.
- 打包
- spark-submit提交jar包
spark-submit参考spark官文:Submitting Applications
这里mian函数的有3个参数:
- 参数1:日期格式”20190101“,标识读取hive_test.engine_queryjson的某天的查询
- 参数2:查询阶段stage-a 或者 stage-b
- 参数3:查询数据条数(sql中的limit)
备注:cluster mode读取hive 需通过–files 指定hive-site, 如果集群已配置可忽略
spark-submit \
--master yarn \
--deploy-mode cluster \
--files hdfs://xxxxx/hive-site.xml\ 指定hive-site,集群已配置可忽略
--queue 这里队列 \
--class com.learn.QuerySender \
./query_sender-1.0-SNAPSHOT-jar-with-dependencies.jar \
20190106 stage-a 100
(3) 结果落Hive表分析
执行完毕后,每个查询的执行状态会写入hive_test.query_result_info
,查询该表做相关统计或分析即可
二、主要代码说明.
(1) hive_test.query_result_info
建表语句.
CREATE EXTERNAL TABLE `hive_test.query_result_info `(
`query ` string COMMENT 'query',
`code` int COMMENT 'result code',
`info` string COMMENT 'query info')
COMMENT 'query result table'
PARTITIONED BY (
`dt` string COMMENT 'dt')
(2) 主要代码片段
package com.learn
import com.alibaba.fastjson.JSON
import com.learn.util.HttpUtil
import org.slf4j.LoggerFactory
import org.apache.spark.{
SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext
import scala.language.postfixOps
import scala.concurrent.{
Future}
import scala.concurrent.ExecutionContext.Implicits.global
/**
* 查询表行数据
* @param queryJson 查询json
* @param dt 日期
*/
case class LogTableRow (
queryJson: String,
dt: String
)
/**
* query result
* @param queryJson query
* @param code result code
* @param info query info
*/
case class QueryResult (
query: String, // query json
code: Int, // result code
info: String // query info
)
object QuerySender {
val queryUrl = "xxx"
val queryResultUrl = "XXX"
val queryTimeout = 300000 // 5minute timeout
val queryPolingInterval = 2000 // 2秒中轮询超时结果
val queryInterval = 2000 // 2秒查询间隔
def main(args: Array[String]) {
val logger = LoggerFactory.getLogger(QuerySender.getClass)
val dt = args.apply(0)
println("dt=" + dt)
val queryStage = args.apply(1)
val limit = args.apply(2)
//val hadoopUserName = args.apply(1)
//val hadoopUserPassWord = args.apply(2)
//println("hadoopUserName=" + hadoopUserName)
//println("hadoopUserPassWord=" + "hadoopUserPassWord")
//System.setProperty("HADOOP_USER_NAME", hadoopUserName)
//System.setProperty("HADOOP_USER_PASSWORD", hadoopUserPassWord)
val conf = new SparkConf()
conf.setAppName("HdfsReader")
//conf.setMaster("local") // 本地测试
// -方法1--读取Hdfs-这里示范读取hdfs,亦可以改为读hive表--
/*
val sc = SparkContext.getOrCreate(conf)
var hdfsFile = sc.textFile(s"hdfs://xxxx/engine_queryjson/dt=$dt/000000_0")
var hivedata = hdfsFile.map(_.split("\t")).map(e => (e(1), e(2),e(0)))
println(hivedata.first()._2)// 第二列为query字符串
// ---提交查询---
var queryResults = scala.collection.mutable.ArrayBuffer[QueryResult]()
println("提交查询.....")
hivedata.foreach(v => {
queryResults.append(submmitQuery(v._2))
Thread.sleep(queryInterval) // 2秒查询间隔
})
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
*/
// -方法2--读取hive表---
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val sql = "select " +
"query_json, dt " +
"from hive_test.engine_queryjson " +
s"where dt='$dt' " +
"and get_json_object(query_json,'$.stage') " +
s"='$queryStage' limit $limit "
println("执行SQL: " + sql)
val queryTable = sqlContext.sql(sql)
val queries = queryTable.map{
row => LogTableRow(row.getString(0), row.getString(1))
}
// ---提交查询---
var queryResults = scala.collection.mutable.ArrayBuffer[QueryResult]()
println("提交查询.....")
println("queries size = " + queries.collect().size)
var querySubmited = 0
queries.collect().foreach(v => {
println(