Spark SQL解析复杂、嵌套、加密json大文件数据思路(java)

这篇博客讲述了如何使用Spark SQL解析200GB大小的、包含加密JSON的文件。首先,通过unbase64和decode函数解密并转换字段,然后去除字符串中的转义字符,将JSON字符串转换为结构化数据。通过Spark SQL内置函数解析和展开JSON字段,生成三张表:xxxstatus、xxxinfo和xxxrxtx。文章提供了优化建议,如拆分大文件和处理缺失字段。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >


文件gz压缩格式20G,解压后200G左右,每行都是一个json字符串,大概500w行。每行json格式简化后如下:

{
	"tel": "15311111111111",
	"mac": "33:D3:B3:43:83:03",
	"name": "zhongxing",
	"type1": "140102",
	"timestamp": "1572735114",
	"event": {
		"ET": "SGW_QUALITY_REPORT",
		"ED": "eyJ4eHhzdGF0dXMiOlt7IlRpbWUiOiIyMDE5LTExLTAzIDA2OjE2OjAwIiwicnVubmluZ1RpbWUiOjI0MDE4Nywic3RhdHVzMDEiOjIsInN0YXR1czAyIjo0MX1dLCJ4eHhpbmZvIjpbeyJUaW1lIjoiMjAxOS0xMS0wMyAwNjoxNjowMCIsInh4eERldiI6W3sieHh4RGV2MDEiOjEsInh4eERldjAyIjoieHh4LTAwMDAwIiwieHh4RGV2MDMiOiJ6ekctMSIsInh4eERldjA0Ijoi6Zi/6YeMIiwieHh4RGV2MDUiOiIyMTU0ODc4Nzg3IiwieHh4RGV2MDYiOjQsInh4eERldjA3Ijo0LCJ4eHhEZXYwOCI6MCwieHh4RGV2MDkiOiJXUEEvT1NUIiwieHh4RGV2MTAiOiJhLHQsYyIsInh4eERldjExIjoiNDAtYml0IiwieHh4RGV2MTIiOiJBbGwifSx7Inh4eERldjAxIjowLCJ4eHhEZXYwMiI6Ikd1ZXN0LXRlc3QiLCJ4eHhEZXYwMyI6Inp6Ry0yIiwieHh4RGV2MDQiOiJBUFAiLCJ4eHhEZXYwNSI6Ijc3Nzc3NzciLCJ4eHhEZXYwNiI6NCwieHh4RGV2MDciOjQsInh4eERldjA4IjowLCJ4eHhEZXYwOSI6IldQQS9PU1QiLCJ4eHhEZXYxMCI6ImEsdCxjIiwieHh4RGV2MTEiOiI0MC1iaXQiLCJ4eHhEZXYxMiI6Ik9TeXN0ZW0ifV19XSwieHh4cnh0eCI6W3siVGltZSI6IjIwMTktMTEtMDMgMDQ6NTU6MDgiLCJ4eHhMaXN0IjpbeyJ4eHhMaXN0MDEiOiIyNTY0NTg3OTQ1IiwieHh4TGlzdDAyIjoicGhvbmUiLCJ4eHhMaXN0MDMiOiI1RyIsInh4eExpc3QwNCI6IjAwQ0RONTY1QkdGIiwieHh4TGlzdDA1IjoiMTkyLjE2OC4wLjEyNCIsInh4eExpc3QwNiI6Inh455qE5omL5py6IiwieHh4TGlzdDA3IjoiaVBob25lIDZzIiwieHh4TGlzdDA4Ijoi6Iu55p6cIiwieHh4TGlzdDA5IjoieHh4UGhvbmUiLCJ4eHhMaXN0MTAiOjE1MjAwMDAsInh4eExpc3QxMSI6Mjc0MDAwMDAwfSx7Inh4eExpc3QwMSI6IjI1Njg3OTQxMTExIiwieHh4TGlzdDAyIjoicGhvbmUiLCJ4eHhMaXN0MDMiOiI1RyIsInh4eExpc3QwNCI6IjlDQkRGNDU2RkciLCJ4eHhMaXN0MDUiOiIxOTIuMTY4LjAuMTI1IiwieHh4TGlzdDA2IjoieHjnmoTmiYvmnLoiLCJ4eHhMaXN0MDciOiLljY7kuLpHOemdkuaYpeeJiCIsInh4eExpc3QwOCI6IuWNjuS4uiIsInh4eExpc3QwOSI6Inh4eF9Zb3V0aCIsInh4eExpc3QxMCI6MTU0NjAwMCwieHh4TGlzdDExIjoxMDI1ODAwMDB9LHsieHh4TGlzdDAxIjoiMTg5Nzk0NTYxMiIsInh4eExpc3QwMiI6IlNWQyIsInh4eExpc3QwMyI6IjEiLCJ4eHhMaXN0MDQiOiIyNEhEQlYyMzU2QkMiLCJ4eHhMaXN0MDUiOiIyNTUuMjU1LjI1NS4yNTUiLCJ4eHhMaXN0MDYiOiJ4eOeahOacuumhtuebkiIsInh4eExpc3QwNyI6Iue9keaYk+S6keebkiIsInh4eExpc3QwOCI6Iue9keaYkyIsInh4eExpc3QwOSI6ImFuZHJvaWQteHh4eCIsInh4eExpc3QxMCI6NTg5NzEwMDAwLCJ4eHhMaXN0MTEiOjE1Njg3OTAwMDAwfV19LHsiVGltZSI6IjIwMTktMTEtMDMgMDU6MDA6MDkiLCJ4eHhMaXN0IjpbeyJ4eHhMaXN0MDEiOiI1Njg5NzQ1NjY2NCIsInh4eExpc3QwMiI6InBob25lIiwieHh4TGlzdDAzIjoiM0ciLCJ4eHhMaXN0MDQiOiIwMDBDREI0ODJCQyIsInh4eExpc3QwNSI6IjE5Mi4xNjguMC4xMjYiLCJ4eHhMaXN0MDYiOiJ4eOeahOaJi+acuiIsInh4eExpc3QwNyI6ImlQaG9uZSA2cyIsInh4eExpc3QwOCI6IuiLueaenCIsInh4eExpc3QwOSI6Inh4eFBob25lIiwieHh4TGlzdDEwIjo1Njg5NzQwMCwieHh4TGlzdDExIjoyODg1NjEwMDAwfV19XX0=",
		"MAC": "33DA37488333",
		"CJ": "zhongxing",
		"PC": "ZX HNC(3-2)",
		"Version": "1.0"
	}
}

其中 ,ED字段为加密字段,解密后还是一段JSON:

{
	"xxxstatus": [{
		"Time": "2019-11-03 06:16:00",
		"runningTime": 240187,
		"status01": 2,
		"status02": 41
	}],
	"xxxinfo": [{
		"Time": "2019-11-03 06:16:00",
		"xxxDev": [{
			"xxxDev01": 1,
			"xxxDev02": "xxx-00000",
			"xxxDev03": "zzG-1",
			"xxxDev04": "阿里",
			"xxxDev05": "2154878787",
			"xxxDev06": 4,
			"xxxDev07": 4,
			"xxxDev08": 0,
			"xxxDev09": "WPA/OST",
			"xxxDev10": "a,t,c",
			"xxxDev11": "40-bit",
			"xxxDev12": "All"
		}, {
			"xxxDev01": 0,
			"xxxDev02": "Guest-test",
			"xxxDev03": "zzG-2",
			"xxxDev04": "APP",
			"xxxDev05": "7777777",
			"xxxDev06": 4,
			"xxxDev07": 4,
			"xxxDev08": 0,
			"xxxDev09": "WPA/OST",
			"xxxDev10": "a,t,c",
			"xxxDev11": "40-bit",
			"xxxDev12": "OSystem"
		}]
	}],
	"xxxrxtx": [{
		"Time": "2019-11-03 04:55:08",
		"xxxList": [{
			"xxxList01": "2564587945",
			"xxxList02": "phone",
			"xxxList03": "5G",
			"xxxList04": "00CDN565BGF",
			"xxxList05": "192.168.0.124",
			"xxxList06": "xx的手机",
			"xxxList07": "iPhone 6s",
			"xxxList08": "苹果",
			"xxxList09": "x
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值