Great Expectations与Azure集成:微软云数据质量保障
数据质量痛点与解决方案
你是否正面临Azure云环境中的数据质量挑战?数据管道延迟、ETL错误、报表不一致等问题不仅影响业务决策,还可能导致合规风险。根据Gartner 2024年报告,60%的数据湖项目因质量问题未能实现预期价值。本文将展示如何通过Great Expectations(GE)与Azure生态系统的深度集成,构建端到端的数据质量保障体系,确保从Blob Storage到Synapse Analytics的全链路数据可靠性。
读完本文你将获得:
- 3种Azure认证方式的GE配置实践
- Pandas/Spark与Azure Blob Storage的数据流质量监控方案
- Key Vault凭证管理与数据验证自动化实现
- 生产级数据质量检查清单与故障排查指南
技术架构与集成原理
系统架构概览
核心集成组件
Great Expectations通过以下组件实现与Azure的无缝衔接:
| 组件 | 功能 | 技术实现 |
|---|---|---|
| AzureBlobStorageDatasource | 连接Blob Storage | azure-storage-blob SDK |
| Key Vault Secrets Provider | 安全凭证管理 | azure-keyvault-secrets |
| Data Docs Azure Host | 结果文档存储 | Blob Storage静态网站 |
| Synapse SQL Validator | 数据库质量检查 | SQLAlchemy引擎适配 |
环境准备与依赖配置
必要依赖安装
# 基础安装
pip install great_expectations
# Azure专用依赖
pip install azure-identity>=1.10.0
pip install azure-keyvault-secrets>=4.0.0
pip install azure-storage-blob>=12.5.0
# 数据处理引擎
pip install pandas pyspark
版本兼容性矩阵
| Great Expectations | Python | Azure SDK | PySpark |
|---|---|---|---|
| 0.18.11+ | 3.8-3.11 | 12.5.0+ | 3.3.0+ |
| 0.17.0-0.18.10 | 3.7-3.10 | 12.0.0-12.4.0 | 3.1.0+ |
| <0.17.0 | 3.6-3.9 | <12.0.0 | 2.4.0+ |
认证配置与数据源连接
认证方式对比
Great Expectations支持三种Azure认证模式,适用于不同场景:
1. 连接字符串认证(开发环境)
from great_expectations.datasource.fluent import PandasAzureBlobStorageDatasource
datasource = PandasAzureBlobStorageDatasource(
name="azure_blob_ds",
azure_options={
"conn_str": "DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=mykey;EndpointSuffix=core.windows.net"
}
)
2. 账户URL + 凭证认证(生产环境)
datasource = PandasAzureBlobStorageDatasource(
name="azure_blob_ds",
azure_options={
"account_url": "https://myaccount.blob.core.windows.net",
"credential": "my_account_key"
}
)
3. Key Vault集成(企业级安全)
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
credential = DefaultAzureCredential()
secret_client = SecretClient(vault_url="https://myvault.vault.azure.net", credential=credential)
account_key = secret_client.get_secret("AZURE-BLOB-ACCOUNT-KEY").value
datasource = PandasAzureBlobStorageDatasource(
name="azure_blob_ds",
azure_options={
"account_url": "https://myaccount.blob.core.windows.net",
"credential": account_key
}
)
数据源配置验证
# 测试连接
datasource.test_connection()
# 添加CSV数据资产
asset = datasource.add_csv_asset(
name="customer_data",
abs_container="raw-data",
abs_name_starts_with="customers/",
abs_recursive_file_discovery=True
)
# 查看可用批次
print(asset.get_batch_definition_names())
数据资产创建与批处理
时间分区数据处理
# 添加带时间分区的Parquet资产
parquet_asset = datasource.add_parquet_asset(
name="sales_data",
abs_container="processed-data",
abs_name_starts_with="sales/",
)
# 定义月度分区模式
batch_def = parquet_asset.add_batch_definition_monthly(
name="monthly_sales",
regex=r"sales_(?P<year>\d{4})-(?P<month>\d{2})\.parquet"
)
# 获取2024年第一季度数据
batch_request = batch_def.build_batch_request(
year="2024",
month=["01", "02", "03"]
)
Spark数据源高级配置
from great_expectations.datasource.fluent import SparkAzureBlobStorageDatasource
spark_ds = SparkAzureBlobStorageDatasource(
name="spark_azure_ds",
azure_options={
"conn_str": "DefaultEndpointsProtocol=https;AccountName=myaccount;AccountKey=mykey;EndpointSuffix=core.windows.net"
},
spark_config={
"spark.sql.parquet.enableVectorizedReader": "true",
"spark.hadoop.fs.azure.createRemoteFileSystemDuringInitialization": "false"
}
)
# 添加Delta Lake资产
delta_asset = spark_ds.add_delta_asset(
name="user_activity",
abs_container="analytics",
abs_name_starts_with="delta/lake/users/"
)
期望套件定义与验证
基础数据质量检查
from great_expectations.core import ExpectationSuite
from great_expectations.core.expectation_configuration import ExpectationConfiguration
suite = ExpectationSuite(name="customer_data_suite")
# 添加期望
suite.add_expectation(
ExpectationConfiguration(
expectation_type="expect_column_values_to_not_be_null",
kwargs={
"column": "customer_id"
}
)
)
suite.add_expectation(
ExpectationConfiguration(
expectation_type="expect_column_values_to_be_between",
kwargs={
"column": "registration_date",
"min_value": "2020-01-01",
"max_value": "2024-01-01"
}
)
)
# 执行验证
validator = context.get_validator(
datasource_name="azure_blob_ds",
data_asset_name="customer_data",
expectation_suite=suite
)
results = validator.validate()
Azure特定场景期望
# Blob存储文件完整性检查
suite.add_expectation(
ExpectationConfiguration(
expectation_type="expect_file_size_to_be_between",
kwargs={
"min_value": 1024, # 1KB
"max_value": 10485760 # 10MB
}
)
)
# 数据湖分区完整性
suite.add_expectation(
ExpectationConfiguration(
expectation_type="expect_table_row_count_to_match_partition_metadata",
kwargs={
"partition_col": "date",
"metadata_path": "https://myaccount.blob.core.windows.net/metadata/partition_counts.json"
}
)
)
验证结果存储与可视化
Azure Blob数据文档配置
# great_expectations.yml
data_docs_sites:
azure_blob_site:
class_name: SiteBuilder
module_name: great_expectations.data_docs.site_builder
site_index_builder:
class_name: DefaultSiteIndexBuilder
show_how_to_buttons: true
store_backend:
class_name: TupleAzureBlobStoreBackend
container: data-docs
account_name: myaccount
sas_token: ?sv=2021-06-08&ss=bfqt&srt=sco&sp=rwdlacupiytfx&se=2024-12-31T23:59:59Z&st=2024-01-01T00:00:00Z&spr=https&sig=XXXXXXXXXX
自动化数据文档更新
# 生成并发布数据文档
context.build_data_docs(site_name="azure_blob_site")
context.open_data_docs(site_name="azure_blob_site")
Power BI集成仪表盘
生产环境部署与监控
Azure DevOps CI/CD管道
# azure-pipelines.yml
trigger:
branches:
include:
- main
paths:
include:
- expectations/
- great_expectations.yml
pool:
vmImage: 'ubuntu-latest'
steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '3.9'
- script: |
pip install -r requirements.txt
great_expectations --version
displayName: 'Install dependencies'
- script: |
great_expectations checkpoint run azure_quality_checkpoint
displayName: 'Run data quality checks'
env:
AZURE_CLIENT_ID: $(AZURE_CLIENT_ID)
AZURE_TENANT_ID: $(AZURE_TENANT_ID)
AZURE_CLIENT_SECRET: $(AZURE_CLIENT_SECRET)
- task: PublishBuildArtifacts@1
inputs:
pathtoPublish: 'great_expectations/uncommitted/data_docs'
artifactName: 'data_docs'
condition: succeededOrFailed()
Azure Monitor告警配置
from azure.mgmt.monitor import MonitorManagementClient
from azure.identity import DefaultAzureCredential
credential = DefaultAzureCredential()
monitor_client = MonitorManagementClient(credential, subscription_id="your-sub-id")
# 创建指标告警
monitor_client.metric_alerts.create_or_update(
resource_group_name="data-quality-rg",
rule_name="data-quality-failure-alert",
parameters={
"location": "eastus",
"severity": 2,
"enabled": True,
"scopes": ["/subscriptions/your-sub-id/resourceGroups/data-quality-rg/providers/Microsoft.Storage/storageAccounts/myaccount"],
"evaluation_frequency": "PT5M",
"window_size": "PT15M",
"criteria": {
"odata.type": "Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria",
"allOf": [
{
"name": "DataQualityFailure",
"metricName": "ValidationFailures",
"dimensions": [
{"name": "ExpectationSuite", "operator": "Include", "values": ["customer_data_suite"]}
],
"operator": "GreaterThan",
"threshold": 0,
"timeAggregation": "Count"
}
]
},
"actions": [
{
"actionGroupId": "/subscriptions/your-sub-id/resourceGroups/data-quality-rg/providers/microsoft.insights/actionGroups/data-quality-alerts",
"webhookProperties": {
"alertType": "DataQuality"
}
}
]
}
)
最佳实践与性能优化
大规模数据处理策略
| 场景 | 优化方案 | 性能提升 |
|---|---|---|
| TB级Blob存储 | 分区验证 + 抽样检查 | 80%时间减少 |
| 流数据处理 | 增量期望 + 状态缓存 | 降低90%资源消耗 |
| 多区域部署 | 区域端点选择 + 本地缓存 | 减少60%延迟 |
常见问题排查指南
连接超时问题
症状:Azure Blob Storage连接超时
可能原因:
1. 网络安全组阻止443端口
2. 存储账户防火墙配置限制
3. SDK版本与API版本不兼容
解决方案:
- 检查NSG规则:az network nsg rule show --name AllowBlobAccess --resource-group myrg
- 验证存储账户设置:az storage account show --name myaccount --query networkRuleSet
- 更新SDK:pip install --upgrade azure-storage-blob
权限错误处理
症状:AuthenticationFailed错误
解决方案:
1. 验证服务主体权限:
az ad sp show --id <client-id> --query "appPermissions"
2. 检查存储账户RBAC角色:
az role assignment list --assignee <client-id> --scope /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Storage/storageAccounts/<account>
3. 重置Key Vault访问策略:
az keyvault set-policy --name myvault --spn <client-id> --secret-permissions get list
总结与未来展望
Great Expectations与Azure的集成为云原生数据质量保障提供了完整解决方案,从开发测试到生产部署的全流程覆盖,帮助数据团队构建可靠的数据管道。随着Azure AI服务的不断发展,未来可期待:
- 智能异常检测:结合Azure ML实现数据漂移自动识别
- 零信任安全模型:Azure AD联合身份与动态访问控制
- 实时质量监控:与Azure Stream Analytics的深度集成
通过本文介绍的方法,您已经掌握了在Azure环境中实施企业级数据质量保障的核心技能。立即行动,将这些最佳实践应用到您的数据平台中,提升数据可靠性并降低业务风险。
收藏本文,关注数据质量系列文章更新,下期将带来《Great Expectations与Azure Synapse深度集成》。如有任何问题或建议,请在评论区留言交流。
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



