考试内容总览
Learning Objects: | 工具 |
Designing and implementing data storage | 1. Storage Azure Synapse Analytics Azure Databricks Azure Data Lake Storage Gen2(ADLS2,可代替Hadoop Distributed File System也就是HDFS) 2. Shard Partition data store in to multiple shards。分块技术提高效率 3. Serving Layer |
Designing and developing data processing | 1. Investing and transforming data Transform data: Apache Spark和SQL Transform feature: Azure Data Factory(ADF) 可以从storage A导入数据,transform,存入storage B; Azure Synapse Pipelines是ADF的子集,集成在Azure Synapse Analytics里; Stream Analytics老,用SQL语句 2. Batch Processing Azure Data Factory(ADF) Azure Synapse Pipelines Azure Databricks Azure Data Lake Storage Gen2(ADLS2) 3. Stream Processing Stream Analytics Azure Databricks Azure Event Hubs 4. Manage batches and Pipelines Azure Data Factory(ADF) Azure Synapse Pipelines |
Designing and implementing data security | 加密等等 |
Monitoring and optimizing data storage and data processing | 1. Monitor Azure Monitor 2. Optimize Azure Data Factory(ADF) Azure Databricks Azure Synapse Analytics |
Storage方案
Raw Storage | Azure Blob Storage Azure File Storage Azure Data Lake Storage Gen2 (ADLS2) |
Relational DBs Storage | Azure SQL Database Azure DB for MySQL Azure DB for MariaDB Azure DB for PostgreSQL |
NoSQL DBs | Azure Cosmos DB Azure Cache for Redis |
Data Warehouse/Data Lake
常见搭配是用Data lake处理数据,然后输出到Data Warehouse
Name | Key Points | Corresponding feature in Azure Synapse Analytics |
Data Warehouse | • Structured data (relational tables) • Collect data from many databases • Designed for reporting • Query using SQL | Dedicated SQL Pool • Lets you run SQL queries on structured, relational tables |
Data Lake |
| Spark Pool • Lets you use Spark to query both structured and unstructured data |
Data flow设计考虑因素:
Factor | Details |
Security | Who needs to access the data? What can they do with it? What are the security requirements of the source data? |
Auditing requirements | Metadata (creation date, batch data, modified dates) |
Logging requirements | Operation counts Insertion counts Execution times Success or failure information |
Metadata | What metadata is necessary? What business intelligence tools will be used to analyze the data? |
Storage retention requirements | Store the data forever? Store in raw format? Are there any retention policies? Does GDPR (General Data Protection Regulations) apply? |
Others | Existing skillsets Pre-existing packages Interdependencies |
关于Apache Spark和 Hadoop MapReduce
两者都是开源框架。
Aspect | Apache Spark | Hadoop MapReduce |
Processing Speed | Much faster due to in-memory processing | Slower, as it processes data on disk |
Data Processing | Real-time and iterative analytics | Batch processing |
Ease of Use | User-friendly APIs, supports multiple languages | Requires Java code |
Fault Tolerance | Resilient Distributed Datasets (RDDs) for better fault tolerance | Hadoop Distributed File System (HDFS) |
Integration | Extensive ecosystem, integrates well with other big data tools. Azure Databricks上的Spark可以处理ADLS的数据。 | Primarily designed for HDFS |
Synapse Pipeline
Pipeline Flow sample | • Copy data from Azure Cosmos DB to a Spark pool • Run a Spark job that creates statistics about the data • Store the results in a SQL pool |
Steps to use Synapse Pipeline | Using code 1. Create a Linked Service for Cosmos DB as source data 2. Create Spark Pool for running code 3. Create and SQL pool as target storage 4. Write Analytics code in note book 5. Code analyzes data from Cosmos DB and output result to SQL Pool 6. If need to run at regular basis, create a pipeline and set schedule No code 1. Create a Linked Service for Cosmos DB as source data 2. Create SQL pool 3. Create a Synapse Pipeline with Data flow within it. Data Flow will: a. Load from A DB b. Transform data c. Store data in B Pool Note: Both Pools are all VM clusters. You can use Auto-scale, Auto-pause to save money |
Storage Concepts | Dedicated SQL Pool (Azure SQL Data Warehouse ) • When paused, no cost for cluster but still cost storage • DWUs (Data Warehousing Units) set the amount of CPU, memory, and I/O in the compute cluster • No Auto-scale Serverless SQL Pool • No storage • Cannot access Dedicated SQL Pool Can query data in other places like CSV/JSON in Azure Storage, Azure Open dataset, external tables created by Spark pools • No compute cost. Cost is based on data processed by queries • Pool auto created when Synapse Workspace is created |