DP 203 学习笔记

考试内容总览

Learning Objects:

工具

Designing and implementing data storage

1. Storage

Azure Synapse Analytics

Azure Databricks

Azure Data Lake Storage Gen2(ADLS2,可代替Hadoop Distributed File System也就是HDFS)

2. Shard

Partition data store in to multiple shards。分块技术提高效率

3. Serving Layer

Designing and developing data processing

1. Investing and transforming data

Transform data: Apache Spark和SQL

Transform feature:

Azure Data Factory(ADF) 可以从storage A导入数据,transform,存入storage B;

Azure Synapse Pipelines是ADF的子集,集成在Azure Synapse Analytics里;

Stream Analytics老,用SQL语句

2. Batch Processing

Azure Data Factory(ADF)

Azure Synapse Pipelines

Azure Databricks

Azure Data Lake Storage Gen2(ADLS2)

3. Stream Processing

Stream Analytics

Azure Databricks

Azure Event Hubs

4. Manage batches and Pipelines

Azure Data Factory(ADF)

Azure Synapse Pipelines

Designing and implementing data security

加密等等

Monitoring and optimizing data storage and data processing

1. Monitor

Azure Monitor

2. Optimize

Azure Data Factory(ADF)

Azure Databricks

Azure Synapse Analytics

Storage方案

Raw

Storage

Azure Blob Storage

Azure File Storage

Azure Data Lake Storage Gen2 (ADLS2)

Relational

DBs Storage

Azure SQL Database

Azure DB for MySQL

Azure DB for MariaDB

Azure DB for PostgreSQL

NoSQL

DBs

Azure Cosmos DB

Azure Cache for Redis

Data Warehouse/Data Lake

常见搭配是用Data lake处理数据,然后输出到Data Warehouse

Name

Key Points

Corresponding feature in Azure Synapse Analytics

Data Warehouse

• Structured data (relational tables)

• Collect data from many databases

• Designed for reporting

• Query using SQL

Dedicated SQL Pool

• Lets you run SQL queries on structured, relational tables

Data Lake

  • Unstructured data (documents, etc.) and structured data
  • Collect all kinds of data in one place
  • Designed for big data analytics
  • Query using Apache Spark

Spark Pool

• Lets you use Spark to query both structured and unstructured data

Data flow设计考虑因素:

Factor

Details

Security

Who needs to access the data?

What can they do with it?

What are the security requirements

of the source data?

Auditing requirements

Metadata (creation date, batch data,

modified dates)

Logging requirements

Operation counts

Insertion counts

Execution times

Success or failure information

Metadata

What metadata is necessary?

What business intelligence tools will

be used to analyze the data?

Storage retention requirements

Store the data forever?

Store in raw format?

Are there any retention policies?

Does GDPR (General Data Protection

Regulations) apply?

Others

Existing skillsets

Pre-existing packages

Interdependencies

关于Apache Spark和 Hadoop MapReduce

两者都是开源框架。

Aspect

Apache Spark

Hadoop MapReduce

Processing Speed

Much faster due to in-memory processing

Slower, as it processes data on disk

Data Processing

Real-time and iterative analytics

Batch processing

Ease of Use

User-friendly APIs, supports multiple languages

Requires Java code

Fault Tolerance

Resilient Distributed Datasets (RDDs) for better fault tolerance

Hadoop Distributed File System (HDFS)

Integration

Extensive ecosystem, integrates well with other big data tools.

Azure Databricks上的Spark可以处理ADLS的数据。

Primarily designed for HDFS

Synapse Pipeline

Pipeline Flow sample

• Copy data from Azure Cosmos DB to a Spark pool

• Run a Spark job that creates statistics about the data

• Store the results in a SQL pool

Steps to use Synapse Pipeline

Using code

1. Create a Linked Service for Cosmos DB as source data

2. Create Spark Pool for running code

3. Create and SQL pool as target storage

4. Write Analytics code in note book

5. Code analyzes data from Cosmos DB and output result to SQL Pool

6. If need to run at regular basis, create a pipeline and set schedule
 

No code

1. Create a Linked Service for Cosmos DB as source data

2. Create SQL pool

3. Create a Synapse Pipeline with Data flow within it.

Data Flow will:

a. Load from A DB

b. Transform data

c. Store data in B Pool

Note: Both Pools are all VM clusters. You can use Auto-scale, Auto-pause to save money

Storage Concepts

Dedicated SQL Pool (Azure SQL Data Warehouse )

• When paused, no cost for cluster but still cost storage

• DWUs (Data Warehousing Units) set the amount of CPU, memory, and I/O in the compute cluster

• No Auto-scale

Serverless SQL Pool

• No storage

• Cannot access Dedicated SQL Pool

Can query data in other places like CSV/JSON in Azure Storage, Azure Open dataset, external tables created by Spark pools

• No compute cost. Cost is based on data processed by queries

• Pool auto created when Synapse Workspace is created

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值