DP 203 学习笔记_to convert the data model to a star schema dp203-优快云博客

Learning Objects:	工具
Designing and implementing data storage	1. Storage Azure Synapse Analytics Azure Databricks Azure Data Lake Storage Gen2(ADLS2，可代替Hadoop Distributed File System也就是HDFS) 2. Shard Partition data store in to multiple shards。分块技术提高效率 3. Serving Layer
Designing and developing data processing	1. Investing and transforming data Transform data: Apache Spark和SQL Transform feature: Azure Data Factory(ADF) 可以从storage A导入数据，transform，存入storage B； Azure Synapse Pipelines是ADF的子集，集成在Azure Synapse Analytics里； Stream Analytics老，用SQL语句 2. Batch Processing Azure Data Factory(ADF) Azure Synapse Pipelines Azure Databricks Azure Data Lake Storage Gen2(ADLS2) 3. Stream Processing Stream Analytics Azure Databricks Azure Event Hubs 4. Manage batches and Pipelines Azure Data Factory(ADF) Azure Synapse Pipelines
Designing and implementing data security	加密等等
Monitoring and optimizing data storage and data processing	1. Monitor Azure Monitor 2. Optimize Azure Data Factory(ADF) Azure Databricks Azure Synapse Analytics

Raw

Storage

Azure Blob Storage

Azure File Storage

Azure Data Lake Storage Gen2 (ADLS2)

Relational

DBs Storage

Azure SQL Database

Azure DB for MySQL

Azure DB for MariaDB

Azure DB for PostgreSQL

NoSQL

DBs

Azure Cosmos DB

Azure Cache for Redis

Data Warehouse/Data Lake

常见搭配是用Data lake处理数据，然后输出到Data Warehouse

Name

Key Points

Corresponding feature in Azure Synapse Analytics

Data Warehouse

• Structured data (relational tables)

• Collect data from many databases

• Designed for reporting

• Query using SQL

Dedicated SQL Pool

• Lets you run SQL queries on structured, relational tables

Data Lake

Spark Pool

• Lets you use Spark to query both structured and unstructured data

Factor	Details
Security	Who needs to access the data? What can they do with it? What are the security requirements of the source data?
Auditing requirements	Metadata (creation date, batch data, modified dates)
Logging requirements	Operation counts Insertion counts Execution times Success or failure information
Metadata	What metadata is necessary? What business intelligence tools will be used to analyze the data?
Storage retention requirements	Store the data forever? Store in raw format? Are there any retention policies? Does GDPR (General Data Protection Regulations) apply?
Others	Existing skillsets Pre-existing packages Interdependencies

两者都是开源框架。

Aspect	Apache Spark	Hadoop MapReduce
Processing Speed	Much faster due to in-memory processing	Slower, as it processes data on disk
Data Processing	Real-time and iterative analytics	Batch processing
Ease of Use	User-friendly APIs, supports multiple languages	Requires Java code
Fault Tolerance	Resilient Distributed Datasets (RDDs) for better fault tolerance	Hadoop Distributed File System (HDFS)
Integration	Extensive ecosystem, integrates well with other big data tools. Azure Databricks上的Spark可以处理ADLS的数据。	Primarily designed for HDFS