39. AWS Glue

Overview

  • AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams.
  • AWS Glue consists of a central metadata repository known as the AWS Glue Data Catalog, an ETL engine that automatically generates Python or Scala code, and a flexible scheduler that handles dependency resolution, job monitoring, and retries.
  • AWS Glue is serverless, so there’s no infrastructure to set up or manage.
  • AWS Glue is designed to work with semi-structured data

When Should Use AWS Glue

  • You can use AWS Glue to organize, cleanse, validate, and format data for storage in a data warehouse or data lake.
  • You can use AWS Glue when you run serverless queries against your Amazon S3 data lake.
    • AWS Glue can catalog your Amazon Simple Storage Service (Amazon S3) data, making it available for querying with Amazon Athena and Amazon Redshift Spectrum.
    • With crawlers, your metadata stays in sync with the underlying data.
    • Athena and Redshift Spectrum can directly query your Amazon S3 data lake using the AWS Glue Data Catalog. 
  • You can create event-driven ETL pipelines with AWS Glue. You can run your ETL jobs as soon as new data becomes available in Amazon S3 by invoking your AWS Glue ETL jobs from an AWS Lambda function. You can also register this new dataset in the AWS Glue Data Catalog as part of your ETL jobs.
  • You can use AWS Glue to understand your data assets. You can store your data using various AWS services and still maintain a unified view of your data using the AWS Glue Data Catalog.

Architecture

The basic concepts populating your Data Catalog and processing ETL dataflow in                 AWS Glue.

 

  • You define jobs in AWS Glue to accomplish the work that's required to extract, transform, and load (ETL) data from a data source to a data target. You typically perform the following actions:
    • For data store sources, you define a crawler to populate your AWS Glue Data Catalog with metadata table definitions. You point your crawler at a data store, and the crawler creates table definitions in the Data Catalog.
    • For streaming sources, you manually define Data Catalog tables and specify data stream properties.
    • In addition to table definitions, the AWS Glue Data Catalog contains other metadata that is required to define ETL jobs. You use this metadata when you define a job to transform your data.
    • AWS Glue can generate a script to transform your data. Or, you can provide the script in the AWS Glue console
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值