AMAZON Redshift(1)Introduction

本文介绍了使用 Amazon Redshift 构建数据仓库的最佳实践,包括如何利用 Python 和 SQL 进行数据操作,如何设计和加载数据到 Redshift,以及如何进行高效的数据查询。此外还涉及了 NumPy、SciPy 和 Pandas 等 Python 库的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

AMAZON Redshift(1)Introduction

Python is well used here with SQL.
Normal SQL
select regex_replace(url, ‘(https?)://([^@]*@)?([^:/]*)([/:].*)$)’, ‘\3’) FROM table;

===>
Python and SQL
create function f_hostname(url VARCHAR) returns archer Immutable as
$$ import url parse.urlparse(url).hostname $$
LANGUAGE plpython;

select f_hostname(url) FROM table;

NumPy SciPy: math tool
Pandas: SQL operation on top of SciPy and NumPy
Dateutil and Pytz: Date and Timezone

http://www.numpy.org/

http://scipy.org/about.html

http://pandas.pydata.org/

https://dateutil.readthedocs.org/en/latest/

https://pypi.python.org/pypi/pytz/

Data Warehouse System Architecture
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_high_level_system_architecture.html

Industry-standard PostgreSQL JDBC and ODBC driver.

Leader node —> compile codes and distribute the compiled code to the compute nodes, assigns a portion of the data to each compute node

Compute nodes —> 160 GB node

Load data from S3 into Redshift
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_Loading-data-from-S3.html

Copy Command to Load the Data
copy <table_name> from ‘s3://<bucket_name>/<object_prefix>'
credentials ‘<aws-auth-args>’;
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_loading-tables-from-s3.html

http://docs.aws.amazon.com/zh_cn/datapipeline/latest/DeveloperGuide/dp-copydata-redshift.html

Work on the DB
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_deleting_redshift_user_cmd.html

How to Design the Table
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_designing-tables-best-practices.html

http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_Creating_tables.html

How to Load Data
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_loading-data-best-practices.html

How to Query Data
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_designing-queries-best-practices.html

DataBase Admin’s Command
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/t_querying_redshift_system_tables.html

Table Design
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. - timestamp

If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. - range or equality

If you frequently join a table, specify the join column as both the sort key and the distribution key.

References:
http://docs.aws.amazon.com/zh_cn/redshift/latest/dg/c_redshift_system_overview.html

https://aws.amazon.com/cn/documentation/redshift/
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值