这部分主要讲解关于数据清理方面的知识,涉及到如下函数的详细讲解:
LEFT 、RIGHT、LENGTH 函数
POSITION、STRPOS、SUBSTR函数
CONCAT函数
CAST函数
COALESCE函数
3 练习LEFT RIGHT LENGTH函数
1.
SELECT RIGHT(website,3) AS domain, COUNT(*) num_types
FROM accounts
GROUP BY 1
ORDER BY 2 DESC;
#结果返回
domain num_types
com 349
net 1
org 1
2.
SELECT LEFT(name,1) AS first_char, COUNT(*) num_types
FROM accounts
GROUP BY 1
ORDER BY 2 DESC;
# 结果返回
first_char num_types
C 37
A 37
P 27
M 22
S 17
T 17
D 17
B 16
L 16
E 15
N 15
H 15
G 14
U 13
W 12
F 12
R 8
J 7
K 7
O 7
I 7
V 7
X 2
3 1
Y 1
e 1
Q 1
3.
SELECT SUM(num) nums, SUM(letter) letters
FROM (SELECT name, CASE WHEN LEFT(UPPER(name), 1) IN ('0','1','2','3','4','5','6','7','8','9')
THEN 1 ELSE 0 END AS num,
CASE WHEN LEFT(UPPER(name), 1) IN ('0','1','2','3','4','5','6','7','8','9')
THEN 0 ELSE 1 END AS letter
FROM accounts) t1;
# 结果返回
nums letters
1 350
4.
SELECT SUM(vowel) num_vowel,SUM(non_vowel) num_non
FROM (
SELECT name,CASE WHEN LEFT(upper(name),1) IN('A','E','I','O','U')THEN 1 ELSE 0 END AS vowel,
CASE WHEN LEFT(upper(name),1) IN('A','E','I','O','U')THEN 0 ELSE 1 END AS non_vowel
FROM accounts)t1;
# 结果 ### 运用子查询,在运用CASE语句的时候,注意不要忘记了END
num_vowel num_non
80 271
官网给出的答案
LEFT & RIGHT 解决方案
1.
SELECT RIGHT(website, 3) AS domain, COUNT(*) num_companies
FROM accounts
GROUP BY 1
ORDER BY 2 DESC;
2.
SELECT LEFT(UPPER(name), 1) AS first_letter, COUNT(*) num_companies
FROM accounts
GROUP BY 1
ORDER BY 2 DESC;
3.
SELECT SUM(num) nums, SUM(letter) letters
FROM (SELECT name, CASE WHEN LEFT(UPPER(name), 1) IN ('0','1','2','3','4','5','6','7','8','9')
THEN 1 ELSE 0 END AS num,
CASE WHEN LEFT(UPPER(name), 1) IN ('0','1','2','3','4','5','6','7','8','9')
THEN 0 ELSE 1 END AS letter
FROM accounts) t1;
有 350 个公司名称以字母开头,1 个公司以数字开头。因此有 350/351 的公司名称以字母开头,即百分比是 99.7%。
4.
SELECT SUM(vowels) vowels, SUM(other) other
FROM (SELECT name, CASE WHEN LEFT(UPPER(name), 1) IN ('A','E','I','O','U')
THEN 1 ELSE 0 END AS vowels,
CASE WHEN LEFT(UPPER(name), 1) IN ('A','E','I','O','U')
THEN 0 ELSE 1 END AS other
FROM accounts) t1;
有 80 的公司名称以元音开头,271 的公司以其他音节开头。元音的比例是 80/351,或 22.8%。因此,有 77.2% 的公司名称没有以元音开头。
6.练习POSITION STRPOS UPPER LOWER函数的运用
练习:POSITION 和 STRPOS
对于以下练习,需要用到 LEFT 和 RIGHT 以及 POSITION 或 STRPOS 知识。
使用 accounts 表格创建一个名字和姓氏列,用于存储 primary_poc 的名字和姓氏。
SELECT primary_poc,
LEFT(primary_poc,POSITION(' 'IN primary_poc)-1) as last_name,
RIGHT(primary_poc,LENGTH(primary_poc)-POSITION(' 'IN primary_poc)) as first_name
FROM accounts;
现在创建一个包含 sales_rep 表格中每个销售代表姓名的列,同样,需要提供名字和姓氏列。
SELECT name,
LEFT(name,POSITION(' 'IN name)-1) as last_name,
RIGHT(name,LENGTH(name)-POSITION(' 'IN name)) as first_name
FROM sales_reps;
# 结果返回
name last_name first_name
Samuel Racine Samuel Racine
Eugena Esser Eugena Esser
Michel Averette Michel Averette
Renetta Carew Renetta Carew
Cara Clarke Cara Clarke
Lavera Oles Lavera Oles
Elba Felder Elba Felder
………………
# 官方给出的答案:
SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name,
RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name
FROM accounts;
SELECT LEFT(name, STRPOS(name, ' ') -1 ) first_name,
RIGHT(name, LENGTH(name) - STRPOS(name, ' ')) last_name
FROM sales_reps;
9.l练习CONCAT 函数
练习:CONCAT
accounts 表格中的每个客户都想为每个 primary_poc 创建一个电子邮箱。邮箱应该是 primary_poc 的名字.primary_poc的姓氏@公司名称.com。
WITH t1 AS(
SELECT name,LEFT(primary_poc, POSITION(' 'IN primary_poc)-1) AS last_name,
RIGHT(primary_poc,LENGTH(primary_poc)-POSITION(' 'IN primary_poc))AS first_name
FROM accounts)
SELECT name,first_name,last_name,CONCAT(first_name,'.',last_name,'@',name,'.com')
FROM t1;
你可能注意到了,在上一个答案中,有些公司名称存在空格,肯定不适合作为邮箱地址。看看你能否通过删掉客户名称中的所有空格来创建合适的邮箱地址,否则你的答案就和问题 1. 的一样。此处是一些实用的文档。
WITH t1 AS(
SELECT name,LEFT(primary_poc, POSITION(' 'IN primary_poc)-1) AS last_name,
RIGHT(primary_poc,LENGTH(primary_poc)-POSITION(' 'IN primary_poc))AS first_name
FROM accounts)
SELECT name,first_name,last_name,CONCAT(first_name,'.',last_name,'@',REPLACE(name,' ',''),'.com')
FROM t1;
我们还需要创建初始密码,在用户第一次登录时将更改。初始密码将是 primary_poc 的名字的第一个字母(小写),然后依次是名字的最后一个字母(小写)、姓氏的第一个字母(小写)、姓氏的最后一个字母(小写)、名字的字母数量、姓氏的字母数量,然后是合作的公司名称(全大写,没有空格)
WITH t1 AS(
SELECT name,LEFT(primary_poc, POSITION(' 'IN primary_poc)-1) AS last_name,
RIGHT(primary_poc,LENGTH(primary_poc)-POSITION(' 'IN primary_poc))AS first_name
FROM accounts)
SELECT name, first_name, last_name, CONCAT(first_name,'.',last_name,'@',REPLACE(name,' ',''),'.com') email,
CONCAT(lower(LEFT(last_name,1)),lower(RIGHT(last_name,1)),lower(LEFT(first_name,1)),lower(RIGHT(first_name,1)),LENGTH(last_name),LENGTH(first_name),REPLACE(upper(name),' ','')) AS password
FROM t1;
# 结果 ;可以利用这样的方式,给新加入公司的所有人定义一个初始密码。
name first_name last_name email password
Walmart Tuma Tamara Tuma.Tamara@Walmart.com tata64WALMART
Exxon Mobil Shields Sung Shields.Sung@ExxonMobil.com sgss47EXXONMOBIL
Apple Lupo Jodee Lupo.Jodee@Apple.com jelo54APPLE
Berkshire Hathaway Banda Serafina Banda.Serafina@BerkshireHathaway.com saba85BERKSHIREHATHAWAY
McKesson Crusoe Angeles Crusoe.Angeles@McKesson.com asce76MCKESSON
UnitedHealth Group Gayman Savanna Gayman.Savanna@UnitedHealthGroup.com sagn76UNITEDHEALTHGROUP
CVS Health Haskell Anabel Haskell.Anabel@CVSHealth.com alhl67CVSHEALTH
………………
官网给出的答案::
CONCAT 解决方案
-
WITH t1 AS ( SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name, RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name FROM accounts) SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', name, '.com') FROM t1;
-
WITH t1 AS ( SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name, RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name FROM accounts) SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', REPLACE(name, ' ', ''), '.com') FROM t1;
3.
WITH t1 AS (
SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name, RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name
FROM accounts)
SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', name, '.com'), LEFT(LOWER(first_name), 1) || RIGHT(LOWER(first_name), 1) || LEFT(LOWER(last_name), 1) || RIGHT(LOWER(last_name), 1) || LENGTH(first_name) || LENGTH(last_name) || REPLACE(UPPER(name), ' ', '')
FROM t1;
12 练习CAST函数
WITH T1 AS(SELECT date,RIGHT(LEFT(date,POSITION(' ' IN date)-1),4) as year,
REPLACE(LEFT(LEFT(date,POSITION(' ' IN date)-1),5),'/','-') as month_day
FROM sf_crime_data)
SELECT date,(year||'-'||LEFT(month_day,2)||'-'||RIGHT(month_day,2))::date AS date_formatted
FROM t1
LIMIT 10;
# 结果返回
date date_formatted
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
官网给的答案:
-
SELECT * FROM sf_crime_data LIMIT 10;
- yyyy-mm-dd
date
列的格式是 mm/dd/yyyy,日期末尾的时间也不正确。
-
SELECT date orig_date, (SUBSTR(date, 7, 4) || '-' || LEFT(date, 2) || '-' || SUBSTR(date, 4, 2)) new_date FROM sf_crime_data; # 结果 orig_date new_date 01/31/2014 08:00:00 AM +0000 2014-01-31 01/31/2014 08:00:00 AM +0000 2014-01-31 01/31/2014 08:00:00 AM +0000 2014-01-31 01/31/2014 08:00:00 AM +0000 2014-01-31 01/31/2014 08:00:00 AM +0000 2014-01-31 01/31/2014 08:00:00 AM +0000 2014-01-31 01/31/2014 08:00:00 AM +0000 2014-01-31 01/31/2014 08:00:00 AM +0000 2014-01-31 01/31/2014 08:00:00 AM +0000 2014-01-31 01/31/2014 08:00:00 AM +0000 2014-01-31
- 注意,可以像之前的课程中一样,使用 DATE_TRUNC 和 DATE_PART 处理这一新的日期。
SELECT date orig_date, (SUBSTR(date, 7, 4) || '-' || LEFT(date, 2) || '-' || SUBSTR(date, 4, 2))::DATE new_date
FROM sf_crime_data;
# 结果返回
orig_date new_date
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000 2014-01-31T00:00:00.000Z
15.练习 COALESCE函数
1.
SELECT *
FROM accounts a
LEFT JOIN orders o
ON a.id = o.account_id
WHERE o.total IS NULL;
#结果
id name website lat long primary_poc sales_rep_id account_id occurred_at standard_qty gloss_qty poster_qty total standard_amt_usd gloss_amt_usd poster_amt_usd total_amt_usd
Goldman Sachs Group www.gs.com 40.75744399 -73.96730918 Loris Manfredi 321690
2.
SELECT COALESCE(a.id, a.id) filled_id, a.name, a.website, a.lat, a.long, a.primary_poc, a.sales_rep_id, o.*
FROM accounts a
LEFT JOIN orders o
ON a.id = o.account_id
WHERE o.total IS NULL;
#结果返回
filled_id name website lat long primary_poc sales_rep_id id account_id occurred_at standard_qty gloss_qty poster_qty total standard_amt_usd gloss_amt_usd poster_amt_usd total_amt_usd
1731 Goldman Sachs Group www.gs.com 40.75744399 -73.96730918 Loris Manfredi 321690
3.
SELECT COALESCE(a.id, a.id) filled_id, a.name, a.website, a.lat, a.long, a.primary_poc, a.sales_rep_id, COALESCE(o.account_id, a.id),o.occurred_at,o.standard_qty,o.gloss_qty,o.poster_qty,o.total,o.standard_amt_usd,o.gloss_amt_usd,o.poster_amt_usd,o.total_amt_usd
FROM accounts a
LEFT JOIN orders o
ON a.id = o.account_id
WHERE o.total IS NULL;
#结果返回
filled_id name website lat long primary_poc sales_rep_id coalesce occurred_at standard_qty gloss_qty poster_qty total standard_amt_usd gloss_amt_usd poster_amt_usd total_amt_usd
1731 Goldman Sachs Group www.gs.com 40.75744399 -73.96730918 Loris Manfredi 321690 1731
4.
SELECT COALESCE(a.id, a.id) filled_id, a.name, a.website, a.lat, a.long, a.primary_poc, a.sales_rep_id, COALESCE(o.account_id, a.id) account_id, o.occurred_at, COALESCE(o.standard_qty, 0) standard_qty, COALESCE(o.gloss_qty,0) gloss_qty, COALESCE(o.poster_qty,0) poster_qty, COALESCE(o.total,0) total, COALESCE(o.standard_amt_usd,0) standard_amt_usd, COALESCE(o.gloss_amt_usd,0) gloss_amt_usd, COALESCE(o.poster_amt_usd,0) poster_amt_usd, COALESCE(o.total_amt_usd,0) total_amt_usd
FROM accounts a
LEFT JOIN orders o
ON a.id = o.account_id
WHERE o.total IS NULL;
# 结果返回
filled_id name website lat long primary_poc sales_rep_id account_id occurred_at standard_qty gloss_qty poster_qty total standard_amt_usd gloss_amt_usd poster_amt_usd total_amt_usd
1731 Goldman Sachs Group www.gs.com 40.75744399 -73.96730918 Loris Manfredi 321690 1731 0 0 0 0 0 0 0 0
总结:
本部分主要针对数据的清理,主要讲解了以下部分的知识:
1)清理和重新整理混乱的数据
2)将列转换成不同的数据类型
3)处理NULL的技巧
在该部分也学习了以下的常用有效函数:
1)LEFT、RIGHT、LENGTH函数分别用于从左边去字符,从右边取字符,计算字符串的长度
2)POSITION、STRPOS、SUBSTR函数,前两者的功能一样,但是函数的用法形式不太一样,最后一个函数,可以实现从第几个位置开始取,一共取几个字符的功能实现
3)CONACT函数,实现连接字符串
4)CAST函数与:DATE方法均能实现,将字符串形式的日期转换为日期格式的日期。
5)COALESCE用于处理NULL的行,对于相应是NULL的行进行值填充。