SQL学习之-2.9 数据清理相关知识

这部分主要讲解关于数据清理方面的知识,涉及到如下函数的详细讲解:

LEFT 、RIGHT、LENGTH 函数

POSITION、STRPOS、SUBSTR函数

CONCAT函数

CAST函数

COALESCE函数

3 练习LEFT RIGHT LENGTH函数

1.
SELECT RIGHT(website,3) AS domain, COUNT(*) num_types
FROM accounts
GROUP BY 1
ORDER BY 2 DESC;
#结果返回
domain	num_types
com	349
net	1
org	1

2.
SELECT LEFT(name,1) AS first_char, COUNT(*) num_types
FROM accounts
GROUP BY 1
ORDER BY 2 DESC;
# 结果返回
first_char	num_types
C	37
A	37
P	27
M	22
S	17
T	17
D	17
B	16
L	16
E	15
N	15
H	15
G	14
U	13
W	12
F	12
R	8
J	7
K	7
O	7
I	7
V	7
X	2
3	1
Y	1
e	1
Q	1

3.
SELECT SUM(num) nums, SUM(letter) letters
FROM (SELECT name, CASE WHEN LEFT(UPPER(name), 1) IN ('0','1','2','3','4','5','6','7','8','9') 
                       THEN 1 ELSE 0 END AS num, 
         CASE WHEN LEFT(UPPER(name), 1) IN ('0','1','2','3','4','5','6','7','8','9') 
                       THEN 0 ELSE 1 END AS letter
      FROM accounts) t1;
# 结果返回
nums	letters
1	350

4.
SELECT SUM(vowel) num_vowel,SUM(non_vowel) num_non
FROM (
SELECT name,CASE WHEN LEFT(upper(name),1) IN('A','E','I','O','U')THEN 1 ELSE 0 END AS vowel,
CASE WHEN LEFT(upper(name),1) IN('A','E','I','O','U')THEN 0 ELSE 1 END AS non_vowel
FROM accounts)t1;
# 结果  ### 运用子查询,在运用CASE语句的时候,注意不要忘记了END 
num_vowel	num_non
80	271

官网给出的答案

LEFT & RIGHT 解决方案
1.
SELECT RIGHT(website, 3) AS domain, COUNT(*) num_companies
FROM accounts
GROUP BY 1
ORDER BY 2 DESC;
2.
SELECT LEFT(UPPER(name), 1) AS first_letter, COUNT(*) num_companies
FROM accounts
GROUP BY 1
ORDER BY 2 DESC;
3.
SELECT SUM(num) nums, SUM(letter) letters
FROM (SELECT name, CASE WHEN LEFT(UPPER(name), 1) IN ('0','1','2','3','4','5','6','7','8','9') 
                       THEN 1 ELSE 0 END AS num, 
         CASE WHEN LEFT(UPPER(name), 1) IN ('0','1','2','3','4','5','6','7','8','9') 
                       THEN 0 ELSE 1 END AS letter
      FROM accounts) t1;
有 350 个公司名称以字母开头,1 个公司以数字开头。因此有 350/351 的公司名称以字母开头,即百分比是 99.7%。
4.
SELECT SUM(vowels) vowels, SUM(other) other
FROM (SELECT name, CASE WHEN LEFT(UPPER(name), 1) IN ('A','E','I','O','U') 
                        THEN 1 ELSE 0 END AS vowels, 
          CASE WHEN LEFT(UPPER(name), 1) IN ('A','E','I','O','U') 
                       THEN 0 ELSE 1 END AS other
         FROM accounts) t1;
有 80 的公司名称以元音开头,271 的公司以其他音节开头。元音的比例是 80/351,或 22.8%。因此,有 77.2% 的公司名称没有以元音开头。

6.练习POSITION STRPOS UPPER LOWER函数的运用

练习:POSITION 和 STRPOS
对于以下练习,需要用到 LEFT 和 RIGHT 以及 POSITION 或 STRPOS 知识。

使用 accounts 表格创建一个名字和姓氏列,用于存储 primary_poc 的名字和姓氏。
SELECT primary_poc,
LEFT(primary_poc,POSITION(' 'IN primary_poc)-1) as last_name,
RIGHT(primary_poc,LENGTH(primary_poc)-POSITION(' 'IN primary_poc)) as first_name
FROM accounts;


现在创建一个包含 sales_rep 表格中每个销售代表姓名的列,同样,需要提供名字和姓氏列。
SELECT name,
LEFT(name,POSITION(' 'IN name)-1) as last_name,
RIGHT(name,LENGTH(name)-POSITION(' 'IN name)) as first_name
FROM sales_reps;
# 结果返回
name	last_name	first_name
Samuel Racine	Samuel	Racine
Eugena Esser	Eugena	Esser
Michel Averette	Michel	Averette
Renetta Carew	Renetta	Carew
Cara Clarke	Cara	Clarke
Lavera Oles	Lavera	Oles
Elba Felder	Elba	Felder
………………


# 官方给出的答案:
SELECT LEFT(primary_poc, STRPOS(primary_poc, ' ') -1 ) first_name, 
RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name
FROM accounts;

SELECT LEFT(name, STRPOS(name, ' ') -1 ) first_name, 
       RIGHT(name, LENGTH(name) - STRPOS(name, ' ')) last_name
FROM sales_reps;

9.l练习CONCAT 函数

练习:CONCAT
accounts 表格中的每个客户都想为每个 primary_poc 创建一个电子邮箱。邮箱应该是 primary_poc 的名字.primary_poc的姓氏@公司名称.com。
WITH t1 AS(
  SELECT name,LEFT(primary_poc, POSITION(' 'IN primary_poc)-1) AS last_name,
  RIGHT(primary_poc,LENGTH(primary_poc)-POSITION(' 'IN primary_poc))AS first_name
FROM accounts)

SELECT name,first_name,last_name,CONCAT(first_name,'.',last_name,'@',name,'.com')
FROM t1;

你可能注意到了,在上一个答案中,有些公司名称存在空格,肯定不适合作为邮箱地址。看看你能否通过删掉客户名称中的所有空格来创建合适的邮箱地址,否则你的答案就和问题 1. 的一样。此处是一些实用的文档。
WITH t1 AS(
  SELECT name,LEFT(primary_poc, POSITION(' 'IN primary_poc)-1) AS last_name,
  RIGHT(primary_poc,LENGTH(primary_poc)-POSITION(' 'IN primary_poc))AS first_name
FROM accounts)

SELECT name,first_name,last_name,CONCAT(first_name,'.',last_name,'@',REPLACE(name,' ',''),'.com')
FROM t1;


我们还需要创建初始密码,在用户第一次登录时将更改。初始密码将是 primary_poc 的名字的第一个字母(小写),然后依次是名字的最后一个字母(小写)、姓氏的第一个字母(小写)、姓氏的最后一个字母(小写)、名字的字母数量、姓氏的字母数量,然后是合作的公司名称(全大写,没有空格)

WITH t1 AS(
  SELECT name,LEFT(primary_poc, POSITION(' 'IN primary_poc)-1) AS last_name,
  RIGHT(primary_poc,LENGTH(primary_poc)-POSITION(' 'IN primary_poc))AS first_name
FROM accounts)

SELECT name, first_name, last_name, CONCAT(first_name,'.',last_name,'@',REPLACE(name,' ',''),'.com') email,
CONCAT(lower(LEFT(last_name,1)),lower(RIGHT(last_name,1)),lower(LEFT(first_name,1)),lower(RIGHT(first_name,1)),LENGTH(last_name),LENGTH(first_name),REPLACE(upper(name),' ','')) AS password
FROM t1;
# 结果 ;可以利用这样的方式,给新加入公司的所有人定义一个初始密码。
name	first_name	last_name	email	password
Walmart	Tuma	Tamara	Tuma.Tamara@Walmart.com	tata64WALMART
Exxon Mobil	Shields	Sung	Shields.Sung@ExxonMobil.com	sgss47EXXONMOBIL
Apple	Lupo	Jodee	Lupo.Jodee@Apple.com	jelo54APPLE
Berkshire Hathaway	Banda	Serafina	Banda.Serafina@BerkshireHathaway.com	saba85BERKSHIREHATHAWAY
McKesson	Crusoe	Angeles	Crusoe.Angeles@McKesson.com	asce76MCKESSON
UnitedHealth Group	Gayman	Savanna	Gayman.Savanna@UnitedHealthGroup.com	sagn76UNITEDHEALTHGROUP
CVS Health	Haskell	Anabel	Haskell.Anabel@CVSHealth.com	alhl67CVSHEALTH
………………

去除公司名称里的空格

官网给出的答案::

CONCAT 解决方案

  1. WITH t1 AS (
     SELECT LEFT(primary_poc,     STRPOS(primary_poc, ' ') -1 ) first_name,  RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name
     FROM accounts)
    SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', name, '.com')
    FROM t1;
    
  2. WITH t1 AS (
     SELECT LEFT(primary_poc,     STRPOS(primary_poc, ' ') -1 ) first_name,  RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name
     FROM accounts)
    SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', REPLACE(name, ' ', ''), '.com')
    FROM  t1;
    

 

3.

WITH t1 AS (
 SELECT LEFT(primary_poc,     STRPOS(primary_poc, ' ') -1 ) first_name,  RIGHT(primary_poc, LENGTH(primary_poc) - STRPOS(primary_poc, ' ')) last_name, name
 FROM accounts)
SELECT first_name, last_name, CONCAT(first_name, '.', last_name, '@', name, '.com'), LEFT(LOWER(first_name), 1) || RIGHT(LOWER(first_name), 1) || LEFT(LOWER(last_name), 1) || RIGHT(LOWER(last_name), 1) || LENGTH(first_name) || LENGTH(last_name) || REPLACE(UPPER(name), ' ', '')
FROM t1;

 

12 练习CAST函数

WITH T1 AS(SELECT date,RIGHT(LEFT(date,POSITION(' ' IN date)-1),4) as year,
REPLACE(LEFT(LEFT(date,POSITION(' ' IN date)-1),5),'/','-') as month_day
FROM sf_crime_data)

SELECT date,(year||'-'||LEFT(month_day,2)||'-'||RIGHT(month_day,2))::date AS date_formatted
FROM t1
LIMIT 10;

# 结果返回
date	date_formatted
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z

官网给的答案:

  1. SELECT *
    FROM sf_crime_data
    LIMIT 10;
    
  2. yyyy-mm-dd
     
  3. date 列的格式是 mm/dd/yyyy,日期末尾的时间也不正确。
     
  4. SELECT date orig_date, (SUBSTR(date, 7, 4) || '-' || LEFT(date, 2) || '-' || SUBSTR(date, 4, 2)) new_date
    FROM sf_crime_data;
    
    # 结果
    orig_date	new_date
    01/31/2014 08:00:00 AM +0000	2014-01-31
    01/31/2014 08:00:00 AM +0000	2014-01-31
    01/31/2014 08:00:00 AM +0000	2014-01-31
    01/31/2014 08:00:00 AM +0000	2014-01-31
    01/31/2014 08:00:00 AM +0000	2014-01-31
    01/31/2014 08:00:00 AM +0000	2014-01-31
    01/31/2014 08:00:00 AM +0000	2014-01-31
    01/31/2014 08:00:00 AM +0000	2014-01-31
    01/31/2014 08:00:00 AM +0000	2014-01-31
    01/31/2014 08:00:00 AM +0000	2014-01-31
    
  5. 注意,可以像之前的课程中一样,使用 DATE_TRUNC 和 DATE_PART 处理这一新的日期。
SELECT date orig_date, (SUBSTR(date, 7, 4) || '-' || LEFT(date, 2) || '-' || SUBSTR(date, 4, 2))::DATE new_date
FROM sf_crime_data;

# 结果返回
orig_date	new_date
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z
01/31/2014 08:00:00 AM +0000	2014-01-31T00:00:00.000Z

15.练习 COALESCE函数

1.
SELECT *
FROM accounts a
LEFT JOIN orders o
ON a.id = o.account_id
WHERE o.total IS NULL; 
#结果
id	name	website	lat	long	primary_poc	sales_rep_id	account_id	occurred_at	standard_qty	gloss_qty	poster_qty	total	standard_amt_usd	gloss_amt_usd	poster_amt_usd	total_amt_usd
Goldman Sachs Group	www.gs.com	40.75744399	-73.96730918	Loris Manfredi	321690		

2.
SELECT COALESCE(a.id, a.id) filled_id, a.name, a.website, a.lat, a.long, a.primary_poc, a.sales_rep_id, o.*
FROM accounts a
LEFT JOIN orders o
ON a.id = o.account_id
WHERE o.total IS NULL;
#结果返回
filled_id	name	website	lat	long	primary_poc	sales_rep_id	id	account_id	occurred_at	standard_qty	gloss_qty	poster_qty	total	standard_amt_usd	gloss_amt_usd	poster_amt_usd	total_amt_usd
1731	Goldman Sachs Group	www.gs.com	40.75744399	-73.96730918	Loris Manfredi	321690											

3.
SELECT COALESCE(a.id, a.id) filled_id, a.name, a.website, a.lat, a.long, a.primary_poc, a.sales_rep_id, COALESCE(o.account_id, a.id),o.occurred_at,o.standard_qty,o.gloss_qty,o.poster_qty,o.total,o.standard_amt_usd,o.gloss_amt_usd,o.poster_amt_usd,o.total_amt_usd
FROM accounts a
LEFT JOIN orders o
ON a.id = o.account_id
WHERE o.total IS NULL;
#结果返回
filled_id	name	website	lat	long	primary_poc	sales_rep_id	coalesce	occurred_at	standard_qty	gloss_qty	poster_qty	total	standard_amt_usd	gloss_amt_usd	poster_amt_usd	total_amt_usd
1731	Goldman Sachs Group	www.gs.com	40.75744399	-73.96730918	Loris Manfredi	321690	1731									

4.
								
SELECT COALESCE(a.id, a.id) filled_id, a.name, a.website, a.lat, a.long, a.primary_poc, a.sales_rep_id, COALESCE(o.account_id, a.id) account_id, o.occurred_at, COALESCE(o.standard_qty, 0) standard_qty, COALESCE(o.gloss_qty,0) gloss_qty, COALESCE(o.poster_qty,0) poster_qty, COALESCE(o.total,0) total, COALESCE(o.standard_amt_usd,0) standard_amt_usd, COALESCE(o.gloss_amt_usd,0) gloss_amt_usd, COALESCE(o.poster_amt_usd,0) poster_amt_usd, COALESCE(o.total_amt_usd,0) total_amt_usd
FROM accounts a
LEFT JOIN orders o
ON a.id = o.account_id
WHERE o.total IS NULL;
# 结果返回
filled_id	name	website	lat	long	primary_poc	sales_rep_id	account_id	occurred_at	standard_qty	gloss_qty	poster_qty	total	standard_amt_usd	gloss_amt_usd	poster_amt_usd	total_amt_usd
1731	Goldman Sachs Group	www.gs.com	40.75744399	-73.96730918	Loris Manfredi	321690	1731		0	0	0	0	0	0	0	0

 

总结:

本部分主要针对数据的清理,主要讲解了以下部分的知识:

1)清理和重新整理混乱的数据

2)将列转换成不同的数据类型

3)处理NULL的技巧

 

在该部分也学习了以下的常用有效函数:

1)LEFT、RIGHT、LENGTH函数分别用于从左边去字符,从右边取字符,计算字符串的长度

2)POSITION、STRPOS、SUBSTR函数,前两者的功能一样,但是函数的用法形式不太一样,最后一个函数,可以实现从第几个位置开始取,一共取几个字符的功能实现

3)CONACT函数,实现连接字符串

4)CAST函数与:DATE方法均能实现,将字符串形式的日期转换为日期格式的日期。

5)COALESCE用于处理NULL的行,对于相应是NULL的行进行值填充。

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值