优达学城_数据清洗_项目三wrangle_act

下面是我优达学城项目三的记录报告

里面的思路和文字说明大多都在代码块里面的注释中,#后面?,可能不太容易看,需要认真看。?

#导入可能需要的包
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
import os
import requests   
from pprint import pprint
import re

1. 收集

# 通过编程方式获得文件,1)image-predictions.tsv
url = "https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv"
response = requests.get(url)
with open (os.path.join(url.split("/")[-1]),mode="wb" ) as file:
    file.write(response.content)   #把文本文件写入文件中去。存下来
    print("下载完毕!")
#!ls 
下载完毕!
# 通过编程方式下载文件,2)twitter-archive-enhanced.csv 下载完成
url2 = "https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/twitter-archive-enhanced.csv"

# url = "https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv"
response2 = requests.get(url2)
with open (os.path.join(url2.split("/")[-1]),mode="wb" ) as file:
    file.write(response2.content)   #把文本文件写入文件中去。存下来
    print("下载完毕!")
#!ls   
下载完毕!
# 通过编程方式下载文件,3)twitter-archive-enhanced.csv 下载完成
url3 = "https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/tweet_json.txt"
response3 = requests.get(url3)
with open (os.path.join(url3.split("/")[-1]),mode="wb" ) as file:
    file.write(response3.content)   #把文本文件写入文件中去。存下来
    print("下载完毕!")
#!ls   #检查文件是否有下载到路径中去。
下载完毕!

2.评估

陆续读取打开各个文件以查看数据情况

#先读取
#读取出这几个文件
twitter_achieve = pd.read_csv("twitter-archive-enhanced.csv")   
image_predictions  = pd.read_csv("image-predictions.tsv",sep="\t") 

twitter_achieve[twitter_achieve['expanded_urls'].isnull()]
tweet_idin_reply_to_status_idin_reply_to_user_idtimestampsourcetextretweeted_status_idretweeted_status_user_idretweeted_status_timestampexpanded_urlsrating_numeratorrating_denominatornamedoggoflooferpupperpuppo
308862670092850176008.862664e+172.281182e+092017-07-15 16:51:35 +0000<a href="http://twitter.com/download/iphone" r...@NonWhiteHat @MayhewMayhem omg hello tanner yo...NaNNaNNaNNaN1210NoneNoneNoneNoneNone
558816333001792430088.816070e+174.738443e+072017-07-02 21:58:53 +0000<a href="http://twitter.com/download/iphone" r...@roushfenway These are good dogs but 17/10 is ...NaNNaNNaNNaN1710NoneNoneNoneNoneNone
648796743196427960348.795538e+173.105441e+092017-06-27 12:14:36 +0000<a href="http://twitter.com/download/iphone" r...@RealKentMurphy 14/10 confirmedNaNNaNNaNNaN1410NoneNoneNoneNoneNone
1138707263143655096328.707262e+171.648776e+072017-06-02 19:38:25 +0000<a href="http://twitter.com/download/iphone" r...@ComplicitOwl @ShopWeRateDogs &gt;10/10 is res...NaNNaNNaNNaN1010NoneNoneNoneNoneNone
1488634275150833541128.634256e+177.759620e+072017-05-13 16:15:35 +0000<a href="http://twitter.com/download/iphone" r...@Jack_Septic_Eye I'd need a few more pics to p...NaNNaNNaNNaN1210NoneNoneNoneNoneNone
1798572148918910771218.571567e+171.806710e+082017-04-26 12:48:51 +0000<a href="http://twitter.com/download/iphone" r...@Marc_IRL pixelated af 12/10NaNNaNNaNNaN1210NoneNoneNoneNoneNone
185856330835276025856NaNNaN2017-04-24 02:15:55 +0000<a href="http://twitter.com/download/iphone" r...RT @Jenna_Marbles: @dog_rates Thanks for ratin...8.563302e+1766699013.02017-04-24 02:13:14 +0000NaN1410NoneNoneNoneNoneNone
1868562880843501608988.562860e+172.792810e+082017-04-23 23:26:03 +0000<a href="http://twitter.com/download/iphone" r...@xianmcguire @Jenna_Marbles Kardashians wouldn...NaNNaNNaNNaN1410NoneNoneNoneNoneNone
1888558626518340280348.558616e+171.943518e+082017-04-22 19:15:32 +0000<a href="http://twitter.com/download/iphone" r...@dhmontgomery We also gave snoop dogg a 420/10...NaNNaNNaNNaN42010NoneNoneNoneNoneNone
1898558601361491230728.558585e+171.361572e+072017-04-22 19:05:32 +0000<a href="http://twitter.com/download/iphone" r...@s8n You tried very hard to portray this good ...NaNNaNNaNNaN66610NoneNoneNoneNoneNone
2188503335677040680978.503288e+172.195506e+072017-04-07 13:04:55 +0000<a href="http://twitter.com/download/iphone" r...@markhoppus MARK THAT DOG HAS SEEN AND EXPERIE...NaNNaNNaNNaN1310NoneNoneNoneNoneNone
2288482136700395642888.482121e+174.196984e+092017-04-01 16:41:12 +0000<a href="http://twitter.com/download/iphone" r...Jerry just apuppologized to me. He said there ...NaNNaNNaNNaN1110NoneNoneNoneNoneNone
2348476172824906137608.476062e+174.196984e+092017-03-31 01:11:22 +0000<a href="http://twitter.com/download/iphone" r....@breaannanicolee PUPDATE: Cannon has a heart ...NaNNaNNaNNaN1310NoneNoneNoneNoneNone
2748406986369756364818.406983e+178.405479e+172017-03-11 22:59:09 +0000<a href="http://twitter.com/download/iphone" r...@0_kelvin_0 &gt;10/10 is reserved for puppos s...NaNNaNNaNNaN1010NoneNoneNoneNoneNone
2908381502775512473608.381455e+172.195506e+072017-03-04 22:12:52 +0000<a href="http://twitter.com/download/iphone" r...@markhoppus 182/10NaNNaNNaNNaN18210NoneNoneNoneNoneNone
2918380858393432064018.380855e+172.894131e+092017-03-04 17:56:49 +0000<a href="http://twitter.com/download/iphone" r...@bragg6of8 @Andy_Pace_ we are still looking fo...NaNNaNNaNNaN1510NoneNoneNoneNoneNone
3138352464395298406408.352460e+172.625958e+072017-02-24 21:54:03 +0000<a href="http://twitter.com/download/iphone" r...@jonnysun @Lin_Manuel ok jomny I know you're e...NaNNaNNaNNaN9600NoneNoneNoneNoneNone
3428320885765862973458.320875e+173.058208e+072017-02-16 04:45:50 +0000<a href="http://twitter.com/download/iphone" r...@docmisterio account started on 11/15/15NaNNaNNaNNaN1115NoneNoneNoneNoneNone
3468319269883236392988.319030e+172.068372e+072017-02-15 18:03:45 +0000<a href="http://twitter.com/download/iphone" r...@UNC can confirm 12/10NaNNaNNaNNaN1210NoneNoneNoneNoneNone
375828361771580813312NaNNaN2017-02-05 21:56:51 +0000<a href="http://twitter.com" rel="nofollow">Tw...Beebop and Doobert should start a band 12/10 w...NaNNaNNaNNaN1210NoneNoneNoneNoneNone
3878265987998208655378.265984e+174.196984e+092017-02-01 01:11:25 +0000<a href="http://twitter.com/download/iphone" r...I was going to do 007/10, but the joke wasn't ...NaNNaNNaNNaN710NoneNoneNoneNoneNone
4098233334895169372168.233264e+171.582854e+092017-01-23 00:56:15 +0000<a href="http://twitter.com/download/iphone" r...@HistoryInPics 13/10NaNNaNNaNNaN1310NoneNoneNoneNoneNone
4278211534218646159368.211526e+171.132119e+082017-01-17 00:33:26 +0000<a href="http://twitter.com/download/iphone" r...@imgur for a polar bear tho I'd say 13/10 is a...NaNNaNNaNNaN1310NoneNoneNoneNoneNone
4988131303666891489288.131273e+174.196984e+092016-12-25 21:12:41 +0000<a href="http://twitter.com/download/iphone" r...I've been informed by multiple sources that th...NaNNaNNaNNaN1210NoneNoneNoneNoneNone
5138116476864368803848.116272e+174.196984e+092016-12-21 19:01:02 +0000<a href="http://twitter.com/download/iphone" r...PUPDATE: I've been informed that Augie was act...NaNNaNNaNNaN1110NoneNoneNoneNoneNone
5708018549532623503368.018543e+171.185634e+072016-11-24 18:28:13 +0000<a href="http://twitter.com/download/iphone" r....@NBCSports OMG THE TINY HAT I'M GOING TO HAVE...NaNNaNNaNNaN1110NoneNoneNoneNoneNone
5768008594148318986248.008580e+172.918590e+082016-11-22 00:32:18 +0000<a href="http://twitter.com/download/iphone" r...@SkyWilliams doggo simply protecting you from ...NaNNaNNaNNaN1110NonedoggoNoneNoneNone
6117971659614848901137.971238e+172.916630e+072016-11-11 19:55:50 +0000<a href="http://twitter.com/download/iphone" r...@JODYHiGHROLLER it may be an 11/10 but what do...NaNNaNNaNNaN1110NoneNoneNoneNoneNone
7017860513372975226887.727430e+177.305050e+172016-10-12 03:50:17 +0000<a href="http://twitter.com/download/iphone" r...13/10 for breakdancing puppo @shibbnbotNaNNaNNaNNaN1310NoneNoneNoneNonepuppo
707785515384317313025NaNNaN2016-10-10 16:20:36 +0000<a href="http://twitter.com/download/iphone" r...Today, 10/10, should be National Dog Rates DayNaNNaNNaNNaN1010NoneNoneNoneNoneNone
8437667149219251445767.667118e+174.196984e+092016-08-19 19:14:16 +0000<a href="http://twitter.com/download/iphone" r...His name is Charley and he already has a new s...NaNNaNNaNNaN1310NoneNoneNoneNoneNone
8577639569720770109457.638652e+171.584641e+072016-08-12 04:35:10 +0000<a href="http://twitter.com/download/iphone" r...@TheEllenShow I'm not sure if you know this bu...NaNNaNNaNNaN1210NonedoggoNoneNoneNone
9677503816851334184967.501805e+174.717297e+092016-07-05 17:31:49 +0000<a href="http://twitter.com/download/iphone" r...13/10 such a good doggo\n@spaghemilyNaNNaNNaNNaN1310NonedoggoNoneNoneNone
10057476514308535255047.476487e+174.196984e+092016-06-28 04:42:46 +0000<a href="http://twitter.com/download/iphone" r...Other pupper asked not to have his identity sh...NaNNaNNaNNaN1210NoneNoneNonepupperNone
10807388911496125726737.384119e+173.589728e+082016-06-04 00:32:32 +0000<a href="http://twitter.com/download/iphone" r...@mount_alex3 13/10NaNNaNNaNNaN1310NoneNoneNoneNoneNone
12957079831884261539847.079801e+172.319108e+092016-03-10 17:35:20 +0000<a href="http://twitter.com/download/iphone" r...@serial @MrRoles OH MY GOD I listened to all o...NaNNaNNaNNaN1210NoneNoneNoneNoneNone
13457044912240996474887.044857e+172.878549e+072016-03-01 02:19:31 +0000<a href="http://twitter.com/download/iphone" r...13/10 hero af\n@ABCNaNNaNNaNNaN1310NoneNoneNoneNoneNone
1445696518437233913856NaNNaN2016-02-08 02:18:30 +0000<a href="http://twitter.com/download/iphone" r...Oh my god 10/10 for every little hot dog pupperNaNNaNNaNNaN1010NoneNoneNonepupperNone
14466964905391019089926.964887e+174.196984e+092016-02-08 00:27:39 +0000<a href="http://twitter.com/download/iphone" r...After reading the comments I may have overesti...NaNNaNNaNNaN110NoneNoneNoneNoneNone
14746936442167407697936.936422e+174.196984e+092016-01-31 03:57:23 +0000<a href="http://twitter.com/download/iphone" r...BREAKING PUPDATE: I've just been notified that...NaNNaNNaNNaN1010NoneNoneNoneNoneNone
14796935822941672448026.935722e+171.198989e+092016-01-30 23:51:19 +0000<a href="http://twitter.com/download/iphone" r...Personally I'd give him an 11/10. Not sure why...NaNNaNNaNNaN1110NoneNoneNoneNoneNone
14976924232800289669136.924173e+174.196984e+092016-01-27 19:05:49 +0000<a href="http://twitter.com/download/iphone" r...PUPDATE: just noticed this dog has some extra ...NaNNaNNaNNaN910NoneNoneNoneNoneNone
15236906072603604295696.903413e+174.670367e+082016-01-22 18:49:36 +0000<a href="http://twitter.com/download/iphone" r...12/10 @LightningHolttNaNNaNNaNNaN1210NoneNoneNoneNoneNone
15986860357801422970886.860340e+174.196984e+092016-01-10 04:04:10 +0000<a href="http://twitter.com/download/iphone" r...Yes I do realize a rating of 4/20 would've bee...NaNNaNNaNNaN420NoneNoneNoneNoneNone
16056856810903889756166.855479e+174.196984e+092016-01-09 04:34:45 +0000<a href="http://twitter.com/download/iphone" r...Jack deserves another round of applause. If yo...NaNNaNNaNNaN1410NoneNoneNoneNoneNone
16186849698608084541446.849598e+174.196984e+092016-01-07 05:28:35 +0000<a href="http://twitter.com/download/iphone" r...For those who claim this is a goat, u are wron...NaNNaNNaNNaN510NoneNoneNoneNoneNone
16636828089881787392006.827884e+174.196984e+092016-01-01 06:22:03 +0000<a href="http://twitter.com/download/iphone" r...I'm aware that I could've said 20/16, but here...NaNNaNNaNNaN2016NoneNoneNoneNoneNone
16896813406653771939846.813394e+174.196984e+092015-12-28 05:07:27 +0000<a href="http://twitter.com/download/iphone" r...I've been told there's a slight possibility he...NaNNaNNaNNaN510NoneNoneNoneNoneNone
17746780233232473579536.780211e+174.196984e+092015-12-19 01:25:31 +0000<a href="http://twitter.com/download/iphone" r...After getting lost in Reese's eyes for several...NaNNaNNaNNaN1310NoneNoneNoneNoneNone
18196765905729418936326.765883e+174.196984e+092015-12-15 02:32:17 +0000<a href="http://twitter.com/download/iphone" r...After some outrage from the crowd. Bubbles is ...NaNNaNNaNNaN710NoneNoneNoneNoneNone
18446758490184471674886.758457e+174.196984e+092015-12-13 01:25:37 +0000<a href="http://twitter.com/download/iphone" r...This dog is being demoted to a 9/10 for not we...NaNNaNNaNNaN910NoneNoneNoneNoneNone
18956747425310375116806.747400e+174.196984e+092015-12-10 00:08:50 +0000<a href="http://twitter.com/download/iphone" r...Some clarification is required. The dog is sin...NaNNaNNaNNaN1110NoneNoneNoneNoneNone
19056746069113424240696.744689e+174.196984e+092015-12-09 15:09:55 +0000<a href="http://twitter.com/download/iphone" r...The 13/10 also takes into account this impecca...NaNNaNNaNNaN1310NoneNoneNoneNoneNone
19146743309064343797766.658147e+171.637468e+072015-12-08 20:53:11 +0000<a href="http://twitter.com/download/iphone" r...13/10\n@ABC7NaNNaNNaNNaN1310NoneNoneNoneNoneNone
19406737163207231692846.737159e+174.196984e+092015-12-07 04:11:02 +0000<a href="http://twitter.com/download/iphone" r...The millennials have spoken and we've decided ...NaNNaNNaNNaN110NoneNoneNoneNoneNone
20386715503324644556806.715449e+174.196984e+092015-12-01 04:44:10 +0000<a href="http://twitter.com/download/iphone" r...After 22 minutes of careful deliberation this ...NaNNaNNaNNaN110NoneNoneNoneNoneNone
21496696848655546204166.693544e+174.196984e+092015-11-26 01:11:28 +0000<a href="http://twitter.com/download/iphone" r...After countless hours of research and hundreds...NaNNaNNaNNaN1110NoneNoneNoneNoneNone
21896689678771192545286.689207e+172.143566e+072015-11-24 01:42:25 +0000<a href="http://twitter.com/download/iphone" r...12/10 good shit Bubka\n@wane15NaNNaNNaNNaN1210NoneNoneNoneNoneNone
22986670704821439447056.670655e+174.196984e+092015-11-18 20:02:51 +0000<a href="http://twitter.com/download/iphone" r...After much debate this dog is being upgraded t...NaNNaNNaNNaN1010NoneNoneNoneNoneNone
twitter_achieve.head(3)   
#大概查看数据集情况,发现质量问题
# 1)质量问题:source 里面包含html的标签,可以进一步提取 
# 2)整洁性问题:狗的地位stage(体型而定的) 应该为分类数据,应该放在同一列中
tweet_idin_reply_to_status_idin_reply_to_user_idtimestampsourcetextretweeted_status_idretweeted_status_user_idretweeted_status_timestampexpanded_urlsrating_numeratorrating_denominatornamedoggoflooferpupperpuppo
0892420643555336193NaNNaN2017-08-01 16:23:56 +0000<a href="http://twitter.com/download/iphone" r...This is Phineas. He's a mystical boy. Only eve...NaNNaNNaNhttps://twitter.com/dog_rates/status/892420643...1310PhineasNoneNoneNoneNone
1892177421306343426NaNNaN2017-08-01 00:17:27 +0000<a href="http://twitter.com/download/iphone" r...This is Tilly. She's just checking pup on you....NaNNaNNaNhttps://twitter.com/dog_rates/status/892177421...1310TillyNoneNoneNoneNone
2891815181378084864NaNNaN2017-07-31 00:18:03 +0000<a href="http://twitter.com/download/iphone" r...This is Archie. He is a rare Norwegian Pouncin...NaNNaNNaNhttps://twitter.com/dog_rates/status/891815181...1210ArchieNoneNoneNoneNone
twitter_achieve.name.value_counts() 
#3)质量问题:狗的名字可以相同,但是 有异常的 a,None  the an
None         745
a             55
Charlie       12
Oliver        11
Lucy          11
Cooper        11
Lola          10
Tucker        10
Penny         10
Bo             9
Winston        9
Sadie          8
the            8
Buddy          7
Daisy          7
Toby           7
Bailey         7
an             7
Scout          6
Jack           6
Dave           6
Oscar          6
Koda           6
Stanley        6
Leo            6
Milo           6
Rusty          6
Jax            6
Bella          6
Finn           5
            ... 
Gustav         1
Andy           1
Pippin         1
Molly          1
Sage           1
Ashleigh       1
Schnozz        1
Shiloh         1
Margo          1
Tito           1
Brownie        1
my             1
Pherb          1
Colin          1
Buckley        1
Alexander      1
Kulet          1
Trigger        1
Aja            1
Petrick        1
Izzy           1
Milky          1
Dido           1
Kara           1
Wiggles        1
Carter         1
JD             1
by             1
Boston         1
Jarod          1
Name: name, Length: 957, dtype: int64
twitter_achieve.retweeted_status_user_id.notnull().value_counts()  
# 4)retweeted_status_user_id 不为空的是 转发的推文 ,需要清理掉,只留下非转发的推文
False    2175
True      181
Name: retweeted_status_user_id, dtype: int64
twitter_achieve.info()
# 5)质量问题: 发现in_reply_to_status_id     in_reply_to_user_id   都只有78条,基本都是空值,需要清理
# 6) 质量问题: tweet_id 用户id应该为字符串类型,而不是int64 数值类型,所以需要修改。后面读取完三个表后同样发现都需要统一成字符串类型
# 7) 质量问题: expanded_urls是推文的链接地址,存在缺失值,缺失的行可能已经失效,需要处理
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):
tweet_id                      2356 non-null int64
in_reply_to_status_id         78 non-null float64
in_reply_to_user_id           78 non-null float64
timestamp                     2356 non-null object
source                        2356 non-null object
text                          2356 non-null object
retweeted_status_id           181 non-null float64
retweeted_status_user_id      181 non-null float64
retweeted_status_timestamp    181 non-null object
expanded_urls                 2297 non-null object
rating_numerator              2356 non-null int64
rating_denominator            2356 non-null int64
name                          2356 non-null object
doggo                         2356 non-null object
floofer                       2356 non-null object
pupper                        2356 non-null object
puppo                         2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
twitter_achieve.expanded_urls
# 8)整洁性问题,expanded_urls里面有写链接有多条链接重复在一起,用逗号分隔的,如6这种。
0       https://twitter.com/dog_rates/status/892420643...
1       https://twitter.com/dog_rates/status/892177421...
2       https://twitter.com/dog_rates/status/891815181...
3       https://twitter.com/dog_rates/status/891689557...
4       https://twitter.com/dog_rates/status/891327558...
5       https://twitter.com/dog_rates/status/891087950...
6       https://gofundme.com/ydvmve-surgery-for-jax,ht...
7       https://twitter.com/dog_rates/status/890729181...
8       https://twitter.com/dog_rates/status/890609185...
9       https://twitter.com/dog_rates/status/890240255...
10      https://twitter.com/dog_rates/status/890006608...
11      https://twitter.com/dog_rates/status/889880896...
12      https://twitter.com/dog_rates/status/889665388...
13      https://twitter.com/dog_rates/status/889638837...
14      https://twitter.com/dog_rates/status/889531135...
15      https://twitter.com/dog_rates/status/889278841...
16      https://twitter.com/dog_rates/status/888917238...
17      https://twitter.com/dog_rates/status/888804989...
18      https://twitter.com/dog_rates/status/888554962...
19      https://twitter.com/dog_rates/status/887473957...
20      https://twitter.com/dog_rates/status/888078434...
21      https://twitter.com/dog_rates/status/887705289...
22      https://twitter.com/dog_rates/status/887517139...
23      https://twitter.com/dog_rates/status/887473957...
24      https://twitter.com/dog_rates/status/887343217...
25      https://twitter.com/dog_rates/status/887101392...
26      https://twitter.com/dog_rates/status/886983233...
27      https://www.gofundme.com/mingusneedsus,https:/...
28      https://twitter.com/dog_rates/status/886680336...
29      https://twitter.com/dog_rates/status/886366144...
                              ...                        
2326    https://twitter.com/dog_rates/status/666411507...
2327    https://twitter.com/dog_rates/status/666407126...
2328    https://twitter.com/dog_rates/status/666396247...
2329    https://twitter.com/dog_rates/status/666373753...
2330    https://twitter.com/dog_rates/status/666362758...
2331    https://twitter.com/dog_rates/status/666353288...
2332    https://twitter.com/dog_rates/status/666345417...
2333    https://twitter.com/dog_rates/status/666337882...
2334    https://twitter.com/dog_rates/status/666293911...
2335    https://twitter.com/dog_rates/status/666287406...
2336    https://twitter.com/dog_rates/status/666273097...
2337    https://twitter.com/dog_rates/status/666268910...
2338    https://twitter.com/dog_rates/status/666104133...
2339    https://twitter.com/dog_rates/status/666102155...
2340    https://twitter.com/dog_rates/status/666099513...
2341    https://twitter.com/dog_rates/status/666094000...
2342    https://twitter.com/dog_rates/status/666082916...
2343    https://twitter.com/dog_rates/status/666073100...
2344    https://twitter.com/dog_rates/status/666071193...
2345    https://twitter.com/dog_rates/status/666063827...
2346    https://twitter.com/dog_rates/status/666058600...
2347    https://twitter.com/dog_rates/status/666057090...
2348    https://twitter.com/dog_rates/status/666055525...
2349    https://twitter.com/dog_rates/status/666051853...
2350    https://twitter.com/dog_rates/status/666050758...
2351    https://twitter.com/dog_rates/status/666049248...
2352    https://twitter.com/dog_rates/status/666044226...
2353    https://twitter.com/dog_rates/status/666033412...
2354    https://twitter.com/dog_rates/status/666029285...
2355    https://twitter.com/dog_rates/status/666020888...
Name: expanded_urls, Length: 2356, dtype: object
twitter_achieve.rating_denominator.value_counts() 
# 9)质量问题:评分的分母除了10,还有少量不为10,如11,2,7,10的倍数等,需要重新检查或者重新提取
10     2333
11        3
50        3
80        2
20        2
2         1
16        1
40        1
70        1
15        1
90        1
110       1
120       1
130       1
150       1
170       1
7         1
0         1
Name: rating_denominator, dtype: int64
image_predictions.info()     #评估image_predictions  表
image_predictions.head(4)  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2075 entries, 0 to 2074
Data columns (total 12 columns):
tweet_id    2075 non-null int64
jpg_url     2075 non-null object
img_num     2075 non-null int64
p1          2075 non-null object
p1_conf     2075 non-null float64
p1_dog      2075 non-null bool
p2          2075 non-null object
p2_conf     2075 non-null float64
p2_dog      2075 non-null bool
p3          2075 non-null object
p3_conf     2075 non-null float64
p3_dog      2075 non-null bool
dtypes: bool(3), float64(3), int64(2), object(4)
memory usage: 152.1+ KB
tweet_idjpg_urlimg_nump1p1_confp1_dogp2p2_confp2_dogp3p3_confp3_dog
0666020888022790149https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg1Welsh_springer_spaniel0.465074Truecollie0.156665TrueShetland_sheepdog0.061428True
1666029285002620928https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg1redbone0.506826Trueminiature_pinscher0.074192TrueRhodesian_ridgeback0.072010True
2666033412701032449https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg1German_shepherd0.596461Truemalinois0.138584Truebloodhound0.116197True
3666044226329800704https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg1Rhodesian_ridgeback0.408143Trueredbone0.360687Trueminiature_pinscher0.222752True

image_predictions[image_predictions['jpg_url'].duplicated()==True]['jpg_url'].value_counts()
# 10)image_predictions 有很多重复的图片链接有 66条之多,需要删除
https://pbs.twimg.com/media/CtzKC7zXEAALfSo.jpg                                            1
https://pbs.twimg.com/media/CvoBPWRWgAA4het.jpg                                            1
https://pbs.twimg.com/media/Co-hmcYXYAASkiG.jpg                                            1
https://pbs.twimg.com/media/CsrjryzWgAAZY00.jpg                                            1
https://pbs.twimg.com/media/Cp6db4-XYAAMmqL.jpg                                            1
https://pbs.twimg.com/media/CV_cnjHWUAADc-c.jpg                                            1
https://pbs.twimg.com/media/CdHwZd0VIAA4792.jpg                                            1
https://pbs.twimg.com/media/Crwxb5yWgAAX5P_.jpg                                            1
https://pbs.twimg.com/media/DFDw2tyUQAAAFke.jpg                                            1
https://pbs.twimg.com/media/CVgdFjNWEAAxmbq.jpg                                            1
https://pbs.twimg.com/media/CUN4Or5UAAAa5K4.jpg                                            1
https://pbs.twimg.com/media/CuRDF-XWcAIZSer.jpg                                            1
https://pbs.twimg.com/media/Cq9guJ5WgAADfpF.jpg                                            1
https://pbs.twimg.com/media/Ct2qO5PXEAE6eB0.jpg                                            1
https://pbs.twimg.com/media/CxqsX-8XUAAEvjD.jpg                                            1
https://pbs.twimg.com/ext_tw_video_thumb/817423809049493505/pu/img/5OFW0yueFu9oTUiQ.jpg    1
https://pbs.twimg.com/media/CWyD2HGUYAQ1Xa7.jpg                                            1
https://pbs.twimg.com/media/CYLDikFWEAAIy1y.jpg                                            1
https://pbs.twimg.com/media/CVuQ2LeUsAAIe3s.jpg                                            1
https://pbs.twimg.com/media/CvaYgDOWgAEfjls.jpg                                            1
https://pbs.twimg.com/media/C4bTH6nWMAAX_bJ.jpg                                            1
https://pbs.twimg.com/media/CW88XN4WsAAlo8r.jpg                                            1
https://pbs.twimg.com/media/CvJCabcWgAIoUxW.jpg                                            1
https://pbs.twimg.com/media/Ck2d7tJWUAEPTL3.jpg                                            1
https://pbs.twimg.com/media/CtKHLuCWYAA2TTs.jpg                                            1
https://pbs.twimg.com/media/Cwx99rpW8AMk_Ie.jpg                                            1
https://pbs.twimg.com/media/CvyVxQRWEAAdSZS.jpg                                            1
https://pbs.twimg.com/media/CvT6IV6WEAQhhV5.jpg                                            1
https://pbs.twimg.com/ext_tw_video_thumb/815965888126062592/pu/img/JleSw4wRhgKDWQj5.jpg    1
https://pbs.twimg.com/media/C3nygbBWQAAjwcW.jpg                                            1
                                                                                          ..
https://pbs.twimg.com/media/CiyHLocU4AI2pJu.jpg                                            1
https://pbs.twimg.com/media/Ct72q9jWcAAhlnw.jpg                                            1
https://pbs.twimg.com/media/CkNjahBXAAQ2kWo.jpg                                            1
https://pbs.twimg.com/media/CwJR1okWIAA6XMp.jpg                                            1
https://pbs.twimg.com/tweet_video_thumb/CeBym7oXEAEWbEg.jpg                                1
https://pbs.twimg.com/media/CsVO7ljW8AAckRD.jpg                                            1
https://pbs.twimg.com/ext_tw_video_thumb/807106774843039744/pu/img/8XZg1xW35Xp2J6JW.jpg    1
https://pbs.twimg.com/media/CU1zsMSUAAAS0qW.jpg                                            1
https://pbs.twimg.com/media/Cbs3DOAXIAAp3Bd.jpg                                            1
https://pbs.twimg.com/media/Cveg1-NXgAASaaT.jpg                                            1
https://pbs.twimg.com/media/ChK1tdBWwAQ1flD.jpg                                            1
https://pbs.twimg.com/media/CwiuEJmW8AAZnit.jpg                                            1
https://pbs.twimg.com/media/CU3mITUWIAAfyQS.jpg                                            1
https://pbs.twimg.com/media/CZhn-QAWwAASQan.jpg                                            1
https://pbs.twimg.com/media/CmoPdmHW8AAi8BI.jpg                                            1
https://pbs.twimg.com/media/Cx5R8wPVEAALa9r.jpg                                            1
https://pbs.twimg.com/media/CpmyNumW8AAAJGj.jpg                                            1
https://pbs.twimg.com/media/Cs_DYr1XEAA54Pu.jpg                                            1
https://pbs.twimg.com/media/C12whDoVEAALRxa.jpg                                            1
https://pbs.twimg.com/media/CtVAvX-WIAAcGTf.jpg                                            1
https://pbs.twimg.com/media/C4KHj-nWQAA3poV.jpg                                            1
https://pbs.twimg.com/media/CrXhIqBW8AA6Bse.jpg                                            1
https://pbs.twimg.com/media/CVMOlMiWwAA4Yxl.jpg                                            1
https://pbs.twimg.com/media/C12x-JTVIAAzdfl.jpg                                            1
https://pbs.twimg.com/media/C2oRbOuWEAAbVSl.jpg                                            1
https://pbs.twimg.com/media/CWza7kpWcAAdYLc.jpg                                            1
https://pbs.twimg.com/media/CsGnz64WYAEIDHJ.jpg                                            1
https://pbs.twimg.com/media/C2kzTGxWEAEOpPL.jpg                                            1
https://pbs.twimg.com/media/CiibOMzUYAA9Mxz.jpg                                            1
https://pbs.twimg.com/media/CwS4aqZXUAAe3IO.jpg                                            1
Name: jpg_url, Length: 66, dtype: int64
print(image_predictions[image_predictions['jpg_url'].isnull()]['jpg_url'].value_counts())   #图片链接验证没有为空的。不用处理
Series([], Name: jpg_url, dtype: int64)
image_predictions[(image_predictions['p1_dog']==False)&(image_predictions['p2_dog']==False)&(image_predictions['p3_dog']==False) ]
# 11) 质量问题: image_predictions 中预测完全不属于狗的 数据有324条,需要处理删除处理
tweet_idjpg_urlimg_nump1p1_confp1_dogp2p2_confp2_dogp3p3_confp3_dog
6666051853826850816https://pbs.twimg.com/media/CT5KoJ1WoAAJash.jpg1box_turtle0.933012Falsemud_turtle4.588540e-02Falseterrapin1.788530e-02False
17666104133288665088https://pbs.twimg.com/media/CT56LSZWoAAlJj2.jpg1hen0.965932Falsecock3.391940e-02Falsepartridge5.206580e-05False
18666268910803644416https://pbs.twimg.com/media/CT8QCd1WEAADXws.jpg1desktop_computer0.086502Falsedesk8.554740e-02Falsebookcase7.947970e-02False
21666293911632134144https://pbs.twimg.com/media/CT8mx7KW4AEQu8N.jpg1three-toed_sloth0.914671Falseotter1.525000e-02Falsegreat_grey_owl1.320720e-02False
25666362758909284353https://pbs.twimg.com/media/CT9lXGsUcAAyUFt.jpg1guinea_pig0.996496Falseskunk2.402450e-03Falsehamster4.608630e-04False
29666411507551481857https://pbs.twimg.com/media/CT-RugiWIAELEaq.jpg1coho0.404640Falsebarracouta2.714850e-01Falsegar1.899450e-01False
45666786068205871104https://pbs.twimg.com/media/CUDmZIkWcAAIPPe.jpg1snail0.999888Falseslug5.514170e-05Falseacorn2.625800e-05False
50666837028449972224https://pbs.twimg.com/media/CUEUva1WsAA2jPb.jpg1triceratops0.442113Falsearmadillo1.140710e-01Falsecommon_iguana4.325530e-02False
51666983947667116034https://pbs.twimg.com/media/CUGaXDhW4AY9JUH.jpg1swab0.589446Falsechain_saw1.901420e-01Falsewig3.450970e-02False
53667012601033924608https://pbs.twimg.com/media/CUG0bC0U8AAw2su.jpg1hyena0.987230FalseAfrican_hunting_dog1.260080e-02Falsecoyote5.735010e-05False
56667065535570550784https://pbs.twimg.com/media/CUHkkJpXIAA2w3n.jpg1jigsaw_puzzle0.560001Falsedoormat1.032590e-01Falsespace_heater4.256800e-02False
69667188689915760640https://pbs.twimg.com/media/CUJUk2iWUAAVtOv.jpg1vacuum0.335830Falseswab2.652780e-01Falsetoilet_tissue1.407030e-01False
73667369227918143488https://pbs.twimg.com/media/CUL4xR9UkAEdlJ6.jpg1teddy0.709545Falsebath_towel1.272850e-01FalseChristmas_stocking2.856750e-02False
77667437278097252352https://pbs.twimg.com/media/CUM2qWaWoAUZ06L.jpg1porcupine0.989154Falsebath_towel6.300490e-03Falsebadger9.663400e-04False
78667443425659232256https://pbs.twimg.com/media/CUM8QZwW4AAVsBl.jpg1goose0.980815Falsedrake6.917770e-03Falsehen5.255170e-03False
93667549055577362432https://pbs.twimg.com/media/CUOcVCwWsAERUKY.jpg1electric_fan0.984377Falsespotlight7.736710e-03Falselampshade1.901230e-03False
94667550882905632768https://pbs.twimg.com/media/CUObvUJVEAAnYPF.jpg1web_site0.998258Falsedishwasher2.010840e-04Falseoscilloscope1.417360e-04False
96667724302356258817https://pbs.twimg.com/media/CUQ7tv3W4AA3KlI.jpg1ibex0.619098Falsebighorn1.251190e-01Falseram7.467320e-02False
98667766675769573376https://pbs.twimg.com/media/CURiQMnUAAAPT2M.jpg1fire_engine0.883493Falsetow_truck7.473390e-02Falsejeep1.277260e-02False
100667782464991965184https://pbs.twimg.com/media/CURwm3cUkAARcO6.jpg1lorikeet0.466149Falsehummingbird8.301100e-02FalseAfrican_grey5.424740e-02False
106667866724293877760https://pbs.twimg.com/media/CUS9PlUWwAANeAD.jpg1jigsaw_puzzle1.000000Falseprayer_rug1.011300e-08Falsedoormat1.740170e-10False
107667873844930215936https://pbs.twimg.com/media/CUTDtyGXIAARxus.jpg1common_iguana0.999647Falsefrilled_lizard1.811500e-04FalseAfrican_chameleon1.283570e-04False
112667911425562669056https://pbs.twimg.com/media/CUTl5m1WUAAabZG.jpg1frilled_lizard0.257695Falseox2.351600e-01Falsetriceratops8.531690e-02False
115667937095915278337https://pbs.twimg.com/media/CUT9PuQWwAABQv7.jpg1hamster0.172078Falseguinea_pig9.492420e-02FalseBand_Aid5.999520e-02False
117668142349051129856https://pbs.twimg.com/media/CUW37BzWsAAlJlN.jpg1Angora0.918834Falsehen3.779340e-02Falsewood_rabbit1.101490e-02False
118668154635664932864https://pbs.twimg.com/media/CUXDGR2WcAAUQKz.jpg1Arctic_fox0.473584Falsewallaby2.614110e-01Falsewhite_wolf8.094780e-02False
123668226093875376128https://pbs.twimg.com/media/CUYEFlQXAAUkPGm.jpg1trombone0.390339Falsecornet3.141490e-01FalseFrench_horn2.551820e-01False
130668291999406125056https://pbs.twimg.com/media/CUZABzGW4AE5F0k.jpg1web_site0.995535Falseskunk1.363490e-03Falsebadger6.856500e-04False
132668466899341221888https://pbs.twimg.com/media/CUbfGbbWoAApZth.jpg1shopping_basket0.398361Falsehamper3.632220e-01Falsebassinet8.417350e-02False
140668544745690562560https://pbs.twimg.com/media/CUcl5jeWsAA6ufS.jpg1bearskin0.427870Falsebow2.588580e-01Falsepanpipe2.156260e-02False
.......................................
1839837482249356513284https://pbs.twimg.com/media/C59VqMUXEAAzldG.jpg2birdhouse0.541196Falsecan_opener1.210940e-01Falsecarton5.613670e-02False
1844838916489579200512https://pbs.twimg.com/media/C6RkiQZUsAAM4R4.jpg2web_site0.993651Falsemonitor1.405900e-03Falseenvelope1.093090e-03False
1847839290600511926273https://pbs.twimg.com/media/C6XBt9XXEAEEW9U.jpg1web_site0.670892Falsemonitor1.015650e-01Falsescreen7.530610e-02False
1851840370681858686976https://pbs.twimg.com/media/C6mYrK0UwAANhep.jpg1teapot0.981819Falsecup1.402580e-02Falsecoffeepot2.420540e-03False
1853840696689258311684https://pbs.twimg.com/media/C6rBLenU0AAr8MN.jpg1web_site0.841768Falserule7.087310e-03Falseenvelope6.820300e-03False
1869844580511645339650https://pbs.twimg.com/media/C7iNfq1W0AAcbsR.jpg1washer0.903064Falsedishwasher3.248900e-02Falseprinter1.645620e-02False
1886847962785489326080https://pbs.twimg.com/media/C8SRpHNUIAARB3j.jpg1sea_lion0.882654Falsemink6.688020e-02Falseotter2.567870e-02False
1887847971574464610304https://pbs.twimg.com/media/C8SZH1EWAAAIRRF.jpg1coffee_mug0.633652Falsecup2.733920e-01Falsetoilet_tissue6.665580e-02False
1891849051919805034497https://pbs.twimg.com/media/C8hwNxbXYAAwyVG.jpg1fountain0.997509FalseAmerican_black_bear1.413120e-03Falsesundial6.811150e-04False
1892849336543269576704https://pbs.twimg.com/media/C8lzFC4XcAAQxB4.jpg1patio0.521788Falseprison1.495440e-01Falserestaurant2.715260e-02False
1900851464819735769094https://pbs.twimg.com/media/C9ECujZXsAAPCSM.jpg2web_site0.919649Falsemenu2.630610e-02Falsecrossword_puzzle3.481510e-03False
1902851861385021730816https://pbs.twimg.com/media/C8W6sY_W0AEmttW.jpg1pencil_box0.662183Falsepurse6.650550e-02Falsepillow4.472530e-02False
1905852226086759018497https://pbs.twimg.com/ext_tw_video_thumb/85222...1prison0.352793Falsedishwasher1.107230e-01Falsefile9.411200e-02False
1906852311364735569921https://pbs.twimg.com/media/C9QEqZ7XYAIR7fS.jpg1barbell0.971581Falsedumbbell2.841790e-02Falsego-kart5.595040e-07False
1910853299958564483072https://pbs.twimg.com/media/C9eHyF7XgAAOxPM.jpg1grille0.652280Falsebeach_wagon1.128460e-01Falseconvertible8.625230e-02False
1931859074603037188101https://pbs.twimg.com/media/C-wLyufW0AA546I.jpg1revolver0.190292Falseprojectile1.490640e-01Falsefountain6.604660e-02False
1936860184849394610176https://pbs.twimg.com/media/C-_9jWWUwAAnwkd.jpg1chimpanzee0.267612Falsegorilla1.042930e-01Falseorangutan5.990750e-02False
1937860276583193509888https://pbs.twimg.com/media/C_BQ_NlVwAAgYGD.jpg1lakeside0.312299Falsedock1.598420e-01Falsecanoe7.079450e-02False
1940860924035999428608https://pbs.twimg.com/media/C_KVJjDXsAEUCWn.jpg2envelope0.933016Falseoscilloscope1.259140e-02Falsepaper_towel1.117850e-02False
1946862457590147678208https://pbs.twimg.com/media/C_gQmaTUMAAPYSS.jpg1home_theater0.496348Falsestudio_couch1.672560e-01Falsebarber_chair5.262500e-02False
1953863907417377173506https://pbs.twimg.com/media/C_03NPeUQAAgrMl.jpg1marmot0.358828Falsemeerkat1.747030e-01Falseweasel1.234850e-01False
1956864873206498414592https://pbs.twimg.com/media/DAClmHkXcAA1kSv.jpg2pole0.478616Falselakeside1.141820e-01Falsewreck5.592650e-02False
1975870063196459192321https://pbs.twimg.com/media/DBMV3NnXUAAm0Pp.jpg1comic_book0.534409Falseenvelope2.807220e-01Falsebook_jacket4.378550e-02False
1979870804317367881728https://pbs.twimg.com/media/DBW35ZsVoAEWZUU.jpg1home_theater0.168290Falsesandbar9.804040e-02Falsetelevision7.972940e-02False
2012879050749262655488https://pbs.twimg.com/media/DDMD_phXoAQ1qf0.jpg1tabby0.311861Falsewindow_screen1.691230e-01FalseEgyptian_cat1.329320e-01False
2021880935762899988482https://pbs.twimg.com/media/DDm2Z5aXUAEDS2u.jpg1street_sign0.251801Falseumbrella1.151230e-01Falsetraffic_light6.953380e-02False
2022881268444196462592https://pbs.twimg.com/media/DDrk-f9WAAI-WQv.jpg1tusker0.473303FalseIndian_elephant2.456460e-01Falseibex5.566070e-02False
2046886680336477933568https://pbs.twimg.com/media/DE4fEDzWAAAyHMM.jpg1convertible0.738995Falsesports_car1.399520e-01Falsecar_wheel4.417270e-02False
2052887517139158093824https://pbs.twimg.com/ext_tw_video_thumb/88751...1limousine0.130432Falsetow_truck2.917540e-02Falseshopping_cart2.632080e-02False
2074892420643555336193https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg1orange0.097049Falsebagel8.585110e-02Falsebanana7.611000e-02False

324 rows × 12 columns

# 读取出各个文件。  

tweetIdList = []    #这个是装提取出来的id的
tweetReTList = []   #这个是转推次数的提取
tweetFavList = []   # 喜欢数的提取。
tweetTextList =[]
with open('tweet_json.txt') as json_file:      #直接打开文件的方式打开失败,于是改用按行读取的方式
    for oneLine in json_file.readlines():
        tempDic= json.loads(oneLine)   #每部分都是字典的东西放在这儿。
        tempID = tempDic['id_str']       #根据需求分别提取推文用户id,喜欢数,转发数
        tempRe = tempDic['retweet_count']
        tempFa = tempDic['favorite_count']
        tweetIdList.append(tempID)
        tweetReTList.append(tempRe)
        tweetFavList.append(tempFa)

tweet_json = pd.DataFrame({'tweet_id':tweetIdList,'retweet_count':tweetReTList,'favorite_count':tweetFavList})
tweet_json   #从tweet_json.txt 中提取出 用户id,喜欢数,转发数
# 12)发现问题 ,整洁度问题: 三个表格有相同tweet_id的字段,可以合并成一个进行操作。
tweet_idretweet_countfavorite_count
0892420643555336193884239492
1892177421306343426648033786
2891815181378084864430125445
3891689557279858688892542863
4891327558926688256972141016
5891087950875897856324020548
6890971913173991426214212053
78907291814112378881954866596
8890609185150312448440328187
9890240255349198849768432467
10890006608113172480758431127
11889880896479866881511628208
12889665388333682689850238745
13889638837579907072470527633
14889531135344209921230915329
15889278841981685760563525712
16888917238123831296468129555
17888804989199671297453526021
18888554962724278272372220267
19888078434458587136363722144
20887705289381826560558430690
218875171391580938241205346940
228874739571039518831881370007
238873432170453688321071334223
24887101392804085760614731045
25886983233522544640804535786
26886736880519319552342012286
27886680336477933568459722802
28886366144734445568329721488
298862670092850176004117
............
2322666411507551481857337457
232366640712685676544043113
232466639624737329152091171
232566637375374458880299194
2326666362758909284353590801
232766635328845610188876228
2328666345417576210432146308
232966633788230352486496203
2330666293911632134144365519
233166628740622469529671152
233266627309761663795281183
233366626891080364441637108
2334666104133288665088683514703
23356661021559091445761581
233666609951378705203273160
233766609400002215936278168
233866608291673319833747121
2339666073100786774016173334
234066607119322150912067154
2341666063827256086533230494
234266605860052415692861117
2343666057090499244032146304
2344666055525042405380261449
23456660518538268508168771250
234666605075879469465760136
234766604924816582246541111
2348666044226329800704147309
234966603341270103244947128
235066602928500262092848132
23516660208880227901495302528

2352 rows × 3 columns

tweet_json.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2352 entries, 0 to 2351
Data columns (total 3 columns):
tweet_id          2352 non-null object
retweet_count     2352 non-null int64
favorite_count    2352 non-null int64
dtypes: int64(2), object(1)
memory usage: 55.2+ KB

3.清洗

将上面评估时发现的数据问题汇总如下

问题定义

质量问题:
  • 1)质量问题: tweet_id 用户id应该为字符串类型,而不是int64 数值类型,所以需要修改。

  • 2)质量问题: expanded_urls是推文的链接地址,存在缺失值,缺失的行可能已经失效,需要处理

  • 3)质量问题: source 里面包含html的标签,可以进一步提取出去html标签的文本内容,表示来源。

  • 4)质量问题: jpg_url是发现有重复值,需要清理。

  • 5)retweeted_status_user_id 不为空的是 转发的推文 ,需要清理掉,只留下非转发的推文

  • 6)质量问题: 质量问题:狗的名字可以相同,但是 有异常的 a,None the an

  • 7)质量问题: 发现in_reply_to_status_id in_reply_to_user_id 都只有78条,基本都是空值,需要清理

  • 8)质量问题: image_predictions 三次预测中完全不属于狗的 数据有324条,需要处理删除处理

  • 9)质量问题: 评分的分母除了10和10的倍数的,还有少量不为10,如11,2,7需要重新检查或者重新提取

整洁性问题:
  • 10)三个表格有相同tweet_id的字段,可以合并成一个进行操作。
    1. expanded_urls 中有些列里面有多个相同的expanded_urls 观察到。
    1. 整洁性问题:狗的地位stage(体型而定的) 应该为分类数据,应该放在同一列中

下面总体按照顺序进行处理,但是因为有些处理完了才能继续处理后面的,所以些许顺序对不上,如下tweet_id因为合并需要格式相同,故提前处理,不过总共是可以确认处理完了8个质量问题,和2个整洁性问题的。

# 展开所有的column
pd.options.display.max_columns=500
pd.set_option('max_colwidth',200)
#定义: 1)质量问题: tweet_id 用户id应该为字符串类型,而不是int64 数值类型,所以需要修改。
#定义:      9)整洁性问题: 合并三个表

# 编码      解决tweet_id 数据类型不对和   合并三个数据集
tempTable  = pd.merge(twitter_achieve,image_predictions,on="tweet_id")   #类型需要转换成相同的才可以。
tempTable_clean = tempTable.copy()         #复制一份
tempTable_clean['tweet_id'] = tempTable_clean['tweet_id'].apply(str)     #把前两列合并成的tweet_id 格式转成字符串类型。 
tempTable_all  = pd.merge(tempTable_clean,tweet_json,on="tweet_id")
final_Data_clean = tempTable_all.copy()    #复制一份,不改变原来数据集
final_Data_clean.info()  

#测试   检查数据集格式tweet_id 是否为字符串类型,已确认修改成功;合并三个表,已确认,成功。
final_Data_clean.head(4)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2073 entries, 0 to 2072
Data columns (total 30 columns):
tweet_id                      2073 non-null object
in_reply_to_status_id         23 non-null float64
in_reply_to_user_id           23 non-null float64
timestamp                     2073 non-null object
source                        2073 non-null object
text                          2073 non-null object
retweeted_status_id           79 non-null float64
retweeted_status_user_id      79 non-null float64
retweeted_status_timestamp    79 non-null object
expanded_urls                 2073 non-null object
rating_numerator              2073 non-null int64
rating_denominator            2073 non-null int64
name                          2073 non-null object
doggo                         2073 non-null object
floofer                       2073 non-null object
pupper                        2073 non-null object
puppo                         2073 non-null object
jpg_url                       2073 non-null object
img_num                       2073 non-null int64
p1                            2073 non-null object
p1_conf                       2073 non-null float64
p1_dog                        2073 non-null bool
p2                            2073 non-null object
p2_conf                       2073 non-null float64
p2_dog                        2073 non-null bool
p3                            2073 non-null object
p3_conf                       2073 non-null float64
p3_dog                        2073 non-null bool
retweet_count                 2073 non-null int64
favorite_count                2073 non-null int64
dtypes: bool(3), float64(7), int64(5), object(15)
memory usage: 459.5+ KB
tweet_idin_reply_to_status_idin_reply_to_user_idtimestampsourcetextretweeted_status_idretweeted_status_user_idretweeted_status_timestampexpanded_urlsrating_numeratorrating_denominatornamedoggoflooferpupperpuppojpg_urlimg_nump1p1_confp1_dogp2p2_confp2_dogp3p3_confp3_dogretweet_countfavorite_count
0892420643555336193NaNNaN2017-08-01 16:23:56 +0000<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJUNaNNaNNaNhttps://twitter.com/dog_rates/status/892420643555336193/photo/11310PhineasNoneNoneNoneNonehttps://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg1orange0.097049Falsebagel0.085851Falsebanana0.076110False884239492
1892177421306343426NaNNaN2017-08-01 00:17:27 +0000<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>This is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIVNaNNaNNaNhttps://twitter.com/dog_rates/status/892177421306343426/photo/11310TillyNoneNoneNoneNonehttps://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg1Chihuahua0.323581TruePekinese0.090647Truepapillon0.068957True648033786
2891815181378084864NaNNaN2017-07-31 00:18:03 +0000<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>This is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJBNaNNaNNaNhttps://twitter.com/dog_rates/status/891815181378084864/photo/11210ArchieNoneNoneNoneNonehttps://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg1Chihuahua0.716012Truemalamute0.078253Truekelpie0.031379True430125445
3891689557279858688NaNNaN2017-07-30 15:58:51 +0000<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>This is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQNaNNaNNaNhttps://twitter.com/dog_rates/status/891689557279858688/photo/11310DarlaNoneNoneNoneNonehttps://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg1paper_towel0.170278FalseLabrador_retriever0.168086Truespatula0.040836False892542863
# 定义:  2)质量问题: expanded_urls是推文的链接地址,存在缺失值,缺失的行可能已经失效,需要处理

#编码
final_Data_clean  =  final_Data_clean[final_Data_clean['expanded_urls'].notnull()]   #把缺失值的 expanded_urls 去掉。留下没有缺失的。

#测试
final_Data_clean.expanded_urls.isnull().value_counts()   #验证完毕,不再出现含有缺失值的 expanded_urls了。
False    2073
Name: expanded_urls, dtype: int64
# 定义:3)质量问题:source 里面包含html的标签,可以进一步提取 

#编码
urlString = '<a href="https://www.baidu.com/link?url=A3b9CWrhoCv4Oxw6z40oAU2_qNwN9756AJwaCLaPmBpK0bFjU8Rjv2LwWLL7fvHwgyq4cwaMfgO6_as6CpzUg_&amp;wd=&amp;eqid=be1ccee200026e65000000065c81f06b" target="_blank"><em>Beautiful Soup</em> Documentation — <em>Beautiful Soup</em> 4.4.0 ...</a>'
from bs4 import BeautifulSoup
def phraseHtml(string):   #编写函数用来提取source中a标签中的链接内容
    soup = BeautifulSoup(string,'lxml')  #使用BeautifulSoup来解析html
    url = soup.find("a").string
    return url

final_Data_clean['source'] = final_Data_clean['source'].apply(phraseHtml)   #  应用函数提取出a标签内url 。

#测试: 检查source是否提取成功url,已验证,提取成功。
print(final_Data_clean['source'].value_counts())
final_Data_clean.head(4)
Twitter for iPhone    2032
Twitter Web Client      30
TweetDeck               11
Name: source, dtype: int64
tweet_idin_reply_to_status_idin_reply_to_user_idtimestampsourcetextretweeted_status_idretweeted_status_user_idretweeted_status_timestampexpanded_urlsrating_numeratorrating_denominatornamedoggoflooferpupperpuppojpg_urlimg_nump1p1_confp1_dogp2p2_confp2_dogp3p3_confp3_dogretweet_countfavorite_count
0892420643555336193NaNNaN2017-08-01 16:23:56 +0000Twitter for iPhoneThis is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJUNaNNaNNaNhttps://twitter.com/dog_rates/status/892420643555336193/photo/11310PhineasNoneNoneNoneNonehttps://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg1orange0.097049Falsebagel0.085851Falsebanana0.076110False884239492
1892177421306343426NaNNaN2017-08-01 00:17:27 +0000Twitter for iPhoneThis is Tilly. She's just checking pup on you. Hopes you're doing ok. If not, she's available for pats, snugs, boops, the whole bit. 13/10 https://t.co/0Xxu71qeIVNaNNaNNaNhttps://twitter.com/dog_rates/status/892177421306343426/photo/11310TillyNoneNoneNoneNonehttps://pbs.twimg.com/media/DGGmoV4XsAAUL6n.jpg1Chihuahua0.323581TruePekinese0.090647Truepapillon0.068957True648033786
2891815181378084864NaNNaN2017-07-31 00:18:03 +0000Twitter for iPhoneThis is Archie. He is a rare Norwegian Pouncing Corgo. Lives in the tall grass. You never know when one may strike. 12/10 https://t.co/wUnZnhtVJBNaNNaNNaNhttps://twitter.com/dog_rates/status/891815181378084864/photo/11210ArchieNoneNoneNoneNonehttps://pbs.twimg.com/media/DGBdLU1WsAANxJ9.jpg1Chihuahua0.716012Truemalamute0.078253Truekelpie0.031379True430125445
3891689557279858688NaNNaN2017-07-30 15:58:51 +0000Twitter for iPhoneThis is Darla. She commenced a snooze mid meal. 13/10 happens to the best of us https://t.co/tD36da7qLQNaNNaNNaNhttps://twitter.com/dog_rates/status/891689557279858688/photo/11310DarlaNoneNoneNoneNonehttps://pbs.twimg.com/media/DF_q7IAWsAEuuN8.jpg1paper_towel0.170278FalseLabrador_retriever0.168086Truespatula0.040836False892542863
# 定义:4)质量问题: jpg_url是发现有重复值,需要清理。

#编码
print(final_Data_clean[final_Data_clean['jpg_url'].duplicated() ==True].iloc[:,0].size)# #检查到有79条之行的jpg_url是重复的。
final_Data_clean = final_Data_clean[final_Data_clean['jpg_url'].duplicated() ==False]   #直接保留没重复的jpg_url的数据

#测试
final_Data_clean[final_Data_clean['jpg_url'].duplicated() ==True]   #正确除去了包含jpg_url重复的内容
64
tweet_idin_reply_to_status_idin_reply_to_user_idtimestampsourcetextretweeted_status_idretweeted_status_user_idretweeted_status_timestampexpanded_urlsrating_numeratorrating_denominatornamedoggoflooferpupperpuppojpg_urlimg_nump1p1_confp1_dogp2p2_confp2_dogp3p3_confp3_dogretweet_countfavorite_count
# 定义: 5) retweeted_status_user_id 不为空的是 转发的推文 ,需要清理掉,只留下非转发的推文

#编码
final_Data_clean = final_Data_clean[final_Data_clean['retweeted_status_user_id'].isnull()]     # 把去掉的改过去,

#测试,验证成功,已经删除了转发的推特文。
final_Data_clean[final_Data_clean['retweeted_status_user_id'].notnull()]  #
tweet_idin_reply_to_status_idin_reply_to_user_idtimestampsourcetextretweeted_status_idretweeted_status_user_idretweeted_status_timestampexpanded_urlsrating_numeratorrating_denominatornamedoggoflooferpupperpuppojpg_urlimg_nump1p1_confp1_dogp2p2_confp2_dogp3p3_confp3_dogretweet_countfavorite_count
# 定义:6)质量问题:狗的名字可以相同,但是 有异常的 a,None  the an

#编码   
final_Data_clean['name'] = final_Data_clean['name'].replace(['a','None','the','an'],np.nan)  # 把name中的  a   None   the   用NaN进行填充 

#测试,已除去 a None the
final_Data_clean['name'].value_counts()   #这样就看不到空值了咯。
Charlie      11
Cooper       10
Lucy         10
Oliver       10
Winston       8
Sadie         8
Penny         8
Tucker        8
Toby          7
Daisy         7
Stanley       6
Bella         6
Koda          6
Jax           6
Lola          6
Oscar         5
Leo           5
Chester       5
Louis         5
Buddy         5
Phil          4
Maggie        4
Duke          4
Gus           4
Rusty         4
Brody         4
Scout         4
Milo          4
Archie        4
Dexter        4
             ..
Blipson       1
Jangle        1
Taco          1
Willy         1
Pepper        1
Pipsy         1
Aja           1
Noah          1
Pip           1
Sailer        1
Clifford      1
Bertson       1
Thor          1
Julius        1
Flash         1
Binky         1
Ralphus       1
Rover         1
Shiloh        1
Margo         1
Tito          1
Brownie       1
my            1
Colin         1
Buckley       1
Alexander     1
Kulet         1
Keurig        1
Trigger       1
Jarod         1
Name: name, Length: 909, dtype: int64
# 定义 :7)质量问题: 发现in_reply_to_status_id     in_reply_to_user_id   都只有78条,基本都是空值,需要清理

#编码
final_Data_clean= final_Data_clean.drop(['in_reply_to_status_id','in_reply_to_user_id'],axis='columns')

#测试,以通过,去掉了不需要的这两列
final_Data_clean.info()  
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1930 entries, 0 to 2072
Data columns (total 28 columns):
tweet_id                      1930 non-null object
timestamp                     1930 non-null object
source                        1930 non-null object
text                          1930 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 1930 non-null object
rating_numerator              1930 non-null int64
rating_denominator            1930 non-null int64
name                          1333 non-null object
doggo                         1930 non-null object
floofer                       1930 non-null object
pupper                        1930 non-null object
puppo                         1930 non-null object
jpg_url                       1930 non-null object
img_num                       1930 non-null int64
p1                            1930 non-null object
p1_conf                       1930 non-null float64
p1_dog                        1930 non-null bool
p2                            1930 non-null object
p2_conf                       1930 non-null float64
p2_dog                        1930 non-null bool
p3                            1930 non-null object
p3_conf                       1930 non-null float64
p3_dog                        1930 non-null bool
retweet_count                 1930 non-null int64
favorite_count                1930 non-null int64
dtypes: bool(3), float64(5), int64(5), object(15)
memory usage: 397.7+ KB
#定义 8)质量问题: image_predictions 中预测完全不属于狗的 数据有324条,需要处理删除处理

#编码
final_Data_clean[(final_Data_clean['p1_dog']==False)&(final_Data_clean['p2_dog']==False)&(final_Data_clean['p3_dog']==False)]  #图片识别为不是狗的,需要排除掉
final_Data_clean = final_Data_clean[(final_Data_clean['p1_dog']==True)|(final_Data_clean['p2_dog']==True)|(final_Data_clean['p3_dog']==True)]   #这里这儿只保留识别成是狗的种类的

#测试,验证成功,已经除去了三次识别都不是狗的数据。
final_Data_clean[(final_Data_clean['p1_dog']==False)&(final_Data_clean['p2_dog']==False)&(final_Data_clean['p3_dog']==False)]
tweet_idtimestampsourcetextretweeted_status_idretweeted_status_user_idretweeted_status_timestampexpanded_urlsrating_numeratorrating_denominatornamedoggoflooferpupperpuppojpg_urlimg_nump1p1_confp1_dogp2p2_confp2_dogp3p3_confp3_dogretweet_countfavorite_count
# 定义:10 )评分的分母除了10和10的倍数的,还有少量不为10,如11,2,7需要重新检查或者重新提取  <第9个合并三个表在1)那儿一起解决了>

#编码
print(final_Data_clean[final_Data_clean['rating_denominator']!=10].iloc[:,0].size)    #非0 数量不算多,所以直接目测手动处理
final_Data_clean[final_Data_clean['rating_denominator']!=10][['text','rating_numerator','rating_denominator']]
17
textrating_numeratorrating_denominator
344The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd8470
414Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx247
734Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE165150
876After so many requests, this is Bretagne. She was the last surviving 9/11 search dog, and our second ever 14/10. RIP https://t.co/XAVDNDaVgQ911
967Happy 4/20 from the squad! 13/10 for all https://t.co/eV1diwds8a420
1001This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq5050
1022Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc19990
1047Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om128080
1065From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK4550
1131Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa6050
1207Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ4440
1379Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3143130
1380Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55121110
1405This is Darrel. He just robbed a 7/11 and is in a high speed police chase. Was just spotted by the helicopter 10/10 https://t.co/7EsP8LmSp5711
1512IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq144120
1571Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw8880
2052This is an Albanian 3 1/2 legged Episcopalian. Loves well-polished hardwood flooring. Penis on the collar. 9/10 https://t.co/d9NcXFKwLv12
# 数量不多,故可以直接目测,目测结果,基本上没什么错误
# 出现错误的主要是 id=[876,967,1405,2052] 他们对应的正确的值应该是 values=["14/10",'13/10','10/10 ',' 9/10 ']  所以单独设置就可以
#修改值
final_Data_clean.loc[876,"rating_numerator" ] = 14   #修改分子和分母 
final_Data_clean.loc[876,"rating_denominator" ]= 10

final_Data_clean.loc[967,"rating_numerator" ] = 13   #修改分子和分母 
final_Data_clean.loc[967,"rating_denominator" ]= 10

final_Data_clean.loc[1405,"rating_numerator" ]   =10 #修改分子和分母 
final_Data_clean.loc[1405,"rating_denominator" ]= 10

final_Data_clean.loc[2052,"rating_numerator" ]=9    #修改分子和分母     ?因为不是顺序的,所以iloc会对应不上吗,loc,默认的是index
final_Data_clean.loc[2052,"rating_denominator" ]=10


#测试1
final_Data_clean[final_Data_clean['rating_denominator']!=10][['text','rating_numerator','rating_denominator']] #验证修改成功。
textrating_numeratorrating_denominator
344The floofs have been released I repeat the floofs have been released. 84/70 https://t.co/NIYC820tmd8470
414Meet Sam. She smiles 24/7 &amp; secretly aspires to be a reindeer. \nKeep Sam smiling by clicking and sharing this link:\nhttps://t.co/98tB8y7y7t https://t.co/LouL5vdvxx247
734Why does this never happen at my front door... 165/150 https://t.co/HmwrdfEfUE165150
1001This is Bluebert. He just saw that both #FinalFur match ups are split 50/50. Amazed af. 11/10 https://t.co/Kky1DPG4iq5050
1022Happy Saturday here's 9 puppers on a bench. 99/90 good work everybody https://t.co/mpvaVxKmc19990
1047Here's a brigade of puppers. All look very prepared for whatever happens next. 80/80 https://t.co/0eb7R1Om128080
1065From left to right:\nCletus, Jerome, Alejandro, Burp, &amp; Titson\nNone know where camera is. 45/50 would hug all at once https://t.co/sedre1ivTK4550
1131Here is a whole flock of puppers. 60/50 I'll take the lot https://t.co/9dpcw6MdWa6050
1207Happy Wednesday here's a bucket of pups. 44/40 would pet all at once https://t.co/HppvrYuamZ4440
1379Two sneaky puppers were not initially seen, moving the rating to 143/130. Please forgive us. Thank you https://t.co/kRK51Y5ac3143130
1380Someone help the girl is being mugged. Several are distracting her while two steal her shoes. Clever puppers 121/110 https://t.co/1zfnTJLt55121110
1512IT'S PUPPERGEDDON. Total of 144/120 ...I think https://t.co/ZanVtAtvIq144120
1571Here we have an entire platoon of puppers. Total score: 88/80 would pet all at once https://t.co/y93p6FLvVw8880
# 用分子分母相除,结果存到新列中。
#测试2
final_Data_clean['Final_Grade']=final_Data_clean['rating_numerator']/final_Data_clean['rating_denominator']
final_Data_clean['Final_Grade']=final_Data_clean['Final_Grade'].apply(lambda x: '%.2f'%x   )   #结果保留两位小数
final_Data_clean['Final_Grade'].value_counts()   #评分结果。

1.20    407
1.00    356
1.10    348
1.30    214
0.90    134
0.80     68
0.70     31
1.40     22
0.60     16
0.50     14
0.40      6
0.30      5
0.20      2
7.50      1
3.43      1
2.70      1
0.00      1
2.60      1
Name: Final_Grade, dtype: int64
# 定义;11)整洁性问题: expanded_urls 中有些列里面有多个相同的expanded_url 观察到。(第9个整洁性问题在一开始的时候就合并解决)
final_Data_clean['expanded_urls']  # 可以看到内部有些,有很多个重复的相同的链接地址,并且之间是用逗号链接的。

#编码
def cleanExpanded_urls(string):
    if(string.find(",")!=-1):
        return string.split(",")[-1]
    else:
        return string
final_Data_clean['expanded_urls']  = final_Data_clean['expanded_urls']  .apply(cleanExpanded_urls)


#测试,已经清理完毕
final_Data_clean['expanded_urls'] 
1       https://twitter.com/dog_rates/status/892177421306343426/photo/1
2       https://twitter.com/dog_rates/status/891815181378084864/photo/1
3       https://twitter.com/dog_rates/status/891689557279858688/photo/1
4       https://twitter.com/dog_rates/status/891327558926688256/photo/1
5       https://twitter.com/dog_rates/status/891087950875897856/photo/1
6       https://twitter.com/dog_rates/status/890971913173991426/photo/1
7       https://twitter.com/dog_rates/status/890729181411237888/photo/1
8       https://twitter.com/dog_rates/status/890609185150312448/photo/1
9       https://twitter.com/dog_rates/status/890240255349198849/photo/1
10      https://twitter.com/dog_rates/status/890006608113172480/photo/1
11      https://twitter.com/dog_rates/status/889880896479866881/photo/1
12      https://twitter.com/dog_rates/status/889665388333682689/photo/1
13      https://twitter.com/dog_rates/status/889638837579907072/photo/1
14      https://twitter.com/dog_rates/status/889531135344209921/photo/1
15      https://twitter.com/dog_rates/status/889278841981685760/video/1
16      https://twitter.com/dog_rates/status/888917238123831296/photo/1
17      https://twitter.com/dog_rates/status/888804989199671297/photo/1
18      https://twitter.com/dog_rates/status/888554962724278272/photo/1
19      https://twitter.com/dog_rates/status/888078434458587136/photo/1
20      https://twitter.com/dog_rates/status/887705289381826560/photo/1
22      https://twitter.com/dog_rates/status/887473957103951883/photo/1
23      https://twitter.com/dog_rates/status/887343217045368832/video/1
24      https://twitter.com/dog_rates/status/887101392804085760/photo/1
25      https://twitter.com/dog_rates/status/886983233522544640/photo/1
26      https://twitter.com/dog_rates/status/886736880519319552/photo/1
28      https://twitter.com/dog_rates/status/886366144734445568/photo/1
29      https://twitter.com/dog_rates/status/886258384151887873/photo/1
30      https://twitter.com/dog_rates/status/885984800019947520/photo/1
31      https://twitter.com/dog_rates/status/885528943205470208/photo/1
33      https://twitter.com/dog_rates/status/885167619883638784/photo/1
                                     ...                               
2037    https://twitter.com/dog_rates/status/666437273139982337/photo/1
2038    https://twitter.com/dog_rates/status/666435652385423360/photo/1
2039    https://twitter.com/dog_rates/status/666430724426358785/photo/1
2040    https://twitter.com/dog_rates/status/666428276349472768/photo/1
2041    https://twitter.com/dog_rates/status/666421158376562688/photo/1
2042    https://twitter.com/dog_rates/status/666418789513326592/photo/1
2044    https://twitter.com/dog_rates/status/666407126856765440/photo/1
2045    https://twitter.com/dog_rates/status/666396247373291520/photo/1
2046    https://twitter.com/dog_rates/status/666373753744588802/photo/1
2048    https://twitter.com/dog_rates/status/666353288456101888/photo/1
2049    https://twitter.com/dog_rates/status/666345417576210432/photo/1
2050    https://twitter.com/dog_rates/status/666337882303524864/photo/1
2052    https://twitter.com/dog_rates/status/666287406224695296/photo/1
2053    https://twitter.com/dog_rates/status/666273097616637952/photo/1
2056    https://twitter.com/dog_rates/status/666102155909144576/photo/1
2057    https://twitter.com/dog_rates/status/666099513787052032/photo/1
2058    https://twitter.com/dog_rates/status/666094000022159362/photo/1
2059    https://twitter.com/dog_rates/status/666082916733198337/photo/1
2060    https://twitter.com/dog_rates/status/666073100786774016/photo/1
2061    https://twitter.com/dog_rates/status/666071193221509120/photo/1
2062    https://twitter.com/dog_rates/status/666063827256086533/photo/1
2063    https://twitter.com/dog_rates/status/666058600524156928/photo/1
2064    https://twitter.com/dog_rates/status/666057090499244032/photo/1
2065    https://twitter.com/dog_rates/status/666055525042405380/photo/1
2067    https://twitter.com/dog_rates/status/666050758794694657/photo/1
2068    https://twitter.com/dog_rates/status/666049248165822465/photo/1
2069    https://twitter.com/dog_rates/status/666044226329800704/photo/1
2070    https://twitter.com/dog_rates/status/666033412701032449/photo/1
2071    https://twitter.com/dog_rates/status/666029285002620928/photo/1
2072    https://twitter.com/dog_rates/status/666020888022790149/photo/1
Name: expanded_urls, Length: 1628, dtype: object
# 定义:12)整洁性问题:狗的地位stage(体型而定的) 应该为分类数据,应该放在同一列中

#编码
final_Data_clean['stage']=final_Data_clean['text'].str.findall("(puppo|doggo|pupper|floofer)").copy()
#因为狗的身份来自于text中文本,所以想到了从文本中重新提取,然后提取的时候发现,原来一个文本中可能同时出现多个相同的身份描述;
#如下面计数输出可见,正则表达式都会找完为止,因为findall 返回的是list类型,而如果这样直接把多个list内元素连接会出现重复,
#那么最好的方法就是去重,于是采用了set(listObject)的方法,先把重复的去掉,就每种身份词只留下一个了。
print(  final_Data_clean['stage'].value_counts() )  #还有1416是确实没有发现文本中有种类分类的,故置空处理


final_Data_clean['stage']=final_Data_clean['stage'].apply(lambda x:"-".join(set(x)))   #原本只想到合并使用join,确实set()方法是看了网上后借用的。
print("置空前")
print(final_Data_clean['stage'].value_counts()  )  #还有1416是确实没有发现文本中有种类分类的,故置空处理
final_Data_clean['stage']=final_Data_clean['stage'].replace("",np.nan)

#测试,验证已完成。
print()
print("置空后")
print(final_Data_clean['stage'].value_counts()) #还有1416是确实没有发现文本中有种类分类的,故置空处理

final_Data_clean.drop(final_Data_clean[['doggo','puppo','pupper','floofer']],axis=1,inplace=True)  #把整理
final_Data_clean.info()   

[]                          1362
[pupper]                     171
[doggo]                       54
[puppo]                       24
[pupper, pupper]               6
[doggo, pupper]                4
[floofer]                      3
[puppo, doggo]                 2
[pupper, pupper, pupper]       1
[pupper, doggo, doggo]         1
Name: stage, dtype: int64
置空前
                1362
pupper           178
doggo             54
puppo             24
pupper-doggo       5
floofer            3
puppo-doggo        2
Name: stage, dtype: int64

置空后
pupper          178
doggo            54
puppo            24
pupper-doggo      5
floofer           3
puppo-doggo       2
Name: stage, dtype: int64
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1628 entries, 1 to 2072
Data columns (total 26 columns):
tweet_id                      1628 non-null object
timestamp                     1628 non-null object
source                        1628 non-null object
text                          1628 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null object
expanded_urls                 1628 non-null object
rating_numerator              1628 non-null int64
rating_denominator            1628 non-null int64
name                          1166 non-null object
jpg_url                       1628 non-null object
img_num                       1628 non-null int64
p1                            1628 non-null object
p1_conf                       1628 non-null float64
p1_dog                        1628 non-null bool
p2                            1628 non-null object
p2_conf                       1628 non-null float64
p2_dog                        1628 non-null bool
p3                            1628 non-null object
p3_conf                       1628 non-null float64
p3_dog                        1628 non-null bool
retweet_count                 1628 non-null int64
favorite_count                1628 non-null int64
Final_Grade                   1628 non-null object
stage                         266 non-null object
dtypes: bool(3), float64(5), int64(5), object(13)
memory usage: 390.0+ KB
保存清理和合并后的数据集
# 保存清洁的数据集到twitter_archive_master.csv:
final_Data_clean.to_csv("twitter_archive_master.csv", encoding='utf-8')        #存入
print("保存完毕!twitter_archive_master.csv")
保存完毕!twitter_archive_master.csv

4.分析

题出问题

  • 狗狗的stage中哪种身份最多?
  • 数据集中,排名前10的最常见的10个名字是哪些?
  • 数据集中,狗狗的评分大部分是多少?
twitter_archive_master= pd.read_csv("twitter_archive_master.csv")  #读取整理好的数据集中的数据
twitter_archive_master.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1628 entries, 0 to 1627
Data columns (total 27 columns):
Unnamed: 0                    1628 non-null int64
tweet_id                      1628 non-null int64
timestamp                     1628 non-null object
source                        1628 non-null object
text                          1628 non-null object
retweeted_status_id           0 non-null float64
retweeted_status_user_id      0 non-null float64
retweeted_status_timestamp    0 non-null float64
expanded_urls                 1628 non-null object
rating_numerator              1628 non-null int64
rating_denominator            1628 non-null int64
name                          1166 non-null object
jpg_url                       1628 non-null object
img_num                       1628 non-null int64
p1                            1628 non-null object
p1_conf                       1628 non-null float64
p1_dog                        1628 non-null bool
p2                            1628 non-null object
p2_conf                       1628 non-null float64
p2_dog                        1628 non-null bool
p3                            1628 non-null object
p3_conf                       1628 non-null float64
p3_dog                        1628 non-null bool
retweet_count                 1628 non-null int64
favorite_count                1628 non-null int64
Final_Grade                   1628 non-null float64
stage                         266 non-null object
dtypes: bool(3), float64(7), int64(7), object(10)
memory usage: 310.1+ KB
# 分析 狗狗的stage中哪种身份最多?

plt.title("Proportion of doys in each stage")
twitter_archive_master.stage.value_counts().plot(kind="pie",figsize=(8, 8),autopct='%.1f')   

#结论: 可以看到下图,最多的是处于pupper的,有66.9的狗处于这个身份
<matplotlib.axes._subplots.AxesSubplot at 0x1c04945a080>

在这里插入图片描述

# 分析 数据集中,大家最喜欢给狗狗取什么名字?
plt.title("Top 10 favorite names")
plt.xlabel("Names")
plt.ylabel("Quantity")
print(twitter_archive_master.name.value_counts().iloc[0:10])
twitter_archive_master.name.value_counts().iloc[0:10].sort_values().plot("barh")

#结论,排名前10的最常见的10个名字如下
Cooper     10
Lucy       10
Charlie    10
Oliver      9
Tucker      8
Winston     7
Daisy       7
Penny       7
Sadie       7
Jax         6
Name: name, dtype: int64





<matplotlib.axes._subplots.AxesSubplot at 0x1c049385c88>

![png](output_41_2.png在这里插入图片描述

# 分析:数据集中,狗狗的评分大部分是多少?

plt.title("Dog score")

print(twitter_archive_master.Final_Grade.value_counts().iloc[0:10])
twitter_archive_master.Final_Grade.value_counts().iloc[0:10].plot("pie",autopct='%.2f',figsize=(8, 8))

#结论:下图可见最大面积的4个扇形都是1分及以上,22+25+21+13=81,百分之81的狗狗的评分是一分及以上的。
1.2    407
1.0    356
1.1    348
1.3    214
0.9    134
0.8     68
0.7     31
1.4     22
0.6     16
0.5     14
Name: Final_Grade, dtype: int64





<matplotlib.axes._subplots.AxesSubplot at 0x1c0496d3be0>

在这里插入图片描述


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值