Twitter狗狗数据清洗

最新推荐文章于 2021-11-19 17:49:02 发布

原创

最新推荐文章于 2021-11-19 17:49:02 发布 · 881 阅读

2 ·

CC 4.0 BY-SA版权

收集数据

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import json
import os

#导入Twitter档案
twitter_archive_enhanced=pd.read_csv('twitter-archive-enhanced.csv')

#导入图像预测数据
url='https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/\
WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv'

response=requests.get(url)
with open(os.path.join('image-predictions'),mode='wb') as file:
    file.write(response.content)
    
image_predictions=pd.read_csv('image-predictions',sep='\t')

image_predictions.to_csv('image_predictions.tsv', index=False)

#导入额外附加数据
#将json转换为元素为字典的列表 再将这个带字典的列表转换为dataframe
tweet_list=[]
f=open('d:\\tweet_json.txt','r')
for row in f:
    json_dict= json.loads(row) #区别load
    to_append= {
        'tweet_id':json_dict['id_str'],
        'retweet_count':json_dict['retweet_count'],
        'favorite_count':json_dict['favorite_count']
    }
    tweet_list.append(to_append)
tweet = pd.DataFrame(tweet_list, columns = ['tweet_id','retweet_count','favorite_count'])
tweet.head()

数据评估

tweet_id：档案中的推特 ID
in_reply_to_status_id：回复ID
in_reply_to_user_id：被回复推文原始用户ID
timestamp：发文时间
source：消息来源（使用设备）
text：推文内容
retweeted_status_id：转发ID
retweeted_status_user_id：转发用户ID
retweeted_status_timestamp：转发时间
expanded_urls：推文链接
rating_numerator：评分分子
rating_denominator：评分分母
name：宠物名
doggo：狗的成长阶段，分类变量
floofer：狗的成长阶段，分类变量
pupper：狗的成长阶段，分类变量
puppo：狗的成长阶段，分类变量

tweet_id：档案中的推特 ID
jpg_url：预测的图像资源链接
img_num：最可信的预测结果对应的图像编号
p1：算法对推特中图片的一号预测
p1_conf：算法的一号预测的可信度
p1_dog：一号预测该图片是否属于“狗”（有可能是其他物种，比如熊、马等）
p2：算法对推特中图片预测的第二种可能性
p2_conf：算法的二号预测的可信度
p2_dog：二号预测该图片是否属于“狗”
p3：算法对推特中图片预测的第三种可能性
p3_conf：算法的三号预测的可信度
p3_dog：三号预测该图片是否属于“狗”

质量



twitter_archive_enhanced表
1.source列包含多余的html文本内容，需要删除
2.转发的推文需要删除
3.expanded_urls存在缺失值
4.评分分母不全为10，从text中重新提取
5.狗狗分类缺失值较多,还有对应两种类型的情况
6.name列提取错误，如a
7.tweet_id列格式不正确，应该为字符串，目前为int64
8.in_reply_to_status_id和in_reply_to_user_id缺失项较多，需要删除

image_predictions表
9.图片链接中含有66条重复值，应该为转发内容，需要删除
10.tweet_id列也存在相同问题，应该为字符串，目前为int64

tweet表 
暂无

数据整洁度

1.twitter_archive_enhanced表中doggo、floofer、pupper、puppo属于类型变量，应该为1列
2.所有表格中观察对象相同，可以将三个数据片段进行合并

数据清洗

#复制数据集
archive_enhanced_clean = twitter_archive_enhanced.copy()
image_predictions_clean = image_predictions.copy()
tweet_clean = tweet.copy()

#方法一
from bs4 import BeautifulSoup
#提取HTML中的source文本
df_list=[]
for i in archive_enhanced_clean.source:
    soup=BeautifulSoup(i,'lxml')
    sources=soup.find('a').string
    df_list.append(sources)
#存入dataframe
archive_enhanced_clean.source=df_list

#改进后方法二
#archive_enhanced_clean.source = archive_enhanced_clean.source.str.extract('>(.+)<',expand = True)

#测试结果
print(archive_enhanced_clean.source.value_counts())
Ser