1.数据的爬取和清洗
(1)标题和作者的获取以及数据整理
from bs4 import BeautifulSoup
data_all =[]
for i in range(0,10):
url = 'http://bbs.tianya.cn/list-no02-1.shtml'
douban_data = requests.get(url)
soup = BeautifulSoup(douban_data.text,'lxml')
titles = soup.select('tr.bg td.td-title a')
author = soup.select('tr.bg td a.author')
for title,price in zip(titles,author):
data = {'title':title.get_text().strip().split()[0],
'author':price.get_text().strip()}
# print(data)
data_all.append(data)
len(data_all)
(2)点击量和回复量的获取(这里应该循环获取,因为每一个单页的网址不一样)
import requests
from bs4 import BeautifulSoup
url = 'http://bbs.tianya.cn/list.jsp?item=no02&nextid=1556923587000'
douban_data = requests.get(url)
soup = BeautifulSoup(douban_data.text,'lxml')
a_all = soup.select('td')
(3)点击量和回复量数据的整理
import pandas as pd
j=2
k=3
data_all1 = []
for click in zip(a_all):
# print(a_all[3])
if j<=400 and k<= 400:
a_data = {'click':a_all[j].get_text().strip(),
'response':a_all[k].get_text().strip()}
# print(a_all[k].get_text().strip())
j = j+5
k = j+1
data_all1.append(a_data)
(4)两个数据的合并
①首先将list格式的数据转化为DataFrame格式
1)import pandas as pd
data_pd1 = pd.DataFrame(data1,columns = ['author','title'])
2)import pandas as pd
data_pd2 = pd.DataFrame(data3,columns = ['click','response'])
②为合并添加一组关键列
#生成一组不重复的随机数,用于作为两个DataFrame合并的key
import random
listww = random.sample(range(0,400),400)
③为两组list添加关键列
data_s['key'] = listww
data_s
2)data_f = pd.DataFrame(data_f,columns = ['click','response','key'])
data_f['key'] = listww
data_ft
④合并
result = pd.merge(data_s,data_f,on='key')
⑤删除关键列
re = result.drop('key',axis=1)
(5)保存文件–格式为csv
re.to_csv('sqy.csv',index=False)