python爬虫（一）

最新推荐文章于 2025-07-29 19:45:28 发布

北邮张博

最新推荐文章于 2025-07-29 19:45:28 发布

阅读量604

点赞数

分类专栏： python 文章标签： python 爬虫

python 专栏收录该内容

5 篇文章

订阅专栏

本文介绍了一个简单的Python爬虫示例，演示了如何从指定网页抓取图片和相关信息，并保存到本地。通过urllib和BeautifulSoup库实现了网页内容的获取与解析。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

写入图片

#传入图片地址，文件名，保存单张图片
def saveImg(self,imageURL,fileName):
     u = urllib.urlopen(imageURL)
     data = u.read()
     f = open(fileName, 'wb')
     f.write(data)
     f.close()

写入文件

def saveBrief(self,content,name):
    fileName = name + "/" + name + ".txt"
    f = open(fileName,"w+")
    print u"保存信息",fileName
    f.write(content.encode('utf-8'))

创建新文件夹

创建新目录
def mkdir(self,path):
    path = path.strip()
    # 判断路径是否存在
    # 存在     True
    # 不存在   False
    isExists=os.path.exists(path)
    # 判断结果
    if not isExists:
        # 如果不存在则创建目录
        # 创建目录操作函数
        os.makedirs(path)
        return True
    else:
        # 如果目录存在则不创建，并提示目录已存在
        return False

看完《python数据采集》第一章，将demo网页中的图片和图片信息存至本地作为练手，保存备忘

import urllib
from bs4 import BeautifulSoup


def saveImg(imageURL, fileName):
    u = urllib.urlopen(imageURL)
    data = u.read()
    f = open("/home/zhangbo/PycharmProjects/reptile/"+fileName, 'wb')
    f.write(data)
    f.close()

main_url = "http://www.pythonscraping.com/pages/"
response = urllib.urlopen(main_url+"page3.html")
bsObj = BeautifulSoup(response.read())
giftList = bsObj.find_all("tr",{"class":"gift"})
for gift in giftList:
    fileName = gift.find_all("td")[0].get_text().strip()
    imgURL = main_url + str(gift.find_all("td")[3].find("img"))[10:31]
    saveImg(imgURL, fileName)

博客等级

码龄11年

121
原创

279
点赞

789
收藏

846
粉丝

关注

私信

热门文章

分类专栏

展开全部收起

上一篇：: python 引用 java代码

下一篇：: 第1、2节-线性规划、梯度下降和正规方程组

最新评论

Kaggle入门实例-预测房价
闻窗雪: 可视化缺失值部分。后面的就可以正常运行了。import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt train = pd.read_csv("C:/Users/lenovo/Desktop/train_engineered.csv") test = pd.read_csv("C:/Users/lenovo/Desktop/test_engineered.csv") train_ID = train['Id'] test_ID = test['Id'] train.drop('Id', axis=1, inplace=True) test.drop('Id', axis=1, inplace=True) # 如果训练集中有 'SalePrice'，则保存并删除 if 'SalePrice' in train.columns: y_train = train['SalePrice'] train.drop('SalePrice', axis=1, inplace=True) all_data = pd.concat([train, test], axis=0, ignore_index=True) all_data_na = (all_data.isnull().mean() * 100).sort_values(ascending=False) all_data_na = all_data_na[all_data_na > 0] f, ax = plt.subplots(figsize=(15, 12)) plt.xticks(rotation=90) sns.barplot(x=all_data_na.index, y=all_data_na.values) plt.xlabel('Features', fontsize=15) plt.ylabel('Percent of missing values', fontsize=15) plt.title('Percent missing data by feature', fontsize=15) plt.show()
Kaggle入门实例-预测房价
闻窗雪: 另外，我解决了博主没有粘贴成功合并测试集和训练集的问题：# 加载数据 train = pd.read_csv("C:/Users/lenovo/Desktop/train_engineered.csv") test = pd.read_csv("C:/Users/lenovo/Desktop/test_engineered.csv") # 保存原始ID，之后可能需要使用 train_ID = train['Id'] test_ID = test['Id'] # 如果存在目标变量，先将其从训练集中分离出来 if 'SalePrice' in train.columns: y_train = train['SalePrice'].values train.drop(['SalePrice'], axis=1, inplace=True) # 删除两个数据集中的 Id 列，因为它们对于模型来说通常是无关紧要的 train.drop(['Id'], axis=1, inplace=True) test.drop(['Id'], axis=1, inplace=True) # 合并训练集和测试集 all_data = pd.concat([train, test]).reset_index(drop=True)
Kaggle入门实例-预测房价
闻窗雪: # MSSubClass: NA most likely means No building class. # We can replace missing values with None all_data['MSSubClass'] = all_data['MSSubClass'].fillna("None") # Check if there are any remaining missing values all_data_na = (all_data.isnull().sum() / len(all_data)) * 100 all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False) missing_data = pd.DataFrame({'Missing Ratio': all_data_na}) print(missing_data.head())
Kaggle入门实例-预测房价
闻窗雪: # Functional: data description says NA means typical all_data["Functional"] = all_data["Functional"].fillna("Typ") # Electrical: It has one NA value. Since this feature has mostly 'SBrkr', we can set that for the missing value. all_data['Electrical'] = all_data['Electrical'].fillna(all_data['Electrical'].mode()[0]) # KitchenQual: Only one NA value, and same as Electrical, we set 'TA' (which is the most frequent) # for the missing value in KitchenQual. all_data['KitchenQual'] = all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0]) # Exterior1st and Exterior2nd: Both have only one missing value. # We will just substitute in the most common string. all_data['Exterior1st'] = all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0]) all_data['Exterior2nd'] = all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0]) # SaleType: Fill in again with most frequent which is "WD" all_data['SaleType'] = all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
Kaggle入门实例-预测房价
闻窗雪: # other houses in its neighborhood, we can fill in missing values by the median LotFrontage of the neighborhood. all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform( lambda x: x.fillna(x.median())) # GarageType, GarageFinish, GarageQual and GarageCond: Replacing missing data with None for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'): all_data[col] = all_data[col].fillna("None") # GarageYrBlt, GarageArea and GarageCars: Replacing missing data with 0 # (Since No garage = no cars in such garage.) for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'): all_data[col] = all_data[col].fillna(0) # BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, BsmtFullBath and BsmtHalfBath: # missing values are likely zero for having no basement for col in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'): all_data[col] = all_data[col].fillna(0) # BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1 and

大家在看

最新文章

目录

展开全部

收起

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。