如何爬取中国近十年的GDP，对数据进行处理，并写入csv文件？

本文链接：https://blog.youkuaiyun.com/m0_50628114/article/details/112666855

如何爬取中国近十年的GDP，并写入csv文件，对数据进行处理？

上一链接：https://blog.youkuaiyun.com/m0_50628114/article/details/112561146我们讲到如何提取到中国近十年gdp数据，但是数据是不规整的，这节给大家讲一下如何提取我们真正想要的数据。
年份,中国,GDP(美元),占世界%
2019,“14.34万亿 (14,342,902,842,915)”,16.3550%
2018,“13.89万亿 (13,894,817,110,036)”,16.0900%
2017,“12.31万亿 (12,310,408,652,423)”,15.1552%
2016,“11.23万亿 (11,233,277,146,512)”,14.7156%
2015,“11.06万亿 (11,061,552,790,044)”,14.7098%
2014,“10.48万亿 (10,475,682,846,632)”,13.1851%
2013,“9.57万亿 (9,570,405,758,739)”,12.3805%
2012,“8.53万亿 (8,532,230,724,141)”,11.3542%
2011,“7.55万亿 (7,551,500,425,597)”,10.2814%
2010,“6.09万亿 (6,087,164,527,421)”,9.2072%
以上数据，对GDP进行处理，有两种，一种是精确提取，将括号内部的数据提取出来，二是将万亿前面的数据提取出来，可以用正则，也可以用分割，我这里用的是分割。占世界%则是把%去掉，只留下数字，用的是正则。之前的步骤不再赘述，需要可查看上方链接。

数据处理部分

这里是将原来返回的data_header,data_detail进行处理，其中用到了split，正则，以及reversed，zip等函数，不懂得可去官网进行学习。

# 获取年份
    year = []
    for i in range(0, len(data_detail), 3):
        year.append(data_detail[i])
    year1 = reversed(year)

    # 获取GDP，并将GDP中的数字提取出来，转化为数值
    China_GDP = []
    for i in range(1, len(data_detail), 3):
        a = (data_detail[i])
        
        # 若是想取得括号里的数据则是以下这一步
        # b = a.split('(')[1]
        
        # 取得空格前的数字（这里是提取万亿前面的数据）
        b = a.split(' ')[0]
        data = b[0: len(b)-2]
        China_GDP.append(data)
    China_GDP1 = reversed(China_GDP)

    # 获取占世界%
    China_percentage = []
    for i in range(2, len(data_detail), 3):
        remove2 = '%'
        data2 = re.sub(remove2, '', data_detail[i])
        China_percentage.append(data2)
    China_percentage1 = reversed(China_percentage)

    #zip函数,还原成原来格式的数据
    zipped_GDP = zip(year1, China_GDP1, China_percentage1)
    GDP = list(zipped_GDP)

    # 删除标题中的国家
    del data_header[1]
    # 将标题（美元）改为为（万亿美元）
    data_header[1] = 'GDP(万亿美元)'

    # 返回值
    return (GDP, data_header)

对应的写文件函数

def data_write(GDP, header):
    with open('china_data1_1.csv', mode='w', encoding='utf-8', newline='') as f:
        # 基于文件对象构建csv写入对象
        csv_writer = csv.writer(f)
        # 将头部写入列表
        csv_writer.writerow(header)
        # 将具体数值写入
        csv_writer.writerows(GDP)

这样获取到的数据便是我们想要的，经过处理好的，如下
年份,GDP(万亿美元),占世界%
2010,6.09,9.2072
2011,7.55,10.2814
2012,8.53,11.3542
2013,9.57,12.3805
2014,10.48,13.1851
2015,11.06,14.7098
2016,11.23,14.7156
2017,12.31,15.1552
2018,13.89,16.0900
2019,14.34,16.3550

数据可视化

接下来做出可视化图：柱形图和折线图。

1.柱形图

def bar_view():
# 解决中文无法正确显示的问题
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['axes.unicode_minus'] = False
    with open('china_data1_1.csv', mode='r', encoding='utf-8') as f:
        china_data = pd.read_csv(f)
        xdata = china_data.loc[:, '年份']
        ydata = china_data.loc[:, 'GDP(万亿美元)']
        plt.bar(xdata, ydata)
        # 添加数据标签
        # matplotlib.pyplot.text(x, y, s, fontdict=None, withdash=False, **kwargs)
        for x, y in zip(xdata, ydata):
            plt.text(x, y+0.2, y,
                     ha='center', va='bottom',
                     backgroundcolor='green',
                     rotation=30,
                     fontsize=8)
        plt.title('中国2010-2019的GDP')
        plt.xlabel('年份')
        plt.ylabel('单位/万亿美元')
        plt.savefig('china_GDP.png')
        plt.show()

2. 折线图

def line_view():
    # 解决中文无法正确显示的问题
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['axes.unicode_minus'] = False
    with open('china_data1_1.csv', mode='r', encoding='utf-8') as f:
        china_data = pd.read_csv(f)
        xdata = china_data.loc[:, '年份']
        ydata = china_data.loc[:, '占世界%']
        plt.plot(xdata, ydata, marker='o',
                 label=u'', color='blue', linewidth=1)
        for x, y in zip(xdata, ydata):
            plt.text(x, y, y,
                     ha='center', va='bottom',
                     rotation=30,
                     fontsize=8)
        plt.title(u'中国近十年GDP占世界百分比', size=10)
        plt.xlabel(u'年份', size=20)
        plt.ylabel(u'占世界%', size=10)
        plt.savefig('china_percentage.png')
        plt.show()

主函数

url = "https://www.kylc.com/stats/global/yearly_per_country/g_gdp/chn.html"
    html = exception_handling(url)
    if html != None:
        get_data(html)
        GDP, header = get_data(html)
        data_write(GDP, header)
        bar_view()
        line_view()