美股网页表格数据爬虫设计

最新推荐文章于 2024-10-03 19:17:21 发布
AriesSurfer
最新推荐文章于 2024-10-03 19:17:21 发布
阅读量3.5k
点赞数
CC 4.0 BY-SA版权
分类专栏： Python 编程
本文链接：https://blog.youkuaiyun.com/AriesSurfer/article/details/8129246
编程同时被 2 个专栏收录
55 篇文章
订阅专栏
Python
2 篇文章
订阅专栏
# -*- coding: gbk -*-
#!/usr/bin/python
#program : spider -- crawer financial data from 500 diffrerent American stocks' webs.
#获取网页内容
#author :Douronggang
#date   :2012-10-30
'''
说明：
1.以下代码中用到的变量
symbol指 美股的简称 如 谷歌为GOOG
2.如果找到与美股财务数据行情页面，将其存到指定路径。
3.数据属性说明:
-----------财务数据-----------------------------------
利润表：总收入--General_income 毛利总额--Total_gross_profit 净利润--retained_profits
资产负债表：流动资产总额--Total_current_assets  资产总额--Total_assets   流动负债总额--Total_current_liabilities
          负债总额--Total_liabilites 股东权益总额--Total_Shareholder_equity  
现金流量表：净利润--Retained_profits 经营活动产生的现金流量--Business_cashflow
          投资活动产生的现金流量--Investment_cashflow 筹资活动产生的现金流量：Financing_cashflow
          现金净增减额--Cash_net_zje
-----------主要财务指标---------------------------------
主要指标项：净利润率--Net_profit_rate  营业利润率--Operating_profit_ratio
          息税前利润率--EBIT_margin   平均资产回报率--ROAA
          平均股本回报率--ROAE         员工人数--Employee_number
          
（单位为百万美元）
'''
import sys
import urllib2
import re
import time
import os
import exceptions
Text_TAG1='''总收入\t毛利总额\t净利润\t流动资产总额\t资产总额\t流动负债总额\t负债总额\t股东权益总额\t净利润2\t经营活动产生的现金流量\t投资活动产生的现金流量\t筹资活动产生的现金流量\t现金净增减额\t'''
Text_TAG2='''净利润率\t营业利润率\t息税前利润率\t平均资产回报率\t平均股本回报率\t员工人数'''
#设定文本文件存储的主路径
PATH_HEAD='D:\\金融数据组\黄冬\\美国新股\\数据爬取'

class Crawer_data:
    
    def Get_html(self,url):#获取美股行情网页的内容
    
        try:
            content=urllib2.urlopen(url).read()
            return content
        except:
            print url,'HTTPError:无法获取内容.\n'
            return ''
        
              
    def Analysis_html_1(self,url):#分析页面内容获取财务数据各个属性值
        content=self.Get_html(url)
        if content=='':
            return ''
        if_find=content.find('财务数据')
        if if_find!=-1:#找到财务数据的位置
            data='财务数据:\n'+Text_TAG1+'\n'
            Tag=content.find('<th class="subtitle">',if_find)
            for j in range(13):#爬取2011年的财务数据
                first_pos=content.find('<td class="first">',Tag)
                second_pos=first_pos+len('<td class="first">')
                third_pos=content.find('</td>',first_pos)
                temp=content[second_pos:third_pos]
                data=data+temp+'\t'
                Tag=content.find('<th class="subtitle">',third_pos)
            data=data+'\n'
            #print data
            Tag=content.find('<td class="first">',if_find)
            for j in range(13):#爬取2010年的财务数据
                first_pos=content.find('<td class="second">',Tag)
                second_pos=first_pos+len('<td class="second">')
                third_pos=content.find('</td>',first_pos)
                temp=content[second_pos:third_pos]
                data=data+temp+'\t'
                Tag=content.find('<td class="first">',third_pos)
            data=data+'\n'
            #print data
            Tag=content.find('<td class="second">',if_find)
            for j in range(13):#爬取2009年的财务数据
                first_pos=content.find('<td>',Tag)
                second_pos=first_pos+len('<td>')
                third_pos=content.find('</td>',first_pos)
                temp=content[second_pos:third_pos]
                data=data+temp+'\t'
                Tag=content.find('<td class="second">',third_pos)
            #print data
            return data
    
    def Analysis_html_2(self,url): #获取主要财务指标
        content=self.Get_html(url)
        if content=='':
            return ''
        if_find=content.find('主要财务指标')
        if if_find!=-1:
            data='主要财务指标:\n'+Text_TAG2+'\n'
            Tag=content.find('<th class="subtitle">',if_find)
            for j in range(6):#爬取2012年第三季度的指标项
                first_pos=content.find('<td class="first1">',Tag)
                second_pos=first_pos+len('<td class="first1">')
                third_pos=content.find('</td>',first_pos)
                temp=content[second_pos:third_pos]
                data=data+temp+'\t'
                Tag=content.find('<th class="subtitle">',third_pos)
            data=data+'\n'
            #print data
            Tag=content.find('<td class="first1">',if_find)
            for j in range(6):#爬取2011年的指标项
                first_pos=content.find('<td class="second1">',Tag)
                second_pos=first_pos+len('<td class="second1">')
                third_pos=content.find('</td>',first_pos)
                temp=content[second_pos:third_pos]
                data=data+temp+'\t'
                Tag=content.find('<td class="first1">',third_pos)
            #print data
            return data
    def Get_data(self,url):
        str1=self.Analysis_html_1(url)
        str2=self.Analysis_html_2(url)
        if str1==None and str2==None:
            return ''
        else:
            total_content=str(str1)+'\n'+str(str2)
            return total_content
    def write_data(self,content,symbol):
        file_name=symbol+'.txt'
        if os.path.exists(file_name):
            pass
        else:
            if content!='':
                fp=open(file_name,'w')
                fp.write(content)
                fp.close()
    def process_all(self,path):
        content=open(path,'r')
        url_head='http://stock.finance.sina.com.cn/usstock/quotes/'
        for line in content:
            line=line.replace('\n','')
            total_url=url_head+line+'.html'
            data=self.Get_data(total_url)
            #print line,"'s data length:",len(data)
            if len(data)!=0:
                self.write_data(data,line)
                print line,"'s content:\n",data
                #print line,"'s data 写入成功."
            else:
                print line," 没有相关数据."
                
demo=Crawer_data()
demo.process_all('us_stock_list.txt')