Everyday Scripting with Ruby 读书笔记(5)

本文介绍如何利用Ruby中的正则表达式从网页中抽取特定信息。通过实例展示了如何获取ESPN网站上的列表类别及名称。文章还涉及了散列表的使用以及字符串截取等操作。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

[size=small]---
用正则表达式刮取网页
Scraping Web Pages with Regular Expressions
---[/size]

irb(main):001:0> require 'open-uri'
=> true
[b][color=indigo]① open方法可以打开文件和URL[/color][/b]
irb(main):002:0> page = open('http://espn.go.com')
=> #<File:C:/DOCUME~1/GRAY~1.LAN/LOCALS~1/Temp/open-uri.2544.0>
[b][color=indigo]② 同行两条语句只打印最后一个输出[/color][/b]
irb(main):003:0> text = page.read; nil
=> nil
[b][color=indigo]③ scan方法返回匹配给定正则表达式的所有子串,有分组的话每个元素本身都是数组
④ 问号可以避免“贪婪”[/color][/b]
irb(main):014:0> text.scan(/<li\s+class=\"(lo.*?)\".*><a.+>(.+)<\/a>/)
=> [["lo", "ESPN"], ["lo", "Fantasy"], ["lo", "NFL"], ["lo solid", "MLB"], ["lo solid", "NBA"], ["lo solid", "NHL"], ["lo solid", "ESPNU"], ["lo solid", "College FB"], ["lo solid", "Men's BB"], ["lo solid", "Women's BB"], ["lo solid", "NASCAR"], ["lo solid", "Racing"], ["lo solid", "Golf"], ["lo solid", "Soccer"], ["lo solid", "High School"], ["lo solid", "Tennis"], ["lo solid", "Boxing"], ["lo solid", "More +"]]

[b][color=indigo]散列表[/color][/b]
irb(main):002:0> hash = {}
=> {}
irb(main):003:0> hash[:firstname] = "George"
=> "George"
irb(main):004:0> hash[:lastname] = "Hamilton"
=> "Hamilton"
irb(main):005:0> hash[:firstname]
=> "George"
irb(main):006:0> hash2 = {:firstname => "Tom", :lastname => "Johnson"}
=> {:firstname=>"Tom", :lastname=>"Johnson"}
irb(main):007:0> hash
=> {:firstname=>"George", :lastname=>"Hamilton"}
irb(main):008:0>

[b][color=indigo]字符串截取[/color][/b]
irb(main):008:0> "abcdefg"[0..2]
=> "abc"
irb(main):009:0> "abcdefg"[0...2]
=> "ab"
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值