步骤很简单:
1)使用google的搜索inlink的语法link:url和num参数指定一页显示的结果数,
在这个结果中找到外链的页面url,排除网站内链接。
2)依次fetch每个url的页面,在fetch下来的页面找到目标链接,把achor
text记录下来,并计数。
3)按照出现次数排序,比较靠前的基本是这个网站的名字了。
我们使用了好几个代理,每次随机选取一个,以防被封掉。
全部的代码:
结果:
[quote]
{"javaeye/ago123456/123456"=>3, "05.javaeye技术网站"=>1, "javaeye"=>12, "http://www.iteye.com"=>2, "go"=>1, "www.iteye.com"=>1, "javaeye技术社区"=>4, "javaeye社区"=>1, "上海乐福狗信息技术有限公司:诚聘技术经理和开发工程师"=>1, "对require的疑惑--ruby -javaeye做最棒的软件开发交流社区"=>1, ""=>2, "访问此网站"=>1, "google reader"=>1, "javascriptå\u009Fºç¡\u0080ç\u009F¥è¯\u0086大é\u009B\u0086é\u0094¦(1)"=>1, "webå\u0089\u008D端æ\u008A\u0080æ\u009C¯è®ºå\u009D\u009Bæ\u009C\u0080æ\u0096°è®¨è®º - javaeye"=>1}
["javaeye", "javaeye技术社区", "javaeye/ago123456/123456", "http://www.iteye.com", "", "javaeye社区", "go", "www.iteye.com", "上海乐福狗信息技术有限公司:诚聘技术经理和开发工程师", "对require的疑惑--ruby -javaeye做最棒的软件开发交流社区", "05.javaeye技术网站", "访问此网站", "google reader", "javascriptå\u009Fºç¡\u0080ç\u009F¥è¯\u0086大é\u009B\u0086é\u0094¦(1)", "webå\u0089\u008D端æ\u008A\u0080æ\u009C¯è®ºå\u009D\u009Bæ\u009C\u0080æ\u0096°è®¨è®º - javaeye"]
[/quote]
需要安装gem nokogiri,在Ubuntu下:
[quote]
$ sudo apt-get install libxml2 libxml2-dev libxslt libxslt-dev
$ gem install nokogiri
[/quote]
1)使用google的搜索inlink的语法link:url和num参数指定一页显示的结果数,
在这个结果中找到外链的页面url,排除网站内链接。
def search_inlinks_from_google(query,result_num)
escaped_query = CGI.escape("link:#{query}")
google_url = "http://www.google.com/search?q=#{escaped_query}&num=#{result_num}"
inlinks = []
Nokogiri::HTML(open(google_url,:proxy => rand_proxy)).css('h3.r > a.l').each do |node|
link = node.attributes['href']
inlinks << link if not self_inlink?(query,link)
puts "inlink: #{link}"
end
return inlinks
end
2)依次fetch每个url的页面,在fetch下来的页面找到目标链接,把achor
text记录下来,并计数。
def find_site_names_by_inlinks(inlinks,site_query)
name_count_map = Hash.new{|hash,key| hash[key] = 0}
inlinks.each do |link|
puts "searching from #{link}..."
Nokogiri::HTML(open(link,fetch_options)).xpath("//a").each do |link_node|
name_count_map[link_node.text.strip.downcase] += 1 if link_node.attributes["href"].to_s =~ /#{site_query}/
end
end
p name_count_map
name_count_map.sort{|a,b| b[1] <=> a[1]}.collect{|name,count| name }
end
3)按照出现次数排序,比较靠前的基本是这个网站的名字了。
我们使用了好几个代理,每次随机选取一个,以防被封掉。
全部的代码:
#!/usr/bin/ruby
# author fuliang http://fuliang.iteye.com
require 'nokogiri'
require 'open-uri'
require 'net/http'
require 'cgi'
class SiteNameSearchAgent
public
def initialize
@proxies = 1.upto(6).collect{|index| "http://l-crwl#{index}:1080"}
end
def get_site_names(site_query,max_result_num=20)
inlinks = search_inlinks_from_google(site_query,max_result_num)
find_site_names_by_inlinks(inlinks,site_query)
end
private
def search_inlinks_from_google(query,result_num)
escaped_query = CGI.escape("link:#{query}")
google_url = "http://www.google.com/search?q=#{escaped_query}&num=#{result_num}"
inlinks = []
Nokogiri::HTML(open(google_url,:proxy => rand_proxy)).css('h3.r > a.l').each do |node|
link = node.attributes['href']
inlinks << link if not self_inlink?(query,link)
puts "inlink: #{link}"
end
return inlinks
end
def self_inlink?(query,link)
query_domain,link_domain = [query,link].collect{|url| URI.parse(url).host.sub(/^(.*?)\./,"")}
return query_domain == link_domain
end
def find_site_names_by_inlinks(inlinks,site_query)
name_count_map = Hash.new{|hash,key| hash[key] = 0}
inlinks.each do |link|
puts "searching from #{link}..."
Nokogiri::HTML(open(link,fetch_options)).xpath("//a").each do |link_node|
name_count_map[link_node.text.strip.downcase] += 1 if link_node.attributes["href"].to_s =~ /#{site_query}/
end
end
p name_count_map
name_count_map.sort{|a,b| b[1] <=> a[1]}.collect{|name,count| name }
end
def rand_proxy
return @proxies[(rand * 6).to_i]
end
def fetch_options
user_agent = "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.2) Gecko/20061201 Firefox/2.0.0.2 (Ubuntu-feisty)"
fetch_options = {
"User-Agent" => user_agent,
"proxy" => rand_proxy
}
end
end
search_agent = SiteNameSearchAgent.new
p search_agent.get_site_names("http://www.iteye.com",50)
结果:
[quote]
{"javaeye/ago123456/123456"=>3, "05.javaeye技术网站"=>1, "javaeye"=>12, "http://www.iteye.com"=>2, "go"=>1, "www.iteye.com"=>1, "javaeye技术社区"=>4, "javaeye社区"=>1, "上海乐福狗信息技术有限公司:诚聘技术经理和开发工程师"=>1, "对require的疑惑--ruby -javaeye做最棒的软件开发交流社区"=>1, ""=>2, "访问此网站"=>1, "google reader"=>1, "javascriptå\u009Fºç¡\u0080ç\u009F¥è¯\u0086大é\u009B\u0086é\u0094¦(1)"=>1, "webå\u0089\u008D端æ\u008A\u0080æ\u009C¯è®ºå\u009D\u009Bæ\u009C\u0080æ\u0096°è®¨è®º - javaeye"=>1}
["javaeye", "javaeye技术社区", "javaeye/ago123456/123456", "http://www.iteye.com", "", "javaeye社区", "go", "www.iteye.com", "上海乐福狗信息技术有限公司:诚聘技术经理和开发工程师", "对require的疑惑--ruby -javaeye做最棒的软件开发交流社区", "05.javaeye技术网站", "访问此网站", "google reader", "javascriptå\u009Fºç¡\u0080ç\u009F¥è¯\u0086大é\u009B\u0086é\u0094¦(1)", "webå\u0089\u008D端æ\u008A\u0080æ\u009C¯è®ºå\u009D\u009Bæ\u009C\u0080æ\u0096°è®¨è®º - javaeye"]
[/quote]
需要安装gem nokogiri,在Ubuntu下:
[quote]
$ sudo apt-get install libxml2 libxml2-dev libxslt libxslt-dev
$ gem install nokogiri
[/quote]