Creating a real-time search engine with IndexTank and Heroku

最新推荐文章于 2025-04-26 09:57:58 发布

转载最新推荐文章于 2025-04-26 09:57:58 发布 · 763 阅读

文章标签：

#search #url #api #rubygems #query #stream

本文将指导您使用现有云服务构建实时照片搜索引擎，包括选择数据源、解析数据流、建立Rails应用、关联Heroku应用、配置IndexTank索引等步骤。通过此教程，您可以轻松实现照片实时搜索。

In 1998 I created my first search engine. It was very simple: it would crawl sites for mp3 files and generate a search index twice every hour. It took me about six weeks to develop, and I also had to buy a Pentium server to host it. I paid $125/month to have it co-located with an ISP.

Today you could do the same thing for free and in a fraction of the time using existing cloud services. More importantly, you can get it started for free and only spend money to grow the service once it gains traction. I will show you how to build a real-time search engine using IndexTank as the back-end and Heroku as your application host.

The only requirement for this tutorial is to have a Heroku account with the capability of using add-ons (i.e. validated with your credit card). Of course, knowing Ruby and Rails will be helpful to understand what’s going on. Let’s do it!

First off, let’s choose something to search. Presumably we are interested in a real-time stream such as Twitter updates, blog content, etc. Any text stream with a public api will do, for the purpose of this example I chose Plixi, a social photo sharing application. My idea was to search the text associated with pictures that people post to the site.

You can read all about the plixi api here, for the purpose of this we are interested in the json version of the real-time photo stream:

http ://api .plixi .com /api /tpapi .svc /json /photos ?getuser =true

Here’s a snippet of code to parse that stream and extract some useful fields. Try it out (make sure you have ‘json’ installed in your gems):

fetcher.rb

 
   view plain
   copy to clipboard
   print
   ?
  
 require 'rubygems'  
 require 'json'  
 require 'net/http'  
   
 plixi_url='http://api.plixi.com/api/tpapi.svc/json/photos?getuser=true'  
 photos = JSON.parse(Net::HTTP.get_response(URI.parse(plixi_url)).body)  
 count, list = photos['Count'], photos['List']  
 list.each_with_index do |p, i|  
     u = p['User']  
     #only want photos that come with some text  
     if p.has_key?('Message')  
         id = p['GdAlias']  
         text = p['Message']  
         timestamp = Integer(p['UploadDate'])  
         screen_name = u['ScreenName']  
         thumbnail_url = p['ThumbnailUrl']  
         printf "%s,%s,%s\n", id, screen_name, text  
     end  
 end  

So we have an interesting data stream. How do we index it and search it? First, let’s create a new Rails app (we are using Rails 3 for this, it will be different for older versions) and associate it with a Heroku app (make sure the heroku gem is installed, if not: $ sudo gem install heroku).

$ rails new plixidemo
$ cd plixidemo

Associate this rails app with a Heroku app

$ git init
$ heroku create

Request the IndexTank add-on for this app and download the IndexTank gem for local testing

$ heroku addons:add indextank:trial
$ gem install indextank

We need to add the following lines to plixidemo/Gemfile so that our app can use the Indextank client:

gem 'indextank', '1.0.10'
gem 'json_pure', '1.4.6', :require => 'json'

Let’s create two models under app/models. This will be the heart of our application. First, a photo:

app/models/photo.rb

 
   view plain
   copy to clipboard
   print
   ?
  
 class Photo  
   def initialize(data)  
     @data = data  
   end  
   
   def id  
     @data['Id']  
   end  
   
   def screen_name  
     self.user['ScreenName']  
   end  
   
   def to_document  
     {  
       :plixi_id      => self.id,  
       :text          => self.message,  
       :timestamp     => self.upload_date.to_i,  
       :screen_name   => self.screen_name,  
       :thumbnail_url => self.thumbnail_url  
     }  
   end  
   
   def method_missing(name, *args, &block)  
     if @data[name.to_s.classify]  
       @data[name.to_s.classify]  
     else  
       super  
     end  
   end  
 end  

Second, a searcher that knows how to communicate with the indextank api:

app/models/photo_searcher.rb

 
   view plain
   copy to clipboard
   print
   ?
  
 require 'open-uri'  
   
 class PhotoSearcher  
   def self.index  
     @api  = IndexTank::Client.new(ENV['INDEXTANK_API_URL'] || 'http://your_api_url')  
     @index ||= @api.indexes('idx')  
     @index  
   end  
   
   # retrieve photos from IndexTank  
   def self.search(query)  
     index.search(query, :fetch=>'text,thumbnail_url,screen_name,plixi_id,timestamp')  
   end  
   
 end  

note: ‘your_api_url’ can be found on heroku.com:

My Apps -> [your app] -> Add-ons -> IndexTank Search

or, from the command line:

$ heroku config --long|grep INDEXTANK

And one controller, app/controllers/photos_controller.rb

 
   view plain
   copy to clipboard
   print
   ?
  
 class PhotosController < ApplicationController  
   def index  
     @docs = PhotoSearcher.search(params[:query]) if params[:query].present?  
   end  
 end  

Of course, we need a view for our search page: app/views/photos/index.html.erb

 
   view plain
   copy to clipboard
   print
   ?
  
 <%= form_tag photos_path, :method => :get do %>  
   <%= text_field_tag :query %>  
   <button type="submit">Search</button>  
 <% end %>  
   
 <% if @docs %>  
   <p id="result-count">Your search for "<%= params[:query] %>" returned <%= pluralize @docs['matches'], 'result' %></p>  
   
   <ul id="results">  
     <% @docs['results'].each do |doc| %>  
       <li>  
         <%= link_to "http://plixi.com/p/#{doc['plixi_id']}" do %>  
           <%= image_tag doc['thumbnail_url'] %>  
           <%= doc['screen_name'] %> -  
           <%= time_ago_in_words Time.at(doc['timestamp'].to_i) %> ago  
   
           <%= simple_format doc['text'] %>  
         <% end %>  
       </li>  
     <% end %>  
   </ul>  
 <% end %>  

And add the following to your config/routes.rb

 
   view plain
   copy to clipboard
   print
   ?
  
 resources :photos, :only => [:index]  
 root :to => 'photos#index'  

Remember to remove public/index.html and launch your app. If you go to it with your browser to http://localhost:3000, you should see a search box. Your index is empty, so all queries should return zero results.

It’s time to index some documents. Let’s go back to fetcher.rb from the beginning of the tutorial and add the following lines:

After rubygems:

 
  require ‘indextank’  
 

After the plixi_url line:

 
   view plain
   copy to clipboard
   print
   ?
  
 api = IndexTank::Client.new(ENV['INDEXTANK_API_URL'] || '<API_URL>')  
 index = api.indexes 'idx'  

[note: you can find out your <API_URL> by selecting IndexTank Search from the add-ons pull-down menu for your Heroku app]

And the most important line of all, after the printf statement:

 
   view plain
   copy to clipboard
   print
   ?
  
 index.document(i.to_s).add({:plixi_id => id, :text => text, :timestamp => timestamp, :screen_name => screen_name, :thumbnail_url => thumbnail_url})  

Now, run fetcher.rb and all the output you see will be indexed. Go back to your app and search for anything you just saw, it will show up in the search results right away!

We are ready to upload this to Heroku.

git add .
git commit -a
git push heroku master

You now have a real-time photo search engine running at [your_app].heroku.com! Next steps:

Change the fetcher to keep updating your index (remember that you can index up to 5000 documents). Be polite and do not hit Plixi more than a few times per minute!
Learn how to use our auto-complete service and build a pretty web app.

Check out our demo app here:

http://plixitank.heroku.com/