WebDB and DistributedWebDB -- from wiki.apache.org/Nutch

hhdem

于 2006-11-03 14:11:00 发布

阅读量496

点赞数

分类专栏： WEB 文章标签： each function filesystems file algorithm object

WEB 专栏收录该内容

77 篇文章

订阅专栏

本文详细介绍了Nutch WebDB的功能及其实现原理，包括单处理器和分布式操作模式。WebDB作为Nutch的核心组件，用于维护网页信息并支持高效查询。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

WebDB

MichaelCafarella February 7, 2004

The WebDB maintains all of Nutch's information about the web. It keeps information on all known pages and the links that connect them. Anytime a Nutch tool needs information about the entire web (such as during fetchlist generation, or web graph link analysis), that tool needs to use the WebDB.

The WebDB is the only Nutch data structure that is designed to be frequently updated and maintained, rather than rebuilt from scratch with every new index cycle. Clearly, the WebDB must also scale efficiently with the size of the document set. These two facts have made the WebDB more difficult to understand than the typical Nutch component.

This document describes the WebDB API, its implementation, and how the WebDB has been extended for distributed operation.

The IWebDBReader interface is used by any tool that needs only to read from the db. Here are the methods of IWebDBReader, with some commentary.

public Page getPage(String url) throws IOException;

The getPage() function searches by URL for a web page object in the WebDB. All web pages are guaranteed to be unique by URL, so this method can return at most one object. This method is useful for a number of tools.

public Page[] getPages(MD5Hash md5) throws IOException;

This getPages() function can return an arbitrary number of web Page objects. This method searches by an MD5 hash, computed over the contents of the page's HTML. Although web pages are always unique by URL, they may not be unique by MD5. Many web pages are simply copies; that is, the same content can be found at many different URLs.

public boolean pageExists(MD5Hash md5) throws IOException;

This method simply checks whether the WebDB contains a page with the given content checksum. If the WebDB contains 1 or more such pages, it will return true.

public Enumeration pages() throws IOException;

The function pages() will return an Enumeration, allowing the caller to efficiently iterate over every Page in the database. The results are sorted alphanumerically by URL.

public Enumeration pagesByMD5() throws IOException;

The function pagesByMD5() does exactly the same thing as pages(), except the results are sorted by the MD5 checksum. That means that all pages with identical content will appear consecutively in the Enumeration.

public Link[] getLinks(UTF8 url) throws IOException;

Return all the WebDB's hyperlink objects that point to the given URL. Some pages have a very large number of incoming URLs, so it might not be smart for us to return an array of these objects. Probably want to switch this to an Enumeration some time in the future, so all such Links don't need to be in memory at once.

public Link[] getLinks(MD5Hash md5) throws IOException;

This getLinks() function will return all the links that originate from the content identified by the given MD5 checksum.

Nutch can place a limit on the maximum number of outlinks a given webpage can have, so it's fine for this method to return an allocated array of Links.

Note that Link objects are considered to originate in some MD5-identified content, and point at a certain URL. That means that it is not quite right to say that a Nutch Link "links two web pages together." The source content might be duplicated under

DistributedWebDB

More About WebDB

Michael Cafarella February 15, 2004

This is the second of two documents that describe the Nutch WebDB. Part one covered the single-processor WebDB; we now turn to the distributed version.

We would like the distributed WebDB to be updated across multiple disks and machines simultaneously. The completed WebDB can exist across many disks, but we assume a single machine can mount and read all of them. The distributed WebDB exists so we can distribute the burdens of update over many machines.

The machines in a distributed WebDBWriter installation are connected either by NFS or by ssh/scp access to one another. (Our NFS implementation is a little more advanced right now.)

The distributed WebDBReader just loads in files from multiple locations. We aren't too worried about distributed reading; it can be implemented by WebDBReader clients instead of in the db itself.

Multiple WebDBWriter processes will each receive separate, possibly overlapping, sets of WebDBWriter API calls to apply to the db. It's OK for the WebDB to know in advance how many WebDBWriter? processes will be involved in any update operation. All of these processes will be run concurrently before the db update is finished.

Since we know the number of writers ahead of time, it's easy to partition the WebDB keyspace into k regions, one for each WebDBWriter process. MD5 keys can be evenly partitioned according to the first (log k) bits. We derived from data a set of good breakpoint URLs that evenly divide the URL keys.

It would be great if each WebDBWriter process could simply work on just its own part of the db, wholly ignorant of the other processes. Unfortunately, that's not possible. A WebDBWriter can receive an API call that might touch a record in another Writer's keyspace partition. Even worse, remember that every stage of WebDBWriter?

execution can emit edits to be processed in later stages. Each of these edits might involve a record in a different Writer's keyspace partition.

Take a moment to recall how the single-processor WebDBWriter applies changes to the WebDB:

Every API call that modifies the db results in writing an "edit" to disk. No edits are applied until the WebDBWriter is closed. An edit might apply to one or more of the WebDB's tables: PagesByURL, PagesByMD5?, LinksByURL?, LinksByMD5?.
Upon close, sort the PagesByURL edits. Now read in two sorted streams: the sorted edits and the preexisting PagesByURL database. Emit a single stream: the new, sorted, and correct PagesByURL? db. As a side effect, also emit edits that might apply to other tables.
Sort the PagesByMD5 edits. Merge with preexisting PagesByMD5 table, emitting the brand-new PagesByMD5 table. Also emit edits to LinksByMD5 or LinksByURL.
Apply same process to LinksByMD5. Emit any edits to LinksByURL.
Apply same process to LinksByURL. No more edits will be emitted. (Actually, this isn't strictly true. In certain circumstances we may need to re-emit the LinksByMD5 edits, and then redo step 4. Don't worry about this for now.)

So a distributed WebDBWriter cannot labor in blissful ignorance. It needs to communicate with other writers, particularly when writing out edits to be performed. We modify the above single-processor algorithm to account for this inter-writer protocol.

The Distributed DB system works as follows:

As before, every API call that a single WebDBWriter receives results in writing an "edit" to disk. However, there are now k different disk locations, corresponding to each of the k partitions of the web db. Each of the k WebDBWriters is writing to k different files, so there are k^2 edits files all told. Say that a WebDBWriter has finished writing out all its edits. The WebDBWriter now communicates each partition's edits file to the corresponding WebDBWriter? process. (That communication happens via our distributed filesystem, NutchFS?, but that's not critical to the algorithm here.)
Having been closed, each WebDBWriter should apply all relevant edits to its partition of the db. At this point, the Writer waits for each of (k-1) edits files to arrive from other machines. (Of course, one of the k partitions was generated locally.) After files for all k partitions are available, the WebDBWriter sorts the edits. The sorted edit stream is applied to the local sorted db segment, resulting in a brand-new db segment for the current partition. This process is carried out by all k WebDBWriters. The PagesByURL table is now correct. The data exists across k separate db segments, and the work was distributed across k machines. But as before, applying the PagesByURL edits does not just generate a brand-new PagesByURL table. It also generates a new set of edits, which may be applicable to any record in the entire db. Thus, as each PagesByURL? table segment is being generated, the next set of edits are generated. When each PagesByURL? table segment is complete, that Writer's emitted PagesByMD5? edits are complete as well, and so are sent to each of the (k-1) other Writers.
Each of the k Writers waits for its incoming set of (k-1) PagesByMD5 edit files. When they arrive, a Writer sorts the full set of k files and applies the sorted edits. It thus generates a new PagesByMD5 table segment and a full set of k LinksByMD5? edits files.
The same process happens across k processes for the LinksByMD5 table. Emit any edits to k edits files for LinksByURL.
The same process happens across k processes for the LinksByURL table. However, this time no more edits are generated.
The job is now done. The last step is for one of the WebDBWriters to total up the full size of the new WebDB by looking at each of the k brand-new segments. When this last WebDBWriter is finished, the entire update is complete.

Reading from a distributed WebDB is comparatively simple. The WebDB now exists in k separate pieces. When the distributed WebDBReader needs to examine a certain key, it simply checks the relevant segment first. When asked to enumerate all entries, it enumerates each of the k segments in order.

It's worthwhile to spend a little time on how the distributed WebDBWriter's inter-process communication works. No process ever opens a socket to communicate directly with another. Rather, all data is communicated via files. However, since the WebDBWriters exist over many machines and filesystems, these files need to be copied back and forth.

We do that via the NutchFileSystem. Really, "filesystem" is overstating the case a little bit. Rather it's a mechanism for a shared file namespace, with some automatic file copying between machines that announce interest in that namespace.

Every object under control of a NutchFileSystem machine group is represented with a "NutchFile" object. A NutchFile is named using three different parameters:

"dbName" indicates the overall database that the file belongs to. This exists to enable multiple Nutch instances simultaneously on the same machine set. All files within a given instance will have the same dbName.
"shareGroupName" is used to control where the NutchFile will be copied. Clients of the NutchFileSystem ask for NutchFiles via sharegroup. Each machine in a NutchFileSystem machine group is configured so that it knows the entire (sharegroup<->machine) mapping. For the purposes of the distributed WebDB, we create a sharegroup for each partition of the webdb. When a single WebDBWriter is emitting k separate edits files, it is writing to files in k different share groups. Having written everything out, the Writer demands to see any files meant for its particular segment's sharegroup. (We will also create a "master" sharegroup to contain the final db output.)
"name" is just an arbitrary slash-separated filename. It describes a directory/filename hierarchy for the NutchFile in question.

A NutchFile object can be resolved to a "real-world" disk File with the help of a local NutchFileSystem object. Each machine in a NutchFileSystem machine group has a NutchFileSystem object that handles configuration and other services. One such config value is the place on a local disk where the "root" of the NutchFileSystem is found. The disk File embodiment of a NutchFile is a combination of that root, the shareGroupName, and the name.

(Of course, not all sharegroups' files will be present on each machine. That depends on the (sharegroup<->machine) mapping.)

The NutchFileSystem also takes care of file moves, deleted, locking, and guaranteed atomicity.

It should be clear that the NutchFileSystem can be implemented across any group of machines that have mutual remote-access rights. It can also be used across machines that share mutual Network File System mounts.