COMP6714 2024T3 Project: Ranked Retrieval with Spelling CorrectionPython-优快云博客

本文链接：https://blog.youkuaiyun.com/ybtyrain/article/details/143311168

Java Python COMP6714 2024T3 Project: Ranked Retrieval with Spelling Correction

Please read this spec carefully and completely before you start programming.
In this project, you are going to implement (using Python3 in CSE linux machines) a simple search engine that ranks the output documents based on the promixity of the matching terms. It also supports spelling correction of query terms with maxiumum editing distance of 2 per search term (assuming Insert, Delete and Replace operations and no transpose). A search query in this project is a list of space-separated search terms and each search term may contain any numeric digits or uppercase/lowercase letters, and will not contain any punctuations. You will need to implement an indexer (to index the files) and a search program (to search the files based on the index generated by your indexer). As a core requirement for this project, you must implement your solution using an inverted index with positional information (for example, the positional index described in Lecture Week 1). You may also implement any additional indexes as appropriate, if you wish.
Given a search query, each matching document from the search result must contains terms that match all the search terms following the below term matching rules:

· Search is case insensitive.

· Full stops for abbreviations are ignored. e.g., U.S., US are the same.

· Singular/Plural is ignored. e.g., cat, cats, cat's, cats' are all the same.

· Tense is ignored. e.g., breaches, breach, breached, breaching are all the same.

· Numeric tokens such as years, integers should be indexed accordingly and searchable.

· Commas in numeric tokens are ignored, e.g., 1,000,000 and 1000000 are the same.

· Numbers with decimal places are ignored from the index, as a decimal number is not a valid search term (since '.' is not allowed).

· Except the above, all other punctuation should be treated as token dividers.

As a requirement of this project, the matching documents in a search result are ranked according to the distances between the matching terms in these documents, such that matching terms closer to each other will be ranked higher than those further apart. Further details are described below in the Ranking section.
You are provided with approximately 1000 small documents (named with their document IDs) available in ~cs6714/reuters/data. You can find these files by logging into CSE machines and going to folder ~cs6714/reuters/data. Your submitted project will be tested against a similar collection of up to 1000 documents (i.e., we may replace some of these documents to avoid any hard-coded solutions).
Your submission must include 2 main programs: index.py and search.py as described below. You may submit additional Python files in addition to these 2 files. It is your responsibility to submit any other required Python files for the 2 main programs to work properly.
This project will be marked based on auto marking, and will then be checked manually for other requirements described in this specification (for example, if a positional index is implemented). To ensure your project satisfies the input and output formatting requirements, a simple sanity test script. is available in ~cs6714/reuters/sanity. You should run the sanity test script. before you submit your solution. To run the sanity test script. on a CSE linux machine, simply go inside the folder that contains your index.py and search.py and type: ~cs6714/reuters/sanity that will run tests based on examples presented below. Note that it is only a sanity test primarily for formatting, and you are expected to test your project more thoroughly.

The Indexer

Your indexer is run by

python3 index.py [folder-of-documents] [folder-of-indexes]

where [folder-of-documents] is the path to the directory for the collection of documents to be indexed and [folder-of-indexes] is the path to the directory where the index file(s) should be created. All the files in [folder-of-documents] should be opened as read-only, as you may not have the write permission for these files. If [folder-of-indexes] does not exist, create a new directory as specified. You may create multiple index files although too many index files may slow down your performance. The total size of all your generated index files shall not exceed 20MB (which should be plenty for this project).
The following is an example of how the indexer is run:

$ python3 index.py ~cs6714/reuters/data ./MyTestIndex

The Search

After the indexer is r