K most frequent words from a file

本文探讨了一种高效算法,用于从大型文本序列中找出出现频率最高的K个单词。通过使用哈希表记录单词频率,并采用部分堆排序优化排序过程,实现了O(n + k*log(k))的时间复杂度。

Input: A positive integer K and a big text. The text can actually be viewed as word sequence. So we don't have to worry about how to break down it into word sequence.
Output: The most frequent K words in the text.

My thinking is like this. 

  1. use a Hash table to record all words' frequency while traverse the whole word sequence. In this phase, the key is "word" and the value is "word-frequency". This takes O(n) time. 

  2. sort the (word, word-frequency) pair; and the key is "word-frequency". This takes O(n*lg(n)) time with normal sorting algorithm. 

  3. After sorting, we just take the first K words. This takes O(K) time. 

To summarize, the total time is O(n+nlg(n)+K), Since K is surely smaller than N, so it is actually O(nlg(n)).

We can improve this. Actually, we just want top K words. Other words' frequency is not concern for us. So, we can use "partial Heap sorting". For step 2) and 3), we don't just do sorting. Instead, we change it to be

2') build a heap of (word, word-frequency) pair with "word-frequency" as key. It takes O(n) time to build a heap;

3') extract top K words from the heap. Each extraction is O(lg(n)). So, total time is O(k*lg(n)).

To summarize, this solution cost time O(n+k*lg(n)).

This is just my thought. I haven't find out way to improve step 1).
I Hope some Information Retrieval experts can shed more light on this question.

share improve this question
 
 
Would you use merge sort or quicksort for the O(n*logn) sort? –  committedandroider  Feb 27 '15 at 18:15
1 
For practical uses, Aaron Maenpaa's answer of counting on a sample is best. It's not like the most frequentwords will hide from your sample. For you complexity geeks, it's O(1) since the size of the sample is fixed. You don't get the exact counts, but you're not asking for them either. –  Nikana Reklawyks  May 5 '15 at 22:00
 
If what you want is a review of your complexity analysis, then I'd better mention: if n is the number of words in your text and m is the number of different words (types, we call them), step 1 is O(n), but step 2 is O(m.lg(m)), and m << n (you may have billions words and not reach a million types, try it out). So even with a dummy algorithm, it's still O(n + m lg(m)) = O(n). –  Nikana Reklawyks  May 5 '15 at 22:40 

16 Answers

This can be done in O(n) time

Solution 1:

Steps:

  1. Count words and hash it, which will end up in the structure like this

    var hash = {
      "I" : 13,
      "like" : 3,
      "meow" : 3,
      "geek" : 3,
      "burger" : 2,
      "cat" : 1,
      "foo" : 100,
      ...
      ...
    
  2. Traverse through the hash and find the most frequently used word (in this case "foo" 100), then create the array of that size

  3. Then we can traverse the hash again and use the number of occurrences of words as array index, if there is nothing in the index, create an array else append it in the array. Then we end up with an array like:

      0   1      2            3                100
    [[ ],[ ],[burger],[like, meow, geek],[]...[foo]]
    
  4. Then just traverse the array from the end, and collect the k words.

Solution 2:

Steps:

  1. Same as above
  2. Use min heap and keep the size of min heap to k, and for each word in the hash we compare the occurrences of words with the min, 1) if it's greater than the min value, remove the min (if the size of the min heap is equal to k) and insert the number in the min heap. 2) rest simple conditions.
  3. After traversing through the array, we just convert the min heap to array and return the array.
share improve this answer
 
7 
Your solution (1) is an O(n) bucket sort replacing a standard O(n lg n) comparison sort. Your approach requires additional space for the bucket structure, but comparison sorts can be done in place. Your solution (2) runs in time O(n lg k) -- that is, O(n) to iterate over all words and O(lg k) to add each one into the heap. –  stackoverflowuser2010  Sep 29 '14 at 4:29
3 
The first solution does require more space, but it is important to emphasize that it is in fact O(n) in time. 1: Hash frequencies keyed by word, O(n); 2: Traverse frequency hash, create second hash keyed by frequency. This is O(n) to traverse the hash and O(1) to add a word to the list of words at that frequency. 3: Traverse hash down from max frequency until you hit k. At most, O(n). Total = 3 * O(n) = O(n). –  BringMyCakeBack Nov 5 '14 at 1:01 
1 
Typically when counting words, your number of buckets in solution 1 is widely overestimated (because the number one most frequent word is so much more frequent than the second and third best), so your array is sparse and inefficient. –  Nikana Reklawyks  May 5 '15 at 22:03

You're not going to get generally better runtime than the solution you've described. You have to do at least O(n) work to evaluate all the words, and then O(k) extra work to find the top k terms.

If your problem set is really big, you can use a distributed solution such as map/reduce. Have n map workers count frequencies on 1/nth of the text each, and for each word, send it to one of m reducer workers calculated based on the hash of the word. The reducers then sum the counts. Merge sort over the reducers' outputs will give you the most popular words in order of popularity.

share improve this answer
 

A small variation on your solution yields an O(n) algorithm if we don't care about ranking the top K, and a O(n+k*lg(k)) solution if we do. I believe both of these bounds are optimal within a constant factor.

The optimization here comes again after we run through the list, inserting into the hash table. We can use the median of medians algorithm to select the Kth largest element in the list. This algorithm is provably O(n).

After selecting the Kth smallest element, we partition the list around that element just as in quicksort. This is obviously also O(n). Anything on the "left" side of the pivot is in our group of K elements, so we're done (we can simply throw away everything else as we go along).

So this strategy is:

  1. Go through each word and insert it into a hash table: O(n)
  2. Select the Kth smallest element: O(n)
  3. Partition around that element: O(n)

If you want to rank the K elements, simply sort them with any efficient comparison sort in O(k * lg(k)) time, yielding a total run time of O(n+k * lg(k)).

The O(n) time bound is optimal within a constant factor because we must examine each word at least once. 

The O(n + k * lg(k)) time bound is also optimal because there is no comparison-based way to sort k elements in less than k * lg(k) time. 

share improve this answer
 
 
When we select the Kth smallest element, what gets selected is the Kth smallest hash-key. It is not necessary that there are exactly K words in the left partition of Step 3. –  Prakash Murali  May 20 '12 at 15:10 
2 
You will not be able to run "medians of medians" on the hash table as it does swaps. You would have to copy the data from the hash table to an temp array. So, O(n) storage will be reqd. –  user674669  Feb 20 '13 at 17:08 
 
I don't understand how can you select the Kth smallest element in O(n) ? –  Michael Ho Chum  Mar 16 '15 at 16:18
 
Check this out for the algorithm for finding Kth smallest element in O(n) - wikiwand.com/en/Median_of_medians –  Piyush  Feb 2 at 14:19

If your "big word list" is big enough, you can simply sample and get estimates. Otherwise, I like hash aggregation.

Edit:

By sample I mean choose some subset of pages and calculate the most frequent word in those pages. Provided you select the pages in a reasonable way and select a statistically significant sample, your estimates of the most frequent words should be reasonable.

This approach is really only reasonable if you have so much data that processing it all is just kind of silly. If you only have a few megs, you should be able to tear through the data and calculate an exact answer without breaking a sweat rather than bothering to calculate an estimate.

share improve this answer
 
 
Sometimes you have to do this many times over, for example if you're trying to get the list of frequent words per website, or per subject. In that case, "without breaking a sweat" doesn't really cut it. You still need to find a way to do it in as efficiently as possible. –  itsadok  Sep 16 '09 at 8:31
 
+1 for a practical answer that doesn't adress the irrelevant complexity issues. @itsadok: For each run: if it's big enough, sample it ; if it's not, then gaining a log factor is irrelevant. –  Nikana Reklawyks  May 5 '15 at 22:09

You can cut down the time further by partitioning using the first letter of words, then partitioning the largest multi-word set using the next character until you have k single-word sets. You would use a sortof 256-way tree with lists of partial/complete words at the leafs. You would need to be very careful to not cause string copies everywhere.

This algorithm is O(m), where m is the number of characters. It avoids that dependence on k, which is very nice for large k [by the way your posted running time is wrong, it should be O(n*lg(k)), and I'm not sure what that is in terms of m].

If you run both algorithms side by side you will get what I'm pretty sure is an asymptotically optimal O(min(m, n*lg(k))) algorithm, but mine should be faster on average because it doesn't involve hashing or sorting.

share improve this answer
 
7 
What you're describing is called a 'trie'. –  Nick Johnson  Oct 9 '08 at 7:53
 
Hi Strilanc. Can you explain the process of partition in details? –  Morgan Cheng  Oct 9 '08 at 11:39
1 
how does this not involve sorting?? once you have the trie, how do you pluck out the k words with the largest frequencies. doesnt make any sense –  ordinary  Nov 12 '13 at 7:38 

You have a bug in your description: Counting takes O(n) time, but sorting takes O(m*lg(m)), where m is the number of unique words. This is usually much smaller than the total number of words, so probably should just optimize how the hash is built.

share improve this answer
 

Your problem is same as this- http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/

Use Trie and min heap to efficieinty solve it.

share improve this answer
 

If what you're after is the list of k most frequent words in your text for any practical k and for any natural langage, then the complexity of your algorithm is not relevant. 

Just sample, say, a few million words from your text, process that with any algorithm in a matter of seconds, and the most frequent counts will be very accurate.

As a side note, the complexity of the dummy algorithm (1. count all 2. sort the counts 3. take the best) is O(n+m*log(m)), where m is the number of different words in your text. log(m) is much smaller than (n/m), so it remains O(n). 

Practically, the long step is counting.



Find the k most frequent words from a file

Given a book of words. Assume you have enough main memory to accommodate all words. design a data structure to find top K maximum occurring words. The data structure should be dynamic so that new words can be added. 

A simple solution is to use Hashing. Hash all words one by one in a hash table. If a word is already present, then increment its count. Finally, traverse through the hash table and return the k words with maximum counts.

We can use Trie and Min Heap to get the k most frequent words efficiently. The idea is to use Trie for searching existing words adding new words efficiently. Trie also stores count of occurrences of words. A Min Heap of size k is used to keep track of k most frequent words at any point of time(Use of Min Heap is same as we used it to find k largest elements in this post).
Trie and Min Heap are linked with each other by storing an additional field in Trie ‘indexMinHeap’ and a pointer ‘trNode’ in Min Heap. The value of ‘indexMinHeap’ is maintained as -1 for the words which are currently not in Min Heap (or currently not among the top k frequent words). For the words which are present in Min Heap, ‘indexMinHeap’ contains, index of the word in Min Heap. The pointer ‘trNode’ in Min Heap points to the leaf node corresponding to the word in Trie.

Following is the complete process to print k most frequent words from a file.

Read all words one by one. For every word, insert it into Trie. Increase the counter of the word, if already exists. Now, we need to insert this word in min heap also. For insertion in min heap, 3 cases arise:

1. The word is already present. We just increase the corresponding frequency value in min heap and call minHeapify() for the index obtained by “indexMinHeap” field in Trie. When the min heap nodes are being swapped, we change the corresponding minHeapIndex in the Trie. Remember each node of the min heap is also having pointer to Trie leaf node.

2. The minHeap is not full. we will insert the new word into min heap & update the root node in the min heap node & min heap index in Trie leaf node. Now, call buildMinHeap().

3. The min heap is full. Two sub-cases arise.
….3.1 The frequency of the new word inserted is less than the frequency of the word stored in the head of min heap. Do nothing.

….3.2 The frequency of the new word inserted is greater than the frequency of the word stored in the head of min heap. Replace & update the fields. Make sure to update the corresponding min heap index of the “word to be replaced” in Trie with -1 as the word is no longer in min heap.

4. Finally, Min Heap will have the k most frequent words of all words present in given file. So we just need to print all words present in Min Heap.

// A program to find k most frequent words in a file
#include <stdio.h>
#include <string.h>
#include <ctype.h>
 
# define MAX_CHARS 26
# define MAX_WORD_SIZE 30
 
// A Trie node
struct TrieNode
{
     bool isEnd; // indicates end of word
     unsigned frequency;  // the number of occurrences of a word
     int indexMinHeap; // the index of the word in minHeap
     TrieNode* child[MAX_CHARS]; // represents 26 slots each for 'a' to 'z'.
};
 
// A Min Heap node
struct MinHeapNode
{
     TrieNode* root; // indicates the leaf node of TRIE
     unsigned frequency; //  number of occurrences
     char * word; // the actual word stored
};
 
// A Min Heap
struct MinHeap
{
     unsigned capacity; // the total size a min heap
     int count; // indicates the number of slots filled.
     MinHeapNode* array; //  represents the collection of minHeapNodes
};
 
// A utility function to create a new Trie node
TrieNode* newTrieNode()
{
     // Allocate memory for Trie Node
     TrieNode* trieNode = new TrieNode;
 
     // Initialize values for new node
     trieNode->isEnd = 0;
     trieNode->frequency = 0;
     trieNode->indexMinHeap = -1;
     for ( int i = 0; i < MAX_CHARS; ++i )
         trieNode->child[i] = NULL;
 
     return trieNode;
}
 
// A utility function to create a Min Heap of given capacity
MinHeap* createMinHeap( int capacity )
{
     MinHeap* minHeap = new MinHeap;
 
     minHeap->capacity = capacity;
     minHeap->count  = 0;
 
     // Allocate memory for array of min heap nodes
     minHeap->array = new MinHeapNode [ minHeap->capacity ];
 
     return minHeap;
}
 
// A utility function to swap two min heap nodes. This function
// is needed in minHeapify
void swapMinHeapNodes ( MinHeapNode* a, MinHeapNode* b )
{
     MinHeapNode temp = *a;
     *a = *b;
     *b = temp;
}
 
// This is the standard minHeapify function. It does one thing extra.
// It updates the minHapIndex in Trie when two nodes are swapped in
// in min heap
void minHeapify( MinHeap* minHeap, int idx )
{
     int left, right, smallest;
 
     left = 2 * idx + 1;
     right = 2 * idx + 2;
     smallest = idx;
     if ( left < minHeap->count &&
          minHeap->array[ left ]. frequency <
          minHeap->array[ smallest ]. frequency
        )
         smallest = left;
 
     if ( right < minHeap->count &&
          minHeap->array[ right ]. frequency <
          minHeap->array[ smallest ]. frequency
        )
         smallest = right;
 
     if ( smallest != idx )
     {
         // Update the corresponding index in Trie node.
         minHeap->array[ smallest ]. root->indexMinHeap = idx;
         minHeap->array[ idx ]. root->indexMinHeap = smallest;
 
         // Swap nodes in min heap
         swapMinHeapNodes (&minHeap->array[ smallest ], &minHeap->array[ idx ]);
 
         minHeapify( minHeap, smallest );
     }
}
 
// A standard function to build a heap
void buildMinHeap( MinHeap* minHeap )
{
     int n, i;
     n = minHeap->count - 1;
 
     for ( i = ( n - 1 ) / 2; i >= 0; --i )
         minHeapify( minHeap, i );
}
 
// Inserts a word to heap, the function handles the 3 cases explained above
void insertInMinHeap( MinHeap* minHeap, TrieNode** root, const char * word )
{
     // Case 1: the word is already present in minHeap
     if ( (*root)->indexMinHeap != -1 )
     {
         ++( minHeap->array[ (*root)->indexMinHeap ]. frequency );
 
         // percolate down
         minHeapify( minHeap, (*root)->indexMinHeap );
     }
 
     // Case 2: Word is not present and heap is not full
     else if ( minHeap->count < minHeap->capacity )
     {
         int count = minHeap->count;
         minHeap->array[ count ]. frequency = (*root)->frequency;
         minHeap->array[ count ]. word = new char [ strlen ( word ) + 1];
         strcpy ( minHeap->array[ count ]. word, word );
 
         minHeap->array[ count ]. root = *root;
         (*root)->indexMinHeap = minHeap->count;
 
         ++( minHeap->count );
         buildMinHeap( minHeap );
     }
 
     // Case 3: Word is not present and heap is full. And frequency of word
     // is more than root. The root is the least frequent word in heap,
     // replace root with new word
     else if ( (*root)->frequency > minHeap->array[0]. frequency )
     {
 
         minHeap->array[ 0 ]. root->indexMinHeap = -1;
         minHeap->array[ 0 ]. root = *root;
         minHeap->array[ 0 ]. root->indexMinHeap = 0;
         minHeap->array[ 0 ]. frequency = (*root)->frequency;
 
         // delete previously allocated memoory and
         delete [] minHeap->array[ 0 ]. word;
         minHeap->array[ 0 ]. word = new char [ strlen ( word ) + 1];
         strcpy ( minHeap->array[ 0 ]. word, word );
 
         minHeapify ( minHeap, 0 );
     }
}
 
// Inserts a new word to both Trie and Heap
void insertUtil ( TrieNode** root, MinHeap* minHeap,
                         const char * word, const char * dupWord )
{
     // Base Case
     if ( *root == NULL )
         *root = newTrieNode();
 
     //  There are still more characters in word
     if ( *word != '\0' )
         insertUtil ( &((*root)->child[ tolower ( *word ) - 97 ]),
                          minHeap, word + 1, dupWord );
     else // The complete word is processed
     {
         // word is already present, increase the frequency
         if ( (*root)->isEnd )
             ++( (*root)->frequency );
         else
         {
             (*root)->isEnd = 1;
             (*root)->frequency = 1;
         }
 
         // Insert in min heap also
         insertInMinHeap( minHeap, root, dupWord );
     }
}
 
 
// add a word to Trie & min heap.  A wrapper over the insertUtil
void insertTrieAndHeap( const char *word, TrieNode** root, MinHeap* minHeap)
{
     insertUtil( root, minHeap, word, word );
}
 
// A utility function to show results, The min heap
// contains k most frequent words so far, at any time
void displayMinHeap( MinHeap* minHeap )
{
     int i;
 
     // print top K word with frequency
     for ( i = 0; i < minHeap->count; ++i )
     {
         printf ( "%s : %d\n" , minHeap->array[i].word,
                             minHeap->array[i].frequency );
     }
}
 
// The main funtion that takes a file as input, add words to heap
// and Trie, finally shows result from heap
void printKMostFreq( FILE * fp, int k )
{
     // Create a Min Heap of Size k
     MinHeap* minHeap = createMinHeap( k );
    
     // Create an empty Trie
     TrieNode* root = NULL;
 
     // A buffer to store one word at a time
     char buffer[MAX_WORD_SIZE];
 
     // Read words one by one from file.  Insert the word in Trie and Min Heap
     while ( fscanf ( fp, "%s" , buffer ) != EOF )
         insertTrieAndHeap(buffer, &root, minHeap);
 
     // The Min Heap will have the k most frequent words, so print Min Heap nodes
     displayMinHeap( minHeap );
}
 
// Driver program to test above functions
int main()
{
     int k = 5;
     FILE *fp = fopen ( "file.txt" , "r" );
     if (fp == NULL)
         printf ( "File doesn't exist " );
     else
         printKMostFreq (fp, k);
     return 0;
}

Output:

your : 3
well : 3
and : 4
to : 4
Geeks : 6

The above output is for a file with following content.

In this task, you will design and implement a user-driven vocabulary management system that supports interaction from different types of users (e.g., readers and admins). This task simulates a real-world collaborative environment, where access control and data integrity are critical. 🧩 Task Objective You are required to implement the following two classes: Role: defines the role of a user, including their name, access level, and identity information. RoleBasedVocabSys: manages user login, menu display, command execution, and interaction with the TextProcessor object. The program should simulate a terminal-like experience for different types of users, controlling what they can see and what actions they can perform. 👥 User Roles The system supports two user roles: Reader: Can log in and view the vocabulary. Can view the top 10 and bottom 10 most frequent words. Cannot update or modify any part of the vocabulary. check the examples below for reference. Admin: Has all the permissions of the reader. Can update the vocabulary by adding new files or removing existing ones. Has full access to the vocabulary update methods from TextProcessor. User credentials and access roles are provided as the varaiable users_info from the util.py module. in scaffold. 📋 Task Requirements You must: Implement the Role class, which should: Store and return the user’s username, display name, and access level (e.g., "reader", "admin"). Provide getter methods: get_user_name(), get_access(), get_name(). Implement the RoleBasedVocabSys class, which should: Handle login and logout. Display different menus depending on whether a user is logged in and their access level. Call TextProcessor functions (from Task 7) to manage the vocabulary. Enforce role-based access control (e.g., only admins can update vocabularies). Use the provided attributes and method names in the scaffold. Do not rename or remove any predefined code blocks. Implement menu-based navigation where users can choose options via standard input: Exit the system. Login or Logout. View the top 10 or bottom 10 frequent words. Update vocabulary by adding/removing files (admin only). You may write additional helper functions or methods as needed. 🧠 Additional Notes The vocabulary is loaded and managed via the TextProcessor object created in the constructor. The files to be added/removed are fixed as data/for_admin/excluded.csv data/add.csv and data/delete.csv for this exercise, but you may generalize it in future tasks. All user input should be validated using verify_user_choice. The system should loop until the user chooses to exit. Examples Example 1a: interface when starting the program Welcome to the Mark system v0.0! Please Login: 1.Exit 2.Login Enter your choice: Example 1b: unlimited attemps for invalid users inputs Welcome to the Mark system v0.0! Please Login: 1.Exit 2.Login Enter your choice: ewfwef Enter your choice: edf Enter your choice: 3 Enter your choice: 4 ... Enter your choice: Example 2a: correct login credential for reader Welcome to the Mark system v0.0! Please Login: 1.Exit 2.Login Please key your account name: Jueqing Please key your password: jueqing123 Welcome Jueqing Lu Please choose one option below: 1.Exit 2.Logout/Re-Login 3.Show top 10 frequency vocabularies 4.Show last 10 frequency vocabularies Enter your choice: Example 2b: correct login credential for admin Welcome to the Mark system v0.0! Please Login: 1.Exit 2.Login Please key your account name: Trang Please key your password: trang123 Welcome Trang Vu Please choose one option below: 1.Exit 2.Logout/Re-Login 3.Show top 10 frequency vocabularies 4.Show last 10 frequency vocabularies 5.Updating Vobulary for adding 6.Updating Vobulary for excluding Enter your choice:
09-24
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值