Longest Repeated Substring

最新推荐文章于 2021-02-27 21:28:39 发布

转载最新推荐文章于 2021-02-27 21:28:39 发布 · 1.1k 阅读

Algorithm 专栏收录该内容

37 篇文章

订阅专栏

http://nriverwang.blogspot.hk/

This problem is from codeeval.com.

Problem Description:

You are to find the longest repeated substring in a given text. Repeated substrings may not overlap. If more than one substring is repeated with the same length, print the first one you find.(starting from the beginning of the text). NOTE: The substrings can't be all spaces.

Input Sample:
Your program should accept as its first argument a path to a filename. The input file contains several lines. Each line is one test case. Each line contains a test string. eg.

banana
abc

Output Sample:
For each set of input produce a single line of output which is the longest repeated substring. If there is none, print out the string NONE. eg.

an
NONE

Analysis:

The input text is actually a string and suppose the length of string is n .

Brute Force:
The brute force method is easy: for each index pair [ i, j ] ( 1 <= i, j <= n ) of the string, count the length of non-overlapping repeated substring starting from this two indices . And then get the first repeated substring with the longest length. Although the non-overlapping constraint can help to reduce the character comparison times while checking two substrings, the running time in the worst case is still O(n^3) (can be proved, but omitted here). Obviously the brute force method is not a good choice.

Suffix Array:
Another method is to use suffix array. For the input string str , we first sort all of its suffixes in the suffix array. eg.

str = banana, its suffixes are:
banana
anana
nana
ana
na
a

after sort, the suffix array looks like:
a
ana
anana
banana
na
nana

Then for each two adjacent suffixes on the sorted suffix array, count the length of the common prefix (Note: should avoid counting the overlapped part). The longest repeated prefix appearing first on the original input string is the result. e.g.

a
ana
anana
banana
na
nana
The red prefixes an and green prefixes na both have the same longest length 2, but an is the result because it appears before na in str = banana.

Similar to brute force method, the suffixes comparison in sorting and repeated prefix length counting make the running time be O(n^2lgn) (suppose an O(nlgn) sorting algorithm is used). The length of suffix array is n and it may just need to store the indices pointing to the suffixes in the input string rather than the suffixes themselves. Therefore the space complexity is O(n) .

Dynamic Programming:
O(n^2lgn) looks good, however, we can solve this problem in O(n^2) time and O(n^2) space with DP. Suppose we first don't consider the constraint of no overlap. The idea is to find the longest repeated suffix for all pairs of prefixes of the string str . The length of the longest repeated suffix for prefixes str[1..i] and str[1..j] is:

c(i, j) = 1) c(i-1, j-1) + 1   if str[i] == str[j];
          2) 0                 otherwise.

The maximum of c(i, j) ( 1 <= i, j <= n ) is the length of longest repeated substring (suffix) of str . The longest repeated substring should be stored and dynamically updated during the calculating of c(i, j) .

Now consider the non-overlapping constraint, otherwise the above process will give us the string itself. Look at prefixes str[1..i] and str[1..j] , the end indices of their suffixes are i and j (for simplicity, suppose i <= j ), respectively. If their common suffix is longer than j-i , the two suffixes must overlap. The overlapped part should not be counted and the above recursion should be modified as:

c(i, j) = 1) c(i-1, j-1) + 1   if str[i] == str[j] && abs(i-j) > c(i-1, j-1);
          2) 0                 otherwise.

If more than one repeated substring have the same longest length, we just need to take the first one in the input string. More details can be found in the code.

Code (C++)

Bellow is the code of DP method:

std::string repeated_substring(std::string &str) {
    int len = str.length();

    int **c = new int*[len + 1];
    for (int i = 0; i <= len; ++i)
        c[i] = new int[len + 1];
    for (int i = 0; i <= len; ++i) {
        c[i][0] = 0;
        c[0][i] = 0;
    }

    int max_len = 0, index = len + 1;
    for (int i = 1; i <= len; ++i) {
        for (int j = 1; j <= len; ++j) {
            if (str[i-1] == str[j-1] && abs(i-j) > c[i-1][j-1]) {
                c[i][j] = c[i-1][j-1] + 1;
                if (c[i][j] > max_len) {
                    max_len = c[i][j];
                    index = std::min(i, j);
                }
            } else {
                c[i][j] = 0;
            }
        }
    }
    
    for (int i = 0; i <= len; ++i)
        delete[] c[i];
    delete[] c;
 
    if (max_len > 0) {
        std::string ret = str.substr(index - max_len, max_len);
        for (int i = 0; i < max_len; ++i)
            if(ret[i] != ' ')
                return ret;
    }

    return "NONE";
}