Longest Repeated Substring

http://nriverwang.blogspot.hk/


This problem is from codeeval.com.

Problem Description:

You are to find the longest repeated substring in a given text. Repeated substrings may not overlap. If more than one substring is repeated with the same length, print the first one you find.(starting from the beginning of the text). NOTE: The substrings can't be all spaces.

Input Sample:
Your program should accept as its first argument a path to a filename. The input file contains several lines. Each line is one test case. Each line contains a test string. eg.
banana
abc
Output Sample:
For each set of input produce a single line of output which is the longest repeated substring. If there is none, print out the string NONE. eg.
an
NONE

Analysis:

The input text is actually a string and suppose the length of string is  n .

Brute Force:
The brute force method is easy: for each index pair [ i, j ] ( 1 <= i, j <= n ) of the string, count the length of non-overlapping repeated substring starting from this two indices . And then get the first repeated substring with the longest length. Although the non-overlapping constraint can help to reduce the character comparison times while checking two substrings, the running time in the worst case is still  O(n^3)  (can be proved, but omitted here). Obviously the brute force method is not a good choice.

Suffix Array:
Another method is to use suffix array. For the input string  str , we first sort all of its suffixes in the suffix array. eg.
str = banana, its suffixes are:
banana
anana
nana
ana
na
a
after sort, the suffix array looks like:
a
ana
anana
banana
na
nana
Then for each two adjacent suffixes on the sorted suffix array, count the length of the common prefix (Note: should avoid counting the overlapped part). The longest repeated prefix appearing first on the original input string is the result. e.g.
a
ana
anana
banana
na
nana
The red prefixes an and green prefixes na both have the same longest length 2, but an is the result because it appears before na in str = banana. 
Similar to brute force method, the suffixes comparison in sorting and repeated prefix length counting make the running time be  O(n^2lgn)  (suppose an  O(nlgn)  sorting algorithm is used).  The length of suffix array is  n  and it may just need to store the indices pointing to the suffixes in the input string rather than the suffixes themselves. Therefore the space complexity is  O(n) .

Dynamic Programming:
O(n^2lgn)  looks good, however, we can solve this problem in  O(n^2)  time and  O(n^2)  space with DP. Suppose we first don't consider the constraint of no overlap. The idea is to find the longest repeated suffix for all pairs of prefixes of the string  str . The length of the longest repeated suffix for prefixes  str[1..i]  and  str[1..j]  is:
c(i, j) = 1) c(i-1, j-1) + 1   if str[i] == str[j];
          2) 0                 otherwise.
The maximum of  c(i, j)  ( 1 <= i, j <= n ) is the length of longest repeated substring (suffix) of  str . The longest repeated substring should be stored and dynamically updated during the calculating of  c(i, j) .

Now consider the non-overlapping constraint, otherwise the above process will give us the string itself. Look at prefixes  str[1..i]  and  str[1..j] , the end indices of their suffixes are  i  and  (for simplicity, suppose  i <= j ), respectively. If their common suffix is longer than  j-i , the two suffixes must overlap. The overlapped part should not be counted and the above recursion should be modified as:
c(i, j) = 1) c(i-1, j-1) + 1   if str[i] == str[j] && abs(i-j) > c(i-1, j-1);
          2) 0                 otherwise.
If more than one repeated substring have the same longest length, we just need to take the first one in the input string. More details can be found in the code.

Code (C++)

Bellow is the code of DP method: 
std::string repeated_substring(std::string &str) {
    int len = str.length();

    int **c = new int*[len + 1];
    for (int i = 0; i <= len; ++i)
        c[i] = new int[len + 1];
    for (int i = 0; i <= len; ++i) {
        c[i][0] = 0;
        c[0][i] = 0;
    }

    int max_len = 0, index = len + 1;
    for (int i = 1; i <= len; ++i) {
        for (int j = 1; j <= len; ++j) {
            if (str[i-1] == str[j-1] && abs(i-j) > c[i-1][j-1]) {
                c[i][j] = c[i-1][j-1] + 1;
                if (c[i][j] > max_len) {
                    max_len = c[i][j];
                    index = std::min(i, j);
                }
            } else {
                c[i][j] = 0;
            }
        }
    }
    
    for (int i = 0; i <= len; ++i)
        delete[] c[i];
    delete[] c;
 
    if (max_len > 0) {
        std::string ret = str.substr(index - max_len, max_len);
        for (int i = 0; i < max_len; ++i)
            if(ret[i] != ' ')
                return ret;
    }

    return "NONE";
}


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值