找出数组中出现次数最多的子串

最新推荐文章于 2022-03-16 15:51:41 发布

翻译最新推荐文章于 2022-03-16 15:51:41 发布 · 1.4k 阅读

算法同时被 2 个专栏收录

96 篇文章

订阅专栏

Array

23 篇文章

订阅专栏

给定int数组一个，找出出现次数最多的非空子串。如果有多个这种子串，返回最长的子串。

两个子串如果包含完全相同的字符，且字符顺序也相同，那么他们相等。

Given an array of ints, find the most frequent non-empty subarray in it. If there are more than one such sub-arrays return the longest one/s.

Note: Two subarrays are equal if they contain identical elements and elements are in the same order.

For example: if input = {4,5,6,8,3,1,4,5,6,3,1}
Result: {4,5,6}

创建子串的后缀数组，并将它们排序。使用两个变量记录最长子串的长度和出现频率。

遍历已排序的数组，找到出现次数最多的数组并返回它。

1. Build a suffix array and sort the array. Use 2 variables - one to maintain the length of the longest repeated sub array and the other to maintain the frequency.

2. Traverse the sorted array to find out the most occurring and longest repeated subarray and return it.

后缀数组实际上是一个二维数组。下面是给定数组{4,5,6,8,3,1,4,5,6,3,1}的后缀数组。每一个元素是一个一维数组。

Suffix array is actually a 2D array. The suffix array for the given array {4,5,6,8,3,1,4,5,6,3,1} would be as below. Here, each element of the array itself is an array.

{4,5,6,8,3,1,4,5,6,3,1}
{5,6,8,3,1,4,5,6,3,1}
{6,8,3,1,4,5,6,3,1}
{8,3,1,4,5,6,3,1}
{3,1,4,5,6,3,1}
{1,4,5,6,3,1}
{4,5,6,3,1}
{5,6,3,1}
{6,3,1}
{3,1}
{1}
将这些后缀数组排序之后，得到：
After sorting the suffix array, you'd get:
{8,3,1,4,5,6,3,1}
{6,8,3,1,4,5,6,3,1}
{6,3,1}
{5,6,8,3,1,4,5,6,3,1}
{5,6,3,1}
{4,5,6,8,3,1,4,5,6,3,1}
{4,5,6,3,1}
{3,1,4,5,6,3,1}
{3,1}
{1,4,5,6,3,1}
{1}
通过比较前缀检查匹配的子串很容易。如果遍历上面的排序数组，比较相邻元素的相似性，得出前缀 4 5 6具有最大出现次数2，同时也为最长的子串。[6], [5,6],[3,1] and [1]与出现了两次，但是他们较短。
Checking for matching subarrays is easily done in a suffix array by comparing the prefixes. If you traverse the above sorted array and compare adjacent elements for similarity you'd see the prefix [4,5,6] is occurring maximum number(=2) of times and is also of maximum length. There are other subarrays as well, like [6], [5,6],[3,1] and [1] that are occurring 2 times, but they are shorter than the subarray [4,5,6], which is our required answer. HTH.

（下面是自己的思路：

我觉得记录频度和子串更好，一个current_subarray表示目前公共子串,一个max_subarray频率最高的最长公共子串，currfreq , maxfreq分布为频率计数器。

当两个相邻数组有公共子串时：

如果公共子串同current_subarray相等，那么currfreq++，current_subarray不变。

但是公共子串同current_subarray不相等时，表示上一步的current_subarray统计完毕，利用current_subarray和max_subarray比较，可能刷新max_subarray（如果current_subarray为空就不用再比较了，比较麻烦的是二者频率相等，这是就要比较长度了）。然后current_subarray设为这两个数组的公共子串，currfreq置为2，因为不知道对current_subarray的频率计算结束了没有，所以这里不用刷新max_fre，后续步骤刷新。

如果两个数组没有公共子串：

那么对于current_subarray的统计可以结束了，同max_subarray比较，可能刷新max_subarray及其计数器。然后将current_subarray设为空。

不断考查相邻数组，直至最后两个数组。最后一步要比较current_subarray和max_subarray进行可能的刷新。）