Question

给定一个序列，如 S = {1,8,2,1,4,1,2,9,1,8,4}，我需要找到最小长度的子序列包含 S 的所有元素（没有重复，顺序无关紧要）。如何以有效的方式找到这个子序列？

注意： S中有5个不同的元素： {1,2,4,8,9}。最小长度子序列必须包含所有这5个元素。

Answer 1

<强>算法：

首先，确定阵列中不同元素的数量 - 这可以在线性时间内轻松完成。让k个不同的元素。

分配大小为10 ^ 5的数组cur，每个数组显示当前子序列中每个元素的使用量（见下文）。

保持cnt变量，显示当前在所考虑的序列中有多少不同的元素。现在，取两个索引begin和end并按以下方式遍历数组：

将cnt和begin初始化为0，将end初始化为-1（在第一次递增后获得0）。然后尽可能执行以下操作：
如果cnt != k：

2.1。增量end。如果end已经是数组的结尾，那么就打破。如果cur[array[end]]为零，则递增cnt。增加cur[array[end]]。

否则：

2.2 {

尝试递增begin迭代器：while cur[array[begin]] > 1，递减它，并递增begin（cur[array[begin]] > 1意味着我们当前子序列中有另一个这样的元素）。毕竟，将[begin, end]间隔与当前答案进行比较，如果更好，则将其存储。

}

在进一步的过程变得不可能之后，你得到了答案。复杂性为O(n) - 只是通过数组传递两个交互器。

在C ++中实现：

    #include <iostream>

using namespace std;

const int MAXSIZE = 10000;

int arr[ MAXSIZE ];
int cur[ MAXSIZE ];

int main ()
{
   int n; // the size of array
   // read n and the array

   cin >> n;
   for( int i = 0; i < n; ++i )
      cin >> arr[ i ];

   int k = 0;
   for( int i = 0; i < n; ++i )
   {
      if( cur[ arr[ i ] ] == 0 )
         ++k;
      ++cur[ arr[ i ] ];
   }

   // now k is the number of distinct elements

   memset( cur, 0, sizeof( cur )); // we need this array anew
   int begin = 0, end = -1; // to make it 0 after first increment
   int best = -1; // best answer currently found
   int ansbegin, ansend; // interval of the best answer currently found
   int cnt = 0; // distinct elements in current subsequence

   while(1)
   {
      if( cnt < k )
      {
         ++end;
         if( end == n )
            break;
         if( cur[ arr[ end ]] == 0 )
            ++cnt; // this elements wasn't present in current subsequence;
         ++cur[ arr[ end ]];
         continue;
      }
      // if we're here it means that [begin, end] interval contains all distinct elements
      // try to shrink it from behind
      while( cur[ arr[ begin ]] > 1 ) // we have another such element later in the subsequence
      {
         --cur[ arr[ begin ]];
         ++begin;
      }
      // now, compare [begin, end] with the best answer found yet
      if( best == -1 || end - begin < best )
      {
         best = end - begin;
         ansbegin = begin;
         ansend = end;
      }
      // now increment the begin iterator to make cur < k and begin increasing the end iterator again
      --cur[ arr[ begin]];
      ++begin;
      --cnt;
   }

   // output the [ansbegin, ansend] interval as it's the answer to the problem

   cout << ansbegin << ' ' << ansend << endl;
   for( int i = ansbegin; i <= ansend; ++i )
      cout << arr[ i ] << ' ';
   cout << endl;

   return 0;
}

Answer 2

这可以通过dynamic programming来解决。

在每个步骤k，我们将计算在k S位置结束的最短子序列，并满足包含{的所有唯一元素的要求{1}}。

给定步骤S的解决方案（以下简称“序列”），计算步骤k的解决方案很简单：将S的k+1个元素追加到序列中然后逐个删除序列开头的所有元素，这些元素不止一次包含在扩展序列中。

整体问题的解决方案是在任何步骤中找到的最短序列。

算法的初始化包括两个阶段：

扫描(k+1)一次，构建唯一值的字母表。
找到最短的有效序列，其第一个元素是S的第一个元素;该序列的最后一个位置将是S的初始值。

以上所有操作都可以在k最坏情况下完成（请告知我是否需要澄清）。

以下是Python中上述算法的完整实现：

O(n logn)

注意：

我使用的数据结构（字典和集合）基于哈希表;它们具有良好的平均情况性能，但在最坏的情况下会降级到import collections S = [1,8,2,1,4,1,2,9,1,8,4,2,4] # initialization: stage 1 alphabet = set(S) # the unique values ("symbols") in S count = collections.defaultdict(int) # how many times each symbol appears in the sequence # initialization: stage 2 start = 0 for end in xrange(len(S)): count[S[end]] += 1 if len(count) == len(alphabet): # seen all the symbols yet? break end += 1 best_start = start best_end = end # the induction while end < len(S): count[S[end]] += 1 while count[S[start]] > 1: count[S[start]] -= 1 start += 1 end += 1 if end - start < best_end - best_start: # new shortest sequence? best_start = start best_end = end print S[best_start:best_end]。如果这是你关心的最糟糕的情况，用基于树的结构替换它们将会给出我在上面承诺的总体O(n);
正如@biziclop所指出的，可以消除O(n logn)的第一次扫描，使算法适合于流数据;
如果S的元素是小的非负整数，正如您的注释所示，则S可以展平为整数数组，从而将整体复杂度降低到count

Answer 3

这是一种需要O（N）时间和O（N）空间的算法。它类似于Grigor Gevorgyan的那个。它还使用辅助O（N）标志数组。该算法找到唯一元素的最长子序列。如果bestLength < numUnique则没有包含所有唯一元素的子序列。该算法假设元素是正数，并且最大元素小于序列的长度。

bool findLongestSequence() {
    // Data (adapt as needed)
    const int N = 13;
    char flags[N];
    int a[] = {1,8,2,1,4,1,2,9,1,8,1,4,1};

    // Number of unique elements
    int numUnique = 0;
    for (int n = 0; n < N; ++n) flags[n] = 0; // clear flags
    for (int n = 0; n < N; ++n) {
        if (a[n] < 0 || a[n] >= N) return false; // assumptions violated 
        if (flags[a[n]] == 0) {
            ++numUnique;
            flags[a[n]] = 1;
        }
    }

    // Find the longest sequence ("best")
    for (int n = 0; n < N; ++n) flags[n] = 0; // clear flags
    int bestBegin = 0, bestLength = 0;
    int begin = 0, end = 0, currLength = 0;
    for (; begin < N; ++begin) {
        while (end < N) {
            if (flags[a[end]] == 0) {
                ++currLength;
                flags[a[end]] = 1;
                ++end;
            }
            else {
                break; // end-loop
            }
        }
        if (currLength > bestLength) {
            bestLength = currLength;
            bestBegin = begin;
        }
        if (bestLength >= numUnique) {
            break; // begin-loop
        }
        flags[a[begin]] = 0; // reset
        --currLength;
    }

    cout << "numUnique = " << numUnique << endl;
    cout << "bestBegin = " << bestBegin << endl;
    cout << "bestLength = " << bestLength << endl;
    return true; // longest subseqence found 
}

Answer 4

我有一个O（N * M）算法，其中N是S的长度，M是元素的数量（对于M的小值，它往往效果更好，即：如果复制很少的话，它可能是一个带有二次成本的坏算法）编辑：事实上，它似乎更接近实践中的O（N）。 只有在最糟糕的情况下才会获得O(N*M)

首先查看序列并记录S的所有元素。让我们调用此集E.

我们将使用S的动态子序列创建一个空的map M，其中M将每个元素与子序列中出现的次数相关联。

例如，如果subSequence = {1,8,2,1,4}和E = {1, 2, 4, 8, 9}

M[9]==0
M[2]==M[4]==M[8]==1
M[1]==2

你需要两个索引，每个索引都指向一个元素S.其中一个将被称为L，因为他位于由这两个索引形成的子序列的左侧。另一个将被称为R，因为它是子序列右侧部分的索引。

首先初始化L=0，R=0和M[S[0]]++

算法是：

While(M does not contain all the elements of E)
{
    if(R is the end of S)
      break
  R++
  M[S[R]]++ 
}
While(M contains all the elements of E)
{
  if(the subsequence S[L->R] is the shortest one seen so far)
    Record it
  M[S[L]]--
  L++
}

要检查M是否包含E的所有元素，您可以有一个布尔值向量V. V[i]==true如果M[E[i]]>0则V[i]==false如果M[E[i]]==0。因此，您首先在false处设置V的所有值，并且每次执行M[S[R]]++时，您可以将此元素的V设置为true，并且每次执行M[S[L]]--然后将M[S[L]]==0和false的V设置为{{1}}

Answer 5

如果您需要经常对相同的序列和不同的集合执行此操作，则可以使用反向列表。您准备序列的反转列表，然后收集所有偏移量。然后扫描反转列表中的结果，查找m个连续数字的序列。

使用n序列的长度和m查询的大小，准备工作将在O(n)。如果我没有错误地计算合并步骤，查询的响应时间将在O(m^2)。

如果您需要更多细节，请查看2004年Clausen / Kurth关于代数数据库（“Content-Based Information Retrieval by Group Theoretical Methods”）的论文。这勾勒出了一个可以适应您任务的通用数据库框架。

Answer 6

我会说：

构建元素集D.
保持一个与序列S大小相同的数组。
使用来自S的索引填充数组，指示序列的最新开始，其中D中的所有元素都以该索引结束。
查找数组中序列的最小长度，并保存开始和结束的位置。

显然，只有第3项才是棘手的。我将使用优先级队列/堆，为D中的每个元素分配一个键，并将该元素作为值。除此之外，您还需要一个能够通过其值访问堆中元素的数据结构（映射w /指向元素的指针）。键应该始终是元素发生的S中的最后一个位置。

所以你经历S并且对于你读过的每个字符，你做一个setKey O（log n）然后查看当前的min O（1）并将其写入数组中。

应为O（n * log n）。我希望我没有错过任何东西。它只是出现在我的脑海中，所以请小心一点，或让社区指出我可能犯的错误。

Answer 7

以上解决方案是正确的以及上述代码的java版本

public class MinSequence {

    public static void main(String[] args)
    {
        final int n; // the size of array
        // read n and the array
        final List<Integer> arr=new ArrayList<Integer>(4);
        Map<Integer, Integer> cur = new TreeMap<Integer, Integer>();
        arr.add(1);
        arr.add(2);
        arr.add(1);
        arr.add(3);
        int distinctcount=0;
        for (final Integer integer : arr)
        {
            if(cur.get(integer)==null)
            {
                cur.put(integer, 1);
                ++distinctcount;
            }else
            {
                cur.put(integer,cur.get(integer)+1);
            }
        }

        // now k is the number of distinct elements
        cur=new TreeMap<Integer,Integer>();
        //   memset( cur, 0, sizeof( cur )); // we need this array anew
        int begin = 0, end = -1; // to make it 0 after first increment
        int best = -1; // best answer currently found
        int ansbegin = 0, ansend = 0; // interval of the best answer currently found
        int cnt = 0; // distinct elements in current subsequence
        final int inpsize = arr.size();
        while(true)
        {
            if( cnt < distinctcount )
            {
                ++end;
                if (end == inpsize) {
                    break;
                }
                if( cur.get(arr.get(end)) == null ) {
                    ++cnt;
                    cur.put(arr.get(end), 1);
                } // this elements wasn't present in current subsequence;
                else
                {
                    cur.put(arr.get(end),cur.get(arr.get(end))+1);
                }
                continue;
            }
            // if we're here it means that [begin, end] interval contains all distinct elements
            // try to shrink it from behind
            while (cur.get(arr.get(begin)) != null && cur.get(arr.get(begin)) > 1) // we have another such element later in the subsequence
            {
                cur.put(arr.get(begin),cur.get(arr.get(begin))-1);
                ++begin;
            }
            // now, compare [begin, end] with the best answer found yet
            if( best == -1 || end - begin < best )
            {
                best = end - begin;
                ansbegin = begin;
                ansend = end;
            }
            // now increment the begin iterator to make cur < k and begin increasing the end iterator again
            if (cur.get(arr.get(begin)) != null) {
                cur.put(arr.get(begin),cur.get(arr.get(begin))-1);
            }
            ++begin;
            --cnt;
        }

        // output the [ansbegin, ansend] interval as it's the answer to the problem
        System.out.println(ansbegin+"--->"+ansend);
        for( int i = ansbegin; i <= ansend; ++i ) {
            System.out.println(arr.get(i));
        }
    }

如何查找包含序列的所有元素的最小长度子序列

7 个答案: