生成给定字符串的所有唯一子字符串

时间:2010-04-01 12:28:07

标签: algorithm language-agnostic

给定一个字符串s,生成一组所有唯一子串的最快方法是什么?

示例:对于str = "aba",我们会得到substrs={"a", "b", "ab", "ba", "aba"}

朴素算法将遍历整个字符串,在每次迭代中生成长度为1..n的子字符串,产生O(n^2)上限。

可能有更好的约束吗?

(这是技术上的功课,所以也欢迎指针)

14 个答案:

答案 0 :(得分:41)

正如其他海报所说,给定字符串可能存在O(n ^ 2)子串,因此打印它们不能比这更快。然而,存在可以在线性时间内构建的集合的有效表示:the suffix tree

答案 1 :(得分:13)

没有办法比O(n 2 )更快地执行此操作,因为字符串中总共有O(n 2 )子串,所以如果你必须生成它们全部,在最坏的情况下它们的数量将是n(n + 1) / 2,因此上限 O的下限(n 2 Ω(n 2 )。

答案 2 :(得分:6)

第一个是蛮力,其复杂度为O(N ^ 3),可以降低到O(N ^ 2 log(N))  第二个使用具有复杂度O(N ^ 2)的HashSet  第三个使用LCP,最初找到给定字符串的所有后缀,其中最坏情况为O(N ^ 2),最佳情况为O(N Log(N))。

第一个解决方案: -

import java.util.Scanner;

public class DistinctSubString {
  public static void main(String[] args) {
    Scanner in = new Scanner(System.in);
    System.out.print("Enter The string");
    String s = in.nextLine();
    long startTime = System.currentTimeMillis();
    int L = s.length();
    int N = L * (L + 1) / 2;
    String[] Comb = new String[N];
    for (int i = 0, p = 0; i < L; ++i) {
      for (int j = 0; j < (L - i); ++j) {
        Comb[p++] = s.substring(j, i + j + 1);
      }
    }
    /*
     * for(int j=0;j<N;++j) { System.out.println(Comb[j]); }
     */
    boolean[] val = new boolean[N];
    for (int i = 0; i < N; ++i)
      val[i] = true;
    int counter = N;
    int p = 0, start = 0;
    for (int i = 0, j; i < L; ++i) {
      p = L - i;
      for (j = start; j < (start + p); ++j) {
        if (val[j]) {
          //System.out.println(Comb[j]);
          for (int k = j + 1; k < start + p; ++k) {
            if (Comb[j].equals(Comb[k])) {
              counter--;
              val[k] = false;
            }
          }
        }

      }

      start = j;
    }
    System.out.println("Substrings are " + N
        + " of which unique substrings are " + counter);
    long endTime = System.currentTimeMillis();
    System.out.println("It took " + (endTime - startTime) + " milliseconds");
  }
}

第二种解决方案: -

import java.util.*;

public class DistictSubstrings_usingHashTable {

  public static void main(String args[]) {
    // create a hash set

    Scanner in = new Scanner(System.in);
    System.out.print("Enter The string");
    String s = in.nextLine();
    int L = s.length();
    long startTime = System.currentTimeMillis();
    Set<String> hs = new HashSet<String>();
    // add elements to the hash set
    for (int i = 0; i < L; ++i) {
      for (int j = 0; j < (L - i); ++j) {
        hs.add(s.substring(j, i + j + 1));
      }
    }
    System.out.println(hs.size());
    long endTime = System.currentTimeMillis();
    System.out.println("It took " + (endTime - startTime) + " milliseconds");
  }
}

第三种解决方案: -

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.Arrays;

public class LCPsolnFroDistinctSubString {

  public static void main(String[] args) throws IOException {

    BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
    System.out.println("Enter Desired String ");
    String string = br.readLine();
    int length = string.length();
    String[] arrayString = new String[length];
    for (int i = 0; i < length; ++i) {
      arrayString[i] = string.substring(length - 1 - i, length);
    }

    Arrays.sort(arrayString);
    for (int i = 0; i < length; ++i)
      System.out.println(arrayString[i]);

    long num_substring = arrayString[0].length();

    for (int i = 0; i < length - 1; ++i) {
      int j = 0;
      for (; j < arrayString[i].length(); ++j) {
        if (!((arrayString[i].substring(0, j + 1)).equals((arrayString)[i + 1]
            .substring(0, j + 1)))) {
          break;
        }
      }
      num_substring += arrayString[i + 1].length() - j;
    }
    System.out.println("unique substrings = " + num_substring);
  }

}

第四种解决方案: -

  public static void printAllCombinations(String soFar, String rest) {
    if(rest.isEmpty()) {
        System.out.println(soFar);
    } else {
        printAllCombinations(soFar + rest.substring(0,1), rest.substring(1));
        printAllCombinations(soFar , rest.substring(1));
    }
}

Test case:-  printAllCombinations("", "abcd");

答案 3 :(得分:3)

对于大哦......你能做到的最好是O(n ^ 2)

不需要重新发明轮子,它不是基于弦乐,而是基于一组,所以你必须采取概念并将它们应用到你自己的情况。

算法

Really Good White Paper from MS

In depth PowerPoint

Blog on string perms

答案 4 :(得分:2)

好吧,因为有可能n*(n+1)/2个不同的子串(空子串+1),我怀疑你可能比O(n*2)更好(最糟糕的情况)。最简单的方法是生成它们并使用一些漂亮的O(1)查找表(例如hashmap)来找到它们时的重复项。

答案 5 :(得分:1)

class program
{

        List<String> lst = new List<String>();
        String str = "abc";
        public void func()
        {

            subset(0, "");
            lst.Sort();
            lst = lst.Distinct().ToList();

            foreach (String item in lst)
            {
                Console.WriteLine(item);
            }
        }
        void subset(int n, String s)
        {
            for (int i = n; i < str.Length; i++)
            {
                lst.Add(s + str[i].ToString());
                subset(i + 1, s + str[i].ToString());
            }
        }
}

答案 6 :(得分:1)

WebBrowser t = c as WebBrowser;
if(t == null)
    continue;

答案 7 :(得分:0)

只能在o(n ^ 2)时间内完成,因为字符串的唯一子串的总数将为n(n + 1)/ 2.

示例:

string s =“abcd”

传递0 :(所有字符串长度为1)

a,b,c,d = 4个字符串

传递1 :(所有字符串的长度均为2)

ab,bc,cd = 3字符串

传递2 :(所有字符串长度为3)

abc,bcd = 2个字符串

传递3 :(所有字符串长度为4)

abcd = 1个字符串

使用这个类比,我们可以编写具有o(n ^ 2)时间复杂度和恒定空间复杂度的解决方案。

源代码如下:

#include<stdio.h>

void print(char arr[], int start, int end)
{
    int i;
    for(i=start;i<=end;i++)
    {
        printf("%c",arr[i]);
    }
    printf("\n");
}


void substrings(char arr[], int n)
{
    int pass,j,start,end;
    int no_of_strings = n-1;

    for(pass=0;pass<n;pass++)
    {
        start = 0;
        end = start+pass;
        for(j=no_of_strings;j>=0;j--)
        {
            print(arr,start, end);
            start++;
            end = start+pass;
        }
        no_of_strings--;
    }

}

int main()
{   
    char str[] = "abcd";
    substrings(str,4);
    return 0;
}

答案 8 :(得分:0)

这是我在Python中的代码。它生成任何给定字符串的所有可能的子串。

def find_substring(str_in):
    substrs = []
    if len(str_in) <= 1:
        return [str_in]

    s1 = find_substring(str_in[:1])
    s2 = find_substring(str_in[1:])

    substrs.append(s1)
    substrs.append(s2)
    for s11 in s1:
        substrs.append(s11)            
        for s21 in s2:            
            substrs.append("%s%s" %(s11, s21))

    for s21 in s2:
        substrs.append(s21)

    return set(substrs)

如果将str_ =“abcdef”传递给该函数,则会生成以下结果:

a,ab,abc,abcd,abcde,abcdef,abcdf,abce,abcef,abcf,abd,abde,abdef,abdf,abe,abef,abf,ac,acd,acde,acdef,acdf,ace,acef ,acf,ad,ade,adef,adf,ae,aef,af,b,bc,bcd,bcde,bcdef,bcdf,bce,bcef,bcf,bd,bde,bdef,bdf,be,bef,bf,c ,cd,cde,cdef,cdf,ce,cef,cf,d,de,def,df,e,ef,f

答案 9 :(得分:0)

朴素算法需要O(n ^ 3)时间而不是O(n ^ 2)时间。 存在O(n ^ 2)个子串。 如果你放置O(n ^ 2)个子串,例如set, 然后设置比较每个字符串的O(lgn)比较,以检查集合中是否存在alrady。 此外,字符串比较需要O(n)时间。 因此,如果使用set,则需要O(n ^ 3 lgn)时间。如果使用哈希表而不是设置,则可以减少O(n ^ 3)时间。

重点是字符串比较而非数字比较。

因此,如果使用后缀数组和最长公共前缀(LCP)算法,最好的算法之一就是说,它会减少此问题的O(n ^ 2)时间。 使用O(n)时间算法构建后缀数组。 LCP的时间= O(n)时间。 由于对于后缀数组中的每对字符串,请执行LCP,因此总时间为O(n ^ 2)时间,以查找不同子字符串的长度

此外,如果您想打印所有不同的子串,则需要O(n ^ 2)时间。

答案 10 :(得分:0)

这会打印唯一的子串。 https://ideone.com/QVWOh0

def uniq_substring(test):
    lista=[]
    [lista.append(test[i:i+k+1]) for i in range(len(test)) for k in
    range(len(test)-i) if test[i:i+k+1] not in lista and 
    test[i:i+k+1][::-1] not in lista]
    print lista

uniq_substring('rohit')
uniq_substring('abab')

['r', 'ro', 'roh', 'rohi', 'rohit', 'o', 'oh', 'ohi', 'ohit', 'h',   
'hi', 'hit', 'i', 'it', 't']
['a', 'ab', 'aba', 'abab', 'b', 'bab']

答案 11 :(得分:0)

使用后缀数组和最长公共前缀尝试此代码。它还可以为您提供唯一子串的总数。代码可能会在visual studio中产生堆栈溢出,但在Eclipse C ++中运行良好。那是因为它返回了函数的向量。 Haven没有对极长的琴弦进行测试。将这样做并报告回来。

  // C++ program for building LCP array for given text
#include <bits/stdc++.h>
#include <vector>
#include <string>
using namespace std;

#define MAX 100000
int cum[MAX];


// Structure to store information of a suffix
struct suffix
{
    int index; // To store original index
    int rank[2]; // To store ranks and next rank pair
};

// A comparison function used by sort() to compare two suffixes
// Compares two pairs, returns 1 if first pair is smaller
int cmp(struct suffix a, struct suffix b)
{
    return (a.rank[0] == b.rank[0])? (a.rank[1] < b.rank[1] ?1: 0):
        (a.rank[0] < b.rank[0] ?1: 0);
}

// This is the main function that takes a string 'txt' of size n as an
// argument, builds and return the suffix array for the given string
vector<int> buildSuffixArray(string txt, int n)
{
    // A structure to store suffixes and their indexes
    struct suffix suffixes[n];

    // Store suffixes and their indexes in an array of structures.
    // The structure is needed to sort the suffixes alphabatically
    // and maintain their old indexes while sorting
    for (int i = 0; i < n; i++)
    {
        suffixes[i].index = i;
        suffixes[i].rank[0] = txt[i] - 'a';
        suffixes[i].rank[1] = ((i+1) < n)? (txt[i + 1] - 'a'): -1;
    }

    // Sort the suffixes using the comparison function
    // defined above.
    sort(suffixes, suffixes+n, cmp);

    // At his point, all suffixes are sorted according to first
    // 2 characters. Let us sort suffixes according to first 4
    // characters, then first 8 and so on
    int ind[n]; // This array is needed to get the index in suffixes[]
    // from original index. This mapping is needed to get
    // next suffix.
    for (int k = 4; k < 2*n; k = k*2)
    {
        // Assigning rank and index values to first suffix
        int rank = 0;
        int prev_rank = suffixes[0].rank[0];
        suffixes[0].rank[0] = rank;
        ind[suffixes[0].index] = 0;

        // Assigning rank to suffixes
        for (int i = 1; i < n; i++)
        {
            // If first rank and next ranks are same as that of previous
            // suffix in array, assign the same new rank to this suffix
            if (suffixes[i].rank[0] == prev_rank &&
                    suffixes[i].rank[1] == suffixes[i-1].rank[1])
            {
                prev_rank = suffixes[i].rank[0];
                suffixes[i].rank[0] = rank;
            }
            else // Otherwise increment rank and assign
            {
                prev_rank = suffixes[i].rank[0];
                suffixes[i].rank[0] = ++rank;
            }
            ind[suffixes[i].index] = i;
        }

        // Assign next rank to every suffix
        for (int i = 0; i < n; i++)
        {
            int nextindex = suffixes[i].index + k/2;
            suffixes[i].rank[1] = (nextindex < n)?
                                suffixes[ind[nextindex]].rank[0]: -1;
        }

        // Sort the suffixes according to first k characters
        sort(suffixes, suffixes+n, cmp);
    }

    // Store indexes of all sorted suffixes in the suffix array
    vector<int>suffixArr;
    for (int i = 0; i < n; i++)
        suffixArr.push_back(suffixes[i].index);

    // Return the suffix array
    return suffixArr;
}

/* To construct and return LCP */
vector<int> kasai(string txt, vector<int> suffixArr)
{
    int n = suffixArr.size();

    // To store LCP array
    vector<int> lcp(n, 0);

    // An auxiliary array to store inverse of suffix array
    // elements. For example if suffixArr[0] is 5, the
    // invSuff[5] would store 0. This is used to get next
    // suffix string from suffix array.
    vector<int> invSuff(n, 0);

    // Fill values in invSuff[]
    for (int i=0; i < n; i++)
        invSuff[suffixArr[i]] = i;

    // Initialize length of previous LCP
    int k = 0;

    // Process all suffixes one by one starting from
    // first suffix in txt[]
    for (int i=0; i<n; i++)
    {
        /* If the current suffix is at n-1, then we don’t
        have next substring to consider. So lcp is not
        defined for this substring, we put zero. */
        if (invSuff[i] == n-1)
        {
            k = 0;
            continue;
        }

        /* j contains index of the next substring to
        be considered to compare with the present
        substring, i.e., next string in suffix array */
        int j = suffixArr[invSuff[i]+1];

        // Directly start matching from k'th index as
        // at-least k-1 characters will match
        while (i+k<n && j+k<n && txt[i+k]==txt[j+k])
            k++;

        lcp[invSuff[i]] = k; // lcp for the present suffix.

        // Deleting the starting character from the string.
        if (k>0)
            k--;
    }

    // return the constructed lcp array
    return lcp;
}

// Utility function to print an array
void printArr(vector<int>arr, int n)
{
    for (int i = 0; i < n; i++)
        cout << arr[i] << " ";
    cout << endl;
}

// Driver program
int main()
{
    int t;
    cin >> t;
    //t = 1;

    while (t > 0)  {

        //string str = "banana";
        string str;

        cin >> str; // >> k;

        vector<int>suffixArr = buildSuffixArray(str, str.length());
        int n = suffixArr.size();

        cout << "Suffix Array : \n";
        printArr(suffixArr, n);

        vector<int>lcp = kasai(str, suffixArr);

        cout << "\nLCP Array : \n";
        printArr(lcp, n);

        // cum will hold number of substrings if that'a what you want (total = cum[n-1]
        cum[0] = n - suffixArr[0];

    //  vector <pair<int,int>> substrs[n];

        int count = 1;


        for (int i = 1; i <= n-suffixArr[0]; i++)  {
            //substrs[0].push_back({suffixArr[0],i});
            string sub_str = str.substr(suffixArr[0],i);
            cout << count << "  "  <<  sub_str << endl;
            count++;
        }

        for(int i = 1;i < n;i++)  {

            cum[i] = cum[i-1] + (n - suffixArr[i] - lcp[i - 1]);

            int end = n - suffixArr[i];
            int begin = lcp[i-1] + 1;
            int begin_suffix = suffixArr[i];

            for (int j = begin, k = 1; j <= end; j++, k++)  {

                //substrs[i].push_back({begin_suffix, lcp[i-1] + k});
        //      cout << "i push "  << i  << "   " << begin_suffix << " " << k << endl;
                string sub_str = str.substr(begin_suffix, lcp[i-1] +k);
                cout << count << "  "  <<  sub_str << endl;
                count++;

            }

        }

        /*int count = 1;

        cout << endl;

        for(int i = 0; i < n; i++){

            for (auto it = substrs[i].begin(); it != substrs[i].end(); ++it )    {

                string sub_str = str.substr(it->first, it->second);
                cout << count << "  "  <<  sub_str << endl;
                count++;
            }

        }*/


        t--;

    }

    return 0;
}

这是一个更简单的算法:

#include <iostream>
#include <string.h>
#include <vector>
#include <string>
#include <algorithm>
#include <time.h>

using namespace std;


char txt[100000], *p[100000];
int m, n;

int cmp(const void *p, const void *q) {
    int rc = memcmp(*(char **)p, *(char **)q, m);
    return rc;
}
int main() {
    std::cin >> txt;
    int start_s = clock();


    n = strlen(txt);
    int k; int i;
    int count = 1;
    for (m = 1; m <= n; m++) {
        for (k = 0; k+m <= n; k++)
            p[k] = txt+k;
        qsort(p, k, sizeof(p[0]), &cmp);
        for (i = 0; i < k; i++) {
            if (i != 0 && cmp(&p[i-1], &p[i]) == 0){
                continue;
            }
            char cur_txt[100000];
            memcpy(cur_txt, p[i],m);
            cur_txt[m] = '\0';
            std::cout << count << "  "  << cur_txt << std::endl;
            count++;
        }
    }

    cout << --count  << endl;

    int stop_s = clock();
    float run_time = (stop_s - start_s) / double(CLOCKS_PER_SEC);
    cout << endl << "distinct substrings  \t\tExecution time = " << run_time << " seconds" << endl;
    return 0;
}

虽然这两种算法对于极长的字符串列出的速度太慢了。我用长度超过47,000的字符串测试算法,算法花了20多分钟才完成,第一个算法耗时1200秒,第二个算法占用1360秒,而且只计算独特的子串而不输出到终点站。因此,对于长度可达1000的字符串,您可能会得到一个可行的解决方案。两种解决方案都确实计算了相同的子串总数。我确实测试了两种算法的字符串长度为2000和10,000。时间是第一个算法:0.33秒和12秒;对于第二种算法,它是0.535秒和20秒。所以看起来通常第一种算法更快。

答案 12 :(得分:0)

许多答案包括2个for循环和一个.substring()调用,它们声明O(N ^ 2)时间复杂度。但是,必须注意,在Java中(.Java 7中的更新6),对.substring()调用的最坏情况是O(N)。因此,通过在代码中添加.substring()调用,N的顺序增加了一个。

因此,2个for循环和这些循环中的.substring()调用等于O(N ^ 3)时间复杂度。

答案 13 :(得分:-2)

你的程序没有给出独特的sbstrins。

请使用输入abab进行测试,输出应为aba,ba,bab,abab