使用ngram的离解压力算法

时间:2011-11-24 17:02:32

标签: algorithm

我解决了那种算法,并且坚持了算法是如何工作的。

分离的按压算法是http://en.wikipedia.org/wiki/Dissociated_press

N gram - http://en.wikipedia.org/wiki/N-gram

可以使连续字符串中的随机字符串成为可能,因此可以实现。

  

分离的印刷算法首先打印随机的n-gram。   然后它需要打印最后n-1个单词,并选择随机   以这些n-1个单词开头的n-gram。它会打印出最后一个单词   这个n-gram,并重复。所以输出的每个连续n个字   text是原始文本的n-gram。有时会发生这种情况   原始文本不包含以n-1个单词开头的n-gram   打印。在这种情况下,算法就会停止。

实际上我不知道。如何终止。

ngram(1,2)ngram(2,3)ngram(3,4)........ T T

对我来说是什么例子?我无法理解它的文字。

2 个答案:

答案 0 :(得分:1)

好吧,首先你将测试分成n-gram:

  

分离的印刷算法首先打印随机的n-gram。

变为(对于n = 4)

  • 分离的新闻算法
  • 分离按压算法启动
  • 按算法开始
  • 算法从打印开始
  • 首先打印
  • 通过随机打印
  • 打印随机n-gram

等。然后,您从任何您喜欢的n-gram开始,并开始添加将完成到目前为止构建的文本的最后n-1个单词的单词到已知的n-gram。因此,您创建的文本似乎几乎可读 - n越大,您的文本就越可读。

答案 1 :(得分:0)

这不是一个非常复杂的算法。给定的版本运行得很好:

public class Dissociator {

// Required size of the overlap
int overlapSize = 8;

// Size of the fragment
int fragmentSize = overlapSize;

// The initial sequence to dissociate, characters or words (could also dissociate some other objects).
ArrayList<String> initial;

boolean space;
boolean wordMode;

Random r = new Random(System.currentTimeMillis());

// Dissociate the given string. 
public String dissociate(String in) {

    ArrayList<String> a;
    if (wordMode)
        a = wordBased(in);
    else
        a = charBased(in);

    ArrayList<String> out = dissociate(a);

    StringBuilder b = new StringBuilder(out.size());
    for (String s : out) {
        b.append(s);
        if (wordMode)
            b.append(' ');
    }

    return b.toString();
}

/**
 * Run dissociation algorithm
 * 
 * @param input the initial sequence
 * @return the dissociated sequence.
 */
public ArrayList<String> dissociate(ArrayList<String> input) {

    initial = input;
    ArrayList<String> out = new ArrayList<String>();

    while (out.size() < input.size()) {
        int size = r.nextInt(overlapSize);
        if (size == 0)
            size = 1;

        ArrayList<String> tail = getTailOf(out, size);

        // Find random sequence in the input that matches the tail
        int p = r.nextInt(input.size() - 1) + 1; // Avoid zero.
        int was = p - 1; // This variable allows to break dissociation if it is not possible to find 
        // the acceptable continuation.

        boolean ok = false;

        if (tail.size() > 0)
            do {
                while (input.get(p).equals(tail.get(0)) && p != was)
                    p = (p + 1) % input.size();

                for (int j = 1; j < tail.size(); j++)
                    if (j + p < input.size()) {
                        if (!tail.get(j).equals(input.get(j + p))) {
                            ok = false;
                            break;
                        }
                    }
                ok = true;
            } while (!ok && p != was);

        for (int j = p; j < Math.min(p + fragmentSize, input.size()); j++)
            out.add(input.get(j));
    }
    return out;
}

//  Get the tail of the given size.
private ArrayList<String> getTailOf(ArrayList<String> out, int size) {
    if (size >= out.size())
        return out;
    else {
        ArrayList<String> r = new ArrayList<String>(size);
        for (int p = out.size() - size; p < out.size(); p++) {
            r.add(out.get(p));
        }
        return r;
    }
}

private static ArrayList<String> charBased(String in) {
    ArrayList<String> is = new ArrayList<String>();
    for (int i = 0; i < in.length(); i++)
        is.add(in.substring(i, i + 1));
    return is;
}

private static ArrayList<String> wordBased(String in) {
    ArrayList<String> is = new ArrayList<String>();
    StringTokenizer st = new StringTokenizer(in, " ,:()?!\"'");
    while (st.hasMoreTokens())
        is.add(st.nextToken());
    return is;
}

public static void main(String[] args) throws Exception {
    String in;
    File f = new File(args[0]);
    BufferedReader r = new BufferedReader(new FileReader(f));
    String sr;
    StringBuilder bb = new StringBuilder((int) f.length());
    while ((sr = r.readLine()) != null) {
        bb.append(sr);
        bb.append(' ');
    }

    in = bb.toString();
    Dissociator d = new Dissociator();

    String b = d.dissociate(in);
    System.out.println(b);

}
}
相关问题