What pattern in a grep command will match each character at most once?

时间:2015-07-28 22:44:15

标签: regex grep

I want to find words in a document using only the letters in a given pattern, but those letters can appear at most once. Suppose document.txt consists of "abcd abbcd" What pattern (and what concepts are involved in writing such a pattern) will return "abcd" and not "abbcd"?

4 个答案:

答案 0 :(得分:1)

You could check if a character appears more than once and then negate the result (in your source code):

  • split your document into words
  • check each word with if (!string.IsNullOrEmpty(target_box.SelectedItem.ToString()) (that matches ([a-z])[a-z]*\1, but not abbcd)
  • negate the result

Explanation:

  • abcd matches any single character
  • ([a-z]) allows none or more characters after the one matched above
  • [a-z]* is a back reference to the character found at \1

答案 1 :(得分:1)

这里已经有了一些好主意,但我想在python中提供一个示例实现。这不一定是最佳的,但它应该有效。用法是:

$ python find.py -p abcd < file.txt

find.py的实现是:

import argparse
import sys
from itertools import cycle

parser = argparse.ArgumentParser()
parser.add_argument('-p', required=True)
args = parser.parse_args()

for line in sys.stdin:
    for candidate in line.split():
        present = dict(zip(args.p, cycle((0,)))) # initialize dict of letter:count
        for ch in candidate:
            if ch in present:
                present[ch] += 1 
        if all(x <= 1 for x in present.values()):
            print(candidate)

这样可以处理一次匹配模式中每个字符的要求,即它允许零匹配。如果您希望将每个字符完全匹配一次,则您将倒数第二行更改为:

        if all(x == 1 for x in present.values()):

答案 2 :(得分:0)

Melpomene is right, regexps are not the best instrument to solve this task. Regexp is essentially a finite state machine. In your case current state can be defined as the combination of presence flags for each of the letters from your alphabet. Thus the total number of internal states in regex will be 2^N where N is the number of allowed letters.

The easiest way to define such regex will be list all possible permutations of available letters (and use ListProperty to eliminate necessity to list shorter sequences). For three letters (a,b,c) regex looks like:

?

For the four letters (a,b,c,d) it becomes much longer:

a?b?c?|a?c?b?|b?a?c?|b?c?a?|c?a?b?|c?b?a?

As you can see, not that convenient.

The solution without regexps depends on your toolset. I would write a simple program that processes input text word by word. At the start of the word a?b?c?d?|a?b?d?c?|a?c?b?d?|a?c?d?b?|a?d?b?c?|a?d?c?b?|b?a?c?d?|b?a?d?c?|b?c?a?d?|b?c?d?a?|b?d?a?c?|b?d?c?a?|c?a?b?d?|c?a?d?b?|c?b?a?d?|c?b?d?a?|c?d?a?b?|c?d?b?a?|d?a?b?c?|d?a?c?b?|d?b?a?c?|d?b?c?a?|d?c?a?b?|d?c?b?a? is created, where each bit represents the presence of the corresponding letter of the desired alphabet. While traversing the word if bit that corresponds to the current letter is zero it becomes one. If already marked bit occurs or letter is not in alphabet, word is skipped. If word is completely evaluated, then it's "valid".

答案 3 :(得分:0)

grep -Pwo '[abc]+' | grep -Pv '([abc]).*\1' | awk 'length==3'

其中:

  • first grep:由图案字母组成的单词......
  • second grep:...没有重复的字母......
  • awk:...长度为字母数