Question

我正在编写一个程序，用于确定特定字符串是否使用文本文件中指定的正则表达式的语言。在正则表达式中，有特殊符号被替换，因为它们不是传统的键盘字符。这些替换是：e表示epsilon，N表示空集，U表示联合，o表示连接，*表示星号运算符。我无法弄清楚是否需要编写代码来解释这种替换，以便程序正确运行，例如长度超过一个字符的字符串。例如，检查正则表达式是否包含U，然后定义其含义等等。

示例文本文件如下所示：

12
(((1U(2o1))U(2o2))o((1U2)*))
1
21
22
1212121
21121212
22121212
e
2

我的计划到目前为止：

'''
Searching text for string that are in the language
of a regular expression
'''
import sys
import re



# Part 1
# Read from file (alphabet, regular expression, sequence of strings)

fileName = sys.argv[1] # open file
# outPut = sys.argv[2]
alphabet = []
inputs = []
strings = {}

# Read lines from text file and store alphabet and regX
f = open(fileName, 'r')
lines = f.readlines()
alphabet = lines[0].strip()
del lines[0]
regx = lines[0]
del lines[0]


print(alphabet, regx, sep='\n') # debug print statement

# Remaining lines are the test regular expressions
# print(lines) # debug print statement

for line in lines:
    splitLine = line.split()
    strings[splitLine[0]] = ",".join(splitLine[0:])

print('Printing out the strings:', strings) # debug print statement

# Substitute for epsilon, concat, empty set, union and star



# Testing if the strings are apart of the regular expressions
for string in strings:
    if string not in regx:
        print('False', string)
    else:
        print('True', string)

在该特定文本文件的示例输出中：

12
(((1U(2o1))U(2o2))o((1U2)*))

Printing out the strings: {'1212121': '1212121', 'e': 'e', '22': '22'...}

False 1212121
False e
False 22
False 22121212
True 2
False 21
True 1
False 21121212

正确的输出应该是：

True 1
True 21
True 22
True 1212121
True 21121212
True 22121212
False e
False 2

Answer 1

扩展tekim的建议：

你的正则表达式中的“o”只能像标准常规一样被删除表达式，只是把两个东西放在一起就可以了级联。 “U”变成“|”

删除strings字典;它没有任何意义，只能混淆字符串的顺序。

将您的正则表达式转换为标准语法：

regx = regx.strip().replace("U", "|").replace("o", "")+"$"

检查示例字符串并生成正确的输出：

for line in lines:
    string = line.strip()
    if re.match(regx, string):
        print('True', string)
    else:
        print('False', string)

合并正则表达式符号以检查正则表达式

1 个答案: