Question

这是我从一家技术公司的现场采访中得到的问题，我认为这最终会扼杀我的机会。

您会得到一个句子，以及一个字典，字典以单词为键，而词性为值。

目标是编写一个函数，在给您一个句子时，将每个单词依次更改为词典中给定的词性。我们可以假设句子中的所有内容都作为字典中的键出现。

例如，假设我们得到以下输入：

sentence='I am done; Look at that, cat!' 

dictionary={'!': 'sentinel', ',': 'sentinel', 
            'I': 'pronoun', 'am': 'verb', 
            'Look': 'verb', 'that': 'pronoun', 
             'at': 'preposition', ';': 'preposition', 
             'done': 'verb', ',': 'sentinel', 
             'cat': 'noun', '!': 'sentinel'}

output='pronoun verb verb sentinel verb preposition pronoun sentinel noun sentinel'

棘手的是捉住哨兵。如果词性中没有哨兵，则可以轻松完成。有一个简单的方法吗？有图书馆吗？

Answer 1

Python的Regular Expression包可用于将句子拆分为标记。

import re
sentence='I am done; Look at that, cat!' 

dictionary={'!': 'sentinel', ',': 'sentinel', 
            'I': 'pronoun', 'am': 'verb', 
            'Look': 'verb', 'that': 'pronoun', 
             'at': 'preposition', ';': 'preposition', 
             'done': 'verb', ',': 'sentinel', 
             'cat': 'noun', '!': 'sentinel'}

tags = list()
for word in re.findall(r"[A-Za-z]+|\S", sentence):
    tags.append(dictionary[word])

print (' '.join(tags))

输出

代词动词动词介词动词介词代词哨兵名词哨兵

正则表达式[A-Za-z]+|\S基本上选择所有由一个或多个出现的字母（大写和小写）（由[A-Za-z]+以及（由|完成，这表示更改）全部\s之前的非空格。

Answer 2

这是一个不太令人印象深刻但更具解释性的解决方案：

让我们首先定义问题中的示例字典和句子：

sentence = 'I am done; Look at that, cat!' 

dictionary = {
    '!':    'sentinel', 
    ',':    'sentinel', 
    ',':    'sentinel', 
    'I':    'pronoun', 
    'that': 'pronoun', 
    'cat':  'noun', 
    'am':   'verb', 
    'Look': 'verb', 
    'done': 'verb', 
    'at':   'preposition', 
    ';':    'preposition', 
}

对于我的解决方案，我定义了一个递归解析函数，恰当地命名为parse。 parse首先用空格将一个句子分成多个单词，然后尝试通过在提供的字典中查找每个单词来对每个单词进行分类。如果在词典中找不到该单词（因为它附加了一些标点符号，等等），则parse然后将该单词拆分成其组成标记，然后从那里递归地对其进行解析。

def parse(sentence, dictionary):
  # split the words apart by whitespace
  # some tokens may still be stuck together. (i.e. "that,")
  words = sentence.split() 

  # this is a list of strings containing the 'category' of each word
  output = [] 

  for word in words:
    if word in dictionary:
      # base case, the word is in the dictionary
      output.append(dictionary[word])
    else:
      # recursive case, the word still has tokens attached

      # get all the tokens in the word
      tokens = [key for key in dictionary.keys() if key in word]

      # sort all the tokens by length - this makes sure big words are more likely to be preserved. (scat -> s, cat or sc, at) check 
      tokens.sort(key=len)

      # this is where we'll store the output 
      sub_output = None

      # iterate through the tokens to find if there's a valid way to split the word
      for token in tokens:
        try: 

          # pad the tokens inside each word
          sub_output = parse(
            word.replace(token, f" {token} "), 
            dictionary
          )

          # if the word is parsable, no need to try other combinations
          break
        except: 
          pass # the word couldn't be split

      # make sure that the word was split - if it wasn't it's not a valid word and the sentence can't be parsed
      assert sub_output is not None

      output.append(sub_output)

  # put it all together into a neat little string
  return ' '.join(output)

这是您将如何使用它：

# usage of parse
output = parse(sentence, dictionary)

# display the example output
print(output)

我希望我的回答使您对解决该问题的另一种方法有更多的了解。

多田！

Answer 3

如果您正在寻找一种基于非正则表达式的方法，则可以尝试以下方法：

def tag_pos(sentence):
    output = []
    for word in sentence.split():
        if word not in dictionary:
            literal = ''.join([char for char in word if not char.isalpha()])
            word = ''.join([char for char in word if char.isalpha()])
            output.append(dictionary[word])
            if not len(literal)>1:

                output.append(dictionary[literal])
            else:

                for literal in other:
                    output.append(dictionary[literal])
        else:
            output.append(dictionary[word])

    return " ".join(output)


output = tag_pos(sentence)
print(output)

将句子分为单词和非白色字符以进行POS标记

3 个答案: