从文本中的单词索引获取字符索引

时间:2013-06-24 04:03:41

标签: python nltk

鉴于文本中单词的索引,我需要获取字符索引。 例如,在下面的文字中:

"The cat called other cats."

单词“cat”的索引是1。 我需要cat的第一个字符的索引,即c将是4。 我不知道这是否相关,但我使用python-nltk来获取这些单词。 现在,我能想到这样做的唯一方法是:

 - Get the first character, find the number of words in this piece of text
 - Get the first two characters, find the number of words in this piece of text
 - Get the first three characters, find the number of words in this piece of text
 Repeat until we get to the required word.

但这将是非常低效的。 任何想法将不胜感激。

3 个答案:

答案 0 :(得分:1)

您可以在此处使用dict

>>> import re
>>> r = re.compile(r'\w+')
>>> text = "The cat called other cats."
>>> dic = { i :(m.start(0), m.group(0)) for i, m in enumerate(r.finditer(text))}
>>> dic
{0: (0, 'The'), 1: (4, 'cat'), 2: (8, 'called'), 3: (15, 'other'), 4: (21, 'cats')}
def char_index(char, word_ind):
    start, word = dic[word_ind]
    ind = word.find(char)
    if ind != -1:
        return start + ind
...     
>>> char_index('c',1)
4
>>> char_index('c',2)
8
>>> char_index('c',3)
>>> char_index('c',4)
21

答案 1 :(得分:0)

import re
def char_index(sentence, word_index):
    sentence = re.split('(\s)',sentence) #Parentheses keep split characters
    return len(''.join(sentence[:word_index*2]))

>>> s = 'The die has been cast'
>>> char_index(s,3)    #'been' has index 3 in the list of words
12
>>> s[12]
'b'
>>> 

答案 2 :(得分:0)

使用enumerate()

>>> def obt(phrase, indx):
...     word = phrase.split()[indx]
...     e = list(enumerate(phrase))
...     for i, j in e:
...             if j == word[0] and ''.join(x for y, x in e[i:i+len(word)]) == word:
...                     return i
... 
>>> obt("The cat called other cats.", 1)
4