如何使用NLTK雪球词干来阻止西班牙语单词Python的列表

时间:2015-03-21 16:01:47

标签: python nltk

我正在尝试使用NLTK雪球词干来阻止西班牙语,我遇到了一些我不知道的编码问题。

这是我试图操作的一个例句:

  

En diciembre,los precios delaenergíasubieronun 1,4 por ciento,los de la vivienda aumentaron un 0,1 por ciento y los precios de la vestimenta se mantuvieron sin cambios,mientras que los delosautomóvilesnuevosbajaron un 0,1 por ciento y los de los pasajesdeavióncayeronel 0,7 por ciento。

首先,我使用代码

从xml文件中读取句子
from nltk.stem.snowball import SnowballStemmer
import xml.etree.ElementTree as ET

stemmer = SnowballStemmer("spanish")
sentence = ET.tostring(context, encoding='utf-8', method="text").lower()

然后在对句子进行标记化以得到单词列表之后,我试图阻止每个单词:

stem = stemmer.stem(words[headIndex - index])

错误来自这一行:

Traceback (most recent call last):
  File "main.py", line 150, in <module>
    main()
  File "main.py", line 142, in main
    vectorDict, vocabulary = englishXml(language)
  File "main.py", line 86, in englishXml
    stem = stemmer.stem(words[headIndex - index])
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/stem/snowball.py", line 3404, in stem
    r1, r2 = self._r1r2_standard(word, self.__vowels)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/stem/snowball.py", line 232, in _r1r2_standard
    if word[i] not in vowels and word[i-1] in vowels:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

我还尝试从xml文件中读取没有“utf-8”编码的句子,但问题是“.lower()”在这种情况下不起作用

sentence = ET.tostring(context, method="text").lower()

这种情况下的错误变成:

Traceback (most recent call last):
  File "main.py", line 154, in <module>
    main()
  File "main.py", line 146, in main
    vectorDict, vocabulary = englishXml(language)
  File "main.py", line 63, in englishXml
    sentence = ET.tostring(context, method="text").lower()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 814, in write
    _serialize_text(write, self._root, encoding)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1006, in _serialize_text
    write(part.encode(encoding))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 18: ordinal not in range(128)

提前致谢!

3 个答案:

答案 0 :(得分:1)

尝试在阻止之前添加此内容

sentence = sentence.decode('utf8')

答案 1 :(得分:0)

确认最终代码是:

from nltk.stem.snowball import SnowballStemmer 
import xml.etree.ElementTree as ET stemmer = SnowballStemmer("spanish") 

sentence = ET.tostring(context, encoding='utf-8', method="text").lower()
sentence = sentence.decode('utf8')
stem = stemmer.stem(words[headIndex - index])

答案 2 :(得分:0)

from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer('spanish')
stemmed_spanish = [stemmer.stem(item) for item in spanish_words]