Question

我正在尝试清理一些文字。我只保留字母和数字。但是，我的文本仍然包含其他字符。

这是我的功能：

def review_to_wordlist(review, remove_stopwords=False, remove_numbers = False ):
# Function to convert a document to a sequence of words,
# optionally removing stop words and numbers.  Returns a list of words.
#
# 1. Remove HTML
review_text = BeautifulSoup(review).get_text()
#
# 2. Remove non-letters
if True:
    review_text = re.sub("[^a-zA-Z0-9]"," ", review_text)
#
# 3. Convert words to lower case and split them
words = review_text.lower().split()
#
# 4. Optionally remove stop words (false by default)
if remove_stopwords:
    stops = set(stopwords.words("english"))
    words = [w for w in words if not w in stops]
#
# 5. Return a list of words
return(words)

这是我得到的一个结果：

NuTone中央真空系统45 EllOhio Steel Tandem Natural和合成草坪清扫系统独特的家居设计36英寸x 80英寸苏 Casa Black表面安装外侧钢安全门与扩展金属屏独特家居设计36英寸x 80英寸Su Casa Black Surface 外置式钢制防盗门，带扩展金属屏幕，独特 Home Designs 36英寸x 80英寸.Su Casa黑色表面贴装外胎钢制防盗门与扩展金属屏MP全球最佳400英寸。 x 36英寸x 1/8英寸。带有薄膜的声学再生纤维衬垫 Laminate Wood MP全球最佳400英寸x 36英寸x 1/8英寸吸音用于层压木材夹具的再生纤维衬垫＃10-1 / 4英寸x 2-1 / 2英寸8 亮钢环形柄普通钉（1磅装）

我得到的错误是：

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 5-6: unexpected end of data


676
Husky Pneumatic 3-1/2 in. 21� Full-Head Strip Framing Nailer
5157
RIDGID 3-1/2 in. 21� Round-Head Nailer
5158
RIDGID 3-1/2 in. 21� Round-Head Nailer

Answer 1

在Padraic Cunningham的two comments中，StatefulWidgets在阅读时默认为utf8。这可能导致某些字符出现乱码，并且可以通过在通话中设置StatelessWidget来解决。

清除文本数据时的UnicodeDecodeError

1 个答案: