Question

我有一个字符串来自一篇有几百个句子的文章。我想将字符串转换为数据帧，每个句子作为一行。例如，

n = 0
p = 0
z = 0

for i in range(10):
    i = input('Enter Next Number:')
    if (i > 0):
        p = p+1
    elif (i < 0):
        n = n+1
    else:
        z = z+1

print "The number of negative numbers is",n
print "The number of positive numbers is",p
print "The number of zeros is",z

我希望它变成：

data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'

作为一名蟒蛇新手，这就是我的尝试：

This is a book, to which I found exciting.
I bought it for my cousin.
He likes it.

使用上面的代码，所有句子都成为列名。我实际上想要它们在一列的行中。

Answer 1

请勿使用read_csv。只需按'.'拆分并使用标准pd.DataFrame：

data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
data_df = pd.DataFrame([sentence for sentence in data.split('.') if sentence],
                       columns=['sentences'])
print(data_df)

#                                     sentences
#  0  This is a book, to which I found exciting
#  1                  I bought it for my cousin
#  2                                He likes it

请记住，如果存在，这将会中断某些句子中的浮点数。在这种情况下，您需要更改字符串的格式（例如，使用'\n'代替'.'来分隔句子。）

Answer 2

这是一个快速解决方案，但它解决了您的问题：

data_df = pd.read_csv(data, sep=".", header=None).T

Answer 3

您可以通过列表理解来实现这一目标：

data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'

df = pd.DataFrame({'sentence': [i+'.' for i in data.split('. ')]})

print(df)

#                                      sentence
# 0  This is a book, to which I found exciting.
# 1                  I bought it for my cousin.
# 2                                He likes it.

Answer 4

您要做的是称为标记化句子。最简单的方法是使用文本挖掘库，例如NLTK：

from nltk.tokenize import sent_tokenize
pd.DataFrame(sent_tokenize(data))

否则你可以尝试类似的东西：

pd.DataFrame(data.split('. '))

但是，如果你遇到这样的句子，这将失败：

problem = 'Tim likes to jump... but not always!'

将字符串转换为dataframe，以冒号分隔

4 个答案: