BERT令牌生成器

时间:2019-12-03 16:16:53

标签: python

我在excel中有一个干净的句子列,我只是想将特定的列放入数据框中,然后将其放在BERT标记器之间。

import pandas as pd
df = pd.read_excel('blah.xlsx')
text = df["text_clean"].astype(str).tolist()
marked_text = "[CLS] " + str(text) + " [SEP]"
marked_text[:10211]

每个句子后面我都没有输出CLS和SEP。 输出是

'[CLS] [\'I think in that case you might want to start stockpiling gin just so you re ready for Season 2 when it hits\', \'Caught up on Dynasties and now need a large gin and some ther...

根本没有发现SEP。 只是为了提醒上面输出中的第一句话,第一行是第二行,依此类推。

1 个答案:

答案 0 :(得分:0)

[SEP] stringized 列表的末尾。您可以使用以下命令打印字符串的最后10个字符进行检查:

print(marked_text[-10:])

也就是说,我想您的预期结果是

[CLS] 'I think in that case you might want to start stockpiling gin just so you re ready for Season 2 when it hits' [ SEP]
[CLS] 'Caught up on Dynasties and now need a large gin and some ther...' [ SEP]
...

要这样做,请将字符串连接应用于文本条目的每个

import pandas as pd
df = pd.read_excel('blah.xlsx')
text = df["text_clean"].astype(str).tolist()
marked_text = []
for e in text:
    marked_text.append("[CLS] " + str(e) + " [SEP]")
print(*marked_text)

输出:

[CLS] 'I think in that case you might want to start stockpiling gin just so you re ready for Season 2 when it hits' [ SEP] [CLS] 'Caught up on Dynasties and now need a large gin and some ther...' [ SEP]...