Question

我大约有100个文本文件，其中包含1-2段的临床注释。每个文件分别命名为doc_1.txt至doc_179.txt。我想将每个文件中的文本保存到带有2列带有标头（id，文本）的.csv文件中。 id列是每个文件的名称。

例如doc_1是记录文件名，它将成为ID。 doc_1中的文本将存储在text column中。预期的结果如下


|   id  | text |
|:-----:|:----:|
| doc_1 | abcf |
| doc_2 | efrf |
| doc_3 | gvni |

到目前为止，我只是查看本文，还没有确定实现我的结果的最佳实践方法。

Answer 1

假设您将拥有文件列表。

import pandas as pd # remove if already imported

# ...

files_list = ["doc_1.txt", "doc_2.txt", ..., "doc_179.txt"]

使用必要的列创建DataFrame：

df = pd.DataFrame(columns=["id", "text"])

遍历每个文件以读取文本，然后保存到csv文件中

for file in files_list:
    with open(file) as f:
        txt = f.read() # to retrieve the text in the file
        file_name = file.split(".")[0] # to remove file type
        df = df.append({"id": file_name, "text": txt}, ignore_index=True) # add row to DataFrame


df.to_csv("result.csv", sep="|", index=False) # export DataFrame into csv file

可以随时更改输出csv文件的名称（result.csv）和sep所使用的字符。

强烈建议不要使用任何文件文本中已经包含的字符。（例如，如果任何文本文件中已经包含逗号，则不要使用,作为sep的值。）

Answer 2

我想更新提供给我的解决方案来解决我的问题。

import pandas as pd

import glob

txtfiles = []
for file in glob.glob("*.txt"):
    txtfiles.append(file)

files_list = [f for f in glob.glob("*.txt")]

df = pd.DataFrame(columns=["id", "text"])

for file in files_list:
    with open(file) as f:
        txt = f.read() # to retrieve the text in the file
        file_name = file.split(".")[0] # to remove file type
        df = df.append({"id": file_name, "text": txt}, ignore_index=True)

从.txt文件中提取文本，并使用列和标题将其保存到.csv文件中

2 个答案: