使用Python和Pandas重命名基于Dataframe内容的文件

时间:2019-04-30 14:03:42

标签: python pandas

我正在尝试读取xlsx文件,将一列中的所有参考编号与文件夹中的文件进行比较,如果它们相对应,则将它们重命名为与该参考编号相关联的电子邮件。

Excel文件具有以下字段:

 Reference     EmailAddress
   1123        bob.smith@yahoo.com
   1233        john.drako@gmail.com
   1334        samuel.manuel@yahoo.com
   ...         .....

我的文件夹applicants仅包含名为参考列的 doc 文件:

enter image description here

如何将applicantsCVs文件夹的内容与excel文件中的 Reference 字段进行比较,如果匹配,请将所有文件重命名为相应的电子邮件地址? / p>

这是到目前为止我尝试过的:

import os
import pandas as pd

dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")
references = dfOne['Reference']

emailAddress = dfOne['EmailAddress']

cleanedEmailList = [x for x in emailAddress if str(x) != 'nan']

print(cleanedEmailList)
excelArray = []
filesArray = []

for root, dirs, files in os.walk("applicantCVs"):
    for filename in files:
        print(filename) #Original file name with type 1233.doc
        reworkedFile = os.path.splitext(filename)[0]
        filesArray.append(reworkedFile)

for entry in references:
    excelArray.append(str(entry))

for i in excelArray:
    if i in filesArray:
        print(i, "corresponds to the file names")

我将参考名称与文件夹内容进行比较,如果相同,则将其打印出来:

 for i in excelArray:
        if i in filesArray:
            print(i, "corresponds to the file names")

我尝试使用os.rename(filename, cleanedEmailList )重命名它,但是由于cleanedEmailList是一组电子邮件,因此无法正常工作。

如何匹配和重命名文件?

6 个答案:

答案 0 :(得分:2)

基于示例数据:

Reference     EmailAddress
   1123        bob.smith@yahoo.com
   1233        john.drako@gmail.com
   nan         jane.smith#example.com
   1334        samuel.manuel@yahoo.com

首先,您要组装一个dict,并将引用集作为键,并将新名称作为值:

references = dict(df.dropna(subset=["Reference","EmailAddress"]).set_index("Reference")["EmailAddress"])
{'1123': 'bob.smith@yahoo.com',
 '1233': 'john.drako@gmail.com',
 '1334': 'samuel.manuel@yahoo.com'}

请注意,这里的引用是str。如果它们不在您的原始数据库中,则可以使用astype(str)

然后您使用pathlib.Path在数据目录中查找所有文件:

files = Path("../data/renames").glob("*")
[WindowsPath('../data/renames/1123.docx'),
 WindowsPath('../data/renames/1156.pptx'),
 WindowsPath('../data/renames/1233.txt')]

重命名可以非常简单:

for file in files:
    new_name = references.get(file.stem, file.stem )
    file.rename(file.with_name(f"{new_name}{file.suffix}"))

references.get询问新文件名,如果找不到,请使用原始词干。

[WindowsPath('../data/renames/1156.pptx'),
 WindowsPath('../data/renames/bob.smith@yahoo.com.docx'),
 WindowsPath('../data/renames/john.drako@gmail.com.txt')]

答案 1 :(得分:0)

如何将“电子邮件助理”(我猜是您的新名字)添加到字典中,其中的键是您的参考数字? 看起来可能像这样:

cor_dict = {}

for i in excelArray:
        if i in filesArray:
            cor_dict[i] =dfOne['EmailAddress'].at[dfOne.Reference == i]


for entry in cor_dict.items():
    path = 'path to file...'
    filename = str(entry[0])+'.doc'
    new_filename =  str(entry[1]).replace('@','_') + '_.doc'

    filepath = os.path.join(path, filename)
    new_filepath = os.path.join(path,new_filename)

    os.rename(filename, new_filename)

答案 2 :(得分:0)

您可以直接使用df.apply()在数据框中执行此操作:

import glob
import os.path

#Filter out null addresses
df = df.dropna(subset=['EmailAddress']) 

#Add a column to check if file exists
df2['Existing_file'] = df2.apply(lambda row: glob.glob("applicantsCVs/{}.*".format(row['Reference'])), axis=1)

df2.apply(lambda row: os.rename(row.Existing_file[0], 'applicantsCVs/{}.{}'.format( row.EmailAddress, row.Existing_file[0].split('.')[-1])) if len(row.Existing_file) else None, axis = 1)
print(df2.Existing_file.map(len), "existing files renamed")

编辑: 现在可以使用glob模块与任何扩展名(.doc.docx)一起使用

答案 3 :(得分:0)

这是使用简单迭代的一种方法。

例如:

import os

#Sample Data#
#dfOne = pd.DataFrame({'Reference': [1123, 1233, 1334, 4444, 5555],'EmailAddress': ["bob.smith@yahoo.com", "john.drako@gmail.com", "samuel.manuel@yahoo.com", np.nan, "samuel.manuel@yahoo.com"]})
dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")
dfOne.dropna(inplace=True)  #Drop rows with NaN

for root, dirs, files in os.walk("applicantsCVs"):
    for file in files:
        file_name, ext = os.path.splitext(file)
        email = dfOne[dfOne['Reference'].astype(str).str.contains(file_name)]["EmailAddress"]
        if email.values:
            os.rename(os.path.join(root, file), os.path.join(root, email.values[0]+ext))

或者如果您只有.docx个文件要重命名

import os

dfOne = pd.read_excel('Book2.xlsx', na_values=['NA'], usecols = "A:D")

dfOne["Reference"] = dfOne["Reference"].astype(str)
dfOne.dropna(inplace=True)  #Drop rows with NaN
ext = ".docx"
for root, dirs, files in os.walk("applicantsCVs"):
    files = r"\b" + "|".join(os.path.splitext(i)[0] for i in files) + r"\b"
    for email, ref in dfOne[dfOne['Reference'].astype(str).str.contains(files, regex=True)].values:
        os.rename(os.path.join(root, ref+ext), os.path.join(root, email+ext))

答案 4 :(得分:0)

让我们考虑以下excel表中的示例数据:

Reference   EmailAddress
1123    bob.smith@yahoo.com
1233    john.drako@gmail.com
1334    samuel.manuel@yahoo.com
nan     python@gmail.com

解决以下问题涉及以下步骤。

步骤1

从excel工作表"my.xlsx"正确导入数据。我在这里使用示例数据

import pandas as pd
import os
#import data from excel sheet and drop rows with nan 
df = pd.read_excel('my.xlsx').dropna()
#check the head of data if the data is in desirable format
df.head() 

您将在此处看到引用中的数据类型为浮点型

enter image description here

步骤2

将引用列中的数据类型更改为整数,然后更改为字符串

df['Reference']=df.Reference.astype(int, inplace=True)
df = df.astype(str,inplace=True)
df.head()

现在数据采用所需格式

enter image description here

步骤3

重命名所需文件夹中的文件。压缩“参考”和“ EmailAddress”的列表以用于for循环。

#absolute path to folder. I consider you have the folder "application cv" in the home directory
path_to_files='/home/applicant cv/'
for ref,email in zip(list(df['Reference']),list(df['EmailAddress'])):
    try: 
        os.rename(path_to_files+ref+'.doc',path_to_files+email+'.doc')
    except:
        print ("File name doesn't exist in the list, I am leaving it as it is")

答案 5 :(得分:0)

步骤1:从excel工作表"Book1.xlsx"

中导入数据
import pandas as pd
df = pd.read_excel (r'path of your file here\Book1.xlsx')        
print (df)

步骤2:选择".docx"文件所在的路径并存储其名称。 仅获取文件名的相关部分进行比较。

mypath = r'path of docx files\doc files'
from os import listdir,rename
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
#print(onlyfiles)
currentfilename=onlyfiles[0].split(".")[0]

This is how I stored the files

步骤3:运行循环以检查名称是否与参考匹配。只需使用rename(src,dest)中的os函数

for i in range(3):
    #print(currentfilename,df['ref'][i])
    if str(currentfilename)==str(df['Reference'][i]):
        corrosponding_email=df['EmailAddress'][i]
        #print(mypath+"\\"+corrosponding_email)
rename(mypath+"\\"+str(currentfilename)+".docx",mypath+"\\"+corrosponding_email+".docx")

通过示例检出代码:https://github.com/Vineet-Dhaimodker