从dir获取最新文件(超过1个)

时间:2017-11-03 15:08:34

标签: python pandas glob

我试图提取最新的苹果'梨#39;和其他.csv存储在python的目录中。新文件以相同的前缀但以不同的频率存储(例如,apple_gets每5天更新一次)。查看latestfile = max(filenames, key=os.path.getctime)但类别.startwith之类的内容?具体 - 所以如果有一个melon_csv,即使它已经在几个月前被保存了,我也会把它拉出来。

    """
fileDir contains csv files such as:

pear_20171102_report2.csv
apple_20171027_report2.csv
orange_20171101_report2.csv
kiwi 20171102 report2.csv
pear_20171101_report2.csv
cherry 20171101 report2.csv
kiwi 20171101 report2.csv
cherry 20171031_report2.csv
mango 20171001 report2.csv
apple_20171101_report2.csv
apple_20171102_report2.csv
...
"""

import glob
import os
import re

fileDir = r'\\ac2knyc05\TestData/'

filenames = glob.glob(fileDir+'*')
regex = re.compile(r'\d{8}')
dates = []
prefix = []

for filename in filenames:
    try:
        date = regex.search(filename).group()
        dates.append(date)
        prefix.append(filename.split(date)[0])

    except AttributeError:
        print(filename)

latestfile = max(filenames, key=os.path.getctime)

print(set(prefix)) 

坚持到这里,不知道如何继续,也许熊猫?

2 个答案:

答案 0 :(得分:2)

不需要大熊猫,您可以使用itertools groupby

from itertools import groupby

def key(filename):
    return filename.replace(" ", "_").split("_")[0]

{k: max(g, key=os.path.getctime)
     for k, g in groupby(sorted(filenames, key=key), key)}

同时为您提供最新文件的类别字典。

注意:您可以使用for循环一次性获取此内容:

res = {}
for f in filenames:
    k, t = key(f), os.path.getctime(f)
    if k not in res:
        res[k] = f, t
    else:
        _, t_ = res[k]
        if t > t_:
            res[k] = f, t

[f for f, _ in res.values()]  # list of the latest file for each category

答案 1 :(得分:1)

不需要大熊猫。您可以简单地将这些文件名放在列表的字典中:

filenames = """pear_20171102_report2.csv
apple_20171027_report2.csv
orange_20171101_report2.csv
kiwi 20171102 report2.csv
pear_20171101_report2.csv
cherry 20171101 report2.csv
kiwi 20171101 report2.csv
cherry 20171031_report2.csv
mango 20171001 report2.csv
apple_20171101_report2.csv
apple_20171102_report2.csv"""

categories = {}
for filename in filenames.split("\n"):
    start_with = filename.split(' ')[0].split('_')[0]
    categories.setdefault(start_with, []).append(filename)

print(categories)
# {'pear': ['pear_20171102_report2.csv', 'pear_20171101_report2.csv'], 'apple': ['apple_20171027_report2.csv', 'apple_20171101_report2.csv', 'apple_20171102_report2.csv'], 'orange': ['orange_20171101_report2.csv'], 'kiwi': ['kiwi 20171102 report2.csv', 'kiwi 20171101 report2.csv'], 'cherry': ['cherry 20171101 report2.csv', 'cherry 20171031_report2.csv'], 'mango': ['mango 20171001 report2.csv']}

对于每个类别,您现在都有可以按ctime排序的列表。