Matlab - 使用 afterEach 进行并行文件搜索

时间:2021-06-06 00:02:58

标签: performance matlab parallel-processing

我正在尝试在 Matlab 中实现一个函数,该函数搜索文件并并行执行以加快进程。我已经在以下功能中成功实现了这一点:

function matches = searchfordata(starting_path, search_depth, checkFunction)

arguments
    
    starting_path {isfolder}
    search_depth int64
    checkFunction function_handle
    
end
tic;

folders = struct('name', '' , 'folder', starting_path);

dataMap = containers.Map('KeyType', 'double', 'ValueType', 'any');


next_folders = [];
matches = [];
no_folders = height(folders);
current_depth = 0;
total_folders = 0;

if search_depth < 0
    search_depth = 9001;
end

while current_depth <= search_depth && no_folders > 0
    
    
    total_folders = total_folders + no_folders;
    
    
    
    
    parfor n = 1:no_folders
        
        
        path = strcat(folders(n).folder, filesep, folders(n).name);
        [files, cfolders] = filesandfolders(path);
        
        if height(files) > 0
            check = checkFunction(files);
        else
            check = [];
        end
        
        matches = [matches;check];
        next_folders = [next_folders; cfolders];
        
        
    end
    
    
    if height(matches) > 0
        dataMap(current_depth) = matches;
        matches = [];
    end
    
    folders = next_folders;
    no_folders = height(folders);
    next_folders = [];
    
    current_depth = current_depth + 1;
    
end

matches = dataMap;
toc;
end

与此相关的其他函数/类:

function [files, folders] = filesandfolders(path)
%UNTITLED Summary of this function goes here
%   Detailed explanation goes here
directory_contents = dir(path);
files = directory_contents(~[directory_contents.isdir]);
folders = directory_contents([directory_contents.isdir]);
folders = folders(~ismember({folders.name}, {'.', '..'}));
end

基本上是一个目录,它将结果拆分为文件和文件夹并删除“。”和文件夹结果中的“..”。

function boolobject = checkFiles(files)
%CHECKFILES Checks given files for powercycler files
%   Detailed explanation goes here
%cyclingregex = 'cycling_parameters\.xml';
%transientregex = '\.(pol|par|raw)$';
cyclingregex = '\.txt$';
transientregex = '\.txt$';
matching_cycling = regexpi({files.name}, cyclingregex, 'Match');
matching_transient = regexpi({files.name}, transientregex, 'Match');
cycling_indices = ~cellfun(@isempty, matching_cycling);
transient_indices = ~cellfun(@isempty, matching_transient);


boolobject = FolderData(files(1).folder);
boolobject.cyclingData = any(cycling_indices);
boolobject.rthData = any(transient_indices);


if boolobject.cyclingData || boolobject.rthData
    
    return
else
    boolobject = [];
end


end

这会从文件和文件夹中获取文件列表作为我正在搜索的文件的输入和过滤器。我将其更改为 txt 以获得更好的可重现性。这个函数的输出是这个类:

classdef FolderData < handle
    %FOLDERDATA Summary of this class goes here
    %   Detailed explanation goes here
    
    properties
        folder %folderpath
        rthData %bool, true if this folder contains rth-data files
        cyclingData %bool, true if this folder contains cycling-data files
    end
    
    methods
        function this = FolderData(path)
            this.rthData = false;
            this.cyclingData = false;
            this.folder = path;
        end
                
    end
end

这只是说明找到了哪些文件以及在哪个文件夹中。

顶部的实际搜索功能在我的驱动器上需要 8-30 秒并且正在运行。现在我想我可以用 afterEach 加快速度。基本思想是,如果 parfor 循环中正在处理的文件夹的内容在数量上有很大差异,则有一个文件夹阻止该过程,因为它需要在函数恢复工作之前完成parfor 循环。

为此,我创建了以下脚本:

clc;
clear all;

path = 'D:\';
if isempty(gcp('nocreate'))
    parpool(4);
end
fun = @checkFiles;

output = searchparforae(path, fun);
%output = searchforae(path, fun);
%output = searchfordata(path, 100, fun);



function matches = searchparforae(starting_path, checkFunction)

tic;


folder_que = parallel.pool.DataQueue;
matches = [];

listener = afterEach(folder_que, @search_folder);
starting_path = struct('folder', starting_path, 'name', '');
search_folder(starting_path);


    function search_folder(input)
        
        parfor n = 1:height(input)
            folder_path = strcat(input(n).folder, filesep, input(n).name);
            fprintf(1, folder_path);
            fprintf(1, '\n');
            [files, folders] = filesandfolders(folder_path);
            if height(files) > 0
                check = checkFunction(files);
            else
                check = [];
            end
            
            matches = [matches;check];
            send(folder_que, folders);
        end
    end
toc;
end

function matches = searchforae(starting_path, checkFunction)

tic;


folder_que = parallel.pool.DataQueue;
matches = [];

listener = afterEach(folder_que, @search_folder);
starting_path = struct('folder', starting_path, 'name', '');
search_folder(starting_path);


    function search_folder(input)
        
        for n = 1:height(input)
            folder_path = strcat(input(n).folder, filesep, input(n).name);
            fprintf(1, folder_path);
            fprintf(1, '\n');
            [files, folders] = filesandfolders(folder_path);
            if height(files) > 0
                check = checkFunction(files);
            else
                check = [];
            end
            
            matches = [matches;check];
            send(folder_que, folders);
        end
    end
toc;
end

“searchforae”和“searchparforae”这两个函数完全一样,只是循环不同。从名称中可以明显看出“searchforae”有一个 for 循环,而“searchparforae”有一个 parfor 循环。

现在 searchforae 根本不起作用。打印输出显示,searchforae 仅处理初始给定目录及其正下方目录中的文件。打印输出:

D:\
D:\$RECYCLE.BIN
D:\Downloads
D:\OneDriveTemp
D:\Programme
D:\Repositories
D:\Sonstiges
D:\Spiele
D:\System Volume Information
D:\Uni
D:\Uni2
D:\Users
D:\Zwischenablage

相比之下,searchparforae 函数与顶部的 searchfordata 函数一样有效。但它需要 5-10 分钟而不是 8-30 秒。我使用 afterEach 错了吗?为什么要花那么长时间?另外,为什么 searchforae 函数不能正常工作,即使与 searchparforae 相比,唯一的区别是 for 循环而不是 parfor?

0 个答案:

没有答案
相关问题