过滤目录中的所有文件,以查找与多个正则表达式匹配的单词

时间:2018-12-17 17:38:37

标签: python regex glob pypdf2 os.path

我正在尝试过滤目录中的所有文件(pdf,txt,csv,ipynp等),以查找与我的正则表达式匹配的单词。到目前为止,我已经编写了一个程序(如下所示),该程序可以读取csv和pdf文件,但是else语句(读取所有其他文件类型)始终会给我一个错误(显示在底部)。我在else:语句之后输入错误吗?我已经尝试了一切,但无济于事。

class S {
    private String s;

    public S(String s) {
        this.s = s;
    }

    public S c(String s) {
        return new S(this.s + s);
    }

    public void b(S o) {
        if (this.s == o.s) {
            System.out.println("ok");
        }
    }
}

class M {
    public static void main(String[] args) {
        S s1 = new S("toto");
        S s2 = s1.c("");
        s1.b(s2); //1.
        s1.b(new S("toto")); //2.
    }
}

我收到一条错误消息,指出IsAdirectoryError:[Errno 21]是目录:您知道为什么我每次运行代码时都会一直显示此错误消息。

   import glob
import re
import PyPDF2
#-------------------------------------------------Input----------------------------------------------------------------------------------------------
folder_path = "/home/"
file_pattern = "/*"
folder_contents = glob.glob(folder_path + file_pattern)

#Search for Emails
regex1= re.compile(r'\S+@\S+')
#Search for Phone Numbers
regex2 = re.compile(r'\d\d\d[-]\d\d\d[-]\d\d\d\d')
#Search for Locations
regex3 =re.compile("([A-Z]\w+), ([A-Z]{2})")


for file in folder_contents:

    if re.search(r".*(?=pdf$)",file):
        #this is pdf
        with open(file, 'rb') as pdfFileObj:
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
            pageObj = pdfReader.getPage(0)  
            read_file = pageObj.extractText() 
            #print("{}".format(file))
    elif re.search(r".*(?=csv$)",file):
        #this is csv
        with open(file,"r+",encoding="utf-8") as csv:
            read_file = csv.read()
    else:
            with open(file,"rt", encoding='latin-1') as allOtherFiles:
                continue
    if regex1.findall(read_file) or regex2.findall(read_file) or regex3.findall(read_file):
        print ("YES, This file containts PHI")
        print(file)
    else:
        print("No, This file DOES NOT contain PHI")
        print(file)

1 个答案:

答案 0 :(得分:1)

您能否尝试将with open(file,"rt") as allOtherFiles:语句更改为

with open(file,"rt", encoding='latin-1') as allOtherFiles:

再次运行代码,看看是否遇到相同的错误。如果仍然有错误,我们将不得不尝试其他编码格式。

编辑: 要解决下一个错误:

IsADirectoryError: [Errno 21] Is a directory: /home/e136320/jupyter_shared_notebooks

这是由文件夹内名为jupyter_shared_notebooks的文件或文件夹引起的。
因为python不具有文件扩展名格式,所以不知道如何打开jupyter_shared_notebooks。引发此错误。
要解决此问题,您可以尝试

if '.' not in file:
    continue
else:
    with open(file,"rt", encoding='latin-1') as allOtherFiles:
        #rest of your code here