Extracting what each package provides from a text file using regex in python

时间:2016-04-04 17:51:02

标签: python regex python-3.4

it's has been a while since I was working on with this but I can't figure out how to resolve my problem.

I have multiple paragraphs such as in the Packages.gz file present in this link http://fr.archive.ubuntu.com/ubuntu/dists/trusty-security/main/binary-amd64/

I would like your help to process it using a regular expression in order to get at the end a dictionary containing as keys the packages and values a list of the packages they provide.

As you can see, some packages do provide one or more packages others don't. My best regular expression was the following :

    ((?<=Package: ).*)|((?<=Provides: )(?:[, ]*[a-zA-Z0-9-+.]*))

It stops on the first package in the "Provides:" sentence, but I need them all without the ", ".

Any help is appreciated. Thank you.

2 个答案:

答案 0 :(得分:1)

You don't need to reinvent the wheel here. The python-apt library already implements the text file parsing you want. I recommend using it. It will give you the list of provides for a package.

答案 1 :(得分:0)

Here is a program that builds a dict object to map "package" lines to lists representing "provides" lines.

It uses a regular expression, and re.findall, as requested.

import re
from pprint import pprint

with open('Packages') as fp:
    data = fp.read()

data = re.findall(
    r'''
    (?smx)                  # Dot matches all, Multiline, Verbose
    ^Package:\s*(.*?)$      # The package line
    .*?                 #     Arbitrary lines
    (?:
        ^Provides:\s*(.*?$) # The provides line
        |                   # OR
        ^$                  #  an empty line
    )
    ''',
    data)

data = {k:v.split(',') if v else [] for k,v in data}

pprint(data)

Alternatively, here is a solution that does not use a regular expression. It runs slightly faster in my PC, on your 70,000-line file. The speed difference is largely irrelevant, however; the difference is less than .02 seconds.

import re
from pprint import pprint

def gen():
    with open('Packages') as fp:
        for line in fp:
            if line.startswith('Package:'):
                package = line.split(':')[1].strip()
            elif line.startswith('Provides:'):
                yield package, line.split(':')[1].strip().split(',')
                package = None
            elif package and line == '\n':
                yield package, []
                package = None
        if package:
            yield package, []

data = dict(gen())

pprint(data)
相关问题