Question

我有一个非常大的列表列表（包含13个列表，每个列表约有4100万个条目，即总共约5亿个条目，每个条目都是一个短字符串）。我需要获取该列表并找到两个子列表的并集，即找到它们之间的所有唯一元素，并以最节省内存的方式将它们保存到新列表中。订购不是必需的。一种方法是：

c = a[1] + a[2]
c = set(c)

但这是最有效的内存方式吗？另一个复杂因素是a[1]或a[2]中的某些条目可能包含多个元素（即看起来像a[1]=[['val1'],['val2','val3'],...]）。我如何才能最好地处理这一问题，以便val2和val3在最终结果中显示为单独的条目？

Answer 1

我不会100％确定这样做是最有效的方式，但我发现它最简单：

l3 = set(l1)
l3.update(l2)
l3 = list(l3)

这不应该分配超过必要的内存：

l3  = []
for i in l1:
  if i not in l3:
    l3.append(i)
for i in l2:
  if i not in l3:
    l3.append(i)

Answer 2

对于短字符串，集合比numpy更有效。有了这些数据：

File inFile = null;
String inFilePath = "/path/to/inputFile/input_highlight.pdf";
String outDirPath = "/tmp";

try {
    inFile = new File(inFilePath);
} catch (Exception e) {
    throw new RuntimeException(inFilePath + " file access error.", e);
}

Document document = inFile.getDocument();

Pages pages = document.getPages();

PageStamper stamper = new PageStamper();
    for (Page page : pages) {

    stamper.setPage(page);

    PageAnnotations annotations = page.getAnnotations();

    for (Annotation annotation : annotations) {

        if (annotation.getColor() == null) {

            continue;

        }

        Rectangle2D textStringBox = annotation.getBox();

        PrimitiveComposer composer = stamper.getBackground();
        composer.setStrokeColor(DeviceRGBColor.Black);
        textStringBox.setRect(annotation.getBox().getX(), annotation.getBox().getY(), annotation.getBox().getWidth(), annotation.getBox().getHeight());
        composer.drawRectangle(textStringBox);
        composer.stroke();

        composer.beginLocalState();
        composer.setStrokeColor(DeviceRGBColor.Black);
        composer.end();

        stamper.flush();

        System.out.println("Text: " + annotation.getText());
        System.out.println("Color: " + annotation.getColor());
        System.out.println("Coordinates: " + annotation.getBox().toString());

        annotation.setColor(DeviceRGBColor.White);

    }

}

只是这样做：

import random
N=13 #13
M=100000
ll=[["".join([l for l in [ 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'[randint(0,25)]\
for k in range(4)]]) for l in range(M)] for h in range(N)] #example data

如果没有内存大小问题，那么M = 41 000 000的结果将需要几分钟。

以大多数内存有效的方式在两个非常大的列表中查找唯一值

2 个答案: