更新

Question

我有一个如下所示的点列表：

points = [(54592748,54593510),(54592745,54593512), ...]

这些点中的许多点在点[n] [0]几乎等于点[m] [0]并且点[n] [1]几乎等于点[m] [1]的意义上是相似的。哪里几乎相同＆＃39;是我决定的任何整数。我想从列表中筛选出所有类似的点，只保留其中一个。

这是我的代码。

points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10 # max distance allowed between two points
to_compare = points[:] # make a list of item to compare
to_remove = set() # keep track of items to be removed

for point in points:
    to_compare.remove(point) # do not compare with itself
    for other_point in to_compare:
        if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
             to_remove.add(other_point)

for point in to_remove:
    points.remove(point)

它有效......

>>>points
[(54592748, 54593510), (117628626, 117630648), (1354358, 1619520)]

但我正在寻找更快的解决方案，因为我的列表长达数百万件。

PyPy提供了很多帮助，它在整个过程中加速了6次，但是可能首先有一种更有效的方法来做到这一点，不是吗？

非常欢迎任何帮助。

=======

更新

我用点对象测试了一些答案你可以从这里pickle.load（）https://mega.nz/#!TVci1KDS!tE5fTnjpPwbvpFTmW1TLsVXDvYHbRF8F7g10KGdOPCs

我的代码需要1104秒，并将列表缩小为96428分（从99920开始）。大卫的代码在14秒内完成了这项任务！但遗漏了一些东西，剩下96431分。马丁的代码需要0.06秒!!但也错过了一些东西，剩下96462分。

关于为什么结果不一样的任何线索？

Answer 1

根据您需要的准确程度，以下方法应该可以正常运行：

points = [(54592748, 54593510), (54592745, 54593512), (117628626, 117630648), (1354358, 1619520), (54592746, 54593509)]
d = 20
hpoints = {((x - (x % d)), (y - (y % d))) : (x,y) for x, y in points}

for x in hpoints.itervalues():  
    print x

这会将每个点转换为字典键，每个x和y坐标的四舍五入为其模数。结果是一个字典，其中包含给定区域中最后一个点的坐标。对于您提供的数据，这将显示以下内容：

(117628626, 117630648)
(54592746, 54593509)
(1354358, 1619520)

Answer 2

首先排序列表避免了内部for循环，从而避免了n ^ 2时间。我不确定它是否实际上更快，因为我没有完整的数据。试试这个（它从我的示例点看到的输出相同，只是订购）。

points = [(54592748,54593510),(54592745,54593512),(117628626,117630648),(1354358,1619520),(54592746,54593509)]
md = 10  # max distance allowed between two points
points.sort()
to_remove = set()  # keep track of items to be removed

for i, point in enumerate(points):
    if i == len(points) - 1:
        break
    other_point = points[i+1]
    if abs(point[0]-other_point[0]) <= md and abs(point[1]-other_point[1]) <= md:
        to_remove.add(point)

for point in to_remove:
    points.remove(point)

print(points)

Answer 3

此功能用于从列表中获取唯一项目（它不是我的，我发现它一段时间）只在列表上循环一次（加上字典查找）。

def unique(seq, idfun=None): 
  # order preserving
  if idfun is None:
      def idfun(x): return x
  seen = {}
  result = []
  for item in seq:
      marker = idfun(item)
      # in old Python versions:
      # if seen.has_key(marker)
      # but in new ones:
      if marker in seen: continue
      seen[marker] = 1
      result.append(item)
  return result

id函数需要一些聪明才智。 point [0]除以错误并浮动到整数。所以所有的点[0]都是 x *错误＆lt; = point [0]＆lt; （x + 1）*错误是相同的，对于点[1]也是如此。

def id(point):
   error = 4
   x = point[0]//error
   y = point[1]//error
   idValue = str(x)+"//"+str(y)
   return idValue

因此，这些函数会将连续的误差倍数之间的点减少到同一点。好消息是它只触及原始列表一次加dictionary lookups。坏消息是这个id函数不会被捕获，例如15和17应该是相同的，因为15减少到3和17减少到4.有可能是一些聪明，这个问题可以解决。

[注意：我最初使用了idValue的素数指数，但是指数会变大。如果你可以使idValue为int，那么会提高查找速度]

删除点列表中类似点的最佳方法

更新

3 个答案: