在Python中快速计算Pareto前端

时间:2015-09-25 23:12:36

标签: python numpy

我在3D空间中有一组点,我需要从中找到Pareto前沿。执行速度在这里非常重要,并且当我添加测试点时,时间会非常快。

点数集如下:

[[0.3296170319979843, 0.0, 0.44472108843537406], [0.3296170319979843,0.0, 0.44472108843537406], [0.32920760896951373, 0.0, 0.4440408163265306], [0.32920760896951373, 0.0, 0.4440408163265306], [0.33815192743764166, 0.0, 0.44356462585034007]]

现在,我正在使用这个算法:

def dominates(row, candidateRow):
    return sum([row[x] >= candidateRow[x] for x in range(len(row))]) == len(row) 

def simple_cull(inputPoints, dominates):
    paretoPoints = set()
    candidateRowNr = 0
    dominatedPoints = set()
    while True:
        candidateRow = inputPoints[candidateRowNr]
        inputPoints.remove(candidateRow)
        rowNr = 0
        nonDominated = True
        while len(inputPoints) != 0 and rowNr < len(inputPoints):
            row = inputPoints[rowNr]
            if dominates(candidateRow, row):
                # If it is worse on all features remove the row from the array
                inputPoints.remove(row)
                dominatedPoints.add(tuple(row))
            elif dominates(row, candidateRow):
                nonDominated = False
                dominatedPoints.add(tuple(candidateRow))
                rowNr += 1
            else:
                rowNr += 1

        if nonDominated:
            # add the non-dominated point to the Pareto frontier
            paretoPoints.add(tuple(candidateRow))

        if len(inputPoints) == 0:
            break
    return paretoPoints, dominatedPoints

在此处找到:http://code.activestate.com/recipes/578287-multidimensional-pareto-front/

在一组解决方案中找到非主导解决方案的最快方法是什么?或者,简而言之,Python能比这个算法做得更好吗?

6 个答案:

答案 0 :(得分:18)

如果你担心实际速度,你肯定想要使用numpy(因为聪明的算法调整可能比使用数组操作获得的效果要小)。以下是三种解决方案,它们都计算相同的功能。 is_pareto_efficient_dumb解决方案在大多数情况下较慢,但随着成本增加而变得更快,is_pareto_efficient_simple解决方案比许多点的哑解决方案更有效,并且最终is_pareto_efficient函数不太可读但速度最快(所以都是Pareto Efficient!)。

import numpy as np


# Very slow for many datapoints.  Fastest for many costs, most readable
def is_pareto_efficient_dumb(costs):
    """
    Find the pareto-efficient points
    :param costs: An (n_points, n_costs) array
    :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient
    """
    is_efficient = np.ones(costs.shape[0], dtype = bool)
    for i, c in enumerate(costs):
        is_efficient[i] = np.all(np.any(costs[:i]>c, axis=1)) and np.all(np.any(costs[i+1:]>c, axis=1))
    return is_efficient


# Fairly fast for many datapoints, less fast for many costs, somewhat readable
def is_pareto_efficient_simple(costs):
    """
    Find the pareto-efficient points
    :param costs: An (n_points, n_costs) array
    :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient
    """
    is_efficient = np.ones(costs.shape[0], dtype = bool)
    for i, c in enumerate(costs):
        if is_efficient[i]:
            is_efficient[is_efficient] = np.any(costs[is_efficient]<c, axis=1)  # Keep any point with a lower cost
            is_efficient[i] = True  # And keep self
    return is_efficient


# Faster than is_pareto_efficient_simple, but less readable.
def is_pareto_efficient(costs, return_mask = True):
    """
    Find the pareto-efficient points
    :param costs: An (n_points, n_costs) array
    :param return_mask: True to return a mask
    :return: An array of indices of pareto-efficient points.
        If return_mask is True, this will be an (n_points, ) boolean array
        Otherwise it will be a (n_efficient_points, ) integer array of indices.
    """
    is_efficient = np.arange(costs.shape[0])
    n_points = costs.shape[0]
    next_point_index = 0  # Next index in the is_efficient array to search for
    while next_point_index<len(costs):
        nondominated_point_mask = np.any(costs<costs[next_point_index], axis=1)
        nondominated_point_mask[next_point_index] = True
        is_efficient = is_efficient[nondominated_point_mask]  # Remove dominated points
        costs = costs[nondominated_point_mask]
        next_point_index = np.sum(nondominated_point_mask[:next_point_index])+1
    if return_mask:
        is_efficient_mask = np.zeros(n_points, dtype = bool)
        is_efficient_mask[is_efficient] = True
        return is_efficient_mask
    else:
        return is_efficient

分析测试(使用从正态分布中提取的点):

10,000个样本,2个成本:

is_pareto_efficient_dumb: Elapsed time is 1.586s
is_pareto_efficient_simple: Elapsed time is 0.009653s
is_pareto_efficient: Elapsed time is 0.005479s

有1,000,000个样本,2个成本:

is_pareto_efficient_dumb: Really, really, slow
is_pareto_efficient_simple: Elapsed time is 1.174s
is_pareto_efficient: Elapsed time is 0.4033s

10,000个样本,15个成本:

is_pareto_efficient_dumb: Elapsed time is 4.019s
is_pareto_efficient_simple: Elapsed time is 6.466s
is_pareto_efficient: Elapsed time is 6.41s

请注意,如果效率问题,您可以通过预先重新排序数据获得2倍左右的加速,请参阅here

答案 1 :(得分:7)

修改

我最近最近看到了这个问题并找到了一个有用的启发式方法,如果有很多点独立分布并且维度很少,那么该方法很有效。

想法是计算点的凸包。由于具有很少的尺寸和独立分布的点,凸包的顶点数量将很小。直观地,我们可以预期凸包的一些顶点支配许多原始点。此外,如果凸包中的一个点不受凸包中任何其他点的支配,那么它也不会被原始集合中的任何点所支配。

这给出了一个简单的迭代算法。我们反复

  1. 计算凸包。
  2. 从凸包中保存帕累托无名点。
  3. 过滤点以移除那些由凸包元素支配的点。
  4. 我为维度3添加了一些基准。似乎对于某些点的分布,这种方法会产生更好的渐近性。

    import numpy as np
    from scipy import spatial
    from functools import reduce
    
    # test points
    pts = np.random.rand(10_000_000, 3)
    
    
    def filter_(pts, pt):
        """
        Get all points in pts that are not Pareto dominated by the point pt
        """
        weakly_worse   = (pts <= pt).all(axis=-1)
        strictly_worse = (pts < pt).any(axis=-1)
        return pts[~(weakly_worse & strictly_worse)]
    
    
    def get_pareto_undominated_by(pts1, pts2=None):
        """
        Return all points in pts1 that are not Pareto dominated
        by any points in pts2
        """
        if pts2 is None:
            pts2 = pts1
        return reduce(filter_, pts2, pts1)
    
    
    def get_pareto_frontier(pts):
        """
        Iteratively filter points based on the convex hull heuristic
        """
        pareto_groups = []
    
        # loop while there are points remaining
        while pts.shape[0]:
            # brute force if there are few points:
            if pts.shape[0] < 10:
                pareto_groups.append(get_pareto_undominated_by(pts))
                break
    
            # compute vertices of the convex hull
            hull_vertices = spatial.ConvexHull(pts).vertices
    
            # get corresponding points
            hull_pts = pts[hull_vertices]
    
            # get points in pts that are not convex hull vertices
            nonhull_mask = np.ones(pts.shape[0], dtype=bool)
            nonhull_mask[hull_vertices] = False
            pts = pts[nonhull_mask]
    
            # get points in the convex hull that are on the Pareto frontier
            pareto   = get_pareto_undominated_by(hull_pts)
            pareto_groups.append(pareto)
    
            # filter remaining points to keep those not dominated by
            # Pareto points of the convex hull
            pts = get_pareto_undominated_by(pts, pareto)
    
        return np.vstack(pareto_groups)
    
    # --------------------------------------------------------------------------------
    # previous solutions
    # --------------------------------------------------------------------------------
    
    def is_pareto_efficient_dumb(costs):
        """
        :param costs: An (n_points, n_costs) array
        :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient
        """
        is_efficient = np.ones(costs.shape[0], dtype = bool)
        for i, c in enumerate(costs):
            is_efficient[i] = np.all(np.any(costs>=c, axis=1))
        return is_efficient
    
    
    def is_pareto_efficient(costs):
        """
        :param costs: An (n_points, n_costs) array
        :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient
        """
        is_efficient = np.ones(costs.shape[0], dtype = bool)
        for i, c in enumerate(costs):
            if is_efficient[i]:
                is_efficient[is_efficient] = np.any(costs[is_efficient]<=c, axis=1)  # Remove dominated points
        return is_efficient
    
    
    def dominates(row, rowCandidate):
        return all(r >= rc for r, rc in zip(row, rowCandidate))
    
    
    def cull(pts, dominates):
        dominated = []
        cleared = []
        remaining = pts
        while remaining:
            candidate = remaining[0]
            new_remaining = []
            for other in remaining[1:]:
                [new_remaining, dominated][dominates(candidate, other)].append(other)
            if not any(dominates(other, candidate) for other in new_remaining):
                cleared.append(candidate)
            else:
                dominated.append(candidate)
            remaining = new_remaining
        return cleared, dominated
    
    # --------------------------------------------------------------------------------
    # benchmarking
    # --------------------------------------------------------------------------------
    
    # to accomodate the original non-numpy solution
    pts_list = [list(pt) for pt in pts]
    
    import timeit
    
    # print('Old non-numpy solution:s\t{}'.format(
        # timeit.timeit('cull(pts_list, dominates)', number=3, globals=globals())))
    
    print('Numpy solution:\t{}'.format(
        timeit.timeit('is_pareto_efficient(pts)', number=3, globals=globals())))
    
    print('Convex hull heuristic:\t{}'.format(
        timeit.timeit('get_pareto_frontier(pts)', number=3, globals=globals())))
    

    结果

    # >>= python temp.py # 1,000 points
    # Old non-numpy solution:      0.0316428339574486
    # Numpy solution:              0.005961259012110531
    # Convex hull heuristic:       0.012369581032544374
    # >>= python temp.py # 1,000,000 points
    # Old non-numpy solution:      70.67529802105855
    # Numpy solution:              5.398462114972062
    # Convex hull heuristic:       1.5286884519737214
    # >>= python temp.py # 10,000,000 points
    # Numpy solution:              98.03680767398328
    # Convex hull heuristic:       10.203076395904645
    

    原帖

    我通过几次调整重新编写了相同的算法。我认为你的大部分问题都来自inputPoints.remove(row)。这需要搜索点列表;按索引删除会更有效率。 我还修改了dominates函数以避免一些冗余的比较。这可能在更高的维度上得心应用。

    def dominates(row, rowCandidate):
        return all(r >= rc for r, rc in zip(row, rowCandidate))
    
    def cull(pts, dominates):
        dominated = []
        cleared = []
        remaining = pts
        while remaining:
            candidate = remaining[0]
            new_remaining = []
            for other in remaining[1:]:
                [new_remaining, dominated][dominates(candidate, other)].append(other)
            if not any(dominates(other, candidate) for other in new_remaining):
                cleared.append(candidate)
            else:
                dominated.append(candidate)
            remaining = new_remaining
        return cleared, dominated
    

答案 2 :(得分:1)

dominates的定义不正确。当且仅当它在所有维度上优于或等于B时,A支配B,并且至少在一个维度上严格更好。

答案 3 :(得分:1)

彼得,反应很好。

我只是想为那些希望在最大化与默认最小化之间进行选择的人进行概括。这是一个微不足道的修复程序,但很高兴在此处进行记录:

def is_pareto(costs, maximise=False):
    """
    :param costs: An (n_points, n_costs) array
    :maximise: boolean. True for maximising, False for minimising
    :return: A (n_points, ) boolean array, indicating whether each point is Pareto efficient
    """
    is_efficient = np.ones(costs.shape[0], dtype = bool)
    for i, c in enumerate(costs):
        if is_efficient[i]:
            if maximise:
                is_efficient[is_efficient] = np.any(costs[is_efficient]>=c, axis=1)  # Remove dominated points
            else:
                is_efficient[is_efficient] = np.any(costs[is_efficient]<=c, axis=1)  # Remove dominated points
    return is_efficient

答案 4 :(得分:1)

我可能在这里有点晚了,但是我尝试了建议的解决方案,但似乎他们未能返回所有帕累托积分。我进行了一个递归实现(明显更快),可以找到Pareto-front,您可以在https://github.com/Ragheb2464/preto-front

处找到它。

答案 5 :(得分:0)

只是为了清楚上面的例子,获得帕累托前沿的函数与上面的代码略有不同,应该只包含一个 < 而不是 <= 看起来像这样:

def is_pareto(costs):
    is_efficient = np.ones(costs.shape[0], dtype=bool)

    for i, c in enumerate(is_efficient):
        if is_efficient[i]:
           is_efficient[is_efficient] = np.any(costs[is_efficient]<c, axis=1) 

    return is_efficient

免责声明:这只是部分正确,因为统治本身被定义为 <= 对所有人来说,只有 < 对至少一个人来说。但大多数情况下应该足够了