Question

我想使用Alpha视频将视频混合到另一个视频之上。这是我的代码。它工作得很好，但问题是这个代码根本不是高效的，因为/255部分。它很慢并且有问题。

这样做有标准而有效的方法吗？我希望结果是实时的。感谢

import cv2
import numpy as np

def main():
    foreground = cv2.VideoCapture('circle.mp4')
    background = cv2.VideoCapture('video.MP4')
    alpha = cv2.VideoCapture('circle_alpha.mp4')

    while foreground.isOpened():
        fr_foreground = foreground.read()[1]/255
        fr_background = background.read()[1]/255     
        fr_alpha = alpha.read()[1]/255

        cv2.imshow('My Image',cmb(fr_foreground,fr_background,fr_alpha))

        if cv2.waitKey(1) == ord('q'): break

    cv2.destroyAllWindows

def cmb(fg,bg,a):
    return fg * a + bg * (1-a)

if __name__ == '__main__':
    main()

Answer 1

让我们先解决一些明显的问题 - foreground.isOpened()即使在您到达视频结尾后也会返回true，因此您的程序最终会崩溃。解决方案是双重的。首先，在创建它们之后立即测试所有3个VideoCapture实例，使用类似：

if not foreground.isOpened() or not background.isOpened() or not alpha.isOpened():
    print "Unable to open input videos."
    return

这将确保所有这些都正确打开。下一部分是正确处理到达视频的结尾。这意味着要么检查read()的两个返回值中的第一个，这是表示成功的布尔标志，要么测试该帧是否为None。

while True:
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

此外，您似乎并未真正致电cv2.destroyAllWindows() - ()缺失。这并不重要。

为了帮助调查和优化这一点，我使用timeit模块和几个便利功能添加了一些详细的时间

from timeit import default_timer as timer

def update_times(times, total_times):
    for i in range(len(times) - 1):
        total_times[i] += (times[i+1]-times[i]) * 1000

def print_times(total_times, n):
    print "Iterations: %d" % n
    for i in range(len(total_times)):
        print "Step %d: %0.4f ms" % (i, total_times[i] / n)
    print "Total: %0.4f ms" % (np.sum(total_times) / n)

并修改main()函数以测量每个逻辑步骤所花费的时间 - 读取，缩放，混合，显示，等待键。为此，我将分部拆分为单独的陈述。我也进行了一些修改，使得它在Python 2.x中也可以工作（/255作为整数除法插入并产生错误的结果）。

times = [0.0] * 6
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video
    times[1] = timer()
    fr_foreground = fr_foreground / 255.0
    fr_background = fr_background / 255.0
    fr_alpha = fr_alpha / 255.0
    times[2] = timer()
    result = cmb(fr_foreground,fr_background,fr_alpha)
    times[3] = timer()
    cv2.imshow('My Image', result)
    times[4] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[5] = timer()
    update_times(times, total_times)
    n += 1

print_times(total_times, n)

当我使用1280x800 mp4视频作为输入运行时，我注意到它确实相当迟缓，并且它只在我的6核机器上使用15％的CPU。各部分的时间安排如下：

Iterations: 1190
Step 0: 11.4385 ms
Step 1: 37.1320 ms
Step 2: 39.4083 ms
Step 3: 2.5488 ms
Step 4: 10.7083 ms
Total: 101.2358 ms

这表明最大的瓶颈是缩放步骤和混合步骤。低CPU使用率也不是最理想的，但我们首先关注的是低成本的果实。

让我们看看我们使用的numpy数组的数据类型。 read()为我们提供了dtype np.uint8 - 8位无符号整数的数组。但是，浮点除法（如写）将产生dtype np.float64 - 64位浮点值的数组。我们的算法并不真正需要这种精确度，因此我们最好只使用32位浮点数 - 这意味着如果任何操作都是矢量化的，我们可能会做两倍的在相同的时间内进行计算。

这里有两种选择。我们可以简单地将除数转换为np.float32，这会导致numpy给出相同dtype的结果：

fr_foreground = fr_foreground / np.float32(255.0)
fr_background = fr_background / np.float32(255.0)
fr_alpha = fr_alpha / np.float32(255.0)

这为我们提供了以下时间：

Iterations: 1786
Step 0: 9.2550 ms
Step 1: 19.0144 ms
Step 2: 21.2120 ms
Step 3: 1.4662 ms
Step 4: 10.8889 ms
Total: 61.8365 ms

或者我们可以先将数组转换为np.float32，然后就地进行缩放。

fr_foreground = np.float32(fr_foreground)
fr_background = np.float32(fr_background)
fr_alpha = np.float32(fr_alpha)

fr_foreground /= 255.0
fr_background /= 255.0
fr_alpha /= 255.0

其中给出以下时间（将步骤1分为转换（1）和缩放（2） - 休息时间为1）：

Iterations: 1786
Step 0: 9.0589 ms
Step 1: 13.9614 ms
Step 2: 4.5960 ms
Step 3: 20.9279 ms
Step 4: 1.4631 ms
Step 5: 10.4396 ms
Total: 60.4469 ms

两者都是等价的，运行时间约为原始时间的60％。我会坚持第二种选择，因为它将在后面的步骤中变得有用。让我们看看我们还能改进什么。

从前面的时间开始，我们可以看到缩放不再是瓶颈，但仍然会想到一个想法 - 除法通常比乘法慢，所以如果我们乘以倒数会是什么呢？

fr_foreground *= 1/255.0
fr_background *= 1/255.0
fr_alpha *= 1/255.0

确实这确实让我们获得了一毫秒 - 没什么了不起的，但它很容易，所以不妨一试：

Iterations: 1786
Step 0: 9.1843 ms
Step 1: 14.2349 ms
Step 2: 3.5752 ms
Step 3: 21.0545 ms
Step 4: 1.4692 ms
Step 5: 10.6917 ms
Total: 60.2097 ms

现在混合功能是最大的瓶颈，其次是所有3个阵列的类型转换。如果我们看一下混合操作的作用：

foreground * alpha + background * (1.0 - alpha)

我们可以观察到，为了使数学运作，需要在范围（0.0,1.0）中的唯一值是alpha。

如果我们只缩放alpha图像怎么办？另外，由于浮点乘法会提升到浮点数，如果我们也跳过类型转换怎么办？这意味着cmb()必须返回np.uint8数组

def cmb(fg,bg,a):
    return np.uint8(fg * a + bg * (1-a))

我们会

    #fr_foreground = np.float32(fr_foreground)
    #fr_background = np.float32(fr_background)
    fr_alpha = np.float32(fr_alpha)

    #fr_foreground *= 1/255.0
    #fr_background *= 1/255.0
    fr_alpha *= 1/255.0

时间是

Step 0: 7.7023 ms
Step 1: 4.6758 ms
Step 2: 1.1061 ms
Step 3: 27.3188 ms
Step 4: 0.4783 ms
Step 5: 9.0027 ms
Total: 50.2840 ms

显然，步骤1和步骤2要快得多，因为我们只做了1/3的工作。 imshow也会加速，因为它不必从浮点转换。令人费解的是，读取速度也变得更快（我想我们已经避免了一些引擎盖重新分配，因为fr_foreground和fr_background总是包含原始帧。我们确实在cmb()支付了额外演员的价格，但总体来说这似乎是一场胜利 - 我们在原来时间的50％。

要继续，让我们摆脱cmb()功能，将其功能移至main()并将其拆分以衡量每项操作的成本。我们也尝试重用alpha.read()的结果（因为我们最近看到read()表现有所改善）：

times = [0.0] * 11
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha_raw = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

    times[1] = timer()
    fr_alpha = np.float32(fr_alpha_raw)
    times[2] = timer()
    fr_alpha *= 1/255.0
    times[3] = timer()
    fr_alpha_inv = 1.0 - fr_alpha
    times[4] = timer()
    fr_fg_weighed = fr_foreground * fr_alpha
    times[5] = timer()
    fr_bg_weighed = fr_background * fr_alpha_inv
    times[6] = timer()
    sum = fr_fg_weighed + fr_bg_weighed
    times[7] = timer()
    result = np.uint8(sum)
    times[8] = timer()
    cv2.imshow('My Image', result)
    times[9] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[10] = timer()
    update_times(times, total_times)
    n += 1

新时间：

Iterations: 1786
Step 0: 6.8733 ms
Step 1: 5.2742 ms
Step 2: 1.1430 ms
Step 3: 4.5800 ms
Step 4: 7.0372 ms
Step 5: 7.0675 ms
Step 6: 5.3082 ms
Step 7: 2.6912 ms
Step 8: 0.4658 ms
Step 9: 9.6966 ms
Total: 50.1372 ms

我们并没有真正获得任何东西，但读取速度明显加快。

这导致了另一个想法 - 如果我们尝试在后续迭代中最小化分配并重用数组会怎样？

在我们读完第一组帧之后，我们可以在第一次迭代中预先分配必要的数组（使用numpy.zeros_like）：

if n == 0: # Pre-allocate
    fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
    fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
    fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
    fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
    sum = np.zeros_like(fr_alpha_raw, np.float32)
    result = np.zeros_like(fr_alpha_raw, np.uint8)

现在，我们可以使用

numpy.add添加
numpy.subtract用于减法
numpy.multiply用于乘法
numpy.copyto进行类型转换

我们还可以使用单个numpy.multiply将步骤1和2合并在一起。

times = [0.0] * 10
total_times = [0.0] * (len(times) - 1)
n = 0
while True:
    times[0] = timer()
    r_fg, fr_foreground = foreground.read()
    r_bg, fr_background = background.read()
    r_a, fr_alpha_raw = alpha.read()
    if not r_fg or not r_bg or not r_a:
        break # End of video

    if n == 0: # Pre-allocate
        fr_alpha = np.zeros_like(fr_alpha_raw, np.float32)
        fr_alpha_inv = np.zeros_like(fr_alpha_raw, np.float32)
        fr_fg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
        fr_bg_weighed = np.zeros_like(fr_alpha_raw, np.float32)
        sum = np.zeros_like(fr_alpha_raw, np.float32)
        result = np.zeros_like(fr_alpha_raw, np.uint8)

    times[1] = timer()
    np.multiply(fr_alpha_raw, np.float32(1/255.0), fr_alpha)
    times[2] = timer()
    np.subtract(1.0, fr_alpha, fr_alpha_inv)
    times[3] = timer()
    np.multiply(fr_foreground, fr_alpha, fr_fg_weighed)
    times[4] = timer()
    np.multiply(fr_background, fr_alpha_inv, fr_bg_weighed)
    times[5] = timer()
    np.add(fr_fg_weighed, fr_bg_weighed, sum)
    times[6] = timer()
    np.copyto(result, sum, 'unsafe')
    times[7] = timer()
    cv2.imshow('My Image', result)
    times[8] = timer()
    if cv2.waitKey(1) == ord('q'): break
    times[9] = timer()
    update_times(times, total_times)
    n += 1

这给了我们以下时间：

Iterations: 1786
Step 0: 7.0515 ms
Step 1: 3.8839 ms
Step 2: 1.9080 ms
Step 3: 4.5198 ms
Step 4: 4.3871 ms
Step 5: 2.7576 ms
Step 6: 1.9273 ms
Step 7: 0.4382 ms
Step 8: 7.2340 ms
Total: 34.1074 ms

我们修改的所有步骤都有显着改善。我们减少了原始实施所需时间的约35％。

次要更新：

根据Silencer＆＃39; answer，我也测量了cv2.convertScaleAbs。它实际上运行得更快：

Step 6: 1.2318 ms

这给了我另一个想法 - 我们可以利用cv2.add让我们指定目标数据类型并进行饱和度转换。这将允许我们将步骤5和6结合在一起。

cv2.add(fr_fg_weighed, fr_bg_weighed, result, dtype=cv2.CV_8UC3)

出现在

Step 5: 3.3621 ms

再次获得一点胜利（以前我们大约是3.9毫秒）。

继续此后，cv2.subtract和cv2.multiply是其他候选人。我们需要使用4元素元组来定义标量（Python绑定的复杂性），我们需要显式定义乘法的输出数据类型。

    cv2.subtract((1.0, 1.0, 1.0, 0.0), fr_alpha, fr_alpha_inv)
    cv2.multiply(fr_foreground, fr_alpha, fr_fg_weighed, dtype=cv2.CV_32FC3)
    cv2.multiply(fr_background, fr_alpha_inv, fr_bg_weighed, dtype=cv2.CV_32FC3)

时序：

Step 2: 2.1897 ms
Step 3: 2.8981 ms
Step 4: 2.9066 ms

这似乎是在没有一些并行化的情况下我们可以得到的。我们已经充分利用了OpenCV在个人操作方面可能提供的任何优势，因此我们应该专注于管理我们的实施。

为了帮助我弄清楚如何在不同的piepeline阶段（线程）之间划分代码，我制作了一个图表，显示了所有操作，我们的最佳时间，以及计算的相互依赖性：< / p>

WIP 在我写这篇文章时会看到有关其他信息的评论。

Answer 2

我正在使用OpenCV 4.00-pre和Python 3.6。

无需执行三次xxx/255操作。只是为了阿尔法是好的。

注意类型转换，更喜欢cv2.convertScaleAbs(xxx)或np.uint8(xxx)以外的np.copyto(xxx,yyy, "unsafe")。

预分配内存应该更好。

我使用＃2，cv2.convertScaleAbs来避免underflow/overflow，范围为[0,255]。例如：

>>> x = np.array([[-1,256]])
>>> y = np.uint8(x)
>>> z = cv2.convertScaleAbs(x)
>>> x
array([[ -1, 256]])
>>> y
array([[255,   0]], dtype=uint8)
>>> z
array([[  1, 255]], dtype=uint8)

##! 2018/05/09 13:54:34

import cv2
import numpy as np
import time

def cmb(fg,bg,a):
    return fg * a + bg * (1-a)

def test2():
    cap = cv2.VideoCapture(0)
    ret, prev_frame = cap.read()
    """
    foreground = cv2.VideoCapture('circle.mp4')
    background = cv2.VideoCapture('video.MP4')
    alphavideo = cv2.VideoCapture('circle_alpha.mp4')
    """
    while cap.isOpened():
        ts = time.time()
        ret, fg = cap.read()
        alpha = fg.copy()
        bg = prev_frame
        """
        ret, fg = foreground.read()
        ret, bg = background.read()
        ret, alpha = alphavideo.read()
        """

        alpha = np.multiply(alpha, 1.0/255)
        blended = cv2.convertScaleAbs(cmb(fg, bg, alpha))
        te = time.time()
        dt = te-ts
        fps = 1/dt
        print("{:.3}ms, {:.3} fps".format(1000*dt, fps))
        cv2.imshow('Blended', blended)

        if cv2.waitKey(1) == ord('q'):
            break

    cv2.destroyAllWindows()

if __name__ == "__main__":
    test2()

有些输出：

39.0ms, 25.6 fps
37.0ms, 27.0 fps
38.0ms, 26.3 fps
37.0ms, 27.0 fps
38.0ms, 26.3 fps
37.0ms, 27.0 fps
38.0ms, 26.3 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
38.0ms, 26.3 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
37.0ms, 27.0 fps
...

Answer 3

如果它只是混合，渲染和遗忘，那么在GPU上进行它是有意义的。在许多其他人中，VTK（可视化工具包）（https://www.vtk.org）可以为您而不是imshow执行此操作。从OpenCV 3D Visualizer模块（https://docs.opencv.org/3.2.0/d1/d19/group__viz.html）已经知道VTK，所以不应该增加很多依赖。

此后整个计算部分（不包括读取视频帧）下降到cv2.mixChannels并且像素数据传输到两个渲染器，而在我的计算机上，对于1280x720视频，每次迭代大约需要5ms。

import sys
import cv2
import numpy as np
import vtk
from vtk.util import numpy_support
import time

class Renderer:
    # VTK renderer with two layers
    def __init__( self ):
        self.layer1 = vtk.vtkRenderer()
        self.layer1.SetLayer(0)
        self.layer2 = vtk.vtkRenderer()
        self.layer2.SetLayer(1)
        self.renWin = vtk.vtkRenderWindow()
        self.renWin.SetNumberOfLayers( 2 )
        self.renWin.AddRenderer(self.layer1)
        self.renWin.AddRenderer(self.layer2)
        self.iren = vtk.vtkRenderWindowInteractor()
        self.iren.SetRenderWindow(self.renWin)
        self.iren.Initialize()      
    def Render( self ):
        self.iren.Render()

# set background image to a given renderer (resets the camera)
# from https://www.vtk.org/Wiki/VTK/Examples/Cxx/Images/BackgroundImage
def SetBackground( ren, image ):    
    bits = numpy_support.numpy_to_vtk( image.ravel() )
    bits.SetNumberOfComponents( image.shape[2] )
    bits.SetNumberOfTuples( bits.GetNumberOfTuples()/bits.GetNumberOfComponents() )

    img = vtk.vtkImageData()
    img.GetPointData().SetScalars( bits );
    img.SetExtent( 0, image.shape[1]-1, 0, image.shape[0]-1, 0,0 );
    origin = img.GetOrigin()
    spacing = img.GetSpacing()
    extent = img.GetExtent()

    actor = vtk.vtkImageActor()
    actor.SetInputData( img )

    ren.RemoveAllViewProps()
    ren.AddActor( actor )
    camera = vtk.vtkCamera()
    camera.ParallelProjectionOn()
    xc = origin[0] + 0.5*(extent[0] + extent[1])*spacing[0]
    yc = origin[1] + 0.5*(extent[2] + extent[3])*spacing[1]
    yd = (extent[3] - extent[2] + 1)*spacing[1]
    d = camera.GetDistance()
    camera.SetParallelScale(0.5*yd)
    camera.SetFocalPoint(xc,yc,0.0)
    camera.SetPosition(xc,yc,-d)
    camera.SetViewUp(0,-1,0)
    ren.SetActiveCamera( camera )
    return img

# update the scalar data without bounds check
def UpdateImageData( vtkimage, image ):
    bits = numpy_support.numpy_to_vtk( image.ravel() )
    bits.SetNumberOfComponents( image.shape[2] )
    bits.SetNumberOfTuples( bits.GetNumberOfTuples()/bits.GetNumberOfComponents() )
    vtkimage.GetPointData().SetScalars( bits );

r = Renderer()
r.renWin.SetSize(1280,720)
cap = cv2.VideoCapture('video.mp4')
image = cv2.imread('hello.png',1)
alpha = cv2.cvtColor(image,cv2.COLOR_RGB2GRAY )
ret, alpha = cv2.threshold( alpha, 127, 127, cv2.THRESH_BINARY )
alpha = np.reshape( alpha, (alpha.shape[0],alpha.shape[1], 1 ) )

src1=[]
src2=[]
overlay=[]
c=0
while ( 1 ):
    # read the data
    ret, mat = cap.read()
    if ( not ret ):
        break
    #TODO ret, image = cap2.read() #(rgb)
    #TODO ret, alpha = cap3.read() #(mono)

    # alpha blend
    t=time.time()
    if ( overlay==[] ):
        overlay = np.zeros( [image.shape[0],image.shape[1],4], np.uint8 ) 
    cv2.mixChannels( [image, alpha], [overlay], [0,0,1,1,2,2,3,3] )
    if ( src1==[] ):
        src1 = SetBackground( r.layer1, mat )
    else:
        UpdateImageData( src1, mat )
    if ( src2==[] ):
        src2 = SetBackground( r.layer2, overlay )
    else:
        UpdateImageData( src2, overlay )
    r.Render()
    # blending done
    t = time.time()-t;

    if ( c % 10 == 0 ):
        print 1000*t
    c = c+1;

使用OpenCV对视频进行Alpha混合

3 个答案: