Question

我找到了一个帖子，说明如何告诉bsub在运行here之前等待一组指定的作业完成，但是这只有在事先了解作业数量时才有效。

我想运行任意数量的工作，并运行＆＃34;包装＆＃34;我的所有工作完成后的工作

这是我的剧本：

#!/bin/bash
for file in dir/*; do # I don't know how many jobs will be created
    bsub "./do_it_once.sh $file"
done

bsub -w "done(1) && done(2) && done(3)" merge_results.sh

当提交了3个作业时，上述脚本将起作用，但是如果有n个作业怎么办？如何指定我想等待所有工作完成？

Answer 1

修改请参阅kamula's answer了解实际效果:)。

原始答案

从未使用bsub，但快速浏览the man page，我认为可能会这样做：

#!/bin/bash
jobnum=0
for file in src/*; do # I don't know how many jobs will be created
    bsub -J "myjobs[$jobnum]" "./do_it_once.sh $file"
    jobnum=$((jobnum + 1))
done

bsub -w "done(myjobs[*])" merge_results.sh

使用bsub变量myjobs[]，在名为bash的{{1}}数组中使用顺序索引命名作业。然后最后jobnum等待所有bsub个工作完成。

YMMV！

哦 - 此外，您可能需要使用myjobs[]（-J "\"myjobs[...]\""）。手册页说要用双引号括起作业名称，但是我不知道是否有\"要求，或者他们是否假设您将使用扩展未引用文本的shell。

Answer 2

基于cxw's reply，我得到了一些工作。它不使用数组。但是，-w命令可以使用通配符，因此我以相似的方式命名每个作业。

仍然不确定这是否是调用bsub的正确方法，因为每次都需要调用一次，但它有效。

从cxw编辑：

#!/bin/bash
jobnum=0
for file in src/*; do # I don't know how many jobs will be created
    bsub -J "myjobs${jobnum}" "./do_it_once.sh $file"
    jobnum=$((jobnum + 1))
done

bsub -w "done(myjobs*)" merge_results.sh

Answer 3

这是我的完整解决方案，它增加了时间控制并提供了失败作业的数量。如果需要，还要注意杀死失败工作的孩子，并处理僵尸或不间断的过程：

function Logger {
    echo "$1"
}

# Portable child (and grandchild) kill function tester under Linux, BSD and MacOS X
function KillChilds {
    local pid="${1}" # Parent pid to kill childs
    local self="${2:-false}" # Should parent be killed too ?


    if children="$(pgrep -P "$pid")"; then
            KillChilds "$child" true
        done
    fi
        # Try to kill nicely, if not, wait 15 seconds to let Trap actions happen before killing
    if ( [ "$self" == true ] && kill -0 $pid > /dev/null 2>&1); then
        kill -s TERM "$pid"
        if [ $? != 0 ]; then
            sleep 15
            Logger "Sending SIGTERM to process [$pid] failed."
            kill -9 "$pid"
            if [ $? != 0 ]; then
                Logger "Sending SIGKILL to process [$pid] failed."
                return 1
            fi
        else
            return 0
        fi
    else
        return 0
    fi
}

function WaitForTaskCompletion {
    local pids="${1}" # pids to wait for, separated by semi-colon
    local soft_max_time="${2}" # If program with pid $pid takes longer than $soft_max_time seconds, will log a warning, unless $soft_max_time equals 0.
    local hard_max_time="${3}" # If program with pid $pid takes longer than $hard_max_time seconds, will stop execution, unless $hard_max_time equals 0.
    local caller_name="${4}" # Who called this function
    local counting="${5:-true}" # Count time since function has been launched if true, since script has been launched if false
    local keep_logging="${6:-0}" # Log a standby message every X seconds. Set to zero to disable logging

    local soft_alert=false # Does a soft alert need to be triggered, if yes, send an alert once
    local log_ttime=0 # local time instance for comparaison

    local seconds_begin=$SECONDS # Seconds since the beginning of the script
    local exec_time=0 # Seconds since the beginning of this function

    local retval=0 # return value of monitored pid process
    local errorcount=0 # Number of pids that finished with errors

    local pid   # Current pid working on
    local pidCount # number of given pids
    local pidState # State of the process

    local pidsArray # Array of currently running pids
    local newPidsArray # New array of currently running pids

    IFS=';' read -a pidsArray <<< "$pids"
    pidCount=${#pidsArray[@]}

    WAIT_FOR_TASK_COMPLETION=""

    while [ ${#pidsArray[@]} -gt 0 ]; do
        newPidsArray=()

        Spinner
        if [ $counting == true ]; then
            exec_time=$(($SECONDS - $seconds_begin))
        else
            exec_time=$SECONDS
        fi

        if [ $keep_logging -ne 0 ]; then
            if [ $((($exec_time + 1) % $keep_logging)) -eq 0 ]; then
                if [ $log_ttime -ne $exec_time ]; then # Fix when sleep time lower than 1s
                    log_ttime=$exec_time
                fi
            fi
        fi

        if [ $exec_time -gt $soft_max_time ]; then
            if [ $soft_alert == true ] && [ $soft_max_time -ne 0 ]; then
                Logger "Max soft execution time exceeded for task [$caller_name] with pids [$(joinString , ${pidsArray[@]})]."
                soft_alert=true
                SendAlert true

            fi
            if [ $exec_time -gt $hard_max_time ] && [ $hard_max_time -ne 0 ]; then
                Logger "Max hard execution time exceeded for task [$caller_name] with pids [$(joinString , ${pidsArray[@]})]. Stopping task execution."
                for pid in "${pidsArray[@]}"; do
                    KillChilds $pid true
                    if [ $? == 0 ]; then
                        Logger "Task with pid [$pid] stopped successfully." "NOTICE"
                    else
                        Logger "Could not stop task with pid [$pid]." "ERROR"
                    fi
                done
                SendAlert true
                errrorcount=$((errorcount+1))
            fi
        fi

        for pid in "${pidsArray[@]}"; do
            if [ $(IsNumeric $pid) -eq 1 ]; then
                if kill -0 $pid > /dev/null 2>&1; then
                    # Handle uninterruptible sleep state or zombies by ommiting them from running process array (How to kill that is already dead ? :)
                    #TODO(high): have this tested on *BSD, Mac & Win
                    pidState=$(ps -p$pid -o state= 2 > /dev/null)
                    if [ "$pidState" != "D" ] && [ "$pidState" != "Z" ]; then
                        newPidsArray+=($pid)
                    fi
                else
                    # pid is dead, get it's exit code from wait command
                    wait $pid
                    retval=$?
                    if [ $retval -ne 0 ]; then
                        errorcount=$((errorcount+1))
                        Logger "${FUNCNAME[0]} called by [$caller_name] finished monitoring [$pid] with exitcode [$retval]. "DEBUG"
                        if [ "$WAIT_FOR_TASK_COMPLETION" == "" ]; then
                            WAIT_FOR_TASK_COMPLETION="$pid:$retval"
                        else
                            WAIT_FOR_TASK_COMPLETION=";$pid:$retval"
                        fi
                    fi
                fi

            fi
        done

        pidsArray=("${newPidsArray[@]}")
        # Trivial wait time for bash to not eat up all CPU
        sleep .05
    done

    # Return exit code if only one process was monitored, else return number of errors
    if [ $pidCount -eq 1 ] && [ $errorcount -eq 0 ]; then
        return $errorcount
    else
        return $errorcount
    fi
}

用法：

让我们做3个睡眠工作，获取他们的pid并将它们发送给WaitforTaskCompletion：

sleep 10 &
pids="$!"
sleep 15 &
pids="$pids;$!"
sleep 20 &
pids="$pids;$!"

WaitForTaskCompletion $pids 1800 3600 ${FUNCNAME[0]} false 1800

前面的示例会警告您执行是否超过1小时，2小时后停止执行，并每半小时发送一条“活动”日志消息。

Answer 4

由于bjobs的输出在没有作业挂起/正在运行时为1行（No unfinished job found），在至少有1个作业在挂起/正在运行时为2行：

JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
25156   awesome RUN   best_queue superhost   30*host     cool_name  Jun 16 05:38

您可以使用以下方法在bjobs | wc -l上循环播放：

for job in $some_jobs; 
    bsub < $job

    # Waiting for jobs to complete
    while [[ `bjobs | wc -l` -ge 2 ]] ; do \
        sleep 15
    done
done

此技术的一个好处是，您可以启动多个作业，而不管您需要运行多少个作业。只是在等待之前循环播放它们。显然，这不是最清洁的方法，但目前可以使用。

for some_jobs in $job_groups; do \
    for job in $some_jobs; do \
        bsub < $job
    done

    # Waiting for jobs to complete
    while [[ `bjobs | wc -l` -ge 2 ]] ; do \
        sleep 15
    done
done

只有在我之前的所有工作完成后才能运行工作

4 个答案:

原始答案