创建一个文件以跟踪任务进度

时间:2019-05-14 11:18:56

标签: python python-3.x postgresql file

运行用于数据集传输的脚本,该脚本需要近3-4个月的时间才能通过ssh完成。不幸的是,连接会在6-8天后中断,因此需要重新启动。

脚本:

import psycopg2
from time import sleep
from config import config
from tqdm import tqdm
import requests
import json
import subprocess

subprocess.call("./airquality.sh", shell=True)

def val_json():
    db = "select to_json(d) from (  select \
        a.particles_data as particles, \
        a.o3_data as \"O3\", \
        to_timestamp(a.seconds) as \"dateObserved\", \
        l.description as name, \
            json_build_object( \
                'coordinates', \
                json_build_array(l.node_lon, l.node_lat) \
            ) as location \
        from airquality as a \
            inner join deployment as d on \
                d.deployment_id = a.deployment_id \
            inner join location as l on \
                l.location_id = d.location_id \
    ) as d"
    return db

def main():

    url = 'http://localhost:1026/v2/entities/003/attrs?options=keyValues'
    headers = {"Content-Type": "application/json", \
               "fiware-service": "urbansense",  \
               "fiware-servicepath": "/basic"}
    conn = None
    try:
        params = config()
        with psycopg2.connect(**params) as conn:
            with conn.cursor(name='my_cursor') as cur:
                cur.itersize = 2000
                cur.execute(val_json())
       # row = cur.fetchone()
                for row in tqdm(cur):
                    jsonData = json.dumps(row)
                    if jsonData.startswith('[') and jsonData.endswith(']'):
                        jsonData = jsonData[1:-1]
                        print(jsonData)
                    requests.post(url, data= jsonData, headers=headers)
                    sleep(1)

                cur.close()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
    finally:
        if conn is not None:
            conn.close()

if __name__ == '__main__':
    main()

如何创建文件并跟踪传输进度,所以当再次运行此脚本(连接断开后)时,将从先前停止的位置获取数据集?

编辑:

糟糕!我迷路了。 我设法使脚本运行并将进度写入文本文件(air.txt),该文本文件是我手动创建的,内容为0(否则脚本将根本无法运行)。 运行此脚本时,air.txt文件的内容将使用光标位置值进行更新。

问题:

我现在的问题是,当我停止运行脚本(作为一种检查方法),并再次重新启动以确保它从先前的位置选择时,脚本从0开始再次覆盖先前的值(并开始一个新计数,而不是将其读取为开始位置)。 以下是我更新的脚本:

def val_json():
    db = "select to_json(d) from (  select \
        a.particles_data as particles, \
        a.o3_data as \"O3\", \
        to_timestamp(a.seconds) as \"dateObserved\", \
        l.description as name, \
            json_build_object( \
                'coordinates', \
                json_build_array(l.node_lon, l.node_lat) \
            ) as location \
        from airquality as a \
            inner join deployment as d on \
                d.deployment_id = a.deployment_id \
            inner join location as l on \
                l.location_id = d.location_id \
    ) as d"
    return db

def main():
    RESTART_POINT_FILE = 'air.txt'
    conn = None
    try:
        params = config()
        with open(RESTART_POINT_FILE) as fd:
           rows_to_skip = int(next(fd))
    #except OSError:
        rows_to_skip = 0
        with psycopg2.connect(**params) as conn:
            with conn.cursor(name='my_cursor') as cur:
                cur.itersize = 2000
                cur.execute(val_json())

                for processed_rows, row in enumerate(tqdm(cur)):
                    if processed_rows < rows_to_skip: continue
                    jsonData = json.dumps(row)
                    if jsonData.startswith('[') and jsonData.endswith(']'):
                        jsonData = jsonData[1:-1]

                        print('\n', processed_rows, '\t', jsonData)
                    #update progress file...
                    with open(RESTART_POINT_FILE, "w") as fd:
                        print(processed_rows, file=fd)
                    sleep(1)

                cur.close()
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)

    finally:
        if conn is not None:
            conn.close()

if __name__ == '__main__':
    main()

3 个答案:

答案 0 :(得分:1)

一种简单的方法是在众所周知的地方使用专用文件。

该文件将包含一行,其中包含成功处理或不存在的行数。

在开始时,如果不存在该文件,则要跳过的记录数将为0,如果存在,则要跳过的记录数将为该文件第一行的数。应该更改循环以跳过那些记录并跟踪最后处理的记录的数量。

成功终止后,应删除文件;如果写入错误,则应写入最后成功处理的记录的编号。

骨骼代码:

RESTART_POINT_FILE = ... # full path of the restart point file

# begin: read the file:
try:
    with open(RESTART_POINT_FILE) as fd:
        rows_to_skip = int(next(fd))
except OSError:
    rows_to_skip = 0

# loop:

                for processed_row, row in enumerate(tqdm(cur)):
                    if processed_row < rows_to_skip: continue
                    ...

# end
    except (Exception, psycopg2.DatabaseError) as error:
        print(error)
        # write the file
        with open(RESTART_POINT_FILE, "w") as fd:
            print(processed_rows, file=fd)
    finally:
        if conn is not None:
            conn.close()
        # try to remove the file if it exists
        try:
            os.remove(RESTART_POINT_FILE)
        except OSError:
            pass

注意:没有经过测试...

答案 1 :(得分:0)

尝试将while循环用于与True或fals的连接,而当连接为fals时,请等到其再次为真为止

答案 2 :(得分:0)

如果您的问题完全是由于ssh远程终端超时造成的,那么简单的答案是:使用将在远程上运行的终端多路复用器,例如 tmux 屏幕机器,并且即使会话超时也能保持程序运行,您只需要在方便时重新连接,并重新连接终端以查看其处理过程,甚至甚至可以像 nohup 这样的“终端分离器”(但随后您将必要时需要在文件上重定向标准输出)。

但是,这不会使您免于偶尔的OOM杀机,服务器重新启动等等),为此,使用重载机制对程序状态进行常规序列化是一个好主意