运行用于数据集传输的脚本,该脚本需要近3-4个月的时间才能通过ssh完成。不幸的是,连接会在6-8天后中断,因此需要重新启动。
脚本:
import psycopg2
from time import sleep
from config import config
from tqdm import tqdm
import requests
import json
import subprocess
subprocess.call("./airquality.sh", shell=True)
def val_json():
db = "select to_json(d) from ( select \
a.particles_data as particles, \
a.o3_data as \"O3\", \
to_timestamp(a.seconds) as \"dateObserved\", \
l.description as name, \
json_build_object( \
'coordinates', \
json_build_array(l.node_lon, l.node_lat) \
) as location \
from airquality as a \
inner join deployment as d on \
d.deployment_id = a.deployment_id \
inner join location as l on \
l.location_id = d.location_id \
) as d"
return db
def main():
url = 'http://localhost:1026/v2/entities/003/attrs?options=keyValues'
headers = {"Content-Type": "application/json", \
"fiware-service": "urbansense", \
"fiware-servicepath": "/basic"}
conn = None
try:
params = config()
with psycopg2.connect(**params) as conn:
with conn.cursor(name='my_cursor') as cur:
cur.itersize = 2000
cur.execute(val_json())
# row = cur.fetchone()
for row in tqdm(cur):
jsonData = json.dumps(row)
if jsonData.startswith('[') and jsonData.endswith(']'):
jsonData = jsonData[1:-1]
print(jsonData)
requests.post(url, data= jsonData, headers=headers)
sleep(1)
cur.close()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
if __name__ == '__main__':
main()
如何创建文件并跟踪传输进度,所以当再次运行此脚本(连接断开后)时,将从先前停止的位置获取数据集?
编辑:
糟糕!我迷路了。
我设法使脚本运行并将进度写入文本文件(air.txt
),该文本文件是我手动创建的,内容为0
(否则脚本将根本无法运行)。
运行此脚本时,air.txt
文件的内容将使用光标位置值进行更新。
问题:
我现在的问题是,当我停止运行脚本(作为一种检查方法),并再次重新启动以确保它从先前的位置选择时,脚本从0
开始再次覆盖先前的值(并开始一个新计数,而不是将其读取为开始位置)。
以下是我更新的脚本:
def val_json():
db = "select to_json(d) from ( select \
a.particles_data as particles, \
a.o3_data as \"O3\", \
to_timestamp(a.seconds) as \"dateObserved\", \
l.description as name, \
json_build_object( \
'coordinates', \
json_build_array(l.node_lon, l.node_lat) \
) as location \
from airquality as a \
inner join deployment as d on \
d.deployment_id = a.deployment_id \
inner join location as l on \
l.location_id = d.location_id \
) as d"
return db
def main():
RESTART_POINT_FILE = 'air.txt'
conn = None
try:
params = config()
with open(RESTART_POINT_FILE) as fd:
rows_to_skip = int(next(fd))
#except OSError:
rows_to_skip = 0
with psycopg2.connect(**params) as conn:
with conn.cursor(name='my_cursor') as cur:
cur.itersize = 2000
cur.execute(val_json())
for processed_rows, row in enumerate(tqdm(cur)):
if processed_rows < rows_to_skip: continue
jsonData = json.dumps(row)
if jsonData.startswith('[') and jsonData.endswith(']'):
jsonData = jsonData[1:-1]
print('\n', processed_rows, '\t', jsonData)
#update progress file...
with open(RESTART_POINT_FILE, "w") as fd:
print(processed_rows, file=fd)
sleep(1)
cur.close()
except (Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
if __name__ == '__main__':
main()
答案 0 :(得分:1)
一种简单的方法是在众所周知的地方使用专用文件。
该文件将包含一行,其中包含成功处理或不存在的行数。
在开始时,如果不存在该文件,则要跳过的记录数将为0,如果存在,则要跳过的记录数将为该文件第一行的数。应该更改循环以跳过那些记录并跟踪最后处理的记录的数量。
成功终止后,应删除文件;如果写入错误,则应写入最后成功处理的记录的编号。
骨骼代码:
RESTART_POINT_FILE = ... # full path of the restart point file
# begin: read the file:
try:
with open(RESTART_POINT_FILE) as fd:
rows_to_skip = int(next(fd))
except OSError:
rows_to_skip = 0
# loop:
for processed_row, row in enumerate(tqdm(cur)):
if processed_row < rows_to_skip: continue
...
# end
except (Exception, psycopg2.DatabaseError) as error:
print(error)
# write the file
with open(RESTART_POINT_FILE, "w") as fd:
print(processed_rows, file=fd)
finally:
if conn is not None:
conn.close()
# try to remove the file if it exists
try:
os.remove(RESTART_POINT_FILE)
except OSError:
pass
注意:没有经过测试...
答案 1 :(得分:0)
尝试将while循环用于与True或fals的连接,而当连接为fals时,请等到其再次为真为止
答案 2 :(得分:0)
如果您的问题完全是由于ssh远程终端超时造成的,那么简单的答案是:使用将在远程上运行的终端多路复用器,例如 tmux ,屏幕机器,并且即使会话超时也能保持程序运行,您只需要在方便时重新连接,并重新连接终端以查看其处理过程,甚至甚至可以像 nohup 这样的“终端分离器”(但随后您将必要时需要在文件上重定向标准输出)。
但是,这不会使您免于偶尔的OOM杀机,服务器重新启动等等),为此,使用重载机制对程序状态进行常规序列化是一个好主意