Question

有时在我们的实验室中，我们的postgres 8.3数据库将从pid文件中获取孤立，并且在尝试关闭数据库时会收到此消息：

Error: pid file is invalid, please manually kill the stale server process postgres

发生这种情况时，我们会立即执行pg_dump，以便稍后恢复数据库。但是，如果我们只是杀死-9孤立postgres进程然后启动它，则数据库仅使用上次成功关闭的数据启动。但是如果你在杀死它之前psql，那么数据全部可用，这就是为什么pg_dump有效。

有没有办法优雅地关闭孤立的postgres进程，所以我们不必通过pg_dump并恢复？或者有没有办法在杀死孤立进程后恢复数据库？

Answer 1

根据documentation你可以发送SIGTERM或SIGQUIT。 SIGTERM是首选。无论哪种方式都不使用SIGKILL（正如您从个人经验中所知）。

编辑：另一方面，您遇到的不正常，可能表示配置错误或错误。请在pgsql-admin邮件列表上寻求帮助。

Answer 2

从不使用kill -9。

我强烈建议你试着弄清楚这是怎么发生的。错误消息究竟来自哪里？这不是PostgreSQL错误消息。你有没有机会混合不同的方式来启动/停止服务器（有时是initscrises，有时是pg_ctl）？这可能会导致事情不同步。

但要回答直接问题 - 在进程上使用常规kill（no -9）来关闭它。如果有多个postgres进程正在运行，请确保终止所有postgres进程。

数据库在关闭时始终会自动恢复。这个shuold也发生在kill -9上 - 任何提交的数据都应该在那里。这几乎听起来像你有两个不同的数据目录相互叠加或类似的东西 - 至少在此之前这已经成为NFS的一个已知问题。

Answer 3

我使用像cron每分钟运行的以下脚本。

#!/bin/bash

DB="YOUR_DB"

# Here's a snippet to watch how long each connection to the db has been open:
#     watch -n 1 'ps -o pid,cmd,etime -C postgres | grep $DB'

# This program kills any postgres workers/connections to the specified database
# which have been running for 2 or 3 minutes. It actually kills workers which
# have an elapsed time including "02:" or "03:". That'll be anything running
# for at least 2 minutes and less than 4. It'll also cover anything that
# managed to stay around until an hour and 2 or 3 minutes, etc.
#
# Run this once a minute via cron and it should catch any connection open
# between 2 and 3 minutes. You can temporarily disable it if if you need to run
# a long connection once in a while.
#
# The check for "03:" is in case there's a little lag starting the cron job and
# the timing is really bad and it never sees a worker in the 1 minute window
# when it's got "02:".
old=$(ps -o pid,cmd,etime -C postgres | grep "$DB" | egrep '0[23]:')
if [ -n "$old" ]; then
    echo "Killing:"
    echo "$old"
    echo "$old" | awk '{print $1}' | xargs -I {} kill {}
fi

我如何优雅地杀死陈旧的服务器进程postgres

3 个答案: