PostgreSQL：SSL SYSCALLエラー：EOFが検出されました

Question

最初に、私はこのエラーに関連するいくつかの投稿を検索して見つけました、そしてそれらのほとんどはRAM問題またはSSL問題を指摘しています、sslmodeを追加することによってSSLの可能性を克服しようとしました=コマンドラインで無効化：

 psql -U waypoint -d waypoint -W -c "alter table telemetria_data.historico alter clase type smallint, alter valor type real[], alter power type smallint, alter voltaje type real;" -h localhost -v sslmode=disable

しかし、同じメッセージが表示されました：

SSL SYSCALL error: EOF detected connection to server was lost

考えられるメモリの問題について、トラブルシューティングの方法がわかりません。

データ構造は this 質問で説明されているものであり、これを実行するとクエリが非常に長く実行され、継承されたすべてのテーブルに対して完全な変更テーブルが完成します。

OS：

Linux ip-10-1-0-9 3.16.0-4-AMD64 #1 SMP Debian 3.16.36-1+deb8u2 (2016-10-19) x86_64 GNU/Linux

PostgreSQL：

PostgreSQL 9.4.9 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit

Update 2017-06-01 13:50 GMT

コマンドを（@DanielVéritéの推奨により）に変更しました。

time PGSSLMODE=disable psql -U waypoint -d waypoint -W -c "alter table telemetria_data.historico alter clase type smallint, alter valor type real[], alter power type smallint, alter voltaje type real;" -h localhost

問題は実際に次のように変わりました：

server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. connection to server was lost

Update 2017-06-01 15:34 GMT

次のようないくつかのログエントリ（/var/log/postgresql/postgresql-9.4-main.log内）が見つかりました：

2017-06-01 13:48:49 UTC [22899-357] LOG: checkpoints are occurring too frequently (19 seconds apart) 2017-06-01 13:48:49 UTC [22899-358] HINT: Consider increasing the configuration parameter "checkpoint_segments".

だから私は提案されたヒントを続行します。

このグループのエントリも見つかりましたが、実際にはクラッシュとその後の回復を示しています。

2017-06-01 13:49:04 UTC [4982-17] LOG: server process (PID 6569) was terminated by signal 9: Killed 2017-06-01 13:49:04 UTC [4982-18] DETAIL: Failed process was running: alter table telemetria_data.historico alter clase type smallint, alter valor type real[], alter power type smallint, alter voltaje type real; 2017-06-01 13:49:04 UTC [4982-19] LOG: terminating any other active server processes 2017-06-01 13:49:04 UTC [22902-2] WARNING: terminating connection because of crash of another server process 2017-06-01 13:49:04 UTC [22902-3] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2017-06-01 13:49:04 UTC [22902-4] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2017-06-01 13:49:04 UTC [16383-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:04 UTC [16384-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:04 UTC [16386-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:04 UTC [16385-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:04 UTC [4982-20] LOG: all server processes terminated; reinitializing 2017-06-01 13:49:05 UTC [16402-1] LOG: database system was interrupted; last known up at 2017-06-01 13:48:45 UTC 2017-06-01 13:49:05 UTC [16403-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:05 UTC [16404-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:05 UTC [16414-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:05 UTC [16415-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:06 UTC [16452-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:06 UTC [16453-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:06 UTC [16462-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:06 UTC [16463-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:06 UTC [16472-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:06 UTC [16473-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:06 UTC [16482-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:06 UTC [16483-1] waypoint@waypoint FATAL: the database system is in recovery mode 2017-06-01 13:49:09 UTC [16402-2] LOG: database system was not properly shut down; automatic recovery in progress 2017-06-01 13:49:09 UTC [16402-3] LOG: redo starts at 11EC/9960F440 2017-06-01 13:49:21 UTC [16402-4] LOG: unexpected pageaddr 11E6/52726000 in log segment 00000001000011EC000000C9, offset 7495680 2017-06-01 13:49:21 UTC [16402-5] LOG: redo done at 11EC/C9723D60 2017-06-01 13:49:32 UTC [16402-6] LOG: MultiXact member wraparound protections are now enabled 2017-06-01 13:49:32 UTC [4982-21] LOG: database system is ready to accept connections

この最後のログ部分について何か提案はありますか？

OOM Killerが有効になっていて、以下が/var/log/messagesの出力です。

Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.672817] psql invoked oom-killer: gfp_mask=0x2000d0, order=2, oom_score_adj=0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.675652] psql cpuset=/ mems_allowed=0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.677274] CPU: 1 PID: 16367 Comm: psql Not tainted 3.16.0-4-AMD64 #1 Debian 3.16.36-1+deb8u2 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680406] Hardware name: Xen HVM domU, BIOS 4.2.Amazon 11/11/2016 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] 0000000000000000 ffffffff815123b5 ffff88003dcda1d0 0000000000000000 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] ffffffff8150ff8d 0000000000000000 ffffffff810d6e3f 0000000000000000 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] ffffffff81516d2e 0000000000000200 ffffffff810689d3 ffffffff810c43e4 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] Call Trace: Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff815123b5>] ? dump_stack+0x5d/0x78 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff8150ff8d>] ? dump_header+0x76/0x1e8 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff810d6e3f>] ? smp_call_function_single+0x5f/0xa0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81516d2e>] ? mutex_lock+0xe/0x2a Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff810689d3>] ? put_online_cpus+0x23/0x80 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff810c43e4>] ? rcu_oom_notify+0xc4/0xe0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81153d1c>] ? do_try_to_free_pages+0x4ac/0x520 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff811427dd>] ? oom_kill_process+0x21d/0x370 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff8114239d>] ? find_lock_task_mm+0x3d/0x90 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81142f43>] ? out_of_memory+0x473/0x4b0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81148e0f>] ? __alloc_pages_nodemask+0x9ef/0xb50 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81065c86>] ? copy_process.part.25+0x116/0x1c50 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffffa00f6bba>] ? call_filldir+0x9a/0x160 [ext4] Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81058301>] ? __do_page_fault+0x1d1/0x4f0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff811ac3f9>] ? get_empty_filp+0xc9/0x1c0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff81067990>] ? do_fork+0xe0/0x3d0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff811c6a1c>] ? __alloc_fd+0x7c/0x120 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff815188f9>] ? stub_clone+0x69/0x90 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.680557] [<ffffffff8151858d>] ? system_call_fast_compare_end+0x10/0x15 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.734210] Mem-Info: Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.735151] Node 0 DMA per-cpu: Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.736512] CPU 0: hi: 0, btch: 1 usd: 0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.738300] CPU 1: hi: 0, btch: 1 usd: 0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.740072] Node 0 DMA32 per-cpu: Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.741335] CPU 0: hi: 186, btch: 31 usd: 0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.743092] CPU 1: hi: 186, btch: 31 usd: 0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.745210] active_anon:370484 inactive_anon:549110 isolated_anon:24 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.745210] active_file:240 inactive_file:1425 isolated_file:0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.745210] unevictable:0 dirty:173 writeback:0 unstable:0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.745210] free:20427 slab_reclaimable:9729 slab_unreclaimable:3425 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.745210] mapped:567547 shmem:587500 pagetables:4209 bounce:0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.745210] free_cma:0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.760872] Node 0 DMA free:15224kB min:184kB low:228kB high:276kB active_anon:228kB inactive_anon:188kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:0kB dirty:0kB writeback:0kB mapped:144kB shmem:204kB slab_reclaimable:4kB slab_unreclaimable:80kB kernel_stack:80kB pagetables:4kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.777646] lowmem_reserve[]: 0 3757 3757 3757 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.779553] Node 0 DMA32 free:69956kB min:44868kB low:56084kB high:67300kB active_anon:1481708kB inactive_anon:2196252kB active_file:1772kB inactive_file:1748kB unevictable:0kB isolated(anon):96kB isolated(file):0kB present:3915776kB managed:3849676kB mlocked:0kB dirty:0kB writeback:0kB mapped:2267676kB shmem:2349796kB slab_reclaimable:38712kB slab_unreclaimable:13620kB kernel_stack:2032kB pagetables:16832kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:14 all_unreclaimable? no Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.795757] lowmem_reserve[]: 0 0 0 0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.797650] Node 0 DMA: 11*4kB (EM) 8*8kB (EM) 1*16kB (E) 2*32kB (UE) 1*64kB (E) 1*128kB (E) 2*256kB (UE) 2*512kB (EM) 3*1024kB (UEM) 3*2048kB (EMR) 1*4096kB (M) = 15228kB Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.804011] Node 0 DMA32: 13752*4kB (UEM) 85*8kB (EM) 54*16kB (M) 43*32kB (M) 17*64kB (M) 15*128kB (M) 10*256kB (M) 3*512kB (M) 2*1024kB (M) 0*2048kB 1*4096kB (R) = 71176kB Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.811528] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.815461] 588017 total pagecache pages Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.816967] 0 pages in swap cache Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.818205] Swap cache stats: add 0, delete 0, find 0/0 Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.820215] Free swap = 0kB Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.821475] Total swap = 0kB Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.822543] 982941 pages RAM Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.823640] 0 pages HighMem/MovableOnly Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.825110] 16525 pages reserved Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.826310] 0 pages hwpoisoned Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.827473] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.830523] [ 159] 0 159 8242 800 21 0 0 systemd-journal Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.833717] [ 162] 0 162 10200 135 22 0 -1000 systemd-udevd Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.836679] [ 316] 0 316 6351 1726 14 0 0 dhclient Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.839533] [ 351] 0 351 7181 72 18 0 0 cron Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.842112] [ 353] 0 353 4964 68 14 0 0 systemd-logind Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.844894] [ 362] 107 362 10531 96 26 0 -900 dbus-daemon Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.847504] [ 376] 106 376 8345 154 21 0 0 ntpd Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.850085] [ 377] 0 377 65721 457 30 0 0 rsyslogd Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.852755] [ 388] 0 388 3909 39 12 0 0 agetty Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.855305] [ 389] 0 389 3864 40 13 0 0 agetty Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.857914] [ 451] 0 451 13796 168 29 0 -1000 sshd Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.860562] [ 481] 1002 481 26362 5081 54 0 0 perfmon_loop.rb Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.863918] [ 486] 1002 486 15211 3146 31 0 0 cht_perfmon Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.867273] [ 625] 0 625 9560 144 22 0 0 master Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.869878] [ 630] 108 630 10164 234 24 0 0 qmgr Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.872356] [ 3443] 0 3443 20130 213 41 0 0 sshd Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.874903] [ 3445] 1000 3445 20164 222 39 0 0 sshd Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.877451] [ 3446] 1000 3446 3176 43 9 0 0 sftp-server Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.880261] [ 4982] 105 4982 614831 42946 132 0 -900 postgres Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.882988] [ 6634] 0 6634 1570 23 9 0 0 collectdmon Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.885784] [ 6635] 0 6635 174485 156 36 0 0 collectd Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.888347] [22899] 105 22899 615399 541666 1105 0 0 postgres Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.890970] [22900] 105 22900 615395 14251 88 0 0 postgres Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.893836] [22901] 105 22901 615088 4252 53 0 0 postgres Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.896365] [22902] 105 22902 615305 1316 60 0 0 postgres Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.899083] [22903] 105 22903 21336 378 40 0 0 postgres Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.901954] [ 2946] 108 2946 10076 137 22 0 0 pickup Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.904462] [ 6376] 0 6376 20130 213 42 0 0 sshd Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.907440] [ 6378] 1000 6378 20130 209 40 0 0 sshd Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.910534] [ 6379] 1000 6379 5795 151 16 0 0 bash Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.913144] [ 6382] 0 6382 11515 107 28 0 0 Sudo Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.915635] [ 6383] 0 6383 11895 96 27 0 0 su Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.918149] [ 6384] 105 6384 5796 139 16 0 0 bash Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.920900] [ 6561] 105 6561 18289 236 40 0 0 psql Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.923394] [ 6569] 105 6569 925161 853454 1718 0 0 postgres Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.926097] [16319] 0 16319 10865 95 25 0 0 cron Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.928611] [16320] 0 16320 10865 95 25 0 0 cron Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.931063] [16321] 0 16321 10865 95 25 0 0 cron Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.933758] [16322] 0 16322 10865 95 25 0 0 cron Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.936197] [16323] 1000 16323 1084 20 7 0 0 sh Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.938762] [16324] 1000 16324 1084 20 7 0 0 sh Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.941450] [16325] 1000 16325 1084 21 7 0 0 sh Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.943849] [16326] 1000 16326 1084 21 6 0 0 sh Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.946409] [16327] 1000 16327 3612 54 12 0 0 telemetria.sh Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.949258] [16328] 1000 16328 3613 57 12 0 0 instantaneo.sh Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.952633] [16329] 1000 16329 21335 4808 48 0 0 mon-put-instanc Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.955744] [16330] 1000 16330 3612 54 12 0 0 conexiones.sh Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.958555] [16366] 1000 16366 10744 1513 26 0 0 psql Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.961216] [16367] 1000 16367 10735 1512 26 0 0 psql Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.963722] [16368] 1000 16368 7725 1044 19 0 0 aws

Update 2017-06-01 16:19 GMT

設定を次のように変更しました：

checkpoint_segments = 100 # in logfile segments, min 1, 16MB each checkpoint_timeout = 30s # range 30s-1h

そして、私はハードドライブを満たしました:( checkpoint_segmentsを惜しみなく増やしましたが、最初に使用可能なスペースをチェックしませんでした。幸いにも、非実稼働環境でこの手順をテストしています。したがって、実稼働サーバーをもう一度複製する必要があるかもしれませんが、または、使用されている一時スペースを解放する方法はありますか？

ERROR: could not extend file "base/16384/3940428": No space left on device HINT: Check free disk space.

@deszoの質問に従って、メモリのオーバーコミット値は次のとおりです。

vm.nr_overcommit_hugepages = 0 vm.overcommit_kbytes = 0 vm.overcommit_memory = 0 vm.overcommit_ratio = 50

Update 2017-06-01 18：107 GMT

サーバーインスタンスはAWS c4.large（2 vCPU、3.75GB RAM）です

さらにいくつかのパラメータfrom postgresql.conf：

shared_buffers = 2GB # min 128kB work_mem = 32MB # min 64kB max_connections =800 # (change requires restart)

Evan Carroll · Answer

私の推測では、AWSでPostgreSQLを実行しています。その場合は、質問とタグでこれを識別する必要があります。

カーネルダンプで言及されている最後のプロセスは実際にはawsであるため、安全策のようです

Jun 1 13:49:04 ip-10-1-0-9 kernel: [260956.963722] [16368] 1000 16368 7725 1044 19 0 0 aws

PostgreSQLに関しては、これは接続の失敗を意味します

Awsのoom_killerをトリガーしている原因をトラブルシューティングする限り、2つのオプションがあります。