mount.ocfs2：マウント中にトランスポートエンドポイントが接続されていません...？

Question

デュアルプライマリモードで実行されていた停止したノードをOCFS2に置き換えました。すべての手順が機能します。

/proc/drbd

version: 8.3.13 (api:88/proto:86-96) GIT-hash: 83ca112086600faacab2f157bc5a9324f7bd7f77 build by mockbuild@builder10.centos.org, 2012-05-07 11:56:36 1: cs:Connected ro:Primary/Primary ds:UpToDate/UpToDate C r----- ns:81 nr:407832 dw:106657970 dr:266340 al:179 bm:6551 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

ボリュームをマウントしようとするまで：

mount -t ocfs2 /dev/drbd1 /data/webroot/ mount.ocfs2: Transport endpoint is not connected while mounting /dev/drbd1 on /data/webroot/. Check 'dmesg' for more information on this error.

/var/log/kern.log

kernel: (o2net,11427,1):o2net_connect_expired:1664 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors. kernel: (mount.ocfs2,12037,1):dlm_request_join:1036 ERROR: status = -107 kernel: (mount.ocfs2,12037,1):dlm_try_to_join_domain:1210 ERROR: status = -107 kernel: (mount.ocfs2,12037,1):dlm_join_domain:1488 ERROR: status = -107 kernel: (mount.ocfs2,12037,1):dlm_register_domain:1754 ERROR: status = -107 kernel: (mount.ocfs2,12037,1):ocfs2_dlm_init:2808 ERROR: status = -107 kernel: (mount.ocfs2,12037,1):ocfs2_mount_volume:1447 ERROR: status = -107 kernel: ocfs2: Unmounting device (147,1) on (node 1)

以下は、ノード0のカーネルログ（192.168.3.145）です。

kernel: : (swapper,0,7):o2net_listen_data_ready:1894 bytes: 0 kernel: : (o2net,4024,3):o2net_accept_one:1800 attempt to connect from unknown node at 192.168.2.93 :43868 kernel: : (o2net,4024,3):o2net_connect_expired:1664 ERROR: no connection established with node 1 after 30.0 seconds, giving up and returning errors. kernel: : (o2net,4024,3):o2net_set_nn_state:478 node 1 sc: 0000000000000000 -> 0000000000000000, valid 0 -> 0, err 0 -> -107

私は両方のノードの/etc/ocfs2/cluster.confが同一であることを確信しています：

/etc/ocfs2/cluster.conf

node: ip_port = 7777 ip_address = 192.168.3.145 number = 0 name = SVR233NTC-3145.localdomain cluster = cpc node: ip_port = 7777 ip_address = 192.168.2.93 number = 1 name = SVR022-293.localdomain cluster = cpc cluster: node_count = 2 name = cpc

そしてそれらはうまく接続されています：

# nc -z 192.168.3.145 7777 Connection to 192.168.3.145 7777 port [tcp/cbt] succeeded!

しかし、O2CBハートビートは新しいノード（192.168.2.93）ではアクティブではありません。

/etc/init.d/o2cb status

Driver for "configfs": Loaded Filesystem "configfs": Mounted Driver for "ocfs2_dlmfs": Loaded Filesystem "ocfs2_dlmfs": Mounted Checking O2CB cluster cpc: Online Heartbeat dead threshold = 31 Network idle timeout: 30000 Network keepalive delay: 2000 Network reconnect delay: 2000 Checking O2CB heartbeat: Not active

以下は、ノード1でocfs2を開始しながら、ノード0でtcpdumpを実行した場合の結果です。

 1 0.000000 192.168.2.93 -> 192.168.3.145 TCP 70 55274 > cbt [SYN] Seq=0 Win=5840 Len=0 MSS=1460 TSval=690432180 TSecr=0 2 0.000008 192.168.3.145 -> 192.168.2.93 TCP 70 cbt > 55274 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSval=707657223 TSecr=690432180 3 0.000223 192.168.2.93 -> 192.168.3.145 TCP 66 55274 > cbt [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSval=690432181 TSecr=707657223 4 0.000286 192.168.2.93 -> 192.168.3.145 TCP 98 55274 > cbt [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=32 TSval=690432181 TSecr=707657223 5 0.000292 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181 6 0.000324 192.168.3.145 -> 192.168.2.93 TCP 66 cbt > 55274 [RST, ACK] Seq=1 Ack=33 Win=5792 Len=0 TSval=707657223 TSecr=690432181

RSTフラグは、6パケットごとに送信されます。

このケースをデバッグするために他に何ができますか？

PS：

ノード0のOCFS2バージョン：

ocfs2-tools-1.4.4-1.el5
ocfs2-2.6.18-274.12.1.el5-1.4.7-1.el5

ノード1のOCFS2バージョン：

ocfs2-tools-1.4.4-1.el5
ocfs2-2.6.18-308.el5-1.4.7-1.el5

更新1-日12月23日18:15:07 ICT 2012

両方のノードが同じLANセグメントにありますか？ルーターなどはありませんか？

いいえ、それらは異なるサブネット上の2つのVMWareサーバーです。

ああ、覚えていますが、ホスト名/ DNSはすべてセットアップされ、正しく機能していますか？

確かに、各ノードのホスト名とIPアドレスの両方を/etc/hostsに追加しました。

192.168.2.93 SVR022-293.localdomain 192.168.3.145 SVR233NTC-3145.localdomain

そして、ホスト名を介して相互に接続できます。

# nc -z SVR022-293.localdomain 7777 Connection to SVR022-293.localdomain 7777 port [tcp/cbt] succeeded! # nc -z SVR233NTC-3145.localdomain 7777 Connection to SVR233NTC-3145.localdomain 7777 port [tcp/cbt] succeeded!

更新2-月12月24日18:32:15 ICT 2012

手がかりを見つけた：クラスターの実行中に同僚が/etc/ocfs2/cluster.confファイルを手動で編集した。したがって、デッドノード情報は/sys/kernel/config/cluster/に保持されます。

# ls -l /sys/kernel/config/cluster/cpc/node/ total 0 drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR150-4107.localdomain drwxr-xr-x 2 root root 0 Dec 24 18:21 SVR233NTC-3145.localdomain

（この場合はSVR150-4107.localdomain）

クラスタを停止してデッドノードを削除しますが、次のエラーが発生しました。

# /etc/init.d/o2cb stop Stopping O2CB cluster cpc: Failed Unable to stop cluster as heartbeat region still active

ocfs2サービスはすでに停止しているはずです。

# mounted.ocfs2 -f Device FS Nodes /dev/sdb ocfs2 Not mounted /dev/drbd1 ocfs2 Not mounted

参照はもうありません：

# ocfs2_hb_ctl -I -u 12963EAF4E16484DB81ECB0251177C26 12963EAF4E16484DB81ECB0251177C26: 0 refs

また、ocfs2カーネルモジュールをアンロードして、以下を確認します。

# ps -ef | grep [o]cfs2 root 12513 43 0 18:25 ? 00:00:00 [ocfs2_wq] # modprobe -r ocfs2 # ps -ef | grep [o]cfs2 # lsof | grep ocfs2

しかし、何も変化しません：

# /etc/init.d/o2cb offline Stopping O2CB cluster cpc: Failed Unable to stop cluster as heartbeat region still active

つまり、最後の質問は、デッドノード情報を削除する方法再起動せずに？

更新3-月12月24日22:41:51 ICT 2012

実行中のすべてのハートビートスレッドを以下に示します。

# ls -l /sys/kernel/config/cluster/cpc/heartbeat/ | grep '^d' drwxr-xr-x 2 root root 0 Dec 24 22:18 72EF09EA3D0D4F51BDC00B47432B1EB2

このハートビート領域の参照カウント：

# ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2 72EF09EA3D0D4F51BDC00B47432B1EB2: 7 refs

殺そうとする：

# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2 ocfs2_hb_ctl: File not found by ocfs2_lookup while stopping heartbeat

何か案は？

quanta · Accepted Answer

そうそう！問題が解決しました。

UUIDに注意してください。

# mounted.ocfs2 -d Device FS Stack UUID Label /dev/sdb ocfs2 o2cb 12963EAF4E16484DB81ECB0251177C26 ocfs2_drbd1 /dev/drbd1 ocfs2 o2cb 12963EAF4E16484DB81ECB0251177C26 ocfs2_drbd1

だが：

# ls -l /sys/kernel/config/cluster/cpc/heartbeat/ drwxr-xr-x 2 root root 0 Dec 24 22:53 72EF09EA3D0D4F51BDC00B47432B1EB2

これは、「誤って」OCFS2ボリュームを強制的に再フォーマットしたために発生する可能性があります。私が直面している問題は、Ocfs2-userメーリングリストの this に似ています。

これは、以下のエラーの理由でもあります。

ocfs2_hb_ctl：ハートビートの停止中にocfs2_lookupによってファイルが見つかりません

ocfs2_hb_ctlは、72EF09EA3D0D4F51BDC00B47432B1EB2でUUID /proc/partitionsのデバイスを見つけることができないためです。

OCFS2ボリュームのUUIDを変更できますか？

tunefs.ocfs2のmanページを見る：

Usage: tunefs.ocfs2 [options] <device> [new-size] tunefs.ocfs2 -h|--help tunefs.ocfs2 -V|--version [options] can be any mix of: -U|--uuid-reset[=new-uuid]

だから私は次のコマンドを実行します：

# tunefs.ocfs2 --uuid-reset=72EF09EA3D0D4F51BDC00B47432B1EB2 /dev/drbd1 WARNING!!! OCFS2 uses the UUID to uniquely identify a file system. Having two OCFS2 file systems with the same UUID could, in the least, cause erratic behavior, and if unlucky, cause file system damage. Please choose the UUID with care. Update the UUID ?yes

確認：

# tunefs.ocfs2 -Q "%U
" /dev/drbd1 72EF09EA3D0D4F51BDC00B47432B1EB2

何が起こるかを確認するために、ハートビート領域をもう一度殺そうとしました：

# ocfs2_hb_ctl -K -u 72EF09EA3D0D4F51BDC00B47432B1EB2 # ocfs2_hb_ctl -I -u 72EF09EA3D0D4F51BDC00B47432B1EB2 72EF09EA3D0D4F51BDC00B47432B1EB2: 6 refs

0 refsが表示されるまで強制終了してから、クラスターをオフにします。

# /etc/init.d/o2cb offline cpc Stopping O2CB cluster cpc: OK

そしてそれを止めます：

# /etc/init.d/o2cb stop Stopping O2CB cluster cpc: OK Unloading module "ocfs2": OK Unmounting ocfs2_dlmfs filesystem: OK Unloading module "ocfs2_dlmfs": OK Unmounting configfs filesystem: OK Unloading module "configfs": OK

新しいノードが更新されたかどうかを確認するために再起動します。

# /etc/init.d/o2cb start Loading filesystem "configfs": OK Mounting configfs filesystem at /sys/kernel/config: OK Loading filesystem "ocfs2_dlmfs": OK Mounting ocfs2_dlmfs filesystem at /dlm: OK Starting O2CB cluster cpc: OK # ls -l /sys/kernel/config/cluster/cpc/node/ total 0 drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR022-293.localdomain drwxr-xr-x 2 root root 0 Dec 26 19:02 SVR233NTC-3145.localdomain

OK、ピアノード（192.168.2.93）で、OCFS2を起動しようとしました：

# /etc/init.d/ocfs2 start Starting Oracle Cluster File System (OCFS2) [ OK ]

this スレッドが問題の解決に役立つため、Sunil Mushranに感謝します。

レッスンは次のとおりです。

IPアドレス、ポート、...は、クラスターがオフラインのときにのみ変更できます。 [〜＃〜] faq [〜＃〜] を参照してください。
OCFS2ボリュームを強制的に再フォーマットしないでください。