1)         背景介绍:

最近在vmware中搭建了一个oracle10g RAC的双节点实验平台并将oracle RAC从10.2.0.1升级到10.2.0.5,后来发现两台linux经常自动重启;
 
 
2)         平台信息:
vmware7 + OEL5.7X64 + ASMLib2.0 + ORACLE10.2.0.5
 
 
3)         /var/log/message日志:
Ø NODE1:Linux1
Apr 18 20:44:18 Linux1 syslogd 1.4.1: restart.
Apr 18 20:44:18 Linux1 kernel: klogd 1.4.1, log source = /proc/kmsg started.
Apr 18 20:44:18 Linux1 kernel: Initializing cgroup subsys cpuset
Apr 18 20:44:18 Linux1 kernel: Initializing cgroup subsys cpu
Apr 18 20:44:18 Linux1 kernel: Linux version 2.6.32-200.13.1.el5uek (mockbuild@ca-build9.us.oracle.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Wed Jul 27 21:02:33 EDT 2011
Apr 18 20:44:18 Linux1 kernel: Command line: ro root=/dev/VolGroup00/LogVol00 rhgb quiet
Apr 18 20:44:18 Linux1 kernel: KERNEL supported cpus:
Apr 18 20:44:18 Linux1 kernel:    Intel GenuineIntel
Apr 18 20:44:18 Linux1 kernel:    AMD AuthenticAMD
Apr 18 20:44:18 Linux1 kernel:    Centaur CentaurHauls
Apr 18 20:44:18 Linux1 kernel: BIOS-provided physical RAM map:
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000000ca000 - 00000000000cc000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000000dc000 - 00000000000e4000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000000000100000 - 00000000bfef0000 (usable)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000bfef0000 - 00000000bfeff000 (ACPI data)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000bfeff000 - 00000000bff00000 (ACPI NVS)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000bff00000 - 00000000c0000000 (usable)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000fffe0000 - 0000000100000000 (reserved)
Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000000100000000 - 0000000140000000 (usable)
Apr 18 20:44:18 Linux1 kernel: DMI present.
Ø NODE2:Linux2
Apr 18 20:43:35 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 has been idle for 30.0 seconds, shutting it down.
Apr 18 20:43:35 Linux2 kernel: (swapper,0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr 1334752985.559806 now 1334753015.306532 dr 1334752985.559360 adv 1334752985.559806:1334752985.559807 func (b651ea27:504) 1334752951.27068:1334752951.27323)
Apr 18 20:43:35 Linux2 kernel: o2net: no longer connected to node Linux1 (num 0) at 192.168.3.131:7777
Apr 18 20:43:56 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 7
Apr 18 20:44:05 Linux2 kernel: (o2net,3480,0):o2net_connect_expired:1659 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.
Apr 18 20:44:24 Linux2 avahi-daemon[4341]: Registering new address record for 192.168.0.136 on eth0.
Apr 18 20:44:26 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 7
Apr 18 20:44:28 Linux2 last message repeated 2 times
Apr 18 20:44:28 Linux2 kernel: (o2hb-9938799A41,3564,1):o2dlm_eviction_cb:267 o2dlm has evicted node 0 from group 9938799A418642218A66FE77029DE473
Apr 18 20:44:28 Linux2 kernel: (ocfs2rec,19793,1):ocfs2_replay_journal:1605 Recovering node 0 from slot 0 on device (8,65)
Apr 18 20:44:30 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 8
Apr 18 20:44:31 Linux2 kernel: (ocfs2rec,19793,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery in slot 0
Apr 18 20:44:31 Linux2 kernel: (ocfs2_wq,3567,1):ocfs2_finish_quota_recovery:598 Finishing quota recovery in slot 0
Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_get_lock_resource:836 9938799A418642218A66FE77029DE473:$RECOVERY: at least one node (0) to recover before lock mastery can begin
Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_get_lock_resource:870 9938799A418642218A66FE77029DE473: recovery map is not empty, but must master $RECOVERY lock now
Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_do_recovery:523 (3573) Node 1 is the Recovery Master for the Dead Node 0 for Domain 9938799A418642218A66FE77029DE473
以上信息在两台机器中会交换出现,说明并不是总是固定的一台机器对另外一台超时。
 
 
4)         根据message信息报错,应该是o2cb的idle时间超限导致的,系统中O2CB服务的状态为:
[oracle@Linux1]
service o2cb status
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster ocfs2: Online
Heartbeat dead threshold = 301
 Network idle timeout: 30000                            /此处单位为毫秒,正式message中报的30秒
 Network keepalive delay: 2000
 Network reconnect delay: 2000
Checking O2CB heartbeat: Active
 
 
5)         /etc/sysconfig/o2cb 中心跳的超时阈值已经修改为了301秒,但idle的时间使用了缺省的30000毫秒
[root@Linux1 ~]# more /etc/sysconfig/o2cb
#
# This is a configuration file for automatic startup of the O2CB
# driver. It is generated by running /etc/init.d/o2cb configure.
# On Debian based systems the preferred method is running
# 'dpkg-reconfigure ocfs2-tools'.
#
 
# O2CB_ENABLED: 'true' means to load the driver on boot.
O2CB_ENABLED=true
 
# O2CB_STACK: The name of the cluster stack backing O2CB.
O2CB_STACK=o2cb
 
# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2
 
# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=301                            /此处单位为秒
 
# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead.
O2CB_IDLE_TIMEOUT_MS=30000                                  /此处单位为毫秒,正式message中报的30秒
 
# O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent
O2CB_KEEPALIVE_DELAY_MS=2000                               /此处单位为毫秒
 
# O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
O2CB_RECONNECT_DELAY_MS=2000                            /此处单位为毫秒
 
 
6)        可能由于是在pc机中的vmware搭建的多个虚拟机进行的实验,系统负载较重,导致各个节点的idel时间较长引起,因此计划将/etc/init.d/o2cb进行重新配置,将network相关的配置翻倍:
[root@Linux1 ~]# /etc/init.d/o2cb configure
Configuring the O2CB driver.
 
This will configure the on-boot properties of the O2CB driver.
The following questions will determine whether the driver is loaded on
boot. The current values will be shown in brackets ('[]'). Hitting
<ENTER> without typing an answer will keep that current value. Ctrl-C
will abort.
 
Load O2CB driver on boot (y/n) [y]:
Cluster stack backing O2CB [o2cb]:
Cluster to start on boot (Enter "none" to clear) [ocfs2]:
Specify heartbeat dead threshold (>=7) [301]:                                      /*此处单位为秒
Specify network idle timeout in ms (>=5000) [30000]: 60000       /*此后三行单位为毫秒
Specify network keepalive delay in ms (>=1000) [2000]: 4000
Specify network reconnect delay in ms (>=2000) [2000]: 4000
Writing O2CB configuration: OK
Cluster ocfs2 already online
[root@Linux1 ~]# exit
在各个节点分别重新启动o2cb服务:
[root@Linux2 ~]# service o2cb stop
Stopping O2CB cluster ocfs2: Failed
Unable to stop cluster as heartbeat region still active                   /*此时ocfs2文件系统在加载状态,不能停o2cb服务,需要先umount ocfs2
[root@Linux2 ~]# umount /u02
[root@Linux2 ~]# service o2cb stop
Stopping O2CB cluster ocfs2: OK
Unloading module "ocfs2": OK
Unmounting ocfs2_dlmfs filesystem: OK
Unloading module "ocfs2_dlmfs": OK
Unloading module "ocfs2_stack_o2cb": OK
Unmounting configfs filesystem: OK
Unloading module "configfs": OK
[root@Linux2 ~]# service o2cb start
Loading filesystem "configfs": OK
Mounting configfs filesystem at /sys/kernel/config: OK
Loading stack plugin "o2cb": OK
Loading filesystem "ocfs2_dlmfs": OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Setting cluster stack "o2cb": OK
Starting O2CB cluster ocfs2: OK
[root@Linux2 ~]# mount /u02
[root@Linux2 ~]# mount
/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
/dev/sda1 on /boot type ext3 (rw)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
192.168.2.110:/mnt/nfs4backup/nfs4backup/nfs4backup on /mnt/share type nfs (rw,hard,nointr,tcp,noac,nfsvers=3,timeo=600,rsize=32768,wsize=32768,addr=192.168.2.110)
oracleasmfs on /dev/oracleasm type oracleasmfs (rw)
configfs on /sys/kernel/config type configfs (rw)
ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)
/dev/sdf1 on /u02 type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local)
启动crs
[root@Linux2 ~]# /u01/app/crs/bin/crsctl start crs
Attempting to start CRS stack
The CRS stack will be started shortly
[oracle@Linux2] crs_stat -t
Name            Type           Target    State     Host       
------------------------------------------------------------
ora....SM1.asm application     ONLINE    ONLINE    linux1     
ora....X1.lsnr application     ONLINE    ONLINE   linux1     
ora.linux1.gsd application     ONLINE    ONLINE    linux1     
ora.linux1.ons application     ONLINE    ONLINE    linux1     
ora.linux1.vip application     ONLINE    ONLINE    linux1     
ora....SM2.asm application     ONLINE    ONLINE    linux2     
ora....X2.lsnr application     ONLINE    ONLINE    linux2     
ora.linux2.gsd application     ONLINE    ONLINE    linux2     
ora.linux2.ons application     ONLINE    ONLINE    linux2     
ora.linux2.vip application     ONLINE    ONLINE    linux2     
ora.racdb.db    application    ONLINE    ONLINE    linux2     
ora....b1.inst application     ONLINE    ONLINE    linux1     
ora....b2.inst application     ONLINE    ONLINE    linux2     
ora...._taf.cs application     OFFLINE   OFFLINE              
ora....db1.srv application     OFFLINE   OFFLINE              

ora....db2.srv application    OFFLINE   OFFLINE 

经过观察,系统正常运行,问题圆满解决。