
		OCFS2 - Frequently Asked Questions
		==================================

General
-------

Q01	How do I get started?
A01	a) Download and install the module and tools rpms.
	b) Create cluster.conf and propagate to all nodes.
	c) Configure and start the O2CB cluster service.
	d) Format the volume.
	e) Mount the volume.

Q02	How do I know the version number running?
A02	# cat /proc/fs/ocfs2/version
	OCFS2 1.0.0 Tue Aug  2 17:38:59 PDT 2005 (build e7bd36709a2c1cb875cf2d533a018f20)

Q03	How do I configure my system to auto-reboot after a panic?
A03	To auto-reboot system 60 secs after a panic, do:
	# echo 60 > /proc/sys/kernel/panic
	To enable the above on every reboot, add the following to
	/etc/sysctl.conf:
	kernel.panic = 60
==============================================================================

Download and Install
--------------------

Q01	How do I download the rpms?
A01	If you are on Novell's SLES9, upgrade to SP2 and you will have the
	required module installed. However, you will be required to install
	ocfs2-tools and ocfs2console rpms from the distribution.
	If you are on Red Hat's EL4, download and install the appropriate module
	rpm and the two tools rpms, ocfs2-tools and ocfs2console. Appropriate
	module refers to one matching the kernel flavor, uniprocessor, smp or
	hugemem.

Q02	How do I install the rpms?
A02	You can install all three rpms in one go using:
	rpm -ivh ocfs2-tools-X.i386.rpm ocfs2console-X.i386.rpm
		ocfs2-2.6.9-11.ELsmp-X.i686.rpm
	If you need to upgrade, do:
	rpm -Uvh ocfs2-2.6.9-11.ELsmp-Y.i686.rpm

Q03	Do I need to install the console?
A03	No, the console is recommended but not required.

Q04	What are the dependencies for installing ocfs2console?
A04	ocfs2console requires e2fsprogs, glib2 2.2.3 or later, vte 0.11.10 or
	later, pygtk2 (EL4) or python-gtk (SLES9) 1.99.16 or later,
	python 2.3 or later and ocfs2-tools.

Q05	What modules are installed with the OCFS2 package?
A05	a) configfs.ko
	b) ocfs2.ko
	c) ocfs2_dlm.ko
	d) ocfs2_dlmfs.ko
	e) ocfs2_nodemanager.ko

Q06	What tools are installed with the tools package?
A06	a) mkfs.ocfs2
	b) fsck.ocfs2
	c) tunefs.ocfs2
	d) debugfs.ocfs2
	e) mount.ocfs2
	f) mounted.ocfs2
	g) ocfs2cdsl
	h) ocfs2_hb_ctl
	i) o2cb_ctl
	j) ocfs2console - installed with the console package
==============================================================================

Configure
---------

Q01	How do I populate /etc/ocfs2/cluster.conf?
A01	If you have installed the console, use it to create this
	configuration file. For details, refer to the user's guide.
	If you do not have the console installed, check the Appendix in the
	User's guide for a sample cluster.conf and the details of all the
	components. 
	Do not forget to copy this file to all the nodes in the cluster.
	If you ever edit this file on any node, ensure the other nodes are
	updated as well.

Q02	Should the IP interconnect be public or private?
A02	Using a private interconnect is recommended. While OCFS2 does not
	take much bandwidth, it does require the nodes to be alive on the
	network and sends regular keepalive packets to ensure that they are.
	To avoid a network delay being interpreted as a node disappearing on
	the net leading to a STONITH, a private interconnect is recommended.
	One could use the same interconnect for Oracle RAC and OCFS2.
==============================================================================

O2CB Cluster Service
--------------------

Q01	How do I configure the cluster service?
A01	# /etc/init.d/o2cb configure
	Enter 'y' if you want the service to load on boot and the name of
	the cluster (as listed in /etc/ocfs2/cluster.conf).

Q02	How do I start the cluster service?
A02	a) To load the modules, do:
		# /etc/init.d/o2cb load
	b) To Online it, do:
		# /etc/init.d/o2cb online [cluster_name]
	If you have configured the cluster to load on boot, you could
	combine the two as follows:
		# /etc/init.d/o2cb start [cluster_name]
	The cluster name is not required if you have specified the name
	during configuration.

Q03	How do I stop the cluster service?
A03	a) To offline it, do:
		# /etc/init.d/o2cb offline [cluster_name]
	b) To unload the modules, do:
		# /etc/init.d/o2cb unload
	If you have configured the cluster to load on boot, you could
	combine the two as follows:
		# /etc/init.d/o2cb stop [cluster_name]
	The cluster name is not required if you have specified the name
	during configuration.

Q04	How can I learn the status of the cluster?
A04	To learn the status of the cluster, do:
		# /etc/init.d/o2cb status

Q05	I am unable to get the cluster online. What could be wrong?
A05	Check whether the node name in the cluster.conf exactly matches the
	hostname. One of the nodes in the cluster.conf need to be in the
	cluster for the cluster to be online.
==============================================================================

Format
------

Q01	How do I format a volume?
A01	You could either use the console or use mkfs.ocfs2 directly to format
	the volume.  For console, refer to the user's guide.
		# mkfs.ocfs2 -L "oracle_home" /dev/sdX
	The above formats the volume with default block and cluster sizes,
	which are computed based upon the size of the volume.
		# mkfs.ocfs2 -b 4k -C 32K -L "oracle_home" -N 4 /dev/sdX
	The above formats the volume for 4 nodes with a 4K block size and a
	32K cluster size.

Q02	What does the number of node slots during format refer to?
A02	The number of node slots specifies the number of nodes that can
	concurrently mount the volume. This number is specified during
	format and can be increased using tunefs.ocfs2. This number cannot
	be decreased.

Q03	What should I consider when determining the number of node slots?
A03	OCFS2 allocates system files, like Journal, for each node slot.
	So as to not to waste space, one should specify a number within the
	ballpark of the actual number of nodes. Also, as this number can be
	increased, there is no need to specify a much larger number than one
	plans for mounting the volume.

Q04	Does the number of node slots have to be the same for all volumes?
A04	No. This number can be different for each volume.

Q05	What block size should I use?
A05	A block size is the smallest unit of space addressable by the file
	system. OCFS2 supports block sizes of 512 bytes, 1K, 2K and 4K.
	The block size cannot be changed after the format. For most volume
	sizes, a 4K size is recommended. On the other hand, the 512 bytes
	block is never recommended.

Q06	What cluster size should I use?
A06	A cluster size is the smallest unit of space allocated to a file to
	hold the data. OCFS2 supports cluster sizes of 4K, 8K, 16K, 32K,
	64K, 128K, 256K, 512K and 1M. For database volumes, a cluster size
	of 128K or larger is recommended. For Oracle home, 32K to 64K.

Q07	Any advantage of labelling the volumes?
A07	As in a shared disk environment, the disk name (/dev/sdX) for a
	particular device be different on different nodes, labelling becomes
	a must for easy identification.
	You could also use labels to identify volumes during mount.
		# mount -L "label" /dir
	The volume label is changeable using the tunefs.ocfs2 utility.
==============================================================================

Mount
-----

Q01	How do I mount the volume?
A01	You could either use the console or use mount directly. For console,
	refer to the user's guide.
		# mount -t ocfs2 /dev/sdX /dir
	The above command will mount device /dev/sdX on directory /dir.

Q02	How do I mount by label?
A02	To mount by label do:
		# mount -L "label" /dir

Q03	What entry to I add to /etc/fstab to mount an ocfs2 volume?
A03	Add the following:
		/dev/sdX	/dir	ocfs2	noauto,_netdev	0	0
	The _netdev option indicates that the devices needs to be mounted after
	the network is up.

Q04	What all do I need to do to automount OCFS2 volumes on boot?
A04	a) Enable o2cb service using:
		# chkconfig --add o2cb
	b) Configure o2cb to load on boot using:
		# /etc/init.d/o2cb configure
	c) Add entries into /etc/fstab as follows:
		/dev/sdX	/dir	ocfs2	_netdev	0	0

Q05	How do I know my volume is mounted?
A05	a) Enter mount without arguments, or
		# mount
	b) List /etc/mtab, or
		# cat /etc/mtab
	c) List /proc/mounts
		# cat /proc/mounts
	mount command reads the /etc/mtab to show the information.

Q06	What are the /config and /dlm mountpoints for?
A06	OCFS2 comes bundled with two in-memory filesystems configfs and
	ocfs2_dlmfs. configfs is used by the ocfs2 tools to communicate to the
	in-kernel node manager the list of nodes in the cluster and to the
	in-kernel heartbeat thread the resource to heartbeat on.
	ocfs2_dlmfs is used by ocfs2 tools to communicate with the in-kernel
	dlm to take and release clusterwide locks on resources.

Q07	Why does it take so much time to mount the volume?
A07	It takes around 5 secs for a volume to mount. It does so so as
	to let the heartbeat thread stabilize. In a later release, we
	plan to add support for a global heartbeat, which will make most
	mounts instant.
==============================================================================

Oracle RAC
----------

Q01	Any special flags to run Oracle RAC?
A01	OCFS2 volumes containing the Voting diskfile (CRS), Cluster registry
	(OCR), Data files, Redo logs, Archive logs and control files must 
	be mounted with the "datavolume" and "nointr" mount options. The
	datavolume option ensures that the Oracle processes open these files
	with the o_direct flag. The "nointr" option ensures that the ios
	are not interrupted by signals.
	# mount -o datavolume,nointr -t ocfs2 /dev/sda1 /u01/db

Q02	What about the volume containing Oracle home?
A02	Oracle home volume should be mounted normally, that is, without the
	"datavolume" and "nointr" mount options. These mount options are only
	relevant for Oracle files listed above.
	# mount -t ocfs2 /dev/sdb1 /software/orahome

Q03	Does that mean I cannot have my data file and Oracle home on the
	same volume?
A03	Yes. The volume containing the Oracle data files, redo-logs, etc.
	should never be on the same volume as the distribution (including the
	trace logs like, alert.log).
==============================================================================

Moving data from OCFS (Release 1) and OCFS2
-------------------------------------------

Q01	Can I mount OCFS volumes as OCFS2?
A01	No. OCFS and OCFS2 are not on-disk compatible. We had to break the
	compatibility in order to add many of the new features. At the same
	time, we have added enough flexibility in the new disk layout so as to
	maintain backward compatibility in the future.

Q02	Can OCFS volumes and OCFS2 volumes be mounted on the same machine
	simultaneously?
A02	No. OCFS only works on 2.4 linux kernels (Red Hat's AS2.1/EL3 and SuSE's
	SLES8).  OCFS2, on the other hand, only works on the 2.6 kernels
	(Red Hat's EL4 and SuSE's SLES9).

Q03	Can I access my OCFS volume on 2.6 kernels (SLES9/RHEL4)?
A03	Yes, you can access the OCFS volume on 2.6 kernels using FSCat
	tools, fsls and fscp. These tools can access the OCFS volumes at the
	device layer, to list and copy the files to another filesystem.
	FSCat tools are available on oss.oracle.com.

Q04	Can I in-place convert my OCFS volume to OCFS2?
A04	No. The on-disk layout of OCFS and OCFS2 are sufficiently different
	that it would require a third disk (as a temporary buffer) inorder to
	in-place upgrade the volume. With that in mind, it was decided not to
	develop such a tool but instead provide tools to copy data from OCFS
	without one having to mount it.

Q05	What is the quickest way to move data from OCFS to OCFS2?
A05	Quickest would mean having to perform the minimal number of copies.
	If you have the current backup on a non-OCFS volume accessible from
	the 2.6 kernel install, then all you would need to do is to retore
	the backup on the OCFS2 volume(s). If you do not have a backup but
	have a setup in which the system containing the OCFS2 volumes can
	access the disks containing the OCFS volume, you can use the FSCat
	tools to extract data from the OCFS volume and copy onto OCFS2.
==============================================================================

Coreutils
---------

Q01	Like with OCFS (Release 1), do I need to use o_direct enabled tools
	to perform cp, mv, tar, etc.?
A01	No. OCFS2 does not need the o_direct enabled tools. The file system
	allows processes to open files in both o_direct and bufferred mode
	concurrently.
==============================================================================

Troubleshooting
---------------

Q01	How do I enable and disable tracing?
A01	To list all the debug bits along with their statuses, do:
		# cat /proc/fs/ocfs2_nodemanager/log_mask
	To enable tracing the bit SUPER, do:
		# echo "SUPER allow" > /proc/fs/ocfs2_nodemanager/log_mask
	To disable tracing the bit SUPER, do:
		# echo "SUPER off" > /proc/fs/ocfs2_nodemanager/log_mask
	To totally turn off tracing the SUPER bit, as in, turn off
	tracing even if some other bit is enabled for the same, do:
		# echo "SUPER deny" > /proc/fs/ocfs2_nodemanager/log_mask

Q02	Is there a more convenient way to enable and disable tracing?
A02	Yes, using debugfs.ocfs2.
	To list all the debug bits along with their statuses, do:
		# debugfs.ocfs2 -l
	To enable heartbeat tracing, do:
		# debugfs.ocfs2 -l HEARTBEAT ENTRY EXIT allow 
	To disable heartbeat tracing, do:
		# debugfs.ocfs2 -l HEARTBEAT off ENTRY EXIT deny
==============================================================================

Limits
-----------

Q01	Is there a limit to the number of subdirectories in a directory?
A01	Yes. OCFS2 currently allows up to 32000 subdirectories. While this
	limit could be increased, we will not be doing it till we
	implement some kind of efficient name lookup (htree, etc.).

Q02	Is there a limit to the size of an ocfs2 file system?
A02	Yes, current software addresses block numbers with 32 bits.  So the
	file system device is limited to (2 ^ 32) * blocksize (see mkfs -b).
	With a 4KB block size this amounts to a 16TB file system.  This block
	addressing limit will be relaxed in future software.  At that point
	the limit becomes addressing clusters of 1MB each with 32 bits which
	leads to a 4PB file system.

==============================================================================

System Files
------------

Q01	What are system files?
A01	System files are used to store standard filesystem metadata like
	bitmaps, journals, etc. Storing this information in files in a
	directory allows OCFS2 to be extensible. These system files
	can be accessed using debugfs.ocfs2.

	To list the system files, do:
	# echo "ls -l //" | debugfs.ocfs2 /dev/sdX
        	18              16   1    2  .
        	18              16   2    2  ..
        	19              24   10   1  bad_blocks
        	20              32   18   1  global_inode_alloc
        	21              20   8    1  slot_map
        	22              24   9    1  heartbeat
        	23              28   13   1  global_bitmap
        	24              28   15   2  orphan_dir:0000
        	25              32   17   1  extent_alloc:0000
        	26              28   16   1  inode_alloc:0000
        	27              24   12   1  journal:0000
        	28              28   16   1  local_alloc:0000
        	29              3796 17   1  truncate_log:0000
	The first column lists the block number.

Q02	Why do some files have numbers at the end?
A02	There are two types of files, global and local. Global files are
	for all the nodes, while local, like journal:0000, are node specific.
	The set of local files used by a node is determined by the slot
	mapping of that node. The numbers at the end of the system file
	name is the slot#.

	To list the slot maps, do:
	# echo "slotmap" | debugfs.ocfs2 -n /dev/sdX
        	Slot#   Node#
	            0      39
        	    1      40
	            2      41
        	    3      42
==============================================================================

Heartbeat
---------

Q01	How does the disk heartbeat work?
A01	Every node writes every two secs to its block in the heartbeat
	system file. The block offset is equal to its global node
	number. So node 0 writes to the first block, node 1 to the
	second, etc. All the nodes also read the heartbeat sysfile every
	two secs. As long as the timestamp is changing, that node is
	deemed alive.

Q02	When is a node deemed dead?
A02	An active node is deemed dead if it does not update its
	timestamp for O2CB_HEARTBEAT_THRESHOLD (default=7) loops.
	This value could be configured by adding it to /etc/sysconfig/o2cb
	and restarting the O2CB cluster. This value should be the SAME
	on ALL the nodes in the cluster. Once a node is deemed dead, the
	surviving node which manages to cluster lock the dead node's journal,
	recovers it by replaying the journal.
	
Q03	What about self fencing?
A03	A node self-fences if it fails to update its timestamp for
	((O2CB_HEARTBEAT_THRESHOLD - 1) * 2) secs. The [o2hb-xx] kernel
	thread, after every timestamp write, sets a timer to panic the system
	after that duration. If the next timestamp is written within that
	duration, as it should, it first cancels that timer before setting
	up a new one. This way it ensures the system will self fence if for
	some reason the [o2hb-x] kernel thread is unable to update the
	timestamp and thus be deemed dead by other nodes in the cluster.

Q04	What if a node umounts a volume?
A04	During umount, the node will broadcast to all the nodes that
	have mounted that volume to drop that node from its node maps.
	As the journal is shutdown before this broadcast, any node crash
	after this point is ignored as there is no need for recovery.
==============================================================================

Quorum and Fencing
------------------

Q01	What is a quorum?
A01	A quorum is a designation given to a group of nodes in a cluster which
	are still allowed to operate on shared storage.  It comes up when
	there there is a failure in the cluster which breaks the nodes up
	into groups which can communicate in their groups and with the
	shared storage but not between groups.


Q02	How does OCFS2's cluster services define a quorum? 
A02	The quorum decision is made by a single node based on the number
	of other nodes that are considered alive by heartbeating and the number
	of other nodes that are reachable via the network.

	A node has quorum when:
	* it sees an odd number of heartbeating nodes and has network
	  connectivity to more than half of them.
		or 
	* it sees an even number of heartbeating nodes and has network
	  connectivity to at least half of them *and* has connectivity to
	  the heartbeating node with the lowest node number. 

Q03	What is fencing?
A03	Fencing is the act of forecefully removing a node from a cluster.
	A node with OCFS2 mounted will fence itself when it realizes that it
	doesn't have quorum in a degraded cluster.  It does this so that other
	nodes won't get stuck trying to access its resources.  Currently OCFS2
	will panic the machine when it realizes it has to fence itself
	off from the cluster.  As described in Q02, it will do this when it
	sees more nodes heartbeating than it has connectivity to and fails
	the quorum test.

Q04	How does a node decide that it has connectivity with another?
A04	When a node sees another come to life via heartbeating it will try
	and establish a TCP connection to that newly live node.  It considers
	that other node connected as long as the TCP connection persists and
	the connection is not idle for 10 seconds.  Once that TCP connection
	is closed or idle it will not be reestablished until heartbeat thinks
	the other node has died and come back alive.

Q05	How long does the quorum process take?
A05	First a node will realize that it doesn't have connectivity with
	another node.  This can happen immediately if the connection is closed
	but can take a maximum of 10 seconds of idle time.  Then the node
	must wait long enough to give heartbeating a chance to declare the
	node dead.  It does this by waiting two iterations longer than 
	the number of iterations needed to consider a node dead (see Q03 in
	the Heartbeat section of this FAQ).  The current default of 7
	iterations of 2 seconds results in waiting for 9 iterations or 18
	seconds.  By default, then, a maximum of 28 seconds can pass from the
	time a network fault occurs until a node fences itself.
