, Johann Schmitz

During my daily job (or whenever i work with more than "a few" servers) i face a problem regularly: Some folder needs to be in sync on all servers or strange things happen. One time its a common scripts folder, another time its a folder with configuration files. Sometimes these are "utility" stuff, which can easily written to and read from a network share. But sometimes, the program which needs the files need to be alive even if the fileserver is not available, so it the files need to be placed on the local storage. Often this is archived with rsync+ssh which pushes the changed from one master server to the clients. But what happens if the master server goes down? You probably end with read-only slaves. Most commonly used software (DNS, DHCP, Databases) have cluster or failover support built-in, which is probably better than syncing the raw data files. For everything else, a clustered filesystem can be used. Here are my experiences with GlusterFS on Gentoo.

Installation and Configuration

GlusterFS is in the portage tree, so you can simple run emerge sys-cluster/glusterfs on all nodes. We will use the latest stable version 3.1.2. For this example, i will use two boxes for my testing: node1 and node2 called "bricks" in the GlusterFS jargon). The only dependency i can see is the FUSE support in the kernel. Next, start the glusterfs daemon on both nodes via /etc/init.d/glusterd start. To make the nodes known to each other, open a gluster shell: gluster. The help command gives a list of all commands available. We use the peer probe command to connect the nodes (make sure your node names are resolvable via DNS or /etc/hosts):

node1 root # gluster
gluster> peer probe node2
Probe successful

and on node2

node2 root # gluster
gluster> peer probe node1
Probe successful

A peer status inside the gluster shell should show on both boxes the other node.

Replicating files via GlusterFS

To share a folder via all node, we have to create and configure a volume. We'll use /tmp/gluster-test as the volume's folder (the place where the files are stored) and /tmp/gluster-data where the gluster volume will be mounted.

On both nodes:

mkdir /tmp/gluster-test

One one server:

node1 root # gluster volume create my-volume replica 2 transport tcp node1:/tmp/gluster-test node2:/tmp/gluster-test
Creation of volume my-volume has been successful. Please start the volume to access data.

node1 root # gluster volume start my-volume
Starting volume my-volume has been successful

This command creates and starts a new volume named my-volume with two replicas which communicates via TCP.

As the next step, we can mount the GlusterFS volume like every other filesystem:

node1:

node1 root # mount -t glusterfs node1:/my-volume /tmp/gluster-data/ -o auto,rw,allow_other,default_permissions,max_read=131072

node2:

node2 root # mount -t glusterfs node2:/my-volume /tmp/gluster-data/ -o auto,rw,allow_other,default_permissions,max_read=131072

Now we have clustered file system! Let's try it:

node1 root # touch /tmp/glustered-data/test

node2 root # ll /tmp/gluster-data/
total 8.0K
drwxr-xr-x 2 root root 4.0K Sep 27 18:55 .
drwxrwxrwt 7 root root 4.0K Sep 27 18:54 ..
-rw-r--r-- 1 root root    0 Sep 27 18:55 test

Testing the emergency

Now, lets test the worst case: one of the nodes goes down (planned or unplanned). When the node comes back, he should sync the files:

I'll use the following simple bash loop to demonstrate continuous writes on one node during the outage of the other:

node1 root # while [ true ]; do echo "`hostname` - `date`" | tee -a /tmp/gluster-data/dates.txt; sleep 5;  done

Start this command on the first node and as the next step we'll simulate a network outage between the two nodes:

node1 root # iptables -I OUTPUT 1 -d 192.168.0.2 -p tcp ! --dport 22 -j REJECT

If you run gluster peer status on the first node, you should see the second one as Disconnected (and vice versa):

node1 root # gluster peer status
Number of Peers: 1

Hostname: node2
Uuid: 97929b17-85a2-4aec-96f6-0790110490d5
State: Peer in Cluster (Disconnected)

The /tmp/gluster-data/dates.txt on the first node should contain a few timestamps now. Let's bring back the second one and see what happens:

node1 root # iptables -D OUTPUT 1

node2 root # wc -l /tmp/gluster-data/dates.txt
85 /tmp/gluster-data/dates.txt

Success! But what happens if we write on both systems to the same file during the outage (aka a Split-Brain? Let's start the bash loop on both system and issue the iptables command again. As expected, both systems show the other as "Disconnected". Wait a few seconds and remove the block rule from the iptables chain.

Accessing the data gives us now an Input/Output error. The logfile /var/log/glusterfs/tmp-gluster-data.log contains more information:

Unable to self-heal contents of '/dates.txt' (possible split-brain). Please delete the file from all but the preferred subvolume.

So in this case, we have to manually fix the problem. If we delete the file /tmp/glusterfs-data/dates.txt from the second node, the process recovers itself but leaves us with a truncated file (because the file gets temporarily deleted on all nodes). So we better keep a copy of the "correct" file and restore it before fixing that problem. After all, this isn't GlusterFS fault - successfully recovering from a split-brain is a hard job and almost impossible if you don't have a majority in your cluster who agree on the content of an object.