During my daily job (or whenever i work with more than "a few" servers) i face a problem regularly: Some folder needs to be in sync on all servers or strange things happen. One time its a common scripts folder, another time its a folder with configuration files. Sometimes these are "utility" stuff, which can easily written to and read from a network share. But sometimes, the program which needs the files need to be alive even if the fileserver is not available, so it the files need to be placed on the local storage. Often this is archived with
rsync+ssh which pushes the changed from one master server to the clients. But what happens if the master server goes down? You probably end with read-only slaves. Most commonly used software (DNS, DHCP, Databases) have cluster or failover support built-in, which is probably better than syncing the raw data files. For everything else, a clustered filesystem can be used. Here are my experiences with GlusterFS on Gentoo.
Installation and Configuration
GlusterFS is in the portage tree, so you can simple run
emerge sys-cluster/glusterfs on all nodes. We will use the latest stable version 3.1.2.
For this example, i will use two boxes for my testing:
node2 called "bricks" in the GlusterFS jargon). The only dependency i can see is the FUSE support in the kernel.
Next, start the glusterfs daemon on both nodes via
/etc/init.d/glusterd start. To make the nodes known to each other, open a gluster shell:
gluster. The help command gives a list of all commands available. We use the
peer probe command to connect the nodes (make sure your node names are resolvable via DNS or
node1 root # gluster gluster> peer probe node2 Probe successful
and on node2
node2 root # gluster gluster> peer probe node1 Probe successful
peer status inside the gluster shell should show on both boxes the other node.
Replicating files via GlusterFS
To share a folder via all node, we have to create and configure a volume. We'll use
/tmp/gluster-test as the volume's folder (the place where the files are stored) and
/tmp/gluster-data where the gluster volume will be mounted.
On both nodes:
One one server:
node1 root # gluster volume create my-volume replica 2 transport tcp node1:/tmp/gluster-test node2:/tmp/gluster-test Creation of volume my-volume has been successful. Please start the volume to access data. node1 root # gluster volume start my-volume Starting volume my-volume has been successful
This command creates and starts a new volume named
my-volume with two replicas which communicates via TCP.
As the next step, we can mount the GlusterFS volume like every other filesystem:
node1 root # mount -t glusterfs node1:/my-volume /tmp/gluster-data/ -o auto,rw,allow_other,default_permissions,max_read=131072
node2 root # mount -t glusterfs node2:/my-volume /tmp/gluster-data/ -o auto,rw,allow_other,default_permissions,max_read=131072
Now we have clustered file system! Let's try it:
node1 root # touch /tmp/glustered-data/test node2 root # ll /tmp/gluster-data/ total 8.0K drwxr-xr-x 2 root root 4.0K Sep 27 18:55 . drwxrwxrwt 7 root root 4.0K Sep 27 18:54 .. -rw-r--r-- 1 root root 0 Sep 27 18:55 test
Testing the emergency
Now, lets test the worst case: one of the nodes goes down (planned or unplanned). When the node comes back, he should sync the files:
I'll use the following simple bash loop to demonstrate continuous writes on one node during the outage of the other:
node1 root # while [ true ]; do echo "`hostname` - `date`" | tee -a /tmp/gluster-data/dates.txt; sleep 5; done
Start this command on the first node and as the next step we'll simulate a network outage between the two nodes:
node1 root # iptables -I OUTPUT 1 -d 192.168.0.2 -p tcp ! --dport 22 -j REJECT
If you run
gluster peer status on the first node, you should see the second one as
Disconnected (and vice versa):
node1 root # gluster peer status Number of Peers: 1 Hostname: node2 Uuid: 97929b17-85a2-4aec-96f6-0790110490d5 State: Peer in Cluster (Disconnected)
/tmp/gluster-data/dates.txt on the first node should contain a few timestamps now. Let's bring back the second one and see what happens:
node1 root # iptables -D OUTPUT 1 node2 root # wc -l /tmp/gluster-data/dates.txt 85 /tmp/gluster-data/dates.txt
Success! But what happens if we write on both systems to the same file during the outage (aka a Split-Brain? Let's start the bash loop on both system and issue the
iptables command again. As expected, both systems show the other as "Disconnected". Wait a few seconds and remove the block rule from the iptables chain.
Accessing the data gives us now an Input/Output error. The logfile
/var/log/glusterfs/tmp-gluster-data.log contains more information:
Unable to self-heal contents of '/dates.txt' (possible split-brain). Please delete the file from all but the preferred subvolume.
So in this case, we have to manually fix the problem. If we delete the file
/tmp/glusterfs-data/dates.txt from the second node, the process recovers itself but leaves us with a truncated file (because the file gets temporarily deleted on all nodes). So we better keep a copy of the "correct" file and restore it before fixing that problem. After all, this isn't GlusterFS fault - successfully recovering from a split-brain is a hard job and almost impossible if you don't have a majority in your cluster who agree on the content of an object.