Gluster, Scale Disperse Volume by One Node – Part-2

Physical movement of disks supported by external tool
In last blog, I came up with a scenario where we have to scale our disperse volume by one node only and how can we adopt two different approaches to do so. Here, I will continue with the second approach which involves physical migration of drives/bricks from one server/node to other server/node. This approach is supported by a tool called as pos_main.py.

Little background:
When we add new bricks to an existing volume we use “gluster volume add-brick” command. It is our responsibility to make sure that all (or redundant number) bricks are hosted on different nodes.

Before using this tool, let’s see two important conditions or state which we should always have during this migration of drives from one server to other server.
1 – We should not migrate more than redundant (Let’s call it R) number of bricks of any old sub volume to new node.
2 – At the end we should not have more than R bricks from new node to any of the old node.

Let’s move to the usage of the tool, pos_main.py.
It is required to add new node into the cluster of the node which we are using for our existing disperse volume. We can do so by using gluster peer probe <new node IP or hostname>.

pos_main.py needs to be executes from any one of the server in cluster pool. We need to provide three inputs to this tool after we run it.
1 – Name of the volume
2 – Name of the new server
3 – File name consisting list of new drives in the form of “node:<mount-point of the drive>”

# python3 pos_main.py
Enter name of the volume: vol
Enter file name which contains new bricks: new-node.txt
Enter IP/hostname of new node: node-4
volname = vol
filename = new-node.txt
hostname = node-4
New Bricks***
node-4:/root/brick-new-1
node-4:/root/brick-new-2
node-4:/root/brick-new-3
node-4:/root/brick-new-4
node-4:/root/brick-new-5
node-4:/root/brick-new-6

As soon as we provide these inputs to the script, it validates the inputs and checks if the volume can be scaled or not. Following checks will be done to make sure that scaling can be done without any issue –
1 – Check for the health of the existing volume which we are trying to scale.
2 – Do we have enough new bricks to scale.
3 – The existing volume is well spread out and having a fault tolerant setup.

Once all the conditions are met and checks are done by the script, it does all the calculations and provide a map of the drives which contains old-bricks and the respective new-bricks which should be swapped. An example:
old_brick=apandey:/home/apandey/bricks/gluster/vol-6 and new brick=node-4:/root/brick-new-6
old_brick=apandey:/home/apandey/bricks/gluster/vol-18 and new brick=node-4:/root/brick-new-5
All together : 1
One brick at a time : 0
How do you want to swap device: (0 or 1)

Map will contain all the old and new bricks which are required to be swapped between old and new server. Here afterwards, a user can swap the drives all in one short or one by one as per the input given to the script and as per user choice.
If user opts for “One brick at a time“, script will wait for user to swap the drive it was asked. Once swap is confirmed, drive on new node will be included on existing disperse sub volume and heal will be triggered to make sure volume becomes healthy.
On the other hand if user goes for the other option i.e. “All together” this tool will wait for the user to swap all the drives and confirm it. At the end, all the bricks will be included into the existing volume and again heal will be triggered to make sure we don’t have any inconsistency in our volume.
After successful swapping, we would see all the new bricks on different servers. Now, we can add these bricks using “gluster volume add-brick” command and all the bricks will be properly spread out to have a fault tolerant disperse volume.

Yeh, it does look like a complex procedure to scale our volume by one node. However, this is the fastest and very reliable way of adding new node on which we have drives.

Please drop a comment if you have any queries or better approach to scale our setup by one node. Don’t hesitate to modify the tool to make it error free, faster and better.

Gluster, Scale Disperse Volume by One Node – Part-1

I need to scale out out gluster disperse volume to increase storage capacity, however, our IT department does not have budget and infrastructure to provide more than one server.
If you are one of the gluster user who have landed in this situation, read on. This post has solution to scale your disperse (erasure coded) volume just by one node/server.

When we setup a disperse volume of configuration, say 4+2, we have to make sure that at max 2 (redundancy count) bricks should be placed on one server to provide redundancy on node level. The best setup would be to host one brick per server. However, for a reasonable fault tolerant disperse volume we need at least 3 servers. For this blog, let’s assume we have 3 node setup.
Let’s say we created a disperse volume of 3 * (4+2) where 6 bricks are on each server and we have 3 disperse sub volumes . For each sub volume 2 bricks are on different node.

Initial Setup

After using this volume for some time, we reached the maximum supported capacity and now we want to increase the storage capacity of this volume. We could just add (4+2 = 6 bricks) on this volume using “gluster add-brick” command but it can not be done as all the slots for disks on old servers are occupied, we could not attach any additional disks on existing node. We ask our IT team to provide 2 new disks attached to 3 new servers each but all they provided is one server and 6 disks attached to it.

New Node- N4 added to the cluster

If we add all these 6 bricks on existing EC volume, the new sub volume will have all the bricks on one node which could be very dangerous. if we loose connection to that node or node crashes, we will endup loosing whole sub volume and data stored on that sub volume.

So we have to move some of the disks placed on new node to existing old nodes so that we have proper distribution of the disks.

Setup after Disk Migration

There are two approaches to perform this migration of bricks –
1 – Software driven movement of data using replace-bricks
2 – Physical movement of disks supported by external tool

1 – Software driven movement of data using replace-bricks
In this approach a module (yet to be implemented) in gluster will identify the disks which are required to be replaced and data movement will be triggered. This data migration might happen using gluster heal feature or some other data copying methods like xfs_copy or rsync. Although this approach does not require any human interaction after adding a new node, this is considerably a slower approach if there is huge amount of data on volume. This data movement will happen over the wire which is a time consuming process.

2 – Physical movement of disks supported by external tool
In this approach, we will physically swap the disks between old servers and new servers. This physical swapping of the disks will be done manually and will be supported by a tool which will identify and inform the user about the disks need to be swapped. The detailed steps and functionality of this approach will be explained in Part -2 of this blog…