Gluster Disperse Volume Troubleshooting – HEAL

This blog is the part of the effort of understanding how to debug an issue when things go wrong in gluster. Issues like slow performance, inaccessibility of data, heal not happening or heal is taking long time, are some of the problem we could face while using gluster volumes. In this blog, I will be explaining an important functionality, healing, in glusterfs and will try to explain how to deal with it.

Disperse or Erasure Coded (EC) volume encodes the incoming data and stores it on distributed storage exports referred to as bricks in Gluster terminology.
Simple equation to understand this is –
N = K + M
N = Total number of bricks required to store encoded data
K = Data Bricks – Minimum number of bricks which are required to read/write data on volume.
M = Redundancy bricks – Maximum number of bricks which could go down without hampering any file operations. In other words, we can tolerate failure of up to any M number of bricks.
For this write up, we will take (4 + 2) EC volume configuration into consideration; K = 4 and M = 2.

What is Healing in Disperse volume?
If we talk about human, heal can be understood as getting back to normal state of working after falling sick or getting injured. Gluster volume are just like that and can also fall sick or can be in bad state after getting injured (like server crash or network disconnection). Healing process repairs the whole volume in such a way that data would always be available and, if enough time is given to heal, can sustain any future failure or problems.

If a brick goes down or a write operation fails on a brick, it should be repaired as soon as possible after the failed brick comes online, so that any other failure can be handled in future. If a brick is UP and running and it is not healthy; i.e. it does not have one or more fragments of data of a file, EC should be able to reconstruct it and keep the system in healthy state.

How can an EC volume turned to bad state?
First of all, by calling a volume as unhealthy, we mean any file which is unhealthy on that volume. A file/directory may end up in unhealthy state because of following reasons –
1 – Brick was down during update/write file operation on a file.
2 – Brick on which healthy data was present got corrupted.
3 – A client cannot reach a brick because of network connectivity.
4 – Replace brick operation in progress to decommission a server.
5 – Any other case where a brick has stale data.
There could be other reasons also but above are the most common reasons.
To check if a EC volume is healthy or not we can use following command –
gluster v heal <volume name> info
This will list down all the entries from all the bricks which will require possible heal.

There are few extended attributes which decide if a file fragments are healthy on all the bricks or not.
1 – trusted.ec.version – It has two parts, one for meta data fops and other for data fops.
2 – trusted.ec.size (only for files).
The value of the above two attributes should be same for all the bricks of the volume. If not then it signifies that the odd one out fragment[s] need heal.
3 – trusted.ec.config – Currently it has a fixed value.
4 – trusted.ec.dirty – It has two parts, one for meta data fops and other for data fops. If this attribute is having any value for data or meta data part, then it signifies that something is wrong with some of the fragments of the file/directory which should be healed.

No of Bytes ========0x|——-8 Bytes——–|———8 Bytes——-|
trusted.ec.version ====0x00000000000000010000000000000001
trusted.ec.dirty ======0x00000000000000010000000000000001
Which Fop ? =======0x|——Data————-|——Metadata——–|

In addition to extended attributes, we also check other file attributes, like size of the fragment on disk. If a file/directory is healthy, all the fragments of the file from all the bricks should have same value of the respective extended attributes. If any fragments contains these attribute which is not matching K number of fragments, that file/dir will be marked as dirty. In other words, that entry will be marked as ‘Heal Required’.
We can use following command to get the extended attributes of any file of the volume.
getfattr -m. -d -e hex <brick path on a node>/file

How and When EC marks an entry as ‘Heal Required’
1 – When a client (mount point) sends a file operation (read or write) on a file or directory, EC takes lock on that entry and gets the extended attributes and stats to decide if all the fragments placed on different bricks of the volume are healthy or not. If a fragment is out of sync; i.e. if the above extended attributes are not matching for all the bricks, it marks that file unhealthy by setting “trusted.ec.dirty” extended attribute on all the other fragments. At the same time it serves the request from the other healthy bricks.
2 – If an update fop fails on any of the bricks, but succeeds on at least minimum number (K) of bricks, EC marks that file as dirty by setting trusted.ec.dirty = 1 (for respective fop type)
3 – replace-brick – During replace brick, EC marks the root of the new brick as dirty on all the other bricks and triggers the heal. It also marks the dirty xattrs on the entries which it finds during a crawl of the volume. This is just to make sure that even if this attempt of heal fails (In case heal process gets killed on the way) these entries will be picked again in future when the heal daemon will come back online again.
4 – Marking a file as dirty or setting trusted.ec.dirty xattr creates an entry in
.glusterfs/indices/xattrop/ with gfid of the entry as name.

How to start Heal Process?
There are two ways to start a heal process for a file/directory – Server side heal and Client side heal.
1 – Server side heal – is triggered by Self Heal daemon. Self heal daemon (SHD), running on each node involved in volume setup, scans <Brick-path>.glusterfs/indices/xattrop/ directory after every 10 minutes and checks if any entry is present which is required to be healed. If an entry is present, it triggers heal for that entry. We can also trigger server side heal manually and have two options to do so – Index heal or Full heal. These options just decide where and how to find entries which are needed to be healed.
Index Heal – In this option, SHD scans /.glusterfs/indices/xattrop/ to check if there are entries which are required to be healed. If needed, it picks up an entry and triggers the heal for that entry. Important point to note is that SHD just does readdir on this location and whatever entry it find it triggers heal for that. There is no sequence it forces to pick an entry from this location. So, it might happen that we pick a file whose parent directory also needed to be healed but will be picked later. In this case, SHD will not be able to heal this file and skip it in this round.
Once the entry for directory is picked and that directory is healed, in next round the file will also be healed. Index heal could be triggered using following command –
gluster v heal <volume name>
Full Heal – Once a full heal is triggered, SHD starts from the root of the brick and checks each and every entry present in tree structure. In this case, it check each entry even if it is not required to be healed and triggers heal which internally checks if all the extended attributes for that file are same or not i.e. if entry is healthy or not. If heal is needed, that file/dir will be marked as dirty which will create an entry in /.glusterfs/indices/xattrop/ and also heal will be triggered for that entry.
gluster v heal <volume name> full

2 – Client side heal – is started by a file operation if it finds that some fragments of file/entry are bad. Entry will also be marked as dirty so that it remains a record that this file needs heal even if this heal attempt fails. Client side heal on a specific file/dir can also be triggered by executing following command on mount point –
getfattr -n trusted.ec.heal <volume mount point>/<file path>

Heal internals
Once the heal for an entry starts, whether from client side or server side, heal process goes through different phases in following order –
1 – Name heal
2 – Metadata heal
3 – Data or Entry heal
It might happen that healing is not even required for an entry. In such a scenario, the attributes are compared and no actual heal of data happens for that entry.
Based on the values of extended attributes and on-disk attributes heal process finds out the sink[s] and source bricks. In case of metadata heal, it just sets the same values on sink bricks whatever it finds on source bricks.
For data heal, SHD reads the data from 4 source bricks, decodes it, encodes it again and creates N fragments and sends the write of the respective fragments to the sink bricks. It only writes fragments to the bricks which are required to be healed not the one which are already healthy.
At the end of this cycle, it updates the size and version xattrs for that file on sink brick.

Common issues and Debugging Steps related to Heal for EC volume
Heal is not happening
1 – First of all, check if the self heal (server side heal) and background heal (client side heal) for the volume is enabled or not. Use following –
gluster v get <volname> cluster.disperse-self-heal-daemon
gluster v get patchy disperse.background-heals – Should be non zeor
2 – We can also use ‘gluster v status’ to see if an entry for the self heal daemon is being listed and we have a process ID of shd.
3 – To heal an entry we should have at least K source bricks UP and running and at the same time the brick which is required to be healed should also be UP and running. So, check if the bricks are in running state and reachable.
gluster v status <volname> – All the bricks should be up and running having PID
4 – Manually check the extended attributes of the files on all the bricks using following command.
getfattr -m. -d -e hex <brick-patch>/<path of the file which we want to examine>
We should have at least minimum number of bricks having same xattrs. These bricks will be used as source bricks.
5 – Check the glustershd.log file to see if there are any logs which indicates heal failed.
6 – Check .log on client machine to see if client side heal started but failed.

Heal info showing entries to be healed while IO is going on
1 – Check if all the bricks are up and running.
2 – When some update fop is going on for a file/dir, based on options settings, we mark the trusted.ec.dirty xattr for the file/dir as 1 while actually it does not require heal. At this point of time if we run gluster v heal info and entries are being reported then that could be a bug.
3 – Get the extended attributes for one of the file and see if these attributes are matching on all the bricks. If yes, this could be a bug in gluster v heal <volname> info utility.

Heal info showing entries which does not require heal
1 – If all the xattrs of an entry are same on all the bricks and still heal info is showing an entry while no IO is going-on, that suggest that somehow the entry in .glusterfs/xattrop/indices are not getting purged.
2 – We should check glustershd.log file for the logs related to heal failed.

Heal daemon is taking a lot of CPU cycles
Consider 4 + 2 configuration and all the bricks on different nodes.
When a brick or node remains down for sometime and lots of files have been created and data have been written on those files during this period, we see a lots of files which required to be healed in “gluster v heal info“.
When this brick/node comes UP, all the heal daemons on different nodes starts scanning .glusterfs/xattrop/indices folder which is local to it and start healing those entries. This triggers heal on thousands of files from all the nodes. Every heal requires a read-decode-encode-write computation which requires lots of CPU cycles. That may causes a jump in CPU usage sometimes even 500% to 600% or so. This also impacts the response time of the node and other activities. This issues could be handled by an existing script which provides an option to fix the maximum CPU usage by self heal daemon.
https://github.com/gluster/glusterfs/blob/master/extras/control-cpu-load.sh

What can impact heal speed for disperse volume
Healing is fine but when we have lots of files to be healed, it takes its own time to get it back to healthy state. Xavi Herandaze, who initially designed, coded and implemented disperse volume in gluster, replied to one of the user on gluster-users mailing list when this question was asked. I am copying and pasting his reply which explains what could impact heal performance for EC volume.

Following are the 3 factors which could be considered while measuring/debugging heal speed in EC volume:
1. Erasure code algorithm
2. Small I/O blocks
3. Latencies
Some testing on 1) shows that a single core of an Intel Xeon E5-2630L 2 GHz, can encode and decode at ~400 MB/s using a 16+4 configuration (using latest version without any CPU extensions. Using SSE, it’s ~3 times faster). When self-heal is involved, to heal a single brick we need to read data from healthy bricks and decode it. If we suppose that we are healing a file of 2GB, since we can decode at ~400 MB/s, it will take ~5 seconds to recover the original data. Now we reencode it to generate the missing fragment. Since we only need to generate a single fragment, the encoding speed is much faster. Normally it would take 5 seconds to encode the whole file, but in this case it takes ~1/20 of the time, so ~250ms. So we need 5.25 seconds to heal the file. The total written data to the damaged brick will be 1/16 of the total size, so 128 MB. This gives an effective speed of ~24 MB/s on the brick being healed.

By default the healing process uses blocks of 128KB. On a 16+4 configuration this means that 16 fragments of 8KB are read from 16 healthy bricks and a single fragment of 8KB is written to the damaged brick for each block of 128KB of the original file. So we need to write 128 MB in blocks of 8 KB. This means that ~16000 I/O operations are needed on each brick to heal the file. Due to the serialization of reads and writes, it’s less probable that the target brick can merge write requests to improve performance, though some are indeed merged. Giving exact numbers here depends on a lot of factors. Just as a reference number, if we suppose that we are able to write at 24 MB/s and no merges are done, it would mean ~3000 IOPS. This is a good number of IOPS.

The heal process involves multiple operations that need to be serialized. This means that multiple network latencies are added to heal a single block of the damaged file. Basically we read a block from healthy bricks and then it’s written to damaged bricks. An 8 KB packet has a latency of ~0.15 ms on a 10 Gbps ethernet. Reads from disk can be in the order of some ms, though disk read-ahead can reduce the amortized read time to less than a ms (say ~0.1 ms). Writes can take longer, but let’s suppose that they are cached and immediately acknowledged. In this case we need 0.15 ms * 2 (read and write network latency) + 0.1 ms (disk read latency amortized) = ~0.4 ms per block. 16000 I/O requests * 0.4 ms = 6.5 seconds. If we add the time for decoding/encoding to the time of latency, we get ~11.5 seconds. 128 MB / 11.5 s = ~11 MB/s.
Note that the numbers are not taken from real tests. They are only reasonable theoretical numbers. Also note that even if the write speed on the damaged brick is only 11 MB/s, the amount of data processed of the original file is 176 MB/s (writing 128 MB on the damaged brick means that the whole 2GB of the file have been processed in 11.5 seconds).

To improve these numbers we need to use more cores to encode/decode faster and reduce the number of I/O operations needed. This first one can be done with parallel self-heal. Using CPU extensions like SSE or AVX can improve performance by x3 or x6 also. The second one is being worked on to allow self-heal to be configured to use bigger blocks to heal, reducing the number of network round-trips, thus the total latency.

I hope this blog will give a good insight about heal process and how to debug it. I will try to keep adding information as and when I find something new. Let me know your thoughts on this and feel free to ask any question regarding this topic.

Leave a comment