Join the Community
and take part in the story

Rebuild data and restart service on a disk failure


After a disk failure all the services owned by the disk are automatically stopped or broken.

If the disk owned data, metadata, account or directory services their score are automatically lowered to zero.
And after 5 days, they will be removed from the conscience listing.

In this topic, we are assuming that the name of your namespace is OPENIO.

How to restore a disk with rawx(data) services

Replaced the failed disk, format and mount it at the same location
Manually force its score to zero for preventing write at startup.

To retrieve information about this rawx, on the node do:

# gridinit_cmd status2 @rawx | grep "BROKEN" OPENIO-rawx-1 BROKEN -1 6 5 ---------- -------- OPENIO,rawx,rawx-1 /usr/sbin/httpd -D FOREGROUND -f /etc/oio/sds/OPENIO/rawx-1/rawx-1-httpd.conf

To force the rawx score to zero we need to get the ip:port:

# grep -i "<virtualhost" /etc/oio/sds/OPENIO/rawx-1/rawx-1-httpd.conf <VirtualHost>

Then we can force the score:

# openio cluster lock -s 0 rawx --oio-ns OPENIO +------+----------------+-------------+ | Type | Service | Result | +------+----------------+-------------+ | rawx | | locked to 0 | +------+----------------+-------------+

Recreate the service

Now, by simply reapplying the puppet manifest use for the deployment, it will recreate directories for rdir and rawx services and reenable services.

# puppet apply --no-stringify_facts openio.pp

Repair the rawx

 # gridinit_cmd status @rawx
OPENIO-rawx-1             UP        22589 OPENIO,rawx,rawx-1```

Check if the rawx service if up but scored to zero (Flag Up to True and Score to 0):

```# openio cluster list --oio-ns OPENIO rawx
| Type | Id             | Volume                         | Location  | Slots | Up   | Score |
| rawx | | /var/lib/oio/sds/OPENIO/rawx-1 | openio01.1 | n/a   | True |     0 |

# Launch the data rebuilder

Set the incident on the volume:
```# openio volume admin incident --oio-ns OPENIO
| Volume         |       Date |
| | 1508836327 |

Check if it's set properly:
```# openio volume admin show --oio-ns OPENIO
| Field               | Value          |
| volume              | |
| admin|incident_date | 1508836327     |

Then you can launch the rebuild to retrieve each chunk at the same position (option `--allow-same-rawx`)

```# oio-blob-rebuilder OPENIO  --volume --chunks-per-second 2  --report-interval 5 --allow-same-rawx```

At the end all the data will be reconstructed in the volume.

You can retrieve the log in file : `/var/log/oio/sds/NAMESPACE/blob-rebuilder-IPRAWX
Here at: /var/log/oio/sds/OPENIO/blob-rebuilder-`

Looks like:
```[root@relex01 ~]# oio-blob-rebuilder OPENIO  --volume --chunks-per-second 2  --report-interval 5 --allow-same-rawx
1811 26B72D0 log INFO Rebuilding (container 06D488BF46F311A2E1C9E348A39F3296B72DD1C2F8C5C28BB640BD8D0166E1B7, content CCC3470C4B5C0500C2A059C7F70BB58B, chunk 07F5110F92004A98094E56E0C735DEFC51D162E02A976B93D5CB2FDCD4A8620F)
1811 26B72D0 log INFO RUN started=2017-10-24T13:40:22 passes=1 errors=0 chunks=1 42.81/s bytes=111 4751.90B/s elapsed=0.02 (rebuilder: 100.00%)
1811 26B72D0 log INFO Rebuilding (container 12CDA24A919AFF3685DA7953BD892F51EACEBED16F52EBA69A0B19128E0917BC, content D3E0BC0B4B5C0500D90688AF59030D36, chunk 8F2FECCCD103CA7499BA6D4CD57CB3F04340DB07F8A55E9FD6634B6C844F3040)
1811 26B72D0 log INFO Rebuilding (container 1338678BB1E1DE70EE3379D9ACA65745355A2328906C0C72E433EA7257B25777, content D4A4A10B4B5C0500D011AC20F70BE58B, chunk D4D44F281686519DD5CA1BF6CA4B13E31F14696D817A3239B7A19D04399008AB)
1811 26B72D0 log INFO Rebuilding (container F4959647838E8D8D6CE1E8B8552782F0C7D25A71512E02D6F6004F66A4F84DDC, content D13BE60B4B5C0500687586C1340353D2, chunk E05B75F44F9A359667B888AAAA0E7DF5841228525C2D28781F27D9900E48A772)
1811 26B72D0 log INFO RUN started=2017-10-24T13:40:37 passes=10 errors=0 chunks=41 2.00/s bytes=4551 222.00B/s elapsed=20.02 (rebuilder: 100.00%)
1811 26B72D0 log INFO DONE started=2017-10-24T13:40:22 ended=2017-10-24T13:40:42 passes=0 elapsed=20.03 errors=0 chunks=41 2.05/s bytes=4551 227.25B/s elapsed=20.02 (rebuilder: 100.00%)```

At the end of file you can found a short report of the rebuild.