Join the Community
and take part in the story

`openio object list` fails with an `Election failed` message


#1

Hi, folks!

It happened recently during an integration test. While trying to list the content of the container, the subsequent error occured.

# openio --oio-ns OPENIO --oio-account ACCT object list JFS
Election failed [9006CE70B59E5777D6BB410C57944812EB05FCDB5BA85D520A14B3051D1D094F.1][meta2] (HTTP 503) (STATUS 503)

Repeating the command shown an interesting fact.

# openio --oio-ns OPENIO --oio-account ACCT object list JFS
+-------+------+----------------------------------+------------------+
| Name  | Size | Hash                             |          Version |
+-------+------+----------------------------------+------------------+
| magic |  111 | 272913026300E7AE9B5E2D51F138E674 | 1507062080069088 |
+-------+------+----------------------------------+------------------+

# openio --oio-ns OPENIO --oio-account ACCT object list JFS
+-------+------+----------------------------------+------------------+
| Name  | Size | Hash                             |          Version |
+-------+------+----------------------------------+------------------+
| magic |  111 | 272913026300E7AE9B5E2D51F138E674 | 1507062080069088 |
+-------+------+----------------------------------+------------------+

# openio --oio-ns OPENIO --oio-account ACCT object list JFS
Election failed [9006CE70B59E5777D6BB410C57944812EB05FCDB5BA85D520A14B3051D1D094F.1][meta2] (HTTP 503) (STATUS 503)

Election errors are known to occur when the platform is overloaded.
A traditional system looks like this during a listing of ojects, once the meta2 pers are known:

[client] -> [oio-proxy] -> [meta2] <--------------v
                            | +-> [meta2] <---> [Zookeeper]
                            +-> [meta2] <---------A

A timeout may occurs during any network connection, either between the client and the proxy, between the proxy and the meta2 services, among the meta2 or between the meta2 and zookeeper.

Unfortunately, the platform was slightly loaded, nearly idle.

An interesting troubleshooting information came from a proxy command (soon to be exposed as an openio election sub-command).
If we consider the proxy isbound to the network address 127.0.0.1:6000, we issued the following command:

NS=OPENIO
ACCOUNT=ACCT
USER=JFS
curl -X POST "http://127.0.0.1:6000/v3.0/${NS}/admin/debug?acct=${ACCOUNT}&ref=${USER}&type=meta2"

The output, once pretty printed, was the following:

{
    "127.0.0.1:6001": {
        "body": {
            "#": {
                "P": 0,
                "R": 77,
                "V": 0
            },
            "base": {
                "name": "9006CE70B59E5777D6BB410C57944812EB05FCDB5BA85D520A14B3051D1D094F.1",
                "type": "meta2",
                "zk": "7BEBC0303D9426162F852314A82FAA959178A7CAE7C1D3FEBD44F5727CC6DC30"
            },
            "local": {
                "id": 4680,
                "state": "MASTER",
                "url": "127.0.0.1:6001"
            },
            "log": [
                "CHECKING_SLAVES:GETVERS_OK:MASTER",
                "CHECKING_SLAVES:GETVERS_OK:CHECKING_SLAVES",
                "LISTING:LIST_OK:CHECKING_SLAVES",
                "WATCHING:EXISTS_OK:LISTING",
                "CREATING:CREATE_OK:WATCHING",
                "NONE:NONE:CREATING"
            ],
            "master": {
                "id": 4680,
                "url": null
            },
            "peers": [
                "127.0.0.1:6002",
                "127.0.0.1:6001"
            ]
        },
        "status": {
            "message": "OK",
            "status": 200
        }
    },
    "127.0.0.1:6002": {
        "body": {
            "#": {
                "P": 0,
                "R": 6,
                "V": 0
            },
            "base": {
                "name": "9006CE70B59E5777D6BB410C57944812EB05FCDB5BA85D520A14B3051D1D094F.1",
                "type": "meta2",
                "zk": "7BEBC0303D9426162F852314A82FAA959178A7CAE7C1D3FEBD44F5727CC6DC30"
            },
            "local": {
                "id": 4682,
                "state": "SLAVE",
                "url": "127.0.0.1:6002"
            },
            "log": [
                "ASKING:MASTER_OK:LISTING",
                "LISTING:LIST_OK:ASKING",
                "ASKING:MASTER_OK:LISTING",
                "LISTING:LIST_OK:ASKING",
                "ASKING:MASTER_OK:LISTING",
                "LISTING:LIST_OK:ASKING",
                "ASKING:MASTER_OK:LISTING",
                "LISTING:LIST_OK:ASKING",
                "ASKING:MASTER_OK:LISTING",
                "LISTING:LIST_OK:ASKING",
                "ASKING:MASTER_OK:LISTING",
                "LISTING:LIST_OK:ASKING",
                "ASKING:MASTER_OK:LISTING",
                "LISTING:LIST_OK:ASKING",
                "ASKING:MASTER_OK:LISTING",
                "LISTING:LIST_OK:ASKING",
                "ASKING:MASTER_OK:LISTING",
                "LISTING:LIST_OK:ASKING",
                "ASKING:MASTER_OK:LISTING",
                "LISTING:LIST_OK:ASKING",
                "ASKING:MASTER_OK:LISTING"
            ],
            "master": {
                "id": 4680,
                "url": "127.0.0.1:6002"
            },
            "peers": [
                "127.0.0.1:6001",
                "127.0.0.1:6003"
            ]
        },
        "status": {
            "message": "OK",
            "status": 200
        }
    },
    "127.0.0.1:6003": {
        "body": {
            "#": {
                "P": 0,
                "R": 6,
                "V": 0
            },
            "base": {
                "name": "9006CE70B59E5777D6BB410C57944812EB05FCDB5BA85D520A14B3051D1D094F.1",
                "type": "meta2",
                "zk": "7BEBC0303D9426162F852314A82FAA959178A7CAE7C1D3FEBD44F5727CC6DC30"
            },
            "local": {
                "id": 4681,
                "state": "SLAVE",
                "url": "127.0.0.1:6003"
            },
            "log": [
                "CHECKING_MASTER:GETVERS_OK:SLAVE",
                "ASKING:MASTER_OK:CHECKING_MASTER",
                "LISTING:LIST_OK:ASKING",
                "WATCHING:EXISTS_OK:LISTING",
                "CREATING:CREATE_OK:WATCHING",
                "NONE:NONE:CREATING",
                "LEAVING:LEAVE_OK:NONE",
                "LEAVING:LEFT_SELF:LEAVING",
                "MASTER:LEAVE_REQ:LEAVING",
                "CHECKING_SLAVES:GETVERS_OK:MASTER",
                "CHECKING_SLAVES:GETVERS_KO:CHECKING_SLAVES",
                "LISTING:LIST_OK:CHECKING_SLAVES",
                "SLAVE:LEFT_MASTER:LISTING",
                "CHECKING_MASTER:GETVERS_OK:SLAVE",
                "ASKING:MASTER_OK:CHECKING_MASTER",
                "LISTING:LIST_OK:ASKING"
            ],
            "master": {
                "id": 4680,
                "url": "127.0.0.1:6001"
            },
            "peers": [
                "127.0.0.1:6001",
                "127.0.0.1:6002"
            ]
        },
        "status": {
            "message": "OK",
            "status": 200
        }
    }
}

We got it! It is important here to mention the sequence of transitions

                "ASKING:MASTER_OK:LISTING",
                "LISTING:LIST_OK:ASKING",

The FSM state machine has a leak, or a loop, call as you wish, this was a bug. The FSM changed between the versions 3.x and the 4.x of openio-sds. Versions prior to 4.0.0 are not affected, and versions starting at 4.1.4 received the fix (https://github.com/open-io/oio-sds/pull/1203).


#2

Though, a workaround exists, and it is very useful to unlock an election stuck between ASKING and LISTING!

You can ask the election to explicitly exit, this will send a signal that will preempt the loop. The nodes running the election will join again at the next request.

$ NS=OPENIO
$ ACCOUNT=ACCT
$ USER=JFS
$ curl -X POST 'http://127.0.0.1:6000/v3.0/${NS}/admin/leave?acct=${ACCOUNT}&ref=${USER}&type=meta2'
{"127.0.0.1:6015":{"status":{"status":200,"message":"OK"},"body":""},"127.0.0.1:6017":{"status":{"status":200,"message":"OK"},"body":""},"127.0.0.1:6018":{"status":{"status":200,"message":"OK"},"body":""}}