Ernestas Poškus

Technical blog

"We must view with profound respect the infinite capacity of the human mind to resist the introduction of useful knowledge." - Thomas R. Lounsbury

| github | goodreads | linkedin | twitter |

ansible 2 / elasticsearch 2 / kernel 2 / leadership 1 / linux 2 / mnemonics 1 / nginx 1 / paper 40 / personal 5 / rust 1 / tools 2 /

Swim Scalable Weakly Consistent Infection Style Process Group Membership Protocol

WC 228 / RT 2min


Distributed peer-to-peer applications require weakly-consistent knowledge of process group membership information at all participating processes.

SWIM separates the failure detection and membership update dissemination functionalities of the membership protocol.

Swim focus on a weaker variant of group membership, where membership lists at different members need not be consistent across the group at the same (causal) point in time.

The design of a distributed membership algorithm has traditionally been approached through the technique of heart-beating.

Popular class of all-to-all heart-beating protocols arises from the implicit decision therein to fuse the two principal functions of the membership problem specification:

  1. Membership update Dissemination: propagating membership updates arising from processes joining, leaving or failing
  2. Failure detection: detecting failures of existing members.

SWIM, provides a membership substrate that:

(1) imposes a constant message load per group member; (2) detects a process failure in an (expected) constant time at some non-faulty process in the group; (3) provides a deterministic bound (as a function of group size) on the local time that a non-faulty process takes to detect failure of another process; (4) propagates membership updates, including information about failures, in infection-style (also gossip-style or epidemic-style); the dissemination latency in the group grows slowly (logarithmically) with the number of members; (5) provides a mechanism to reduce the rate of false positives by “suspecting” a process before “declaring” it as failed within the group.