Fri. Sep 18th, 2020

Why are Distributed Systems So Hard?

3 min read
Why are Distributed Systems So Hard?

Isn't it true you can just write deterministic code and if you do it right and work to fix all the bugs, eventually you'll have a simple that never does the wrong thing? If that's not the case then why not? Computers are deterministic – they're predictable and they only ever do exactly what you ask them to do and nothing more…right?

It's complicated.

On a single node (computer) most failures mean the entire node completely stops working – for example the power supply fails, the disk dies, or the motherboard gets fried. These types of failures are easy to detect because the node won't be in a 'partially failed' state where sometimes it performs the functions it's asked and sometimes it doesn't.

In a network, you have multiple nodes – possibly hundreds or thousands. When one node fails, it is impossible to be 100% sure that node is completely down. And if you can't be sure that a node won't come back to life, then you can't simply give its responsibilities away to another node. In addition, if one node fails, for many important algorithms, you have to make sure all the other nodes are aware of the fact.

But why is it so hard to be sure? Computers do a lot of weird things. For example, programming languages use garbage collection to clear out items from memory that are no longer used. This can sometimes be a long process, especially when there are lots of large items that need to be garbage collected. But it's non-deterministic how long a garbage collection run will take. When garbage collections happen, the process running on that node goes through a pause – a period of time when no other work gets done. During this time the node can look as if it's down and out. Since the node won't respond to requests during this time, an outside observer might conclude after some time that it's dead. But after the garbage collection finishes, the process will come back to life and start responding to requests and doing useful work.

In addition, each node has its own system clock. That clock can have bugs, or become slow or fast. So when you're measuring how long it's been since a request to another node was sent, the measurement you get has the potential to be wrong.

Networks themselves can be finnicky too. Nodes that are only accessible through a single edge can easily be cutoff if that edge goes down. When that happens it's called a network partition. The link will eventually be restored, but during the time it's down, it will be impossible to tell from one side the state of the nodes on the other side.

So you have to make guesses about the state of each node with varying levels of confidence. You use timeouts, retries, locks and more complex consensus algorithms to manage the uncertainty about the state of each and every node in the network at any given point in time.

In short, the reason distributed systems are hard is because of non-determinism caused by process pauses, requests with no response and out-of-sync system clocks.

For more and to get familiar with the fundamentals of DS check out the blog epochsystems

submitted by /u/masudio

Leave a Reply

Your email address will not be published. Required fields are marked *