I've been thinking about this a lot - nearly every problem these days is a synchronisation problem. You're regularly downloading something from an API? Thats a sync. You've got a distributed database? Sync problem. Cache Invalidation? Basically a sync problem. You want online and offline functionality? sync problem. Collaborative editing? sync problem.
And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).
The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech which is essentially restricting your data and the rules for dealing with conflicts to situations that are known to be resolvable and then packaging it all up into an object.
> The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech
I will go against the grain and say CRDTs have been a distraction and the overfocus on them have been delaying real progress. They are immature and highly complex and thus hard to debug and understand, and have extremely limited cross-language support in practice - let alone any indexing or storage engine support.
Yes, they are fascinating and yes they solve real problems but they are absolute overkill to your problems (except collab editing), at least currently. Why? Because they are all about conflict resolution. You can get very far without addressing this problem: for instance a cache, like you mentioned, has no need for conflict resolution. The main data store owns the data, and the cache follows. If you can have single ownership, (single writer) or last write wins, or similar, you can drop a massive pile of complexity on the floor and not worry about it. (In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs – I would bet they have an extremely bespoke and domain-specific data structures that are inspired by CRDTs.)
Instead, what I believe we need is end-to-end bidirectional stream based data communication, simple patch/replace data structures to efficiently notify of updates, and standard algorithms and protocols for processing it all. Basically adding async reactivity on the read path of existing data engines like SQL databases. I believe even this is a massive undertaking, but feasible, and delivers lasting tangible value.
I agree! Lots more things are sync. Also: the state of my source files -> my compiler (in watch mode), about 20 different APIs in the kernel - from keyboard state to filesystem watching to process monitoring to connected USB devices.
Also, http caching is sort of a special case of sync - where the cache (say, nginx) is trying to keep a synchronised copy of a resource from the backend web server. But because there’s no way for the web server to notify nginx that the resource has changed, you get both stale reads and unnecessary polling. Doing fan-out would be way more efficient than a keep alive header if we had a way to do it!
CRDTs are cool tech. (I would know - I’ve been playing with them for years). But I think it’s worth dividing data interfaces into two types: owned data and shared data. Owned data has a single owner (eg the database, the kernel, the web server) and other devices live down stream of that owner. Shared data sources have more complex systems - eg everyone in the network has a copy of the data and can make changes, then it’s all eventually consistent. Or raft / paxos. Think git, or a distributed database. And they can be combined - eg, the app server is downstream of a distributed database. GitHub actions is downstream of a git repo.
I’ve been meaning to write a blog post about this for years. Once you realise how ubiquitous this problem is, you see it absolutely everywhere.
> And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).
I've spent 16 years working on a sync engine and have worked with hundreds of enterprises on sync use cases during this time. I've seen countless cases of developers underestimating the complexity of sync. In most cases it happens exactly as you said: start with a naive approach and then the fractal complexity spiral starts. Even if the team is able to do the initial implementation, maintaining it usually turns into a burden that they eventually find too big to bear.
CRDTs work well for linear data structures, but there are known issues with hierarchical ones. For instance, if you have a tree, then two clients could send a transaction that would cause a node to be a parent of itself.
That said, there’s work that has been done towards fixing some of those issues.
Evan Wallace (I think he’s the CTO of Figma) has written about a few solutions he tried for Figma’s collaborative features. And then Martin Kleppmann has a paper proposing a solution:
https://martin.kleppmann.com/papers/move-op.pdf
I've looked at CRDTs, and the concept really appeals to me in the general case, but in the specific cases, my design always ends up being "keep-all-the-facts" about a particular item. But then you defer the problem of 'which facts can I throw away?'. It's like inventing a domain-specific GC.
I'd love to hear about any success cases people have had with CRDTs.
Absolutely. My current product relies heavily on a handful of partner systems and, adds an opinionated layer on top of these systems, and propagates data to CRM, DW, and other analytical systems.
One early insight was that we needed a representation of partner data in our database (and the downstream systems need a representation of our opinionated view as well). This is clearly an (eventually consistent) synchronization problem.
We also realized that we often either fail to sync (due to bugs, timing, or whatever) and need a regular process to resync data.
We've ended up with a homegrown framework that does both things, such that the same business logic gets used in both cases. This also makes it easy to backfill data if a chosen representation changes)
We're now on the third or fourth iteration of this system and I'm pretty happy with it.
> Cache Invalidation? Basically a sync problem.
Does naming things and off-by-one errors also count?
UI is also a sync problem if you squint a bit. React like systems are an attempt to be a sync engine between model and view in a sense.
Multiplayer games too.