90 comments
  • codeulike1m

    I've been thinking about this a lot - nearly every problem these days is a synchronisation problem. You're regularly downloading something from an API? Thats a sync. You've got a distributed database? Sync problem. Cache Invalidation? Basically a sync problem. You want online and offline functionality? sync problem. Collaborative editing? sync problem.

    And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).

    The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech which is essentially restricting your data and the rules for dealing with conflicts to situations that are known to be resolvable and then packaging it all up into an object.

    • klabb31m

      > The one piece of discussion or attempt at a systematic approach I've seen to 'synchronisation' recently is to do with Conflict-free Replicated Data Types https://crdt.tech

      I will go against the grain and say CRDTs have been a distraction and the overfocus on them have been delaying real progress. They are immature and highly complex and thus hard to debug and understand, and have extremely limited cross-language support in practice - let alone any indexing or storage engine support.

      Yes, they are fascinating and yes they solve real problems but they are absolute overkill to your problems (except collab editing), at least currently. Why? Because they are all about conflict resolution. You can get very far without addressing this problem: for instance a cache, like you mentioned, has no need for conflict resolution. The main data store owns the data, and the cache follows. If you can have single ownership, (single writer) or last write wins, or similar, you can drop a massive pile of complexity on the floor and not worry about it. (In the rare cases it’s necessary like Google Docs or Figma I would be very surprised if they use off-the-shelf CRDT libs – I would bet they have an extremely bespoke and domain-specific data structures that are inspired by CRDTs.)

      Instead, what I believe we need is end-to-end bidirectional stream based data communication, simple patch/replace data structures to efficiently notify of updates, and standard algorithms and protocols for processing it all. Basically adding async reactivity on the read path of existing data engines like SQL databases. I believe even this is a massive undertaking, but feasible, and delivers lasting tangible value.

    • josephg1m

      I agree! Lots more things are sync. Also: the state of my source files -> my compiler (in watch mode), about 20 different APIs in the kernel - from keyboard state to filesystem watching to process monitoring to connected USB devices.

      Also, http caching is sort of a special case of sync - where the cache (say, nginx) is trying to keep a synchronised copy of a resource from the backend web server. But because there’s no way for the web server to notify nginx that the resource has changed, you get both stale reads and unnecessary polling. Doing fan-out would be way more efficient than a keep alive header if we had a way to do it!

      CRDTs are cool tech. (I would know - I’ve been playing with them for years). But I think it’s worth dividing data interfaces into two types: owned data and shared data. Owned data has a single owner (eg the database, the kernel, the web server) and other devices live down stream of that owner. Shared data sources have more complex systems - eg everyone in the network has a copy of the data and can make changes, then it’s all eventually consistent. Or raft / paxos. Think git, or a distributed database. And they can be combined - eg, the app server is downstream of a distributed database. GitHub actions is downstream of a git repo.

      I’ve been meaning to write a blog post about this for years. Once you realise how ubiquitous this problem is, you see it absolutely everywhere.

    • ochiba1m

      > And 'synchronisation' as a practice gets very little attention or discussion. People just start with naive approaches like 'download whats marked as changed' and then get stuck in the quagmire of known problems and known edge cases (handling deletions, handling transport errors, handling changes that didn't get marked with a timestamp, how to repair after a bad sync, dealing with conflicting updates etc).

      I've spent 16 years working on a sync engine and have worked with hundreds of enterprises on sync use cases during this time. I've seen countless cases of developers underestimating the complexity of sync. In most cases it happens exactly as you said: start with a naive approach and then the fractal complexity spiral starts. Even if the team is able to do the initial implementation, maintaining it usually turns into a burden that they eventually find too big to bear.

    • danielvaughn1m

      CRDTs work well for linear data structures, but there are known issues with hierarchical ones. For instance, if you have a tree, then two clients could send a transaction that would cause a node to be a parent of itself.

      That said, there’s work that has been done towards fixing some of those issues.

      Evan Wallace (I think he’s the CTO of Figma) has written about a few solutions he tried for Figma’s collaborative features. And then Martin Kleppmann has a paper proposing a solution:

      https://martin.kleppmann.com/papers/move-op.pdf

    • mrkeen1m

      I've looked at CRDTs, and the concept really appeals to me in the general case, but in the specific cases, my design always ends up being "keep-all-the-facts" about a particular item. But then you defer the problem of 'which facts can I throw away?'. It's like inventing a domain-specific GC.

      I'd love to hear about any success cases people have had with CRDTs.

    • jbmsf1m

      Absolutely. My current product relies heavily on a handful of partner systems and, adds an opinionated layer on top of these systems, and propagates data to CRM, DW, and other analytical systems.

      One early insight was that we needed a representation of partner data in our database (and the downstream systems need a representation of our opinionated view as well). This is clearly an (eventually consistent) synchronization problem.

      We also realized that we often either fail to sync (due to bugs, timing, or whatever) and need a regular process to resync data.

      We've ended up with a homegrown framework that does both things, such that the same business logic gets used in both cases. This also makes it easy to backfill data if a chosen representation changes)

      We're now on the third or fourth iteration of this system and I'm pretty happy with it.

    • pwdisswordfishz1m

      > Cache Invalidation? Basically a sync problem.

      Does naming things and off-by-one errors also count?

    • mattnewport1m

      UI is also a sync problem if you squint a bit. React like systems are an attempt to be a sync engine between model and view in a sense.

      Multiplayer games too.

  • mackopes1m

    I'm not convinced that there is one generalised solution to sync engines. To make them truly performant at large scale, engineers need to have deep understanding of the underlying technology, their query performance, database, networking, and build a custom sync engine around their product and their data.

    Abstracting all of this complexity away in one general tool/library and pretending that it will always work is snake oil. There are no shortcuts to building truly high quality product at a large scale.

    • wim1m

      We've built a sync engine from scratch. Our app is a multiplayer "IDE" but for tasks/notes [1], so it's important to have a fast local first/office experience like other editors, and have changes sync in the background.

      I definitely believe sync engines are the future as they make it so much easier to enable things like no-spinners browsing your data, optimistic rendering, offline use, real-time collaboration and so on.

      I'm also not entirely convinced yet though that it's possible to get away with something that's not custom-built, or at least large parts of it. There were so many micro decisions and trade-offs going into the engine: what is the granularity of updates (characters, rows?) that we need and how does that affect the performance. Do we need a central server for things like permissions and real-time collaboration? If so do we want just deltas or also state snapshots for speedup. How much versioning do we need, what are implications of that? Is there end-to-end-encryption, how does that affect what the server can do. What kind of data structure is being synced, a simple list/map, or a graph with potential cycles? What kind of conflict resolution business logic do we need, where does that live?

      It would be cool to have something general purpose so you don’t need to build any of this, but I wonder how much time it will save in practice. Maybe the answer really is to have all kinds of different sync engines to pick from and then you can decide whether it's worth the trade-off not having everything custom-built.

      [1] https://thymer.com

    • tonsky1m

      - You can have many sync engines

      - Sync engines might only solve small and medium scale, that would be a huge win even without large scale

    • thr0w1m

      > Abstracting all of this complexity away in one general tool/library and pretending that it will always work is snake oil.

      Remember Meteor?

    • xg151m

      That might be true, but you might not have those engineers or they might be busy with higher-priority tasks:

      > It’s also ill-advised to try to solve data sync while also working on a product. These problems require patience, thoroughness, and extensive testing. They can’t be rushed. And you already have a problem on your hands you don’t know how to solve: your product. Try solving both, fail at both.

      Also, you might not have that "large scale" yet.

      (I get that you could also make the opposite case, that the individual requirements for your product are so special that you cannot factor out any common behavior. I'd see that as a hypothesis to be tested.)

  • tbrownaw1m

    > decoupled from the horrors of an unreliable network

    The first rule of network transparency is: the network is not transparent.

    > Or: I’ve yet to see a code base that has maintained a separate in-memory index for data they are querying

    Is boost::multi_index_container no longer a thing?

    Also there's SQLite with the :memory: database.

    And this ancient 4gl we use at work has in-memory tables (as in database tables, with typed columns and any number of unique or not indexes) as a basic language feature.

    • anonyfox1m

      In Elixir/Erlang thats quite common I think, at least I do this for when performance matters. Put the specific subset of commonly used data into a ETS table (= in memory cache, allowing concurrent reads) and have a GenServer (who owns that table) listen to certain database change events to update the data in the table as needed.

      Helps a lot with high read situations and takes considerable load off the database with probably 1 hour of coding effort if you know what you're doing.

    • TeMPOraL1m

      > Is boost::multi_index_container no longer a thing?

      Depends on the shop. I haven't seen one in production so far, but I don't doubt some people use it.

      > Also there's SQLite with the :memory: database.

      Ah, now that's cheating. I know, because I did that too. I did that because of the realization that half the members I'm stuffing into classes to store my game state are effectively poor man's hand-rolled tables, indices and spatial indices, so why not just use a proper database for this?.

      > And this ancient 4gl we use at work has in-memory tables (as in database tables, with typed columns and any number of unique or not indexes) as a basic language feature.

      Which one is this? I've argued in the past that this is a basic feature missing from 4GL languages, and a lot of work in every project is wasted on hand-rolling in-memory databases left and right, without realizing it. It would seem I've missed a language that recognized this fact?

      (But then, so did most of the industry.)

    • aiono1m

      > The first rule of network transparency is: the network is not transparent.

      That's precisely why current request model is painful.

  • ximm1m

    > have a theory that every major technology shift happened when one part of the stack collapsed with another.

    If that was true, we would ultimately end up with a single layer. Instead I would say that major shifts happen when we move the boundaries between layers.

    The author here proposes to replace servers by synced client-side data stores.

    That is certainly a good idea for some applications, but it also comes with drawbacks. For example, it would be easier to avoid stale data, but it would be harder to enforce permissions.

    • worthless-trash1m

      I feel like this is the "serverless" discussion all over again.

      There was still a server, its just not YOUR server. In this case, there will still be servers, just maybe not something that you need to manage state on.

      This misnaming creates endless conflict when trying to communicate this with hyper excited management who want to get on the latest trend.

      Cant wait to be on the meeting and hearing: "We dont need servers when we migrate to client side data stores".

    • szundi1m

      [dead]

  • mentalgear1m

    Honourable mentions of some more excellent fully open-source sync engines:

    - Zero Sync: https://github.com/rocicorp/mono

    - Triplit: https://github.com/aspen-cloud/triplit

  • zx80801m

    > decoupled from the horrors of an unreliable network

    There's no such thing as reliable network in the world. The world is network connected, there's almost no local-only systems anymore (for a long long time now).

    Some engineers dream that there's some cases when network is reliable, like when a system fully lives in the same region and single AZ. But even then it's actually not reliable and can have some glitches quite frequently (like once per month or so, depending on some luck).

    • 01HNNWZ0MV43FF1m

      True. Even the network between the CPU and an SD card or USB drive is not reliable

    • jimbokun1m

      I believe the point is that given an unreliable network, it's nice to have access to all the data available locally up to the point when you had a network issue. And then when the network is working again, your data comes up to date with no extra work on the application developer's part.

    • tonsky1m

      > There's no such thing as reliable network in the world

      I’m not saying there is

    • Keyz561m

      [dead]

  • myflash131m

    Locally synced databases seem to be a new trend. Another example is Turso, which works by maintaining a sort of SQLite-DB-per-tenant architecture. Couple that with WASM and we’ve basically come full circle back to old school desktop apps (albeit with sync-on-load). Fat client thin client blah blah.

  • PaulHoule1m

    Lotus Notes was a product far ahead of its time (nearly forgotten today) which was an object database with synchronization semantics. They made a lot of decisions that seem really strange today, like building an email system around it, but that empowered it for long-running business workflows. It's something everybody in the low-code/no-code space really needs to think about.

    • ddrdrck_1m

      No one that has ever had to work with Lotus Notes could forget it. It was atrocious. Maybe the sync engine was great but I really do not know what it was used for ...

  • skybrian1m

    This is also a tricky UI problem. Live updates, where web pages move around on you while you’re reading them, aren’t always desirable. When you’re collaborating with someone you know on the same document, you want to see edits immediately, but what about a web forum? Do you really need to see the newest responses, or is this a distraction? You might want a simple indicator that a reload will show a change, though.

    A white paper showing how Instant solves synchronization problems might be nice.

  • slifin1m

    I'm surprised to see Tonsky here

    Mostly because I consider the state of the art on this to be Clojure Electric and he presumably is aware of it at least to some degree but does not mention it

    • tonsky1m

      Clojure Electric is different. It’s not really a sync, it’s more of a thin client. It relies of having fast connection to server at all times, and re-fetches everything all the time. They innovation is that they found a really, really ergonomic way to do it

    • mananaysiempre1m

      I’m also surprised, but more because I remember very vividly his previous post on sync[1] which described a much more user-friendly (andm much less startup-friendly) system.

      [1] https://tonsky.me/blog/crdt-filesync/

    • profstasiak1m

      thank you for mentioning! I have been reading a lot about sync engines and never saw Clojure Electric being mentioned here on HN!

  • ForTheKidz1m

    > You’ll get your data synced for you

    How does this happen without an interface for conflict resolution? That's the hard part.

    • phito1m

      Right, first thing I did after opening the article is CTRL-F'ing for conflict, and got zero result. How are they not talking about the only real problem about the local-first approach? The rest is just boiler plate code.

    • Sammi1m

      All this recent hype about sync engines and local first applications completely disregards conflict resolution. It's the reason why syncing isn't mainstream already and it isn't solved and arguably cannot be.

      Imagine if git just on its own picked what to keep and what to throw away when there's a conflict. You fundamentally need the user to make the choice.

    • avodonosov1m

      They elaborate on the conflicts in the "80/20 for Multiplayer" section of this essay: https://www.instantdb.com/essays/next_firebase

      (make sure to also read the footnote [28] there).

    • tonsky1m

      Ah, no. Not really. People sometimes think about conflict resolution as a problem that needs to be solved. But it’s not solvable, not really. It’s part of the domain, it’s not going anywhere, it’s irreducible complexity.

      You _will_ have conflicts (because your app is distributed and there are concurrent writes). They will happen on semantic level, so only you (app developer) _will_ be able to solve them. Database (or any other magical tool) can’t do it for you.

      Another misconception is that conflict resolution needs to be “solved” perfectly before any progress can be made. That is not true as well. You might have unhandled conflicts in your system and still have a working, useful, successful product. Conflicts might be rare, insignificant, or people (your users) will just correct for/work around them.

      I am not saying “drop data on the floor”, of course, if you can help it. But try not to overthink it, either.

  • iansinnott1m

    Have been using Instant for a few side projects recently and it has been a phenomenal experience. 10/10, would build with it again. I suspect this is also at least partially true of client-server sync engines in general.

    • kenrick951m

      I concur with this. Been using it on my side project that only have a front-end. The "back-end" is 100% InstantDB. Although for me, I found that the permissions part a bit hard to understand, especially when it involves linking to other namespace. Haven't checked them for a while, maybe they've improved on this...

  • zelon881m

    Here's an idea.... Stop putting your critical business data on disparate third party systems that you don't have access to. Problem solved!

  • joeeverjk1m

    If sync really is the future, do you think devs will finally stop pretending local-first apps are some niche thing and start building around sync as the core instead of the afterthought? Or are we doomed to another decade of shitty conflict resolution hacks?

    • Zanfa1m

      > Or are we doomed to another decade of shitty conflict resolution hacks?

      Conflict resolution is never going away. It's important to distinguish between syntactical and semantical conflicts though, the first of which can be solved, but the other will always require manual intervention.

    • Tobani1m

      I think this makes sense for applications applications that are just managing data maybe? But if your application needs to do things when you change that data (like call to a third party system)... Syncing is maybe not the solution. What happens when the total dataset is large, do you need to download 6gb of data every time you log in? Now you've blown up the quota on local storage. How do you make sure the appropriate data is downloaded or enough data? How do you prioritize the data you need NOW instead of waiting for that last byte of the 6gb to download?

      It is like a useful tool, but not the only future.

    • 1m
      [deleted]
  • paduc1m

    Before I write anything to the DB, I validate with business logic.

    Should I write this logic in the DB itself ? Seems impractical.

    • TeMPOraL1m

      > Should I write this logic in the DB itself ?

      Yes?

      If it sounds impractical, it's because the whole industry got used to not learning databases beyond most basic SQL, and doing everything by hand in application code itself. But given how much of code in most applications is just ad-hoc reimplementation of databases, and then how much of the business logic is tied to data and not application-specific things, I can't help but wonder - maybe a better way would be to treat RDBMS as an application framework and have application itself be a thin UI layer on top?

      On paper it definitely sounds like grouping concerns better.

    • tonsky1m

      If you think of an existing database, like Postgres, sure. It’s not very convenient.

      What I am saying is, in a perfect world, database and server will be the one and run code _and_ data at the same time. There’s really no good reason why they are separated, and it causes a lot of inconveniences right now.

    • Terr_1m

      > logic in the DB

      Something similar but in the opposite direction of lessening DB-responsibilities in favor of logic-layer ones: Driving everything from an event log. (Related to CQRS, Event-Sourcing.)

      It means a bit less focus on "how do I ensure this data-situation never ever ever happens" logic, and a bit more "how shall I model escalation and intervention when weird stuff happens anyway."

      This isn't as bad as it sounds, because any sufficiently old/large software tends to accrue a bunch of informal tinkering processes anyway. It's what drives the unfortunate popularity of DB rows with a soft-deleted mark (that often require manual tinkering to selectively restore) because somebody always wants a special undo which is never really just one-time-only.

    • scotty791m

      I think that's the main issue. It's not enough to have a database that can automatically sync between frontend and backend. It would also need to be complex enough to keep some logic just on the backend (because you don't want to reveal it and entrust adherence to the client) and reject some changes done on frontend if they are invalid. Database would become the app itself.

  • quantadev1m

    IPFS is a technology very helpful for syncing. One way it's being used in a modern context (although only sub-parts of IPFS stack) is how BlueSky engineers, during their design process a few years ago, accepted my proposal that for a new Social Media protocol, each user should have his own "Repository" (Basically a Merkel Tree) of everything he's ever posted. Then there's just a "Sync" up to some master service provider node (decentralized set of nodes/servers) for the rest of the world to consume.

    Merkel-Tree based synching is as performant as you can possibly get (used by Git protocol too I believe) because you can tell of a root of a tree-structure is identical to some other remote tree structure just by comparing the Hash Strings. And this can be recursively applied down any "changed branches" of a tree to implement very fast syncing mechanisms.

    I think we need a NEW INTERNET (i.e. Web3, and dare I say Semantic Web built in) where everyone's basically got their own personal "Tree of Stuff" they can publish to the world, all naively built into some new kind of tree structure-based killer app. Like imagine having Jupyter Notebooks in Tree form, where everything on it (that you want to be) is published to the web.

  • Nelkins1m

    Discussion of sync engines typically goes hand in hand with local-first software. But it seems to be limited to use cases when the amount of data is on the smaller side. For example, can anyone imagine how there might be a local-first version of a recommendation algorithm (I'm thinking something TikTok-esque)? This would be a case where the determination of the recommendation relies on a large amount of data.

    Or think about any kind of large-ish scale enterprise SaaS. One of the clients I'm working with currently sells a Transportation Management Software system (think logistics, truck loads, etc). There are very small portions of the app that I can imagine relying on a sync engine, but being able to search over hundreds of thousands of truck loads, their contents, drivers, etc seems like it would be infeasible to do via a sync engine.

    I mention this because it seems that sync engines get a lot of hype and interest these days, but they apply to a relatively small subset of applications. Which may still be a lot, but it's a bit much to say they're the future (I'm inferring "of application development"--which is what I'm getting from this article).

  • jiggawatts1m

    > Such a library would be called a database. But we’re used to thinking of a database as something server-related, a big box that runs in a data center. It doesn’t have to be like that! Databases have two parts: a place where data is stored and a place where data is delivered. That second part is usually missing.

    Yes! A thousand times this!

    Databases can't just "live on a server somewhere", their code should extend into the clients. The client isn't just a network protocol parser / serialiser, it should implement what is essentially an untrusted, read-only replica. For writes, it should implement what is essentially a local write-ahead log (WAL) either in-memory and optionally fsync-d to local storage. All of this should use the same codebase as the database engine, or machine-generated in multiple languages from some sort of formal specification.

  • spankalee1m

    The problem I have with "moving the database to the client" is the same one I have in practice with CRDTs: In my apps, I need to preserve the history of changes to documents, and I need to validate and authenticate based on high-level change descriptions, not low-level DB access.

    This always leads me back to operational transforms. Operations being reified changes function as undo records; a log of changes; and a narrower, semantically-meaningful API, amenable to validation and authz.

    For the Roam Firebase example: this only works if you can either trust the client to always perform valid actions, or you can fully validate with Firebase's security rules.

    OT has critiques, but almost all of the fall away in my experience when you have a star topology with a central service that mediates everything - defining the canonical order of operations, performs validation & auth, and records the operation log.

  • Phelinofist1m

    The largest feature my team develops is a sync engine. We have a distributed speech assistant app (multiple embeddeds [think car and smartphone] & cloud) that utilizes the Blackboard pattern. The sync engine keeps the blackboards on all instances in sync.

    It is based on gRPC and uses a state machine on all instances that transitions through different states for connection setup, "bulk sync", "live sync" and connection wind down.

    Bulk sync is the state that is used when an instance comes online and needs to catch up on any missed changes. It is also the self-heal mechanism if something goes wrong.

    Unfortunately some embedded instances have super unreliable clocks that drift quite a bit (in both directions). We consider switching to a logical clock.

    We have quite a bit of code that deals with conflicts.

    I inherited this from my predecessor. Nowadays I would probably not implement something like this again, as it is quite complex.

  • erichocean1m

    I designed the sync engine for Things Cloud [0] over a decade ago. It seems to have worked out pretty well for them. (The linked page has some details about what it can do.)

    When sync Just Works™, it's a magical thing.

    One of the reason's my design has been reliable from its very first release, even across multiple refactors/rewrites (I believe it's currently on its third, this time to Swift) is that it uses a Git-like model internally with pervasive hashing. It's almost impossible for sync to work incorrectly (if it works at all).

    [0] https://culturedcode.com/things/cloud/

  • voidpointer1m

    Probably a silly question, but if you take this all the way and treat everything as a DB that is synchronized in the background, how do you manage access control where not every user/client is supposed to have access to every object represented in the DB? Where does that logic go? If you do it on the document level like figma or canvas, every document is a DB and you sync the changes that happen to the document but first you need access to the document/DB. But doesn't this whole idea break apart if you need to do access control on individual parts of what you treat as the DB because you would need to have that logic on the client which could never be secure...

  • theanirudh1m

    How do sync engines address issues where we need something to be more dynamic? Currently I'm building a language learning app and we need to display your "learning path" - what lessons you have finished and what are your next lessons. The next lessons aren't fixed/same for everyone. It will change depending on how the score of completed lessons. Is any query language dynamic enough to support use cases like this? Or is it expected to recalculate the next lessons whenever the user completes a lesson and write it out to a table which can then be queried easily?

  • profstasiak1m

    so... what do people that want to have sync engines do?

    I want to try it for hobby project and I think I will go the route of just one way sync (from database to clients) using electric sql and I will have writes done in a traditional way (POST requests).

    I like the idea of having server db and local db in sync, but what happens with writes? I know people say CRDT etc... but they are solving conflicts in unintuitive ways...

    I know I probably sound uneducated, but I think the biggest part of this is still solving conflicts in a good way, and I don't really see how you can solve those in a way that works for all different domains and have it "collapsed" as the author says

  • Pamar1m

    Maybe I am just dumb but I really cannot see how data synch could solve what (in my kind of business) is a real problem.

    Example: you develop a web app to book for flights online.

    My browser points to it and I login. Should synchronization start right now? Before I even input my departure point and date?

    Ok, no. I write NYC -> BER, and a dep date.

    Should I start synching now?

    Let's say I do. Is this really more efficient than querying a webservice?

    Ok, now all data are synched. Even potentially the ones for business class, even if I just need economy.

    You kniw, I could always change my mind later. Or find out that on the day I need to travel no economy seats are available anymore.

    Whatever. I have all the inventory data that I need. Raw.

    Guess what? As a LH frequent flyer I get special treatment in terms of price. Not just for LH, but most Business Alliance airlines.

    This logic is usually on the server, because airlines want maximum creativity and flexibility in handling inventory.

    Should we just synch data and make the offer selection algorithm run on the webserver instead?

    Let's say it does not matter... I have somehow in front of me all the options for my trip. So I call my wife to confirm she agrees with my choice. I explain her the alternatives... this takes 5 minutes.

    In this period, 367 other people are buying/cancelling trips to Europe. So I either see my selection constantly change (yay! Synchronization!!!) or I press confirm, and if my choice is gine I get a warning message and I repeat my query.

    Now add two elements: - airlines prefer not to show real numbers of available seats - they will usually send you a single digit from 1 to 9 or a "*" to mean "10 or more".

    So just symching raw data and let the combinatorial engine work in the browser is not a very good idea.

    Also, I see the pontential to easily mount DDOS attacks if every client is constantly being synchronized by copying high contention tables in RT.

    What am I missing here?

  • loquisgon1m

    The local first people (https://localfirstweb.dev/) have some cool ideas about how to solve the data synch problem. Check it out.

  • qudat1m

    The problem with sync engines is needing full-stack buy-in in order for it to work properly. Having a separate backend-for-frontend service defeats the purpose in my mind. So what do you do when a company already has an API and other clients beyond a web app? The web app has to accommodate. I see this as the major downside with sync engines.

    I've been using `starfx` which is able to "sync" with APIs using structured concurrency: https://github.com/neurosnap/starfx

  • zareith1m

    I think an underappreciated library in this space is Logux [1]

    It requires deeper (and more) integration work compared to solutions that sync your state for you, but is a lot more flexible wrt. the backend technology choices.

    At its core, it is an action synchronizer. You manage both your local state and remote state through redux-style actions, and the library takes care of syncing and resequencing them (if needed) so that all clients converge at the same state.

    [1] https://logux.org/

  • onion2k1m

    Isn't this what CouchDB/PouchDB solves in quite a nice way?

  • fxnn1m

    The author would be excited to learn that CouchDB solves this problem since 20 years.

    The use case the article describes is exactly the idea behind CouchDB: a database that is at the same time the server, and that's made to be synced with the client.

    You can even put your frontend code into it and it will happily serve it (aka CouchApp).

    https://couchdb.apache.org

  • rockmeamedee1m

    Idk man. It's a nice idea, but it has to be 10x better than what we currently have to overcome the ecosystem advantages of the existing tech. In practice, people in the frontend world already use Apollo/Relay/Tanstack Query to do data caching and querying, and don't worry too much about the occasional overfetching/unoptimized-ness of the setup. If they need to do a complex join they write a custom API endpoint for it. It works fine. Everyone here is very wary of a "magic data access layer" that will fix all of our problems. Serverless turned out to be a nightmare because it only partially solves the problem.

    At the same time, I had a great time developing on Meteorjs a decade ago, which used Mongo on the backend and then synced the DB to the frontend for you. It was really fluid. So I look forward to things like this being tried. In the end though, Meteor is essentially dead today, and there's nothing to replace it. I'd be wary of depending so fully on something so important. Recently Faunadb (a "serverless database") went bankrupt and is closing down after only a few years.

    I see the product being sold is pitched as a "relational version of firebase", which I think good idea. It's a good idea for starter projects/demos all the way up to medium-sized apps, (and might even scale further than firebase by being relational), but it's not "The Future" of all app development.

    Also, I hate to be that guy but the SQL in example could be simpler, when aggregating into JSON it's nice to use a LATERAL join which essentially turns the join into a for loop and synthesises rows "on demand":

      SELECT g.*, 
             COALESCE(t.todos, '[]'::json) as todos
      FROM goals g
      LEFT JOIN LATERAL (
        SELECT json_agg(t.*) as todos
        FROM todos t
        WHERE t.goal_id = g.id
      ) t ON true
    
    That still proves the author's point that SQL is a very complicated tool, but I will say the query itself looks simpler (only 1 join vs 2 joins and a group by) if you know what you're doing.
  • avodonosov1m

    Why he haven't implemented a full Datomic Peer for his DataScript I never understood.

    Having a datalog query engine, supplying it with data from Datomic indexes - b-tree like collections storing entity-attribute-value records - seems simple. Updating the local index cache from log is also simple.

    And that gets you a db in browser.

  • finolex1m

    If anyone could be kind to give feedback on the local-first x data ownership db we're building, would really appreciate it! https://docs.basic.tech/

    Will do my best to take action on any feedback I receive here

  • shikhar1m

    We have had interest in using our serverless stream API (https://s2.dev/) to power sync engines. Very excited about these kinds of use cases, email in profile if anyone wants to chat.

  • beders1m

    I found it quite disappointing to find a marketing piece from Nikki.

    It is full of general statements that are only true for a subset of solutions. Enterprise solutions in particular are vastly more complex and can't be magically made simple by a syncing database. (no solution comes even close to "99% business code". Not unless you re-define what business code is)

    It is astounding how many senior software engineers or architects don't understand that their stack contains multiple data models and even in a greenfield project you'll end up with 3 or more. Reducing this to one is possible for simple cases - it won't scale up. (Rama's attempt is interesting and I hope it proves me wrong)

    From: "yeah, now you don't need to think about the network too much" to "humbug, who even needs SQL"

    I've seen much bigger projects fail because they fell for one or both of these ideas.

    While I appreciate some magic on the front-end/back-end gap, being explicit (calling endpoints, receiving server-side-events) is much easier to reason about. If we have calls failing, we know exactly where and why. Sprinkle enough magic over this gap and you'll end up in debugging hell.

    Make this a laser focused library and I might still be interested because it might remove actual boilerplate. Turn it into a full-stack and your addressable market will be tiny.

  • asdffdasy1m

    > Such a library would be called a database.

    bold of them to assume a database can manage even the most trivial of conflicts.

    There's a reason you bombard all your writes to a "main/master/etc"

  • hamilyon21m

    I am feeling a bit confused. Is not the stated problem solved 99.9% with decades-old battle-proven optimistic locking and some careful retries?

  • arkh1m

    The future of webapps: wasm in the browser, direct SQL for the API.

    Main problem? No result caching but that's "just" a middleware to implement.

  • mike_hearn1m

    I recently took a part time role at Oracle Labs and have been learning PL/SQL as part of a project. Seeing as Niki is shilling for his employer, perhaps it's OK for me to do the same here :) [1]. HN discourse could use a bit of a shakeup when it comes to databases anyway. This may be of only casual interest to most readers, but some HN readers work at places with Oracle licenses and others might be surprised to discover it can be cheaper than an AWS managed Postgres [2].

    It has a couple of features relevant to this blog post.

    The first: Niki points out that in standard SQL producing JSON documents from relational tables is awkward and the syntax is terrible. This is true, so there's a better syntax:

        CREATE JSON RELATIONAL DUALITY VIEW dept_w_employees_dv AS
        SELECT JSON {'_id'            : d.deptno,
                     'departmentName' : d.dname,
                     'location'       : d.loc,
                     'employees'      :
                         [ SELECT JSON {'employeeNumber' :e.empno,
                                        'name' : e.ename}
                           FROM employee e
                           WHERE e.deptno = d.deptno ]
                    }
        FROM department d WITH UPDATE INSERT DELETE;
    
    It makes compound JSON documents from data stored relationally. This has three advantages: (1) JSON documents get materialized on demand by the database instead of requiring frontend code to do it, (2) the ORDS proxy server can serve these over HTTP via generic authenticated endpoints (e.g. using OAuth or cookie based auth) so you may not need to write any code beyond SQL to get data to the browser, and (3) the JSON documents produced can be written to, not only read.

    The second feature is query change notifications. You can issue a command on a connection that starts recording the queries issued on it and then get a callback or a message posted to an MQ when the results change (without polling). The message contains some info about what changed. So by wiring this up to a web socket, which is quite easy, the work of an hour or two in most web frameworks, then you can stream changes to the client directly from the database without needing much logic or third party integrations. You either use the notification to trigger a full requery and send the entire result json back to the browser, or you can get fancier and transform the deltas to json subsets.

    It'd be neat if there was a way to join these two features together out of the box, but AFAIK if you want full streaming of document deltas to the browser and reconstituting them there, it would need a bit more on top.

    Again, you may feel this is irrelevant because doesn't every self-respecting HN reader use Postgres for everything, but it's worth knowing what's out there. Especially as the moment you decide to paying a cloud for hosting your DB you have crossed the Rubicon anyway (all the hosted DBs are proprietary forks of Postgres), so you might as well price out alternatives.

    [1] and you know the drill, views are my own and nobody has reviewed this post.

    [2] https://news.ycombinator.com/item?id=42855546

  • ltbarcly31m

    This has been solved every 5 years or so, and along the way people learn why this solution doesn't actually work.

  • sreekanth8501m

    We use indexedDB and signalr for real time sync. What is new about this?

  • delusional1m

    > I’ve yet to see a code base that has maintained a separate in-memory index for data they are querying

    Define "separate" but my old X11 compositor project neocomp I did something like that with a series of AOS arrays and bitfields that combined to make a sort of entity manager. Each index in the arrays was an entity, and each array held a data associated with a "type" of entity. An entity could hold multiple types that would combine to specify behavior. The bitfield existed to make it quick to query.

    It waaay too complicated for what it was, but it was fun to code and worked well enough. I called it a "swiss" (because it was full of holes). It's still online on github (https://github.com/DelusionalLogic/NeoComp/blob/master/src/s...) even though I don't use it much anymore.

  • ativzzz1m

    I've always wondered, how do applications with more stringent security requirements handle this?

    Assume that permissions to any row in the DB can be removed at any time. If we store the data offline, this security measure is already violated. If you don't care about a user potentially storing data they no longer have access to, when they come online, any operations they make are invalid and that's fine

    But, if security access is part of your business logic, and is complex enough to the point where it lives in your app and not in your DB (other than using DB tools like RLS), how do you verify that the user still has access to all cached data? Wouldn't you need to re-query every row every time?

    I'm still uncertain how these sync engines can be secured properly

  • wslh1m

    Sync, in general, is a very complex topic. There are past examples, such as just trying to sync contacts across different platforms where no definitive solution emerged. One fundamental challenge is that you can’t assume all endpoints behave fairly or consistently, so error propagation becomes a core issue to address.

    Returning to the contacts example, Google Contacts attempts to mitigate error propagation by introducing a review stage, where users can decide how to handle duplicates (e.g., merge contacts that contain different information).

    In the broader context of sync, this highlights the need for policies to handle situations where syncing is simply not possible beyond all the smart logic we may implement.

  • hyperbolablabla1m

    How does this compare to supabase?

  • VikingCoder1m

    There are two hard problems:

    1. Naming things

    2. Caching

    3. Off-by-one errors

  • keizo1m

    didn't know that about roam research. I was a user, but also that app convinced me that front-end went in the wrong direction for a decade...

    Rocicorp Zero Sync, instantdb, linear app like trend is great -- sync will be big. I hope a lot of the spa slop gets fixed!

  • theamk1m

    TL/DR:

    > If your database is smart enough and capable enough, why would you even need a server? Hosted database saves you from the horrors of hosting and lets your data flow freely to the frontend.

    (this is a blog of one such hosted database provider)

  • DeathArrow1m

    I've solved data sync in distributed apps long time ago. I send outgoing data to /dev/null and receive incoming data from /dev/zero. This way data is always consistent. That also helps with availability and partion tolerance.