• Encrypt-Keeper@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    edit-2
    14 hours ago

    It was a DNS issue with DynamoDB, the load balancer issue was a knock-on effect after the DNS issue was resolved. But the problem is it was a ~15 hour outage, and a big reason behind that was the fact that the load in that region is massive. Signal could very well have had their infrastructure in more than one availability zone but since the outage affected the entire region they are screwed.

    You’re right that this can be somewhat mitigated by having infrastructure in multiple regions, but if they don’t, the reason is cost. Multi-region redundancy costs an arm and a leg. You can accomplish that same redundancy via Colo DCs for a fraction of the cost, and when you do fix the root issue, you won’t then have your load balancers fail on you because in addition to your own systems you have half the internet all trying to pass its backlog of traffic at once.

    • sugar_in_your_tea@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      11 hours ago

      Multi-region redundancy costs an arm and a leg

      Yes, if you buy an off the shelf solution, it’ll be expensive.

      I’m suggesting treating VPS instances like you would a colo setup. Let cloud providers manage the hardware, and keep the load balancing in house. For Signal, this can be as simple as client-side latency/load checks. You can still colo in locations with heavier load; that’s how some Linux distros handle repo mirrors, and it works well. Signal’s data needs should be so low that simple DB replicas should be sufficient.