Last weekend, we experienced the most significant service outage in our company’s history. For nearly 24 hours, customers were unable to access Crew, disrupting your operations and violating the trust you’ve placed in our service.
We know how essential Crew is to your operations, which is why we take this incident very seriously. Whether it is a restaurant team managing last-minute schedule changes or an emergency team coordinating a response effort, Crew must work reliably. On behalf of all of us at Crew, I am deeply sorry for letting you down last week.
You are our most important stakeholders, and I think it’s important you know what happened and the steps we’re taking to ensure it doesn’t happen again.
What happened
A data request is sent to Crew each time a message is sent, a schedule is created, a person is invited, or any other action is performed. These requests are added to a queue in the order they are received and then “processed” (e.g. the message that was sent is delivered, the schedule that was created is distributed, etc.).
On November 10th at 9:22 AM PST, a data request exposed a bug in a third party software library on which Crew had a dependency. This bug prevented the system from processing new requests in the queue, effectively rendering Crew unusable during the time of the outage. No customer data was compromised as part of this incident.
Our engineering team was notified within seconds and immediately began a comprehensive investigation. This investigation continued for 23 hours until the issue was finally resolved. At the same time, Crew users were attempting to send messages and perform other actions in the app. Each of these actions contributed to an enormous backlog of data requests that still needed to be processed. It took Crew an additional 10 hours to process these requests. During this time, users experienced a degraded experience that appeared to be partially working.
What we’re doing next
We are firm believers that you can’t fix something unless you put it out in the open, talk about it, and understand it fully.
Over the course of this past week, we conducted a thorough postmortem to identify the areas where we fell down in our response to this outage. While we can’t guarantee that Crew will never again experience downtime, we can commit to you that we not repeat the mistakes that were made in this effort.
Communication
Recovery