Crew Outage: What we're doing next

Written by Broc Miramontes | November 21, 2018

Last weekend, we experienced the most significant service outage in our company’s history. For nearly 24 hours, customers were unable to access Crew, disrupting your operations and violating the trust you’ve placed in our service.

We know how essential Crew is to your operations, which is why we take this incident very seriously. Whether it is a restaurant team managing last-minute schedule changes or an emergency team coordinating a response effort, Crew must work reliably. On behalf of all of us at Crew, I am deeply sorry for letting you down last week.

You are our most important stakeholders, and I think it’s important you know what happened and the steps we’re taking to ensure it doesn’t happen again.

What happened
A data request is sent to Crew each time a message is sent, a schedule is created, a person is invited, or any other action is performed. These requests are added to a queue in the order they are received and then “processed” (e.g. the message that was sent is delivered, the schedule that was created is distributed, etc.).
On November 10th at 9:22 AM PST, a data request exposed a bug in a third party software library on which Crew had a dependency. This bug prevented the system from processing new requests in the queue, effectively rendering Crew unusable during the time of the outage. No customer data was compromised as part of this incident.

Our engineering team was notified within seconds and immediately began a comprehensive investigation. This investigation continued for 23 hours until the issue was finally resolved. At the same time, Crew users were attempting to send messages and perform other actions in the app. Each of these actions contributed to an enormous backlog of data requests that still needed to be processed. It took Crew an additional 10 hours to process these requests. During this time, users experienced a degraded experience that appeared to be partially working.

What we’re doing next
We are firm believers that you can’t fix something unless you put it out in the open, talk about it, and understand it fully.

Over the course of this past week, we conducted a thorough postmortem to identify the areas where we fell down in our response to this outage. While we can’t guarantee that Crew will never again experience downtime, we can commit to you that we not repeat the mistakes that were made in this effort.

Communication

Where we failed: Crew’s primary method of communicating to its users is via the Crew app. An outage of this magnitude blocked us from using those channels in real time and left our users in the dark.
What we are doing: We will soon be deploying a status page to communicate current service levels across Crew’s product suite. We’re also exploring strategies to guide users to more information in-app and allow them to opt-into more proactive communications around service outages via email, SMS, and other channels.

Investigation Time

Where we failed: Simply put, this problem took too long to solve. Our on-call team did not have the resources available to diagnose the problem quickly.
What we are doing: We are making historical information around previous issues and resolutions more accessible to our on-call engineering team, and we are giving the team the resources and we are establishing concrete escalation paths to resource urgent issues faster.

Recovery

Where we failed: Once the issue was resolved, we did not have the right strategies in place to bring the systems back online in a timely manner. 10 hours is unacceptable.
What we are doing: The hard part of speeding up recovery time is doing so in a way that does not put customer data at risk, and the only way to do this properly is to test. To this end, we are establishing a regular cadence of testing faster data recovery strategies so that if a worst case scenario ever does happen again, we are ready for it.

I can’t emphasize enough how important it is to us that we provide you with a service that is both trustworthy and reliable. I know it takes action to rebuild trust, and I hope that in the coming weeks and months you see that trust restored.

If you have any questions or feedback in the meantime, please don’t hesitate to write into support@crewapp.com or message Crew Support directly in the Crew App.

Thank you,
Broc

View full post