Amazon Simple Storage Service (S3) is a service providing web hosting. The cloud computing solution has been used by many organizations successfully. However the solution has experienced some problems including failing for much of the day on July 20th.
Amazon S3 Availability Event [the broken link was removed]
We’ve now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect. We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers’ objects. However, we didn’t have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn’t detect it and it spread throughout the system causing the symptoms described above. We hadn’t encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.
During our post-mortem analysis we’ve spent quite a bit of time evaluating what happened, how quickly we were able to respond and recover, and what we could do to prevent other unusual circumstances like this from having system-wide impacts. Here are the actions that we’re taking: (a) we’ve deployed several changes to Amazon S3 that significantly reduce the amount of time required to completely restore system-wide state and restart customer request processing; (b) we’ve deployed a change to how Amazon S3 gossips about failed servers that reduces the amount of gossip and helps prevent the behavior we experienced on Sunday; (c) we’ve added additional monitoring and alarming of gossip rates and failures; and, (d) we’re adding checksums to proactively detect corruption of system state messages so we can log any such messages and then reject them.
Finally, we want you to know that we are passionate about providing the best storage service at the best price so that you can spend more time thinking about your business rather than having to focus on building scalable, reliable infrastructure. Though we’re proud of our operational performance in operating Amazon S3 for almost 2.5 years, we know that any downtime is unacceptable and we won’t be satisfied until performance is statistically indistinguishable from perfect.
The failure was significant but in my view the advantages of Amazon S3 are still very significant. A huge advantage is how quickly you can scale if needed be. If your application is not hosted on Amazon S3 and it grows enormously you have to physically deal with buying servers, installing them, installing software… All this takes time. On Amazon S3 when you need the bandwidth you can get it, when you don’t need it you don’t have it sitting around unused. In that way it is very lean, it seems to me.
And while server infrastructure failures are bad, for most organizations the option is not Amazon S3 or some solution that is 100% reliable. Currently it is difficult to keep IT infrastructures online and operating and coping with shifting demand… For many situations Amazon S3 seems to be a great resource. They need to keep improving; and they seem to be doing so. Being open and honest about the challenges is a good sign. And improving the system, not blaming a person is another good sign.
Related: Bezos on the Internet Boom – Amazon’s Amazing Achievement – Bezos on Lean Thinking – CERN Pressure Test Failure – 12 Stocks for 10 Years Update (June 2008), Amazon is up 116% in the portfolio since 2005, just behind Google and ahead of Petro China
Keeping Good Employees
Understanding Why Good Workers Quit [the broken link was removed]
Good advice. I like direct, simple, questions. What can we do to keep you? What do you enjoy about your job? What do you dislike? What can I do to increase your joy in work? What one thing would you most like to see changed? What do you want to see continue? Would you like help in some aspect of your career development? What can I do better? Am I providing too much oversight, not enough?
Give honest straight forward answers to questions. If someone wants to move ahead and needs to work harder to advance their career tell them that. If they need to be more cooperative, develop certain skills… tell them. The idea is not just to make the person happy in that meeting. If they need to work on certain things to get where they want then help them do that. Give your best advice and say what they can do to improve.
Related: People are Our Most Important Asset – What 1 Thing Can We Improve? – IT Talent Shortage, or Management Failure? – Silicon Valley Style Hiring – How to Improve – Respect for People, Understanding Psychology – The Joy of Work