I think one factor people don't consider enough is the tradeoffs you need to mak...

seats · on June 30, 2012

Google incurred that cost early, but that tipping point of being 'worth it' to engineer for uptime came pretty late in the game for twitter.

Interestingly AWS appears to be in the camp where it's not heavily focused on uptime either.

A pretty universal truth is that you can afford to be down more than you expect. Even in the cases of Amazon's store being down when you can measure the immense cost by the second, a total cost analysis may still well show that it was a 'cheaper' event than engineering away.

pavs · on July 1, 2012

I don't know about you guys but most of the time twitter doesn't work for me at all. I would say 50-60% of the time. I mean the page loads (v-e-r-y slowly) but the tweets don't. I almost of always have to click on "reload tweets" button to see tweets. Several times.

I don't think I have ever seen such clusterfuck of performance in any major site for such a long time.

It has come to the point that I don't even bother clicking twitter links.

fredsters_s · on June 30, 2012

Yep, there will always be a tipping point when your downtime becomes worth engineering out - this obviously varies per app. Understanding what's available to you is also important - I've met quite a few people who have no idea about some of the features available on AWS. I'm not sure if this is just a problem with AWS's docs / console, or just knowing what to look for in the first place.

_gd3l · on June 30, 2012

Uptime is something users love regardless of the size of the company whose product they're trying to use. Investing in uptime is always worth it.

pg · on June 30, 2012

Investing in uptime is always worth it.

That simply can't be true. There is always going to be a point where an extra decimal place of reliability is too costly.

fredsters_s · on June 30, 2012

There's always a trade off between the cost of failing and the cost of engineering it out. The problem comes with the lack of understanding about where and how apps and intrastructure fail and how to avoid it. If you misunderstand the problem, you'll probably misjudge it.

bermanoid · on July 1, 2012

People should absolutely at least be doing some back of the envelope math on this before choosing a strategy.

If you're at N DAU, then a 12h downtime will affect a bit more than N/2 users, and some percentage of those users will become ex-users - you can run a small split test to figure out how many if you don't already have data on that. You'll also lose a direct half day of revenue. This type of thing will happen somewhere between once a year an once every couple of months, as low and high estimates.

Crunch those numbers, and you'll have an order of magnitude estimate of what downtime actually costs you, and what you can actually afford to spend to minimize it. Keep in mind that engineering and ops time costs quite a bit of money, and that you'll be slowing down other feature development by wasting time on HA.

For instance, let's say you're running a game with 1M DAU, and 5M total active users, making $10k per day (not sure if that's reasonable, but let's pretend), and you've figured out that 12h of downtime makes you lose approximately 10% of the users that log in during that period. In that case, 12h of downtime costs you a one-time "fee" of $5k, and also pushes away ~1% of your total users, which will cost you $100 per day as an ongoing "cost".

If we assume this happens exactly once, and that a mitigation strategy would work with 100% effectiveness, then you should be willing to spend up to $100 extra per day to implement that strategy; the $5k up-front loss is not nothing, but we can probably assume it'll get eaten up by engineering time to implement that strategy. If such a strategy would cost significantly more than $100 per day over your current costs, then by pursuing it you're assuming that "oh shit it's all gone to hell!" AWS events are likely to affect you multiple times over the period in question.

I'm not saying these numbers are realistic in any way, or that the method I've shown is 100% sound (I'm on an iPhone, so I haven't edited or reread any of it); I'm just saying that whether you pursue a mitigation strategy or not, it's not terribly difficult to ground your decision in numbers. They do tend to be right on the edge of reasonable for a lot of people, so it's worth thinking about them (good) or (better) measuring them.

birken · on June 30, 2012

I agree with the first sentence, uptime is a very nice thing that users will notice and appreciate over time.

However, I strongly disagree with the second sentence. Investing is uptime is not always worth it. Taken to its logical extreme, imagine 2 potential websites. One of them is incredibly useful but only up 80% of the time. The other one is a blank HTML page, but it the most reliable website in history with 0 seconds of downtime in the past 10 years. If I surveyed users of both websites, I think it would be almost unanimous that people preferred the useful website that was up sometimes.

Startups have limited time and resources, and in practice getting 99% uptime is relatively easy, whereas 99.9% uptime is relatively hard. That is a difference of ~7 hours of uptime per month. Yes, it sucks when your website is down, but it also sucks when there are features you can't develop because you don't have the time or your technical infrastructure doesn't allow in order to chase ultra high reliability. Obviously this depends on your industry, IE if you are a payment processor you better have super high uptime or you aren't going to have any customers, but realistically most companies will likely not lose that many customers if they are up >99% of the time.

gav · on June 30, 2012

There's also risks inherent in a more complicated system.

You can engineer a more complicated system with the goal of avoiding downtime, but this added complexity may end up with unexpected corner-cases and cause a net decrease in uptime, at least in the short term.

It's often better to concentrate on improving mean time to repair (MTTR).