While RapGenius it right to point out that Heroku was incorrect in describing how their load balancing works. It seems off to blame this exclusively for their performance problems. Plenty of high traffic sites not on Heroku operate just fine using nginx's upstream load balancing, which is simple round robin load balancing, paying no attention to how many requests are being handled currently by a backend.
What seems particularly odd is that rap genius appears to be using their Heroku dynos to serve their css, javascript and certain images. So loading the homepage appears to make about 12 requests that hit their dynos, rather than 1.
If you were looking for low hanging fruit to reduce load on the dinos for a high traffic site, this seems like an obvious place to start, rather than jumping straight to a 'smart' load balancing solution.
> It seems off to blame this exclusively for their performance problems
There are definitely ways to improve the performance, but you can't measure them since Heroku gives you no way of determining how much time requests spend in the in-Dyno queue. Tho you can modify your app to get New Relic to display this info: http://rapgenius.com/1506509
> What seems particularly odd is that rap genius appears to be using their Heroku dynos to serve their css, javascript and certain images. So loading the homepage appears to make about 12 requests that hit their dynos, rather than 1.
These assets are Cached in Varnish. We serve them without hitting dynos at all
With the Bamboo stack, there's a Varnish in front of your servers that you can use to cache responses. So it's part of Heroku. Cedar no longer has Varnish.
Click on the yellow/green bits. Rapgenius is a lyrics site that allows people to give context to lyrical meanings inline. See alexdevkar's comment below.
It works terribly for responses to blog posts, so as someone who doesn't use it already and whose first exposure to it is this Heroku issue, now I think "Rap Genius works terribly".
I think UX success is not measured in how many people you've already inflicted a bad design on, but instead by how many you avoid hurting in the first place.
Why not make the rap explanations look different than a style lots of sites use for regular links? (perhaps a big square around the section or something similar)
There are green badges next to their added comments. And a big bold part at the top that says "Click the green links below to see our responses", but I'm not sure it was there before or after they saw these reactions.
edit: I agree with you on the UX part. Just wanted to point out that there was some context clues.
A system appropriate for annotating music is not necessarily appropriate for blogging. In this case, Rap Genius is a horrifically bad blogging platform.
I noticed that on the original blog post as well. Very clever. Though it seems on the blog posts that some people maybe don't understand what rapgenius does and end up saying things like 'WHAT ARE THESE BOXES WHY AM I HERE WHY DO I NEED TO FILL THIS OUT???' in the context notes. But clever nonetheless.
Not sure if this is called for. Heroku has a performance issue and their documentation had a mistake. They accepted everything, apologized and are working to resolve it. What am I missing here?
There are two ways to look at this, and depending on your point of view you might be upset.The root of the dispute is how they scale and how that affects latency.
According to these write-ups, Heroku scales performance by doing dynamic scheduling on an array of identical servers (called 'dynos'). The documentation talks about a feature named "Intelligent Routing" which only sends work to a dyno which is available to do work.
That is a pretty ideal setup because in practice it means that you get linear scaling by adding server instances, and since costs are based on total server instances you get both linear scale increase with linear cost increase.
However, there is a very classical problem, first noted by Gene Amdahl, about the cost of figuring out how best to parallelize a stream of requests, vs the rate at which you could satisfy those requests. It became known as "Amdahl's Law"[1]. It limits the practical scalability of a lot of systems.
So at some point, Heroku got big enough, that the cost of figuring out which server instance wasn't busy, was taking "too long". (that cost is the (1-P) part in the Amdahl equation) so they decided to reduce the cost of making the choice by replacing a "data driven choice" with a "statistics driven choice". This too is actually a pretty well known way of doing things (Google and Blekko use it to send search queries to a bunch of waiting backends) But unlike the 'idealized' case which has every server instance handling at most one queue, the value becomes a probability that the server instance is either not-busy, or that the current transaction will finish quickly. This works well for systems where the cost of every transaction is nearly identical, you just add servers until the 90th percentile of requests hits your target, but poorly for systems where each request has a variable amount of work it might do.
I spent a number of years studying these sorts of systems while solving scalability issues at Network Appliance. A file system built out of a distributed set of nodes providing access to a single file system image needs to know a-priori the cost of each transaction flowing through it in order to optimize scheduling. Similarly RAID subsystems need to know which disk I/Os are going to land in the cache or on the disk, and if they land on the disk will they result in a seek or not. You end up with a directed graph of weighted probabilities being shoved through a channel of fixed bandwidth. Its all amazingly fun until someone says "I have to get data back from the disk in no less than 10mS every time" (databases would say stuff like that) and then you start trading dollars for milliseconds as they say.
So Heroku changed their algorithm, didn't tell anyone, and the systemic behavior changed in a very user visible way for large users (in this case Rap Genius). The folks at Rap Genius were pissed off that they made this change without informing them, and they, Rap Genius, looked bad to their customers because of it. Nobody in operations wants to say "Uh, I don't know why your experience with our service is currently sucking."
I can see why Rap Genius is mad, and I can see that Heroku might not have fully thought through the ramifications of their algorithm change.
I think the explanation for the change is a bit simpler than that. Someone can correct me if I've misunderstood, but AFAIK it isn't that "Intelligent Routing" was too expensive, but that it depended on simplifying assumptions that stopped being true.
Originally, Heroku was only for Ruby, and it depended on the assumption that a server could only handle one request at a time. All the talk of Intelligent Routing seems to date from this time, so it appears that Intelligent Routing just meant they never routed more than one request at a time to a server. But then Heroku wanted to add support for things like Java and Node.js, which can support multiple requests per server. This meant the simplifying assumption of "1 dyno = 1 request" baked into the old routing algorithm was no longer valid, so they had to switch to something else or they'd be crippling pretty much everything but Ruby.
That would make sense, if Heroku didn't know which were ruby requests and which weren't. But it seems like they did. If only as a set of VIPs (virtual IPs) landing on the router being tagged as 'for ruby' or 'not ruby' which could pick the appropriate routing algorithm.
Understand that keeping millisecond accurate state on 10 machines is doable, on 100 machines its hard, and on a few thousand machines? It really starts to break down. One way I've seen that done is that on ingress a request is wrapped in a message for server X which is taken out of the 'free' pool, and then when server X returns the answer back through the router it gets added back into the free pool. But the next order effect is that list insertion / removal has different sorts of behaviors, if you shift frees into the end of the list and pop them from the front (a round robin approach) you get good distribution but sometimes send things 'far' away when they could be served locally. If you push/pop things from the front you get some really hot servers and some really cold servers. Early on Google played some games which were designed to maximize the use of available network backbone bandwidth (its always oversubscribed from the server to the 'net'). Like any of the more interesting problems it starts off easy and then gets harder and harder.
In a later post, Rap Genius quotes Adam Wiggins the CTO of Heroku with this bit:
" There are a lot of reasons, but the two big ones are 1) the "intelligent routing" doesn't scale, since it relies on distribution locking which effectively destroys parallelism, and 2) it's incompatible with the evented and realtime apps which are increasingly common on the modern web. "Intelligent routing" sounds good, but in the end it wasn't good for our customers."
So, in short they ran into Amdahl's law and changed the way they do things.
The fact that they apologized and accepted the blame does not change the fact that they knowingly degraded performance of their oldest and most loyal customers and forced them to pay much higher costs for years.
1) Understates the problem – It affects not only Bamboo, but all thin (and other non-concurrent web servers) on Cedar. And since thin is the default on Cedar, the problem affects all apps on Cedar by default
2) Understates how long they knew about it – I notified Heroku about the problem INCLUDING the simulation results 3 days before the blog post came out and yet the apology claims they didn't know about the problem until seeing the blog post.
Transparency is definitely called for. Especially for a use case that's getting a lot of attention. Heroku's lack of transparency is what caused this problem in the first place.
But that's exactly what their product is trying to do! They are saying you can use their tech to comment and explain anything, not just song lyrics. I like that their eating their own dog food. (I can't say the same for their color choices).
Isn't this just a fancy way of embedding footnotes in text inline rather than at the bottom of the page? I don't see this catching on as some sort of new blogging platform.
What does that mean? Look, I love Rapgenius for the lyrics and I am as much a fan of eat your dog food as anyone else. But hopefully they are going to go full circle with the philosophy and also listen to user feedback. This user is very confused.
It's definitely confusing – we've got a long way to go before Rap Genius is the perfect platform non-musical textual analysis! (But I still think this is a good way to present our reply since I want to comment on 2 of Heroku's specific claims)
More like, stay away if you are running an app that has limited concurrency per worker dyno (either because your framework is single-threaded, or because your workers are CPU-bound) and you need predictable low latency.
Bear in mind that it's not like Heroku does anything particularly terrible in that use case. As far as I can see, it works about as well as any standard round-robin load balancer would. It's just that if those are your requirements, you have a problem that Heroku can't magically solve for you.
Nah. Heroku is really nice for rapid development. Some ppl think Heroku suck's for production but imho more outages back in the day... and I assumed it was 'common knowledge' scaling up to 99 dynos as a big no no... that's a lot of VC cheddar...
More like "don't say Rap Genius told you about a problem Yesterday when they actually told you about it last week and you refused to respond until they took the issue public"
The most amazing part of this situation, to me at least, is that a site devoted to explaining rap lyrics is lucrative enough to pay $20k/month in hosting. Imagine if they actually had a product!
But then you find that apparently some very smart people to put (other people's) money into the business - and you're left scratching your head "what am I missing?"
Not sure how this works but...
Is the actual response in the green hightlights?
Are all the green highlights from the same person? I think they are.
But then, who made the yellow highlights?
I don't think this was necessary. The first post by RapGenius was great and was a really good way to point out an issue with a commonly used product. However now that Heroku has come to them hat in hand and promised some sort of resolution this seemed unnecessarily petty. Especially the one line about the date of the report, that's just pedantic.
At this point in the issue RG and Heroku should be communicating privately, not via blog post.
> I’m convinced that the best path forward is for one of your developers to work closely with [redacted] to modernize and optimize your web stack. If you invest this time I think it’s very likely you’ll end up with an app that performs the way you want it to at a price within your budget
So... Heroku's CTO acknowledged there was a problem with their stack, and offered to help RapGenius modernize and optimize it. And RapGenius quibbles on the word "yesterday"?
While I do appreciate RapGenius raising the issue publicly to bring better accountability to Heroku. Their response to Heroku's response ought to have been along the lines of: "Hurray! Everyone's boat is rising with the tide." Not this.
> Their response to Heroku's response ought to have been along the lines of: "Hurray! Everyone's boat is rising with the tide." Not this.
If Heroku actually knew about the problem for a long time and yet didn't officially respond or apologize until we published that post, doesn't that make you take their apology less seriously?
Not really no. They offered to help you solve your problem. You received a response from the CTO. One that is favorable to your position. How many other CTOs would have done that? What more did you expect from them?
What seems particularly odd is that rap genius appears to be using their Heroku dynos to serve their css, javascript and certain images. So loading the homepage appears to make about 12 requests that hit their dynos, rather than 1.
If you were looking for low hanging fruit to reduce load on the dinos for a high traffic site, this seems like an obvious place to start, rather than jumping straight to a 'smart' load balancing solution.