I don't necessarily think this is even breaking the HTTP standard. While '+' should not be interpreted as spaces as part of a URL while it's being treated as a URL, the HTTP spec doesn't specify / care what file that may map to on a server.
Edit: As mentioned below, this isn't correct since URLs should be able to be escaped and return the same resource, and an escaped + differs from an unescaped + on S3.
Exactly! The OP's point is summarized in this sentence:
> My point is that the spec requires + to be escaped only inside the querystring.
So what? What the standard mandates for query strings is irrelevant here. It's up to the server how to interpret and map the URLs. "Unconventional and unfortunate" - yes, but breaking the HTTP spec? No.
Please read the actual spec before telling poeple whether something is conforming to it or not. Just making stuff up is exactly how this mess is created. The relevant section in this case:
It breaks the HTTP spec because it internally is decoding the URL wrong. This is important because things that speak HTTP are free to choose to percent encode, or not, the plus sign in a path, and the canonical URL should not differ. If it mapped even an escaped plus to a space, it'd be consistent, though still questionable, behavior.
Amazon has a difficult time with the HTTP standard sometimes. Last time I had to touch an AWS project we discovered a bug[1] in the C++ code backing a Java library (sic).
They had implemented their own HTTP client, but forgot to add the "Host" header to requests which is required by HTTP 1.1.
Interestingly this client sent requests only to their own services, which means that they either released that without testing it or the backend once accepted faulty requests.
It's common for HTTP servers to accept requests without a host header. It's not usually needed by the server unless you're hardening it (I don't class it as a security issue but some security audits will flag it up if you don't force the server to reject invalid host headers) or running named virtual hosts (which is more common than it used to be thanks to SNI but you still often see a 1:1 relationship between (virtual) hosts and IPs). So Amazon could easily have tested their client on 3rd party servers and still not spotted the problem.
As an anecdote, about 15/20ish years ago I wrote my own webbrowser. Obviously something highly rudimentary albeit browsers were much easier to implement back then anyway. I was too lazy to read the HTTP spec (it was a hobby project and I was young and impatient) so a lot of what I did was trial and error. I too wasn't sending a host header but it took long while before I ran into any sites that rejected my HTTP requests. The web landscape was very different back then though and IPs were plentiful but it just goes to show how servers have coded around bad clients for years.
Perhaps I don't understand the issue you're discussing but how would the client working on 3rd party services be a red flag when that is the desired behavior?
Sorry if this was unclear: It's a client that they specifically wrote to talk to their own services, and they're releasing it to their customers as an official way to talk to their own services. It could not talk to their own service.
Their own documentation refers to that library (or did at that point in time, not sure about now).
Anyone that's dealt with S3 in any capacity should be aware of this, it's literally one of the first encoding problems to come up when dealing with signing requests.
This burned me and because of it I can't host a specific static site on S3 because it requires plus signs. Can't change the files being uploaded due to the system generating them... tried to rig up some sort of Akamai rewrite rule to change it at the CDN level but couldn't get it to work.
(Update - the original title mentioned AWS has been breaking standards since 2010. The new title is fine. Thanks for updating it.)
Little bit of hyperbole in the title imo. S3 has generally been very good at embracing the fundamental principles of HTTP and REST, leaving aside corner cases like this.
&tldr; A legacy behavior is to treat + as a space. When you've been around you need to keep backwards compatibility.
URLs and URIs have separate standards from HTTP and they have changed over time (been replaced by newer ones).
Many years ago it was common to encode a space as a + sign. For example, the PHP function urlencode[1] does the same thing with a + sign. If you're a PHP user, don't use this function unless you know you need to. There are better functions now.
When was + treated as space in the path part of the URL? Sure it's been treated as space in the query part, but that would be a weird breaking change if early web treated path and query the same way, and then later standards made them different.
At the time S3 launched the URL spec was RFC 1738 and we had HTML 4.01[2]. And, the URI syntax (all the way back in 1998) noted to use %20 for a space[3].
As far as I can tell, this traces its history back to encoding for forms[4]. It's been used far beyond the encoding for forms and maybe someone can explain why.
It's also not just PHP whose function is that way. In Python urlencode encodes as a + (at least in 2.7.x).
I remember working on the web many years ago where "+" is what was used. This may have been a spec misinterpretation or something else. In any case, it was common enough.
Note, I'm not saying it was right. Just not uncommon.
They could make it configurable on a per bucket basis (perhaps defaulting to the old behaviour if necessary; ideally you would make the conformant behaviour the default, of course).
That way you could opt in to the standard conformant behaviour if you require it, but they can still keep backward compatibility.
I'm not familiar with how S3 works in detail, but I imagine this could require additional API calls in the backend which increases the latency and resource usage of API requests. In the worst case, such a change could easily require Amazon to purchase dozens, if not hundreds of additional servers.
That would require synchronisation, potentially between multiple servers. Doing this efficiently, without race conditions could be very tricky at their size.
Should this really be considered to be a spec violation? It's a restriction, sure, but S3 is to be considered a specific application with specific constraints.
In response to the reported RFC violation, elving@AWS writes: "I agree that's unconventional and unfortunate." My corporate bullshit detector is off the scale.
In earlier times, we would have both the ability and the balls to treat that unwillingness to uphold the rules we all set out with as damage to the Internet, and route around it. But sadly, AWS has become too big to fail, so the engineers introduce special cases into their products and deploy them.
To the contrary, I think it's actually a refreshingly honest response. A "corporate bullshit" response would be to ignore it altogether, try to argue it's a feature not a bug, or give a canned statement about how we respect the environment and want the world to be a better place.
The AWS support is explicitly acknowledging it's an issue, while giving a rational reason why it probably won't be fixed (even if you disagree with the reason). The back-compat concern is unfortunate but a good argument can be made it's not in users' interests either (beyond being just a cost to AWS to implement the change).
They could, by versioning the API (e.g. add a /v2/ to all paths), but that would benefit no-one and should only be done alongside any number of much more important changes.
That opens a whole new can of worms. They have not deprecated a single part of the s3 API ever, in history (okay I'm partially lying, the SOAP API is now officially deprecated in the sense that they won't add new features to it).
They're not going to change it just for the sake of some minor path issues that have a workaround. (Side note: They tend to uses headers rather than paths for API declarations). I have personally been bit by this same issue, but I would never recommend they change pieces of the service to accommodate it. It's handled easily in client code. What they could do is make an obvious "gotchas" section of documentation
AWS has done a tremendous job of getting things right-enough the first time. They have never killed an AWS service. They'd need a much bigger reason to version an API.
Eh, URL/URI escaping is an interesting example, because people have been doing inconsistent and sometimes standards-problematic things with it pretty much as long as they existed. And indeed it's been a perpetual pain and problem. (just one example read up on `&` and `;`, and whether `&` can/should/must be escaped in what contexts; that's not the only one, `+` is another long-standing one). So not a great example of how everyone used to always be consistently standards-compliant in "earlier times", more like a counter-example. I don't think it's unique, my experience is not that everything used to be more consistent and standards compliant in "earlier times" than it is now, when it comes to the web, if anything the reverse.
When were these "earlier times" for the web? Not during browser wars, that's for sure. Not when web2.0 started with crazy ideas about rest. Not during flash-everywhere era. Etc...
Edit: As mentioned below, this isn't correct since URLs should be able to be escaped and return the same resource, and an escaped + differs from an unescaped + on S3.