Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
S3: Plus sign is interpreted as space in the path part of URLs (amazon.com)
97 points by ysh7 on Oct 4, 2017 | hide | past | favorite | 46 comments


I don't necessarily think this is even breaking the HTTP standard. While '+' should not be interpreted as spaces as part of a URL while it's being treated as a URL, the HTTP spec doesn't specify / care what file that may map to on a server.

Edit: As mentioned below, this isn't correct since URLs should be able to be escaped and return the same resource, and an escaped + differs from an unescaped + on S3.


Sure but /%2B should resolve to the same thing as /+


Ah, fair enough, that's a good point.


Exactly! The OP's point is summarized in this sentence:

> My point is that the spec requires + to be escaped only inside the querystring.

So what? What the standard mandates for query strings is irrelevant here. It's up to the server how to interpret and map the URLs. "Unconventional and unfortunate" - yes, but breaking the HTTP spec? No.


Please read the actual spec before telling poeple whether something is conforming to it or not. Just making stuff up is exactly how this mess is created. The relevant section in this case:

https://tools.ietf.org/html/rfc3986#section-6.2.2.2


It breaks the HTTP spec because it internally is decoding the URL wrong. This is important because things that speak HTTP are free to choose to percent encode, or not, the plus sign in a path, and the canonical URL should not differ. If it mapped even an escaped plus to a space, it'd be consistent, though still questionable, behavior.


Amazon has a difficult time with the HTTP standard sometimes. Last time I had to touch an AWS project we discovered a bug[1] in the C++ code backing a Java library (sic).

They had implemented their own HTTP client, but forgot to add the "Host" header to requests which is required by HTTP 1.1.

Interestingly this client sent requests only to their own services, which means that they either released that without testing it or the backend once accepted faulty requests.

[1]: https://github.com/awslabs/amazon-kinesis-producer/issues/61


It's common for HTTP servers to accept requests without a host header. It's not usually needed by the server unless you're hardening it (I don't class it as a security issue but some security audits will flag it up if you don't force the server to reject invalid host headers) or running named virtual hosts (which is more common than it used to be thanks to SNI but you still often see a 1:1 relationship between (virtual) hosts and IPs). So Amazon could easily have tested their client on 3rd party servers and still not spotted the problem.

As an anecdote, about 15/20ish years ago I wrote my own webbrowser. Obviously something highly rudimentary albeit browsers were much easier to implement back then anyway. I was too lazy to read the HTTP spec (it was a hobby project and I was young and impatient) so a lot of what I did was trial and error. I too wasn't sending a host header but it took long while before I ran into any sites that rejected my HTTP requests. The web landscape was very different back then though and IPs were plentiful but it just goes to show how servers have coded around bad clients for years.


> So Amazon could easily have tested their client on 3rd party servers and still not spotted the problem

This would still be a red flag, as the service in question is their instance metadata service that provides authentication tokens.

Something that important should be integration-tested with the actual service.


> This would still be a red flag,

Perhaps I don't understand the issue you're discussing but how would the client working on 3rd party services be a red flag when that is the desired behavior?


Sorry if this was unclear: It's a client that they specifically wrote to talk to their own services, and they're releasing it to their customers as an official way to talk to their own services. It could not talk to their own service.

Their own documentation refers to that library (or did at that point in time, not sure about now).


Ahh I did misunderstand you then. Sorry. Yeah that does sound bad.


Does anyone know if this behavior persists when the bucket is served as a website?


Anyone that's dealt with S3 in any capacity should be aware of this, it's literally one of the first encoding problems to come up when dealing with signing requests.

@dang can you please add (2010) to the title?


Funnily enough, "URLs and plus signs" is still my most up voted question on stack overflow ( https://stackoverflow.com/questions/1005676/urls-and-plus-si... ) -- same a+b example too. 7 years later, it seems even the big names have issue with this.


This burned me and because of it I can't host a specific static site on S3 because it requires plus signs. Can't change the files being uploaded due to the system generating them... tried to rig up some sort of Akamai rewrite rule to change it at the CDN level but couldn't get it to work.


(Update - the original title mentioned AWS has been breaking standards since 2010. The new title is fine. Thanks for updating it.)

Little bit of hyperbole in the title imo. S3 has generally been very good at embracing the fundamental principles of HTTP and REST, leaving aside corner cases like this.


Don't see a hyperbole. The title is technically correct (the best kind of correct), although questionable from a grammar standpoint.


I change the "+" into the escaped code :-) It helped


&tldr; A legacy behavior is to treat + as a space. When you've been around you need to keep backwards compatibility.

URLs and URIs have separate standards from HTTP and they have changed over time (been replaced by newer ones).

Many years ago it was common to encode a space as a + sign. For example, the PHP function urlencode[1] does the same thing with a + sign. If you're a PHP user, don't use this function unless you know you need to. There are better functions now.

[1] http://php.net/manual/en/function.urlencode.php


When was + treated as space in the path part of the URL? Sure it's been treated as space in the query part, but that would be a weird breaking change if early web treated path and query the same way, and then later standards made them different.


At the time S3 launched the URL spec was RFC 1738 and we had HTML 4.01[2]. And, the URI syntax (all the way back in 1998) noted to use %20 for a space[3].

As far as I can tell, this traces its history back to encoding for forms[4]. It's been used far beyond the encoding for forms and maybe someone can explain why.

It's also not just PHP whose function is that way. In Python urlencode encodes as a + (at least in 2.7.x).

I remember working on the web many years ago where "+" is what was used. This may have been a spec misinterpretation or something else. In any case, it was common enough.

Note, I'm not saying it was right. Just not uncommon.

[1] https://www.ietf.org/rfc/rfc1738.txt [2] https://www.w3.org/TR/html401/ [3] https://www.ietf.org/rfc/rfc2396.txt [4] https://www.w3.org/TR/html4/interact/forms.html#h-17.13.4.1


> If you're a PHP user, don't use this function unless you know you need to. There are better functions now.

Don't leave me hanging! What are the better functions now?


`rawurlencode()` is what you're after.

And here is where you'd ask that question, a coding forum https://stackoverflow.com/questions/996139/urlencode-vs-rawu...


Thanks!

And here is where you'd answer that question, a coding forum https://stackoverflow.com/questions/996139/urlencode-vs-rawu...

;)


Knowing PHPs standard library, probably something like "urlencode_safe_for_real_this_time".

Kidding aside, IIRC "rawurlencode" is the RFC compliant one.


It's reasonable they don't want to fix it because it will break existing URLs. Welcome to the ugly world of back compatibility.


They could make it configurable on a per bucket basis (perhaps defaulting to the old behaviour if necessary; ideally you would make the conformant behaviour the default, of course).

That way you could opt in to the standard conformant behaviour if you require it, but they can still keep backward compatibility.


I'm not familiar with how S3 works in detail, but I imagine this could require additional API calls in the backend which increases the latency and resource usage of API requests. In the worst case, such a change could easily require Amazon to purchase dozens, if not hundreds of additional servers.


With the rate of AWS growth, they probably bought dozens more servers in the time it took you to write out your response. :)


Likely, but Amazon didn't get where they are by ignoring small costs.


Or use a different domain for the same buckets, and resolve the name correctly on this new domain.


They could compromise by adding a few more lines of code and having '+' resolve to ' ' if and only if the file can't be found with '+', or vice versa.

Immidiatly mark this behaviour as deprecated and switch over to proper '+' == '+' behaviour later.

edit: LiquidFire's idea is better.


That would require synchronisation, potentially between multiple servers. Doing this efficiently, without race conditions could be very tricky at their size.


It also makes lookups ambiguous if you have both. :)


Should this really be considered to be a spec violation? It's a restriction, sure, but S3 is to be considered a specific application with specific constraints.


Does S3 use HTTP ? If so, it's a violation of the specification of S3 by way of incorporation of the HTTP specification.

Otherwise, if S3 does not use HTTP, we would need to see the S3 specification to determine if it (the implementation Amazon uses) is in violation


In response to the reported RFC violation, elving@AWS writes: "I agree that's unconventional and unfortunate." My corporate bullshit detector is off the scale.

In earlier times, we would have both the ability and the balls to treat that unwillingness to uphold the rules we all set out with as damage to the Internet, and route around it. But sadly, AWS has become too big to fail, so the engineers introduce special cases into their products and deploy them.


To the contrary, I think it's actually a refreshingly honest response. A "corporate bullshit" response would be to ignore it altogether, try to argue it's a feature not a bug, or give a canned statement about how we respect the environment and want the world to be a better place.

The AWS support is explicitly acknowledging it's an issue, while giving a rational reason why it probably won't be fixed (even if you disagree with the reason). The back-compat concern is unfortunate but a good argument can be made it's not in users' interests either (beyond being just a cost to AWS to implement the change).


But can they even change it without risking to break tens of thousands of websites?


They could, by versioning the API (e.g. add a /v2/ to all paths), but that would benefit no-one and should only be done alongside any number of much more important changes.


That opens a whole new can of worms. They have not deprecated a single part of the s3 API ever, in history (okay I'm partially lying, the SOAP API is now officially deprecated in the sense that they won't add new features to it).

They're not going to change it just for the sake of some minor path issues that have a workaround. (Side note: They tend to uses headers rather than paths for API declarations). I have personally been bit by this same issue, but I would never recommend they change pieces of the service to accommodate it. It's handled easily in client code. What they could do is make an obvious "gotchas" section of documentation

AWS has done a tremendous job of getting things right-enough the first time. They have never killed an AWS service. They'd need a much bigger reason to version an API.


Isn't the API so huge that you could just configure a bucket (default-off) to behave properly ?


Eh, URL/URI escaping is an interesting example, because people have been doing inconsistent and sometimes standards-problematic things with it pretty much as long as they existed. And indeed it's been a perpetual pain and problem. (just one example read up on `&` and `;`, and whether `&` can/should/must be escaped in what contexts; that's not the only one, `+` is another long-standing one). So not a great example of how everyone used to always be consistently standards-compliant in "earlier times", more like a counter-example. I don't think it's unique, my experience is not that everything used to be more consistent and standards compliant in "earlier times" than it is now, when it comes to the web, if anything the reverse.


How is this 'corporate bullshit'? Corporate BS is about giving vague circumlocutionary responses that try to just press all the right PR buttons.

This is the opposite of that.


When were these "earlier times" for the web? Not during browser wars, that's for sure. Not when web2.0 started with crazy ideas about rest. Not during flash-everywhere era. Etc...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: