Skip to content

Conversation

@pserwylo
Copy link
Contributor

@pserwylo pserwylo commented Nov 13, 2022

Fixes an issue in External Storage when using an S3 backend, whereby some providers allow Nextcloud to upload, but not subsequently download, files.

I witnessed this using Cloudflare R2, which is a relatively new (but well priced) S3 compatible storage system that has recently come out of Beta.

To reproduce

  • Add a Cloudflare R2 bucket as an External Storage folder in Nextcloud via the admin "External Storage" GUI. Note that the region must be set to auto for R2 buckets. Every thing else such as access keys, hosts, and bucket names is no different then any other S3 compatible storage and the details can be found from your Cloudflare account.
  • Upload one or more files to Nextcloud.
    • Works: They successfully end up in the R2 bucket, and can be downloaded via the Cloudflare R2 web UI or via any other S3 client.
    • Works: Nextcloud UI shows the file name and metadata when browsing files and folders (but no thumbnails).
    • Broken: Nextcloud is unable to download any files from the R2 bucket.

Description of the issue

The error log shows the following when trying to download a file from R2 via Nextcloud:

fopen(httpseek://):
failed to open stream: "OC\\Files\\Stream\\SeekableHttpStream::stream_open"
call failed at /var/www/html/lib/private/Files/Stream/SeekableHttpStream.php#67

This error isn't particularly useful though, because the actual issue is that the SeekableHttpStream::reconnect function returns false which triggers the fopen to fail with the above message.

After investigating, it turned out that this is because the code assumes S3 servers will always return a 206 Partial Response when requesting a file with the request header Range: bytes=0-.

This assumption is not true of Cloudflare R2 (and potentially other providers), which interpret the Range: bytes=0- as "Can I please have the whole file" and return a 200 OK response.

Strictly speaking, a 200 response is okay. Although RFC 7233 doesn't specify anything about the edge case of bytes=0- (which really means the entire file), it does say:

A server MAY ignore the Range header field

It doesn't say anything about the browser MUST respond with a 206 Partial Content when handling Range: request headers. Given this, it seems like the response from Cloudflare here is similar in spirit to ignoring the Range: header and returning with 200 and thus is to spec. Therefore, it seemed appropriate to handle this in the Nextcloud code gracefully by still continuing to fetch the file without failing.

Solution

Add a special case to check for a 200 OK response to a Range: bytes=0- request, and if it exists, request a range that
encompasses the entire length of the file (as defined by the Content-Length header from the initial HEAD request).

Notes about this PR

Only for Range: bytes=0-

This only special cases Range: bytes=0- requests, not other range requests which may potentially be ignored by the server.

I feel this is okay for a couple of reasons.

Firstly - the spec saying "Servers MAY ignore Range requests" is probably to cater for old HTTP servers which don't know how to send partial responses. S3 is relatively modern compared to HTTP, and thus I imagine all S3 servers would actually support range requests for only parts of the file correctly by returning a 206 Partial Content as Nextcloud currently expects. Indeed, this is the case with Cloudflare, which does correctly return 206 Partial Content responses for range requests that do not start at 0.

Secondly - This is to ensure that the semantics of "Please open a file starting at this point and read until the end" are not changed. If we put other behaviour in, e.g. for a Range: bytes=1000- request, but the server returned a HTTP/1.1 200 OK request, it would probably stuff up the rest of the Nextcloud server which was actually expecting only a partial piece of the file.

Notes on testing

This is my first PR to nextcloud/server, and I seem to have chosen a curly issue to do with a specific downstream provider that is nothing to do with Nextcloud. I understand there are some integration tests which work with Azure and AWS, but am not sure what infrastructure would be required on the Nextcloud server in order to get a working PoC up with Cloudflare R2. I believe that the integration tests for S3 storage use a mock S3 storage, which likely doesn't exhibit the same behaviour as Cloudflare R2. Furthermore, Cloudflare R2 buckets cannot be made public in the same way as S3 buckets (they only expose a HTTP endpoint to access files if made public, not a full public S3 compatible API endpoint), so I can't even offer up my own R2 PoC bucket (and you may not want to depend on a random strangers bucket for your integration tests anyway).

Having said that, I did hack the ./tests/lib/preseed-config.php to include the following CloudflareR2 config (for my private bucket):

+  'objectstore' => [
+       'class' => 'OC\\Files\\ObjectStore\\S3',
+               'arguments' => [
+                       'bucket' => 'nextcloud-unittest',
+                       'key' => 'MY_KEY',
+                       'secret' => 'MY_SECRET',
+                       'hostname' => 'MY_ACCOUNT.r2.cloudflarestorage.com',
+                       'region' => 'auto',
+                       'port' => '443',
+                       'use_ssl' => true,
+                       'use_path_style' => false,
+               ],
+  ],

and as expected, when run on master, I get 11 test failures in ./tests/lib/Files/ObjectStorage/S3Tests.php with the following error:

fopen(httpseek://): failed to open stream: "OC\Files\Stream\SeekableHttpStream::stream_open" call failed

After running the same tests on this branch, they all pass and include coverage of my new block of code indicating that Cloudflare R2 did indeed work as expected.

Please let me know if there is anything I can do with regards to writing tests for this case.

Feedback requested on HTTP 200 check

Although the regex used to check for HTTP 200 responses seems relatively naïve, but to the bets of my knowledge it is to spec according to HTTP 1.1. I'm unsure of HTTP/2 or HTTP/3, but empirically the S3 clients seem to use HTTP 1.1 at this point anyway. Feedback welcome if you'd like me to change this in any way - though in practice - it works for Cloudflare R2 according to my manual testing.

Quirky PHP/Cloudflare R2 bug still to be solved

When running the Nextcloud unit tests, I noticed that there was a strange time where a few of the tests resulted in a quirky intermittent failure when downloading byte ranges, that resulted in some bytes being truncated from the response 😮.

I'm still investigating myself with this PoC: https://github.com/pserwylo/cloudflare-r2-php-fopen-bug, but I have noticed the following:

  • It is intermittent, but generally fails approx 30% of the time.
  • It is specific to Cloudflare R2, not AWS S3.
  • It is specific to byte ranges. Downloading entire files works fine.
  • It is specific to byte ranges where the start of the range is below a specific power-of-2 size, and the end of the range is above that power-of-2 range. In my testing, I observed it with 65535 bytes, where I asked for 20 bytes from 65525 -> 65555.
  • When failing, it only returns the bytes from before the power-of-2 range, and truncates all bytes afterwards.
  • It is unique to fopen("https://...") (i.e. it doesn't happen if I perform the same requests via Guzzle.
  • I can't intercept the network traffic (e.g. using Wireshark) to test because it is HTTPS, and when I MITM myself using a proxy such as mitmproxy then the AWS signature verification pics up that a MITM has taken place and seems to reject the request.
  • I am now at the crazy point of compiling PHP myself with debug symbols, and trying to use GDB to watch what happens as I run my unit test from https://github.com/pserwylo/cloudflare-r2-php-fopen-bug, but the intermittent nature of the error is making this extremely difficult to pin down.
  • After a bit more testing, I will share my PoC with the Cloudflare R2 team and see if they are able to figure it out - the PoC is inspired by the Nextcloud implementation (I've copied code from the ObjectStorage portion of Nextcloud), but the actual PoC is independent of Nextcloud - so it is either a bug in Cloudflare R2 or in PHP itself :(

A massive massive thanks to @juliushaertl for the foresight to choose the following filesizes in the S3Test: https://github.com/pserwylo/nextcloud-server/blame/e8ee363138eee7703cc236a4f40276090eb3a997/tests/lib/Files/ObjectStore/S3Test.php#L145. Without these specific byte sizes which cross a specific power-of-2 boundary, I would not have observed this issue.

@pserwylo pserwylo force-pushed the fix-cloudflare-r2-external-storage branch from e8ee363 to e9318c0 Compare November 13, 2022 10:17
@szaimen szaimen added the 3. to review Waiting for reviews label Nov 13, 2022
@szaimen szaimen added this to the Nextcloud 26 milestone Nov 13, 2022
@come-nc
Copy link
Contributor

come-nc commented Nov 15, 2022

@pserwylo Please run composer run cs:fix if possible to fix code style.

@pserwylo pserwylo force-pushed the fix-cloudflare-r2-external-storage branch from e9318c0 to 5330e98 Compare November 15, 2022 12:27
@pserwylo
Copy link
Contributor Author

pserwylo commented Nov 15, 2022

@come-nc: Thanks for the prompt review and the pointer to the cs:fix composer script. I did note the lint failures but wasn't quite sure how to resolve them. Lint fixed and commit amended.

Question regarding the target release - Although happy for it to go with Nextcloud 26, it is probably unsurprising that I would like to selfishly have it in a patch release of Nextcloud 25 so as to be available earlier. Out of curiousity, is the rationale that it is kind of introducing a new feature (e.g. support for Cloudflare R2), as opposed to a bugfix in the existing S3 storage functionality?

Nevertheless, thanks for reviewing.

@come-nc
Copy link
Contributor

come-nc commented Nov 15, 2022

Question regarding the target release - Although happy for it to go with Nextcloud 26, it is probably unsurprising that I would like to selfishly have it in a patch release of Nextcloud 25 so as to be available earlier. Out of curiousity, is the rationale that it is kind of introducing a new feature (e.g. support for Cloudflare R2), as opposed to a bugfix in the existing S3 storage functionality?

Target release is 26 because target branch is master.
The question is then whether to backport it.
I would indeed treat this as a new feature rather than a bugfix, but I have no strong opinion.
@PVince81 @icewind1991 thoughts?

@PVince81
Copy link
Member

considering the code change it looks more like a bugfix than a new feature, to accomodate to quirks

if the code doesn't cause any harm to non-Cloudflare servers then it would be ok to backport to 25

@PVince81
Copy link
Member

would be good to hear what @icewind1991 and/or @juliushaertl think

@pserwylo
Copy link
Contributor Author

pserwylo commented Nov 16, 2022

considering the code change it looks more like a bugfix than a new feature, to accomodate to quirks

if the code doesn't cause any harm to non-Cloudflare servers then it would be ok to backport to 25

Great. I'll wait to hear from others, and if anyone else agrees, then it seems the right thing is to ask the bot to backport (based on looking at other PRs). Is that something that you would normally do, or would the I?

I also note a couple of other CI jobs failing. One is due to PostgreSQL not starting as expected - is this known to be flakey? Or is it something I should investigate? The other one is the "Performance" one, which is failing because it assumes PRs come from the nextcloud/server repo and not forks. I've filed PR #35192 to fix this separately.

@Frederik-Baetens
Copy link

Frederik-Baetens commented Nov 17, 2022

Hi @pserwylo, I'm from the R2 team and would like to thank you for the very thorough investigation into this. In cases like this we're generally open to changing our implementation to more closely match other S3 compatible providers.

We'll be releasing a fix for this compatibility issue with the status code of range requests later this week, or early next week. I'll post an update when that change goes live.

As for your issue with range requests being truncated, I'm very interested in looking into this with you. If it's convenient for you, you can reach me in the cloudflare discord, or if you prefer another communication method, that can be arranged as well.

@Frederik-Baetens
Copy link

Frederik-Baetens commented Nov 17, 2022

This fix has been released now. GetObject with a range set now unconditionally returns a 206 response with content-range set. Let me know if that didn't address this issue, and we can keep looking into compatibility between nextcloud & R2.

@pserwylo pserwylo closed this Nov 18, 2022
@pserwylo
Copy link
Contributor Author

Thanks @Frederik-Baetens - great to see someone from the team here, and such a quick turn around on the fix. For the record, I still believe you were not strictly doing anything incorrect, but fixing it on your end does mean older versions of Nextcloud get if for free without needing a workaround in this code, so perhaps other S3 clients may benefit too if they had similar assumptions.

I've closed this PR now as there is no need to work around this quirk. I can confirm that on my Nextcloud 24.0.7 (obviously without this fix) it can now download files from R2 as expected. Appreciate the reviews from others on this thread too, even though it turned out not to be required.

Regarding the quirk with truncated files that I first noticed when running unit tests on this change, I will try to find you on discord when I get some time next week. In the meantime, I've done my best to document and make the issue reproducible in my pserwylo/cloudflare-r2-php-fopen-bug repo. The README documents in detail my (unsuccessful) efforts to track it down.

@PVince81
Copy link
Member

glad to hear that it got sorted out upstream, thanks a lot 😄

@Frederik-Baetens
Copy link

Frederik-Baetens commented Nov 18, 2022

@pserwylo Thanks for confirming that nextcloud works with R2 now 🎉

I've been able to reproduce that truncated files issue with the helpful reproduction repo you provided, but given that this seems to be specific to PHP fopen(), getting to the bottom of this requires more PHP knowledge than I have. So I'm looking forward to looking into this with you when you find some time next week!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3. to review Waiting for reviews

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants