Skip to content

Conversation

@ogoffart
Copy link
Contributor

The server sorts the chunk by name alphabetically. So if we want to keep
the chunk in order, we need to add a few '0' in front of the chunk name

@mention-bot
Copy link

@ogoffart, thanks for your PR! By analyzing the history of the files in this pull request, we identified @ckamm, @jturcotte and @danimo to be potential reviewers.

@ogoffart
Copy link
Contributor Author

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean "leading" 0

@guruz
Copy link
Contributor

guruz commented Jan 18, 2017

I think it's misleading if you really need to do that.

@dragotin @moscicki @PVince81

@moscicki
Copy link
Contributor

moscicki commented Jan 18, 2017

What is the purpose of this? Why do you care on the client side in which order the server keeps chunks internally? If you need on the client to establish order based on list of chunk names returned by the server then convert to int and sort. Adding leading 0 so the server by side effects gets alphabetical sorting right -- what a horrible idea. Will break the moment server decides to change ordering for any reason. Also, don't you need more zeros to prepend when sorting numbers with more than 2 digits?

Likewise on the server. If you need to establish the order for concatenating the chunks => convert to int.

BTW: The client should keep the state of the transfer based on the server responses. Normally there should be no reason to inquire server state. That's also recommended by S3 multipart upload (http://docs.aws.amazon.com/AmazonS3/latest/dev/mpuoverview.html)

BTW2. I think you really need to think of this API one level higher: you get the list of chunks uploaded to the server. Not files uploaded to the server. There may be different implementations. These webdav calls could be also REST APIs. The API structure is the same -- just the HTTP verb is different. I think it makes no sense that client thinks of these chunks as individual files on the server.

The server sorts the chunk by name alphabetically. So if we want to keep
the chunk in order, we need to add a few '0' in front of the chunk name
@ogoffart
Copy link
Contributor Author

@moscicki

Why do you care on the client side in which order the server keeps chunks internally?

I care in which order the chunks will be assembled: They will be assembled in alphabetical order.

So if i upload chunks 0,1,2,3,4,5,6,7,8,9,10,11 I want them to be assembled in that order. Not 0,1,10,11,2,3,... That would be a file corruption.

don't you need more zeros to prepend when sorting numbers with more than 2 digits?

The "rightJustified" function add a variable number of '0' depending of the size of the string.

BTW: The client should keep the state of the transfer based on the server responses.

That's what the old chunking do. The advantage of querrying the server is that we really know what are the chunks on the servers.

BTW2. [...] These webdav calls could be also REST APIs. [...] I think it makes no sense that client thinks of these chunks as individual files on the server.

That's correct. The client don't think they are files, just entries in a webdav reply.
Is your point that your server will not have these individual file and it will therefore be hard for the server to remember the list of chunks with their name?
But that's part of the new chunking protocol: that you can query the existing chunks with PROPFIND.

@DeepDiver1975
Copy link
Member

CC\ @DeepDiver1975 The server sorts the chunk alphabeticaly (10 < 2) (cf. the strcmp in https://github.com/owncloud/core/blob/5ce66c4d5f9a895bec41f56f7cc96661eec00ae6/apps/dav/lib/Upload/AssemblyStream.php#L69 )
Is that intentional, if yes, we need to document it on https://github.com/cernbox/smashbox/blob/master/protocol/chunking.md

The original idea of the protcol was to use integer numbers - it still should be from my pov.
Maybe the strcmp is simply wrong - I need to rethink on this - next week.

@moscicki
Copy link
Contributor

@ogoffart: Isn't it up to the server to assure correct chunk ordering based on the value you pass? You are passing an integer number (as string due to the way it is encoded in HTTP request). The server should take this number and do whatever necessary to assure the correct sequence. They may pad it with zeros or convert to int or whatever. I really think this should be fixed in the server for this special endpoint. I do not think that the client should assume anything how the server internally stores these chunks and if it does alphabetic sorting internally. From a point of view of the protocol this is a low-level implementation detail that makes the system very fragile in case of future changes.

I completely agree with you that we should make sure the protocol doc reflect any changes or additional assumptions. The current description at https://github.com/cernbox/smashbox/blob/master/protocol/chunking.md is not very complete and I can help fixing this but I would like to know what's described on Dragotin's blog is still valid or if some parts are outdated. For example, do you use the PROPFIND and do you assume anything about ordering of the chunk list that server provides to you?

@moscicki
Copy link
Contributor

About your last question: in our implementation we do remember which chunks have been uploaded. However, ideally, the client should manage this state completely.There might be corner cases where you want to inquire about the status of chunks on the server but they should be quite rare I believe (if needed at all)?

@guruz
Copy link
Contributor

guruz commented Jan 19, 2017

@moscicki Because of the server being able to timeout chunks we had decided that the state is managed on the server and the client inquires with a PROPFIND when an upload is started/resumed.

I agree that the server should be fixed here, not the client.

@DeepDiver1975
Copy link
Member

I agree that the server should be fixed here, not the client.

as said I'll have a look at this next week

@guruz
Copy link
Contributor

guruz commented Jan 19, 2017

@DeepDiver1975 don't worry, not meant as a blame, more like a -1 to @ogoffart patch.

@moscicki
Copy link
Contributor

Likewise here: no offense meant to anyone, please.

For our endpoint implementation we will not care about leading zeros because we will convert to int. But for the sanity of the protocol I would suggest to make minimal assumptions and separate protocols from implementation side effects.

@jturcotte
Copy link
Member

jturcotte commented Jan 19, 2017

It's possible that we'd like to encode more info in the name in the future, like the start offset in the file or the total number of chunks. For example if we want to use the same end point for delta sync.

It could make sense to parse the current chunk number already up to a delimiter (e.g. a ".") so that we have some freedom regarding older implementations in the future.

Or do we assume that we'll pass everything else through HTTP headers?

@moscicki
Copy link
Contributor

How is the file offset provided today? I guess a header?

@jturcotte
Copy link
Member

jturcotte commented Jan 19, 2017

Currently chunks are just concatenated at MOVE time by the server, and the client calculates the source offset of the next chunks by doing an initial PROPFIND of any existing temp upload folder (e.g. when restarted during upload).

@moscicki
Copy link
Contributor

How does the client know in this case what was the chunk size used for this interrupted upload? If you only get the chunk number via propfind without the offset you must assume the chunk size is constant, don't you?

About previous question on special headers, according to the doc below the chunk offset should be sent by the client as a header: https://github.com/cernbox/smashbox/blob/master/protocol/chunking.md

Could someone that knows the current client implementation verify if the doc is up-to-date and if it is complete? Thank you.

@jturcotte
Copy link
Member

jturcotte commented Jan 19, 2017

Chunks, like files, are atomically added to WebDAV. Unfinished chunks aren't kept and that means that we can just use the size of each chunk as returned by PROPFIND, and add them up to the point of the offset of the next-to-do chunk (afterward just using the chunk size in the client's configuration). Different chunks can have different sizes for that reason, as long as chunks of a single file are uploaded serially, in order.

@moscicki
Copy link
Contributor

Okey, that's clear. But why do you require sequential upload of the chunks?

@jturcotte
Copy link
Member

jturcotte commented Jan 19, 2017

If we want parallel chunks, we'd actually need the offset in order to patch potential holes in the chain and avoid having to restart from the first missing chunk.

If a following chunk has the time to finish but not its predecessor, we need the offset of the next chunk if we can't know the used chunk size of the missing chunk.

In any case, we're still doing parallel upload of files (chunked or not), so it would mainly be a win when a single big file is being uploaded (which probably does happen), but with a big-enough chunk size, it probably won't upload faster in parallel anyway.

@moscicki
Copy link
Contributor

Don't you have the chunk offset to be sent in OC-Chunk-Offset header? According to spec.

The previous version (V1) of chunking allowed for parallel upload. This one does not. If you have a hole in transfer then you should also have a hole in the index number? Why not just calculate the offsets correctly from the PROPFIND and send the missing chunk again?

@ogoffart
Copy link
Contributor Author

We can do parallel upload, that's not a problem.
When we resume, we resume from the first chunk that was not successfull.

@ogoffart
Copy link
Contributor Author

@guruz:

more like a -1 to @ogoffart patch.

I think we should merge this patch anyway before the beta if the server is not fixed yet.
We can revert it later if released version of the server don't have this problem.

@ckamm
Copy link
Contributor

ckamm commented Jan 20, 2017

@moscicki @ogoffart

Here are a bunch of other things I noticed while comparing the client behavior to the protocol:

  1. If there is a chunk ordering through the numbers as well as the OC-Chunk-Offset header, what should happen during MOVE precisely? Assembly based on the chunk ordering, using the chunk offset values for validation? (the current oc server version just assembles based on order and doesn't care about chunk-offset)
  2. Is there a particular use for OC-Total-Length? I propose it should be used for detecting missing data in ChunkingNG: Missing chunks are silently ignored core#26988.
  3. The spec mentions If-Destination-Match. The actual implementation sends If: <$dest>([ $match ])
  4. The client currently sometimes uses DELETE on chunks when resuming. There's no mention of that in the protocol document.

@guruz guruz added this to the 2.3.0 milestone Jan 20, 2017
@guruz guruz merged commit d6fdda8 into master Jan 20, 2017
@guruz guruz deleted the chunking-ng branch January 20, 2017 15:04
@guruz
Copy link
Contributor

guruz commented Jan 20, 2017

@ogoffart You are right, merging this until the server is fixed.

I think the server needs to use https://secure.php.net/manual/en/function.strnatcmp.php .. I'll try.

@moscicki
Copy link
Contributor

You may also consider to activate it now via a server capability. You will need this anyway when the server fixes the issue such that you do not break older servers. It is cleaner to do it now via a capability such as "ChunkingNG.ZeroPaddedChunkNames" = 1. Otherwise you will have to come up with a capability to disable this in the future - not nice.

guruz added a commit to owncloud/core that referenced this pull request Jan 20, 2017
For owncloud/client#5476

Before this, the assembly could be bogusly in the order 0,1,10,11,2,3 etc.

As per the spec "The name of every chunk should be its chunk number."
https://github.com/cernbox/smashbox/blob/master/protocol/chunking.md
@guruz
Copy link
Contributor

guruz commented Jan 20, 2017

-> owncloud/core#26995

@moscicki
Copy link
Contributor

moscicki commented Jan 20, 2017

@ckamm: thanks a lot for this investigation.

  1. Very good point. I would assume that the order goes by chunk number and that all Chunk-Offset+Content-Lenght must match the entire file range (as declared by OC-Total-Length). Essentially if this is not met then it means that some bytes were not sent and this is a clear error. This error should result in wrong checksum. Do we want to add a possibility to return 400 Bad Request to MOVE in case not all bytes were sent?

  2. Detecting missing data is one application. Another is that you may want to reserve space in your storage system upfront knowing the size. Another is that you can check quota and reject the initial request straightaway. BTW: do we report quota errors on PUT or in this case MKCOL?

  3. Okey, something to be fixed. BTW. Couldn't we use If-Match like for the normal upload?

  4. Why would the client DELETE already uploaded chunks? Does it make any sense?

Please note that, following idea from @PVince81 I added 400 Bad Request specification: https://github.com/cernbox/smashbox/blob/master/protocol/chunking.md#put

Any comments on this?

@guruz
Copy link
Contributor

guruz commented Jan 26, 2017

The spec mentions If-Destination-Match. The actual implementation sends If: <$dest>([ $match ])
The client currently sometimes uses DELETE on chunks when resuming. There's no mention of that in the protocol document.

@moscicki Would you adjust https://github.com/cernbox/smashbox/blob/master/protocol/chunking.md to allow for DELETE on chunks? The client can decide to do this if he decides that he aborts the upload or that he thinks he's better off with another chunk size etc. e.g. #5480

MorrisJobke pushed a commit to nextcloud/server that referenced this pull request Mar 17, 2017
For owncloud/client#5476

Before this, the assembly could be bogusly in the order 0,1,10,11,2,3 etc.

As per the spec "The name of every chunk should be its chunk number."
https://github.com/cernbox/smashbox/blob/master/protocol/chunking.md
MorrisJobke pushed a commit to nextcloud/server that referenced this pull request Mar 17, 2017
For owncloud/client#5476

Before this, the assembly could be bogusly in the order 0,1,10,11,2,3 etc.

As per the spec "The name of every chunk should be its chunk number."
https://github.com/cernbox/smashbox/blob/master/protocol/chunking.md

Signed-off-by: Morris Jobke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants