Conversation
bf6a89e to
fa26e36
Compare
bb5f003 to
dcd08d1
Compare
e704c83 to
0c37f36
Compare
This reverts commit 6353b4f.
cpp/include/lance/arrow/updater.h
Outdated
|
|
||
| /// Update the values to new values, presented in the array. | ||
| /// The array must has the same length as the batch returned previously via `Next()`. | ||
| ::arrow::Status Update(const std::shared_ptr<::arrow::Array>& arr); |
There was a problem hiding this comment.
maybe call this UpdateBatch to make it clear the input is supposed to line up with batch?
|
|
||
| const std::shared_ptr<format::Schema>& LanceFragment::schema() const { return manifest_->schema(); } | ||
|
|
||
| const std::shared_ptr<lance::format::DataFragment>& LanceFragment::data_fragment() const { |
There was a problem hiding this comment.
i keep forgetting what is DataFragment vs LanceFragment
There was a problem hiding this comment.
DataFragment is the POD, which maps to what are serialized in manifest / protobuf.
LanceFragment is a subclass of arrow::Fragment, which offers the I/O functions, i.e., ScanBatchAsync, CountRows and etc.
|
|
||
| auto num_batches = metadata_->num_batches(); | ||
| auto num_columns = manifest_->schema()->GetFieldsCount(); | ||
| auto num_columns = ranges::max(manifest_->schema()->GetFieldIds()) + 1; |
There was a problem hiding this comment.
don't have to take care of it now but how would this work for column deletions?
There was a problem hiding this comment.
yea, so for now, we dont reclaim the deleted column Ids, so they are still exist in the lookup table, but do not exist in the schema.
A simplest solution could be just run a new snapshot of everything and reclaim the column IDs.
| /// The schema of the newly generated dataset. | ||
| std::shared_ptr<lance::format::Schema> full_schema_; | ||
| /// The schema of the column to updated/added | ||
| std::shared_ptr<lance::format::Schema> column_schema_; |
There was a problem hiding this comment.
so this can support adding multiple columns at a time?
There was a problem hiding this comment.
In theory, yes.
This is internal API, we could change it to be shared_ptr<lance::Field> if we'd live to be more strict for now.
| auto fut = batch_generator_(); | ||
| ARROW_ASSIGN_OR_RAISE(last_batch_, fut.result()); | ||
| if (!last_batch_) { | ||
| ARROW_RETURN_NOT_OK(writer_->Finish().result()); |
There was a problem hiding this comment.
i thought this was reading batches from original dataset. why is a writer involved here?
There was a problem hiding this comment.
An updater is basically an iterator to hold a "Batch Generator" and "Batch Writer", the two of which will be opened / closed together to make data aligned between "Data Files".
Minimal support of appending a column in C++
LanceDataset