Skip to content

[C++] Append column#299

Merged
eddyxu merged 24 commits intomainfrom
lei/update_col
Nov 18, 2022
Merged

[C++] Append column#299
eddyxu merged 24 commits intomainfrom
lei/update_col

Conversation

@eddyxu
Copy link
Member

@eddyxu eddyxu commented Nov 7, 2022

Minimal support of appending a column in C++ LanceDataset

@eddyxu eddyxu added WIP work in progress donotmerge Do not merge labels Nov 7, 2022
@eddyxu eddyxu self-assigned this Nov 7, 2022
@eddyxu eddyxu force-pushed the lei/update_col branch 2 times, most recently from bf6a89e to fa26e36 Compare November 9, 2022 18:14
@eddyxu eddyxu force-pushed the lei/update_col branch 4 times, most recently from bb5f003 to dcd08d1 Compare November 16, 2022 21:10
@eddyxu eddyxu changed the title Append column [C++] Append column Nov 18, 2022
@eddyxu eddyxu requested a review from changhiskhan November 18, 2022 07:26
@eddyxu eddyxu added c++ C++ issues and removed WIP work in progress donotmerge Do not merge labels Nov 18, 2022
@eddyxu eddyxu marked this pull request as ready for review November 18, 2022 07:29
Copy link
Contributor

@changhiskhan changhiskhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor questions


/// Update the values to new values, presented in the array.
/// The array must has the same length as the batch returned previously via `Next()`.
::arrow::Status Update(const std::shared_ptr<::arrow::Array>& arr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe call this UpdateBatch to make it clear the input is supposed to line up with batch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sgtm.


const std::shared_ptr<format::Schema>& LanceFragment::schema() const { return manifest_->schema(); }

const std::shared_ptr<lance::format::DataFragment>& LanceFragment::data_fragment() const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i keep forgetting what is DataFragment vs LanceFragment

Copy link
Member Author

@eddyxu eddyxu Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFragment is the POD, which maps to what are serialized in manifest / protobuf.

LanceFragment is a subclass of arrow::Fragment, which offers the I/O functions, i.e., ScanBatchAsync, CountRows and etc.


auto num_batches = metadata_->num_batches();
auto num_columns = manifest_->schema()->GetFieldsCount();
auto num_columns = ranges::max(manifest_->schema()->GetFieldIds()) + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't have to take care of it now but how would this work for column deletions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, so for now, we dont reclaim the deleted column Ids, so they are still exist in the lookup table, but do not exist in the schema.

A simplest solution could be just run a new snapshot of everything and reclaim the column IDs.

/// The schema of the newly generated dataset.
std::shared_ptr<lance::format::Schema> full_schema_;
/// The schema of the column to updated/added
std::shared_ptr<lance::format::Schema> column_schema_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this can support adding multiple columns at a time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, yes.

This is internal API, we could change it to be shared_ptr<lance::Field> if we'd live to be more strict for now.

auto fut = batch_generator_();
ARROW_ASSIGN_OR_RAISE(last_batch_, fut.result());
if (!last_batch_) {
ARROW_RETURN_NOT_OK(writer_->Finish().result());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought this was reading batches from original dataset. why is a writer involved here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An updater is basically an iterator to hold a "Batch Generator" and "Batch Writer", the two of which will be opened / closed together to make data aligned between "Data Files".

@eddyxu eddyxu merged commit 837e2f3 into main Nov 18, 2022
@eddyxu eddyxu deleted the lei/update_col branch November 18, 2022 21:32
@eddyxu eddyxu mentioned this pull request Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ C++ issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants