-
Notifications
You must be signed in to change notification settings - Fork 192
Add BeginInsert/InsertData/EndInsert flow #443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
clickhouse/client.cpp
Outdated
|
|
||
| void FinishInsert(); | ||
|
|
||
| void SendData(const Block& block); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to move this to public so that PreparedInsert can call it. Not in the header file, though, so shouldn't matter.
clickhouse/client.h
Outdated
| public: | ||
| Block * GetBlock(); | ||
| void Execute(); | ||
| // XXX This shouldn't be public. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't figure out how to make this private. Suggestions appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if it worked declared public in the .cpp file, but I think I could also use an Impl class like Client does to hide such things.
ade33f4 to
51d8216
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a PreparedInsert pattern for more memory-efficient bulk data insertion. Instead of accumulating all data before sending, users can now prepare an INSERT statement once and execute multiple smaller batches within a single ClickHouse operation.
Key Changes:
- Added
PreparedInsertclass withGetBlock(),Execute(), andFinish()methods for iterative data insertion - Implemented
PrepareInsert()methods in Client for initiating prepared inserts - Added comprehensive unit test demonstrating the prepared insert workflow
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| clickhouse/client.h | Declared PreparedInsert nested class and PrepareInsert() methods with detailed documentation |
| clickhouse/client.cpp | Implemented PreparedInsert class methods, ReceivePreparePackets(), and refactored insert finalization logic |
| clickhouse/block.h | Fixed spelling in comments ("Convinience" → "Convenience") |
| ut/client_ut.cpp | Added PrepareInsert test case and fixed spelling in existing comment ("Spontaneosly" → "Spontaneously") |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
slabko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for contributing this feature. It has been on the list for quite some time, and I’m glad someone has started looking into it.
However, I have a few remarks.
In general, if you look at the codebase, there is no manual memory management, that is, instead of using new and delete, we rely on std::unique_ptr and std::shared_ptr to manage heap-allocated resources. In fact, the delete keyword is never used anywhere in the project. Using manual memory management of the PreparedInsert class introduces a very bad situation where PreparedInsert can be inadvertently copied.The compiler will automatically generate the copy assignment and the copy constructor operators, which could lead to shallow copies of pointers and ultimately a double-free error, if users are not careful. This can easily happen by accident.
My second remark is a bit tougher. I know you’ve put thought and care into this design, but I’ll have to ask for large changes. The PreparedInsert is not needed here, and the API is simpler without it. The insert operation should be simple and not require many visible moving parts. Ideally, I would approach it like this:
Block block = client.BeginInsert("INSERT INTO test_clickhouse_cpp_insert VALUES");
for (const auto& td : TEST_DATA) {
id->Append(td.id);
name->Append(td.name);
f->Append(td.f);
}
client.SendData(block);
...
client.SendData(block);
...
client.SendData(block);
client.EndInsert();The main points here are:
BeginInsertandEndInsertclearly form a pair and serve one another.- It’s unambiguous that no other insert or select statements should occur between them. The current
PreparedInsertdesign creates room for sharing thePreparedInsertaround, which risks losing the connection state and start using the client object for something else in the meantime. The proposed pattern enforces a clear principle: one operation → one connection → one client object. Need another parallel operation - create another client. - Here the
Blockobject is detached, and ownership is passed to the user code. The user knows it’s not an internal part ofPreparedInsertand can freely modify it if needed. - You can still preserve automatic
EndInsertbehavior when the client goes out of scope by tracking its state - if it’s in insert mode, callEndInsertin the destructor. - I would avoid using the word
Prepare...here, because it seem to have a bit different idea than what we are trying achiave here.
Thank you again for your work. Please let me know if you’d like any help, I’d be happy to assist.
|
Thank you for the design suggestions. I'll work on them this afternoon. |
|
Done in a91ff8a. |
clickhouse/client.h
Outdated
| */ | ||
| std::unique_ptr<Block> BeginInsert(const std::string& query); | ||
| std::unique_ptr<Block> BeginInsert(const std::string& query, const std::string& query_id); | ||
| void InsertData(Block& block); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holler if you'd rather pass a std::unique_ptr<Block>. Seems okay to me to pass a *block instead, but I'm not yet up to snuff on idiomatic C++.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does BeginInsert have to return std::unique_ptr? It seems to me that it doesn't to be a pointer at all, i.e.:
Block BeginInsert(const std::string& query);
InsertData looks good, except it should be
void InsertData(const Block& block);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But InsertData does modify the block, by design:
void Client::Impl::InsertData(Block& block) {
assert(inserting);
block.RefreshRowCount();
SendData(block);
block.Clear();
}Would you rather that refreshing the count and clearing be done by the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, only new Block() would need delete. Destructor handles cleaning up vector's heap allocation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to void InsertData(const Block& block); in 17467b7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serprex Yeah, with this change I had to update clickhouse_fdw to make copies of all the columns, since it couldn't hold on to a reference to the block. A bit annoying to be creating them all twice, but not awful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, hey, was able to switch from copying to moving the block in bb8b5a4. Sweet!
d3b7dcd to
bb8b5a4
Compare
Add a new pattern for "prepared inserts". It works like this:
* Call `BeginInsert` with an `INSERT` query with optional columns
and ending in `VALUES`. No values should be included in the string.
* It returns a `Block` pre-configured with columns as
declared in the `INSERT` statement
* Add data to the block and periodically call `InsertData` to insert
data and clear the block.
* Call `EndInsert()` or just let the `Client` object go out of scope
to signal the server that it's done inserting.
This allows one to send smaller batches of blocks, thereby using less
memory, but still in a single ClickHouse `INSERT` operation.
Expected to be useful in the Postgres foreign data wrapper insert API,
where multiple rows can be inserted at once but its API handles
one-at-a-time insertion. It will also support the FDW COPY API, which
can submit huge batches of data to insert, as well.
| { } | ||
| Client::Impl::~Impl() { | ||
| // Wrap up an insert if one is in progress. | ||
| EndInsert(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is the right place but it seems this is dangerous in the situation where the database stops when inserting a batch of records.
With v2.6.0 my app correctly recovers if Clickhouse stops in the middle of an insert but after this was added to the destructor, it crashes with this stack:
terminate called after throwing an instance of 'std::system_error'
what(): fail to send 38 bytes of data: Broken pipe
#10 0x000055555561b866 in clickhouse::Client::Impl::EndInsert()
#11 0x0000555555c4823c in clickhouse::Client::Impl::~Impl()
#12 0x0000555555c4857b in clickhouse::Client::~Client()
I reproduce by issuing this simple command during a big insert loop:
service clickhouse-server stop
I think we need at least a try/catch around EndInsert() in the destructor to mitigate the issue but I'm no C++ expert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @madshell
That’s a very good point! It really does make it difficult to handle these kinds of exceptions (and can create a bunch of problems). Putting a try/catch around EndInsert() is probably the best solution, especially since no data loss is involved.
Thank you very much for pointing this out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new pattern for "prepared inserts". It works like this:
BeginInsertwith anINSERTquery with optional columns and ending inVALUES. No values should be included in the string.Blockpre-configured with columns as declared in theINSERTstatementInsertDatato insert data and clear the block.EndInsert()or just let theClientobject go out of scope to signal the server that it's done inserting.This allows one to send smaller batches of blocks, thereby using less memory, but still in a single ClickHouse
INSERToperation.Expected to be useful in the Postgres foreign data wrapper insert API, where multiple rows can be inserted at once but its API handles one-at-a-time insertion. It will also support the FDW COPY API, which can submit huge batches of data to insert, as well.