Skip to content

Conversation

@chl-wxp
Copy link
Contributor

@chl-wxp chl-wxp commented Dec 2, 2025

Issue

#10129

Check list

  • Accuracy of file segmentation logic
  • Unit test coverage of file splitting
  • The rationality of the code architecture of the file splitting strategy

@chl-wxp chl-wxp marked this pull request as draft December 2, 2025 10:26
@LiJie20190102
Copy link
Contributor

LGTM

@chl-wxp chl-wxp marked this pull request as ready for review December 5, 2025 13:53
@chl-wxp
Copy link
Contributor Author

chl-wxp commented Dec 9, 2025

@zhangshenghang @corgy-w

Copy link
Contributor

@corgy-w corgy-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, you should update the documentation.
After enabling enable_split_file, the original skip_header behavior for the file no longer works as before. Is there any way to keep it compatible?

@chl-wxp
Copy link
Contributor Author

chl-wxp commented Dec 11, 2025

First, you should update the documentation. After enabling enable_split_file, the original skip_header behavior for the file no longer works as before. Is there any way to keep it compatible?

  1. Both Chinese and English documents have been added;
  2. Regarding the previous logic of skipping a few lines in csv and text, I made compatibility in the code. When single file sharding is not enabled, the original skip logic will still be used. Although somewhat redundant, this is the best solution currently.

@chl-wxp chl-wxp requested a review from corgy-w December 11, 2025 03:19
URL url = getClass().getClassLoader().getResource("test_split_csv_data.csv");
String realPath = Paths.get(url.toURI()).toString();
final List<FileSourceSplit> splits = localFileSplitStrategy.split("test.table", realPath);
Assertions.assertEquals(4, splits.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add line count check

URL url = getClass().getClassLoader().getResource("test_split_csv_data.csv");
String realPath = Paths.get(url.toURI()).toString();
final List<FileSourceSplit> splits = localFileSplitStrategy.split("test.table", realPath);
Assertions.assertEquals(2, splits.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add line count check

import java.nio.file.Paths;
import java.util.List;

public class SplitFileStrategyTest {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add escape characters to test: when the field content contains a newline character, the line is not truncated.

e.g(csv):

1,t1,"a
b"
2,t2,"c
d"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add escape characters to test: when the field content contains a newline character, the line is not truncated.

e.g(csv):

1,t1,"a
b"
2,t2,"c
d"

please see: #10185

Copy link
Contributor

@corgy-w corgy-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your change not only involves localfile, but also other files related documents need to be changed.

@chl-wxp
Copy link
Contributor Author

chl-wxp commented Dec 12, 2025

Your change not only involves localfile, but also other files related documents need to be changed.

Connectors of different file types have different logic. Currently, only local implements AccordingToSplitSizeSplitStrategy, so only the localfile document is modified.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants