-
Notifications
You must be signed in to change notification settings - Fork 347
Calculate VARCHAR(N) on schema generation to support text longer than the 256 characters
#37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current coverage is
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These configuration parameters are still a work in progress.
|
FYI I made a similar change to |
|
@JoshRosen Sure, have those changes been merged into master? |
|
Yep, the changes in #46 were merged to master; GitHub doesn't say so, though, because I used Spark's merge script instead of the GitHub merge button (this squashes the commits and helps to keep the history clean). |
|
In the interest of managing risk / complexity, I would be in favor of splitting the ability to configure the size on a per-column basis from the mechanism to automatically pick the appropriate sizes. If you do plan to pick up work on this, I'd like to first split off the manual control of sizes into a separate smaller PR, test that really well, then revisit the accumulators approach once we have a more manual workaround in place. |
|
I'm going to close this pull request for now, since #54 has added the column metadata support for configuring string column lengths. If you still want to add automatic inference of the string column sizes then please re-open this PR after addressing the merge conflicts. Thanks! |
Initial work towards supporting long text fields when loading data into Redshift to address the issue reported at #29.
Adds an additional step during the table load phase to dynamically calculate the maximum string lengths within each column of the dataframe. Also replaces the default schema generation by Spark SQL in order to support more flexibility and control in how the Redshift schema is generated.
This is an initial, functional approach but hasn't yet been profiled or benchmarked for performance.