Skip to content

Conversation

@Aerlinger
Copy link

Initial work towards supporting long text fields when loading data into Redshift to address the issue reported at #29.

Adds an additional step during the table load phase to dynamically calculate the maximum string lengths within each column of the dataframe. Also replaces the default schema generation by Spark SQL in order to support more flexibility and control in how the Redshift schema is generated.

This is an initial, functional approach but hasn't yet been profiled or benchmarked for performance.

@codecov-io
Copy link

Current coverage is 85.51%

Merging #37 into master will increase coverage by +0.82% as of 2e92905

@@            master     #37   diff @@
======================================
  Files           10      10       
  Stmts          307     345    +38
  Branches        67      69     +2
  Methods          0       0       
======================================
+ Hit            260     295    +35
  Partial          0       0       
- Missed          47      50     +3

Review entire Coverage Diff as of 2e92905

Powered by Codecov. Updated on successful CI builds.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These configuration parameters are still a work in progress.

@JoshRosen
Copy link
Contributor

FYI I made a similar change to schemaString in #46. Could you take a look and see which parts of this patch are still necessary and should be re-implemented on top of the master which incorporates my change?

@Aerlinger
Copy link
Author

@JoshRosen Sure, have those changes been merged into master?

@JoshRosen
Copy link
Contributor

Yep, the changes in #46 were merged to master; GitHub doesn't say so, though, because I used Spark's merge script instead of the GitHub merge button (this squashes the commits and helps to keep the history clean).

@JoshRosen
Copy link
Contributor

In the interest of managing risk / complexity, I would be in favor of splitting the ability to configure the size on a per-column basis from the mechanism to automatically pick the appropriate sizes. If you do plan to pick up work on this, I'd like to first split off the manual control of sizes into a separate smaller PR, test that really well, then revisit the accumulators approach once we have a more manual workaround in place.

@JoshRosen
Copy link
Contributor

I'm going to close this pull request for now, since #54 has added the column metadata support for configuring string column lengths. If you still want to add automatic inference of the string column sizes then please re-open this PR after addressing the merge conflicts. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants