Calculate `VARCHAR(N)` on schema generation to support text longer than the 256 characters #37

Aerlinger · 2015-08-04T04:59:03Z

Initial work towards supporting long text fields when loading data into Redshift to address the issue reported at #29.

Adds an additional step during the table load phase to dynamically calculate the maximum string lengths within each column of the dataframe. Also replaces the default schema generation by Spark SQL in order to support more flexibility and control in how the Redshift schema is generated.

This is an initial, functional approach but hasn't yet been profiled or benchmarked for performance.

…class

codecov-io · 2015-08-04T05:01:32Z

Current coverage is `85.51%`

Merging #37 into master will increase coverage by +0.82% as of 2e92905

@@            master     #37   diff @@
======================================
  Files           10      10       
  Stmts          307     345    +38
  Branches        67      69     +2
  Methods          0       0       
======================================
+ Hit            260     295    +35
  Partial          0       0       
- Missed          47      50     +3

Review entire Coverage Diff as of 2e92905

Powered by Codecov. Updated on successful CI builds.

Aerlinger · 2015-08-04T05:02:31Z

src/main/scala/com/databricks/spark/redshift/Parameters.scala

These configuration parameters are still a work in progress.

JoshRosen · 2015-08-20T21:34:05Z

FYI I made a similar change to schemaString in #46. Could you take a look and see which parts of this patch are still necessary and should be re-implemented on top of the master which incorporates my change?

Aerlinger · 2015-08-22T02:56:25Z

@JoshRosen Sure, have those changes been merged into master?

JoshRosen · 2015-08-22T02:58:47Z

Yep, the changes in #46 were merged to master; GitHub doesn't say so, though, because I used Spark's merge script instead of the GitHub merge button (this squashes the commits and helps to keep the history clean).

JoshRosen · 2015-08-25T17:46:32Z

In the interest of managing risk / complexity, I would be in favor of splitting the ability to configure the size on a per-column basis from the mechanism to automatically pick the appropriate sizes. If you do plan to pick up work on this, I'd like to first split off the manual control of sizes into a separate smaller PR, test that really well, then revisit the accumulators approach once we have a more manual workaround in place.

JoshRosen · 2015-08-26T23:58:40Z

I'm going to close this pull request for now, since #54 has added the column metadata support for configuring string column lengths. If you still want to add automatic inference of the string column sizes then please re-open this PR after addressing the merge conflicts. Thanks!

Aerlinger added 8 commits August 3, 2015 05:44

Initial spec for stringLength params

ef96912

initial functionality and specs for Schema inference

1fec190

Fixed bug where df could be nil when calculating string lengths

4843185

WIP: First pass at function to generate shema for "VARCHAR(N)"

1f75db8

Add StringMetaSchema object to calculate string lengths across columns

cad5ecb

Clean up tests and extract common DB mocking functionality to parent …

1fc2789

…class

Extract repeated code in specs to utility and parent classes

8e34670

Add inline documentation to MetaSchema object

4ce2d5c

Aerlinger reviewed Aug 4, 2015
View reviewed changes

src/main/scala/com/databricks/spark/redshift/Parameters.scala

Copy link

Author

Aerlinger Aug 4, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These configuration parameters are still a work in progress.

Clean up and DRY specs, and add spec for schema inference

a6c8d65

JoshRosen closed this Aug 26, 2015

JoshRosen added the stale / awaiting update label Aug 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Calculate `VARCHAR(N)` on schema generation to support text longer than the 256 characters #37

Calculate `VARCHAR(N)` on schema generation to support text longer than the 256 characters #37

Uh oh!

Aerlinger commented Aug 4, 2015

Uh oh!

codecov-io commented Aug 4, 2015

Uh oh!

Aerlinger Aug 4, 2015

Uh oh!

JoshRosen commented Aug 20, 2015

Uh oh!

Aerlinger commented Aug 22, 2015

Uh oh!

JoshRosen commented Aug 22, 2015

Uh oh!

JoshRosen commented Aug 25, 2015

Uh oh!

JoshRosen commented Aug 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Calculate VARCHAR(N) on schema generation to support text longer than the 256 characters #37

Calculate VARCHAR(N) on schema generation to support text longer than the 256 characters #37

Uh oh!

Conversation

Aerlinger commented Aug 4, 2015

Uh oh!

codecov-io commented Aug 4, 2015

Current coverage is 85.51%

Uh oh!

Aerlinger Aug 4, 2015

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 20, 2015

Uh oh!

Aerlinger commented Aug 22, 2015

Uh oh!

JoshRosen commented Aug 22, 2015

Uh oh!

JoshRosen commented Aug 25, 2015

Uh oh!

JoshRosen commented Aug 26, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Calculate `VARCHAR(N)` on schema generation to support text longer than the 256 characters #37

Calculate `VARCHAR(N)` on schema generation to support text longer than the 256 characters #37

Current coverage is `85.51%`