-
Notifications
You must be signed in to change notification settings - Fork 347
Use maxlength metadata to configure VARCHAR column lengths #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, this should be 256.
Also, should this be configurable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just use TEXT in this case, which essentially defers the default to redshift?
|
This is a minimum viable patch to illustrate the basic idea and to gather feedback. I'll add more thorough tests tomorrow. @marmbrus, one question on naming: throughout Another question: in the future, is there ever a case where we'd want to generate @Aerlinger, @jaley, @traviscrawford, and @eduardoramirez, you might also want to follow this PR to make sure that it addresses your respective use-cases. |
Current coverage is
|
|
I believe metadata is case sensitive (which is unfortunate since options are not). Lowercase looks good to me. We should perhaps check what the convention in mllib is as they are the heaviest users of this feature. |
|
It turns out that the user-facing APIs for manipulating existing columns' metadata are somewhat incomplete in the Scala API and are missing in the Python, SQL, and R language APIs. I don't think that this is a blocker for merging this patch, however, as I feel that the right approach is to improve Spark's APIs for working with column metadata. In followup PRs, we can consider whether we want to add additional configurations for changing the default size or enabling truncation for working around errors due to limited column width. |
This patch allows users to specify a
maxlengthcolumn metadata entry for string columns in order to control the width ofVARCHARcolumns in generated Redshift table schemas. This is necessary in order to support string columns that are wider than 256 characters. In addition, this configuration can be used as an optimization to achieve space-savings in Redshift. For more background on the motivation of this feature, see #29.See also: #53 to improve error reporting when LOAD fails.