-
Notifications
You must be signed in to change notification settings - Fork 1.5k
PARQUET-1388: Nanosecond precision time and timestamp - parquet-mr #519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I would suggest implement the refactor part as a separate JIRA. It is more a follow-up task for PARQUET-1253. |
|
Ok, created a separate PR for the refactor: #520 |
Rewrite Pig primitive type creation to builders
With wrapper RuntimeException the underlying checked FrontendException could be rethrown outside of the of the visitor, this way retaining backward compatibility
| "TIMESTAMP_NANOS_STRINGIFIER", "yyyy-MM-dd'T'HH:mm:ss.SSS") { | ||
| @Override | ||
| public String stringify(long value) { | ||
| return super.stringify(value) + String.format("%06d", Math.abs(value % 1000_000)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: 1_000_000 (also a few lines below)
| cal.clear(); | ||
| cal.set(2053, Calendar.JULY, 10, 22, 13, 24); | ||
| cal.set(Calendar.MILLISECOND, 84); | ||
| long micros = cal.getTimeInMillis() * 1000_000 + 1900; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename the variable to nanos and use 1_000_000 instead of 1000_000.
| cal.set(1848, Calendar.MARCH, 15, 9, 23, 59); | ||
| cal.set(Calendar.MILLISECOND, 765); | ||
| micros = cal.getTimeInMillis() * 1000_000 - 1; | ||
| assertEquals("1848-03-15T09:23:59.765000001", stringifier.stringify(micros)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timestamp from the calendar is 1848-03-15T09:23:59.765. You subtract 1 nanosec from it, so it should be 1 nanosec earlier than 1848-03-15T09:23:59.765, which is 1848-03-15T09:23:59.764999999 and not 1848-03-15T09:23:59.765000001. Why does this pass?
This PR introduces nanosecond precision time and timestamp types to parquet-mr. It depends on apache/parquet-format#102, should be committed once this format change is released.
Nanosecond precision is introduced only into the new logical type API, the old version (ConvertedType in format and OriginalType in mr) doesn't have a corresponding enum value for nanoseconds. Therefore in the Thrift schema these fields will be null, only new releases can interpret nanosecond precision, older readers can only see the physical type.
In addition to nanosecond precision, I also refactored the modules to use the new logical type API for internal decisions (e.g.: replaced the OriginalType-based switch cases, for example in type builder when checking if the proper annotation is present on physical type). This part (
commit with title: Refactor modules to use the new logical type API) doesn't need new format release, and can be split out from this PR if needed.