Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jan 5, 2019

What changes were proposed in this pull request?

Currently, Spark cannot handle numeric[] type with decimal data because both precision and scale are considered as 0.

postgres=# CREATE TABLE t (v numeric[], d  numeric);
CREATE TABLE
postgres=# INSERT INTO t VALUES('{1111.222,2222.332}', 222.4555);
INSERT 0 1

BEFORE

scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
pgTable: org.apache.spark.sql.DataFrame = [v: array<decimal(0,0)>, d: decimal(38,18)]

scala> pgTable.show
19/01/04 18:07:38 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 0

AFTER

scala> val pgTable = spark.read.jdbc("jdbc:postgresql:postgres", "t", options)
pgTable: org.apache.spark.sql.DataFrame = [v: array<decimal(38,18)>, d: decimal(38,18)]

scala> pgTable.show()
+--------------------+--------------------+
|                   v|                   d|
+--------------------+--------------------+
|[1111.22200000000...|222.4555000000000...|
+--------------------+--------------------+

How was this patch tested?

Manual. Run the PostgresIntegrationSuite.

Closes #23456

@dongjoon-hyun
Copy link
Member Author

cc @maropu

@maropu
Copy link
Member

maropu commented Jan 5, 2019

Thanks, @dongjoon-hyun! I’ll check tonight (away from a keyboard now)

Copy link
Member

@maropu maropu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is the same with v1.6?

@dongjoon-hyun
Copy link
Member Author

Yes. Exactly. It uses decimal(38,18).

@maropu
Copy link
Member

maropu commented Jan 5, 2019

Also, did you check if PostgresIntegrationSuite passed in your env? It seems Jenkins doesnot run the integration tests before.

@dongjoon-hyun
Copy link
Member Author

Yes. I ran it inside IntelliJ manually.

@dongjoon-hyun
Copy link
Member Author

Also, cc @gatorsmile and @mgaido91 .

@SparkQA
Copy link

SparkQA commented Jan 5, 2019

Test build #100764 has finished for PR 23458 at commit 9e1b8ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 5, 2019

Test build #100765 has finished for PR 23458 at commit 74215de.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@SparkQA
Copy link

SparkQA commented Jan 5, 2019

Test build #100772 has finished for PR 23458 at commit 74215de.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

case "timestamp" | "timestamptz" | "time" | "timetz" => Some(TimestampType)
case "date" => Some(DateType)
case "numeric" | "decimal" => Some(DecimalType.bounded(precision, scale))
case "numeric" | "decimal" if precision != 0 || scale != 0 =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check scale != 0, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, I see. Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a very important point, but I think we best use precision > 0 || scale > 0. I mean, there may happen that scale is negative. In that case, precision must be > 0. In the above code a numeric(0, -10) which is an invalid combination is parsed as DECIMAL(0, -10). Probably, we can rely that this case never happens, but I think having code robust also for error conditions is better.

Copy link
Member Author

@dongjoon-hyun dongjoon-hyun Jan 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. If we consider negative scale cases, we should change all of the existing code together to be consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this. We can probably also do that in a followup PR.

@maropu
Copy link
Member

maropu commented Jan 5, 2019

LGTM except for one question.

@maropu
Copy link
Member

maropu commented Jan 5, 2019

Can you also add 'Closes #23456' in the description? I found the duplicate jira ticket.

@SparkQA
Copy link

SparkQA commented Jan 5, 2019

Test build #100777 has finished for PR 23458 at commit 74215de.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Oh, I didn't notice that. I see. Thanks.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@maropu
Copy link
Member

maropu commented Jan 5, 2019

my bad, I didn't notice the existing jira ticket, sorry.

@dongjoon-hyun
Copy link
Member Author

BTW, we need to backport this to older branches, right?

@maropu
Copy link
Member

maropu commented Jan 5, 2019

Yea, I think so. Since v1.6 accepts this query, this is a kind of regression?

@dongjoon-hyun
Copy link
Member Author

Yep. I agree.

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just not sure why we have this PR duplicating #23456. Is that PR stale? It has been opened few hours ago, so I wouldn't consider it like that. Probably I am missing something. Thanks.

case "timestamp" | "timestamptz" | "time" | "timetz" => Some(TimestampType)
case "date" => Some(DateType)
case "numeric" | "decimal" => Some(DecimalType.bounded(precision, scale))
case "numeric" | "decimal" if precision != 0 || scale != 0 =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a very important point, but I think we best use precision > 0 || scale > 0. I mean, there may happen that scale is negative. In that case, precision must be > 0. In the above code a numeric(0, -10) which is an invalid combination is parsed as DECIMAL(0, -10). Probably, we can rely that this case never happens, but I think having code robust also for error conditions is better.

@dongjoon-hyun
Copy link
Member Author

Actually, two JIRA issues are created in a short time and I received a call on SPARK-26540. So, I didn't check other JIRA and other PR.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jan 5, 2019

If @mgaido91 has objection for this PR, I'll close this one. I'm okay. It's fair.
Just tell him to receive this code together.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-26540-DECIMAL-ARRAY branch January 5, 2019 15:45
@dongjoon-hyun
Copy link
Member Author

@maropu . I close this PR and JIRA issue together~

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mine was just a question @dongjoon-hyun. I was wondering if the other PR had something wrong which I couldn't figure out. I'd really appreciate if you could provide your feedback and help there then. Thanks.

case "timestamp" | "timestamptz" | "time" | "timetz" => Some(TimestampType)
case "date" => Some(DateType)
case "numeric" | "decimal" => Some(DecimalType.bounded(precision, scale))
case "numeric" | "decimal" if precision != 0 || scale != 0 =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this. We can probably also do that in a followup PR.

@SparkQA
Copy link

SparkQA commented Jan 5, 2019

Test build #100783 has finished for PR 23458 at commit 74215de.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants