Skip to content

Conversation

@sperlingxx
Copy link
Collaborator

@sperlingxx sperlingxx commented Sep 10, 2021

Signed-off-by: sperlingxx [email protected]

Fixes #3383

Current PR also fixes cases on casting string to date/timestamp in CastOpSuite, which introduce special dates. However, these cases won't be entirely fixed until we fix #3382 through supporting the full range date/timestamp on GPU as what SPARK-35780 does.

Signed-off-by: sperlingxx <[email protected]>
Signed-off-by: sperlingxx <[email protected]>
Signed-off-by: sperlingxx <[email protected]>
@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx sperlingxx requested a review from revans2 September 10, 2021 06:05
// handle special dates like "epoch", "now", etc.
val finalResult = specialDates.foldLeft(converted)((prev, specialDate) =>
specialTimestampOr(sanitizedInput, specialDate._1, specialDate._2, prev))
val finalResult = withResource(daysEqual(sanitizedInput, DateUtils.EPOCH)) { isEpoch =>
Copy link
Collaborator

@firestarman firestarman Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: It would be good to have this boilerplate code into a method.

def daysScalarDays(name: String): Scalar = ShimLoader.getSparkVersion match {
// In Spark 3.2, special datetime values such as `epoch`, `today`, `yesterday`, `tomorrow`,
// and `now` are supported in typed literals only
case version if version >= "3.2" =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to add shims for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do keep the version check here (which I am ok with personally because it reduces complexity in the shim layers) then we should use more robust logic for the version check. We have some code in the test suite that could be moved elsewhere.

https://github.com/NVIDIA/spark-rapids/blob/branch-21.10/tests/src/test/scala/com/nvidia/spark/rapids/SparkQueryCompareTestSuite.scala#L1848-L1849

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to keep the version check. I would suggest to move @andygrove's implementation to ShimVerion to make shimversions comparable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to use a shim over a version check is mostly databricks and other vendors that pull back "fixes" not directly related to version numbers. I doubt it will happen in this case, but similar things have happened in the past.

Copy link
Collaborator

@gerashegalov gerashegalov Sep 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's something we will find out via the test being broken on databricks. If we go with the ShimVersion comparison implementation, the fix is either an additional condition on databricks shim version compare or it's more involved at which point we will have to add a new shim method.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced the version dispatch with a shim method.

// handle special dates like "epoch", "now", etc.
specialDates.foldLeft(converted)((prev, specialDate) =>
specialDateOr(sanitizedInput, specialDate._1, specialDate._2, prev))
withResource(daysEqual(sanitizedInput, DateUtils.EPOCH)) { isEpoch =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change feels like a step backwards. Before with the folding the maximum extra memory needed on the GPU was a boolean column, and the new output column. With this change we now have 5 boolean columns and 5 temporary output columns.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was not optimal. I reworked this part, cleaning up temporary GPU resources as early as possible.

val startTimeSeconds = System.currentTimeMillis()/1000L
val cpuNowSeconds = withCpuSparkSession(now).collect().head.toSeq(1).asInstanceOf[Long]
val gpuNowSeconds = withGpuSparkSession(now).collect().head.toSeq(1).asInstanceOf[Long]
assert(cpuNowSeconds >= startTimeSeconds)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for removing these assertions? Are they no longer valid?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, under spark 3.2+, the result will be zero instead of current time, since NOW is not longer being parsed in 3.2.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then have the test check the version. and have it check that now is replaced with a 0 instead. Or just have the test skip entirely if it is 3.2+

@sperlingxx
Copy link
Collaborator Author

build

@sperlingxx
Copy link
Collaborator Author

build

@pxLi
Copy link
Member

pxLi commented Sep 15, 2021

the CI has been assigned to non-reserved instances which failed PVC driver. I will report to blossom

@pxLi
Copy link
Member

pxLi commented Sep 15, 2021

build

// `converted` will be closed in replaceSpecialDates. We wrap it with closeOnExcept in case
// of exception before replaceSpecialDates.
val finalResult = closeOnExcept(converted) { timeStampVector =>
val specialDates = Seq(DateUtils.EPOCH, DateUtils.NOW, DateUtils.TODAY,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems expensive to replace the special dates with null with Spark 3.2 rather than just skip handling them at all. They will already be ignored by is_timestamp.

I will pull this PR locally today and experiment with it and come back with more detailed suggestions.

@andygrove
Copy link
Contributor

I looked at how I would approach this and I don't think there is a need to make any shim changes here. Spark 3.2 doesn't support the special dates so it seems confusing to add related logic there. There are also no changes in Spark APIs that would cause us to need shims.

I think this issue can be resolved more simply with some version checks in a couple of places:

In DateUtils, where we return maps of special dates to literal values we can return empty maps with Spark 3.2:

def specialDatesDays: Map[String, Int] = if (spark320orLater) {
  Map.empty
} else {
  val today = currentDate()
  Map(
    EPOCH -> 0,
    NOW -> today,
    TODAY -> today,
    YESTERDAY -> (today - 1),
    TOMORROW -> (today + 1)
  )
}

(and the same for specialDatesSeconds and specialDatesMicros).

Then in org.apache.spark.sql.rapids.GpuToTimestamp#parseStringAsTimestamp we can do:

if (spark320orLater) {
  withResource(isTimestamp(lhs.getBase, sparkFormat, strfFormat)) { isTimestamp =>
    withResource(asTimestamp(lhs.getBase, strfFormat)) { converted =>
      withResource(Scalar.fromNull(dtype)) { nullValue =>
        isTimestamp.ifElse(converted, nullValue)
      }
    }
  }
} else {
 // original complex logic that handles special dates
}

Signed-off-by: sperlingxx <[email protected]>
@sperlingxx
Copy link
Collaborator Author

I looked at how I would approach this and I don't think there is a need to make any shim changes here. Spark 3.2 doesn't support the special dates so it seems confusing to add related logic there. There are also no changes in Spark APIs that would cause us to need shims.

I think this issue can be resolved more simply with some version checks in a couple of places:

In DateUtils, where we return maps of special dates to literal values we can return empty maps with Spark 3.2:

def specialDatesDays: Map[String, Int] = if (spark320orLater) {
  Map.empty
} else {
  val today = currentDate()
  Map(
    EPOCH -> 0,
    NOW -> today,
    TODAY -> today,
    YESTERDAY -> (today - 1),
    TOMORROW -> (today + 1)
  )
}

(and the same for specialDatesSeconds and specialDatesMicros).

Then in org.apache.spark.sql.rapids.GpuToTimestamp#parseStringAsTimestamp we can do:

if (spark320orLater) {
  withResource(isTimestamp(lhs.getBase, sparkFormat, strfFormat)) { isTimestamp =>
    withResource(asTimestamp(lhs.getBase, strfFormat)) { converted =>
      withResource(Scalar.fromNull(dtype)) { nullValue =>
        isTimestamp.ifElse(converted, nullValue)
      }
    }
  }
} else {
 // original complex logic that handles special dates
}

I fully agree on this approach, and I've made corresponding changes. Thanks for such a detailed explanation!

@sperlingxx
Copy link
Collaborator Author

build

val startTimeSeconds = System.currentTimeMillis()/1000L
val cpuNowSeconds = withCpuSparkSession(now).collect().head.toSeq(1).asInstanceOf[Long]
val gpuNowSeconds = withGpuSparkSession(now).collect().head.toSeq(1).asInstanceOf[Long]
assert(cpuNowSeconds >= startTimeSeconds)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then have the test check the version. and have it check that now is replaced with a 0 instead. Or just have the test skip entirely if it is 3.2+

@sperlingxx
Copy link
Collaborator Author

build

Signed-off-by: sperlingxx <[email protected]>
@sperlingxx
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator

I pulled this PR and we still have CastOpSuite test failures, was this intended to handle all of those? its referenced in description so I was assuming so

@sperlingxx
Copy link
Collaborator Author

sperlingxx commented Sep 17, 2021

I pulled this PR and we still have CastOpSuite test failures, was this intended to handle all of those? its referenced in description so I was assuming so

Hi @tgravescs, this PR can not fix some failed cases in CastOpSuite which are due to #3382. I am going to simply disable those cases for Spark 3.2+ after finishing this PR. It will be a temporary solution, since we need naive support to fix #3382 completely.

@revans2
Copy link
Collaborator

revans2 commented Sep 17, 2021

It will be a temporary solution, since we need naive support to fix #3382 completely.

We have to fall back to the CPU for this case unless the user opts into letting us do the cast anyways. Disabling the tests is not a solution.

@sperlingxx
Copy link
Collaborator Author

sperlingxx commented Sep 17, 2021

It will be a temporary solution, since we need naive support to fix #3382 completely.

We have to fall back to the CPU for this case unless the user opts into letting us do the cast anyways. Disabling the tests is not a solution.

Oh, I thought we already had corresponding options to disable casting from string to date/timestamp by default. But I just found I was wrong: we only have the config RapidsConf.ENABLE_CAST_STRING_TO_TIMESTAMP. IIUC, we also need to add RapidsConf.ENABLE_CAST_STRING_TO_DATE ?

@revans2
Copy link
Collaborator

revans2 commented Sep 17, 2021

We should add in RapidsConf.ENABLE_CAST_STRING_TO_DATE and only use it for 3.2.0+. We also need to make it clear in the docs for it that it is only for 3.2.0+ because of 7 digit year support. We should also update the docs for RapidsConf.ENABLE_CAST_STRING_TO_TIMESTAMP to include information about the 7 digit year support in 3.2.0+

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not thrilled with VersionUtils moving to the plugin as I explained here. But just to get something shipped I am going to approve this and we can revisit the topic later on.

@revans2
Copy link
Collaborator

revans2 commented Sep 17, 2021

That said we do need to fix the 7 digit year parsing as a separate issue.

@revans2 revans2 merged commit 695f82b into NVIDIA:branch-21.10 Sep 17, 2021
@revans2
Copy link
Collaborator

revans2 commented Sep 17, 2021

Just FYI I filed #3530 to handle the 7 digit year fallback issue. I will take a crack at it because hopefully I can get it done today.

@sperlingxx sperlingxx deleted the stopTransferSpecialDates_320 branch December 2, 2021 02:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

8 participants