Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented May 19, 2020

What changes were proposed in this pull request?

This PR allows missing year and hour fields when parsing date/timestamp string, with 1970 as the default year and 0 as the default hour.

Why are the changes needed?

To keep backward compatibility with Spark 2.4.

Does this PR introduce any user-facing change?

Yes.

Spark 2.4:

scala> sql("select to_timestamp('16', 'dd')").show
+------------------------+
|to_timestamp('16', 'dd')|
+------------------------+
|     1970-01-16 00:00:00|
+------------------------+


scala> sql("select to_date('16', 'dd')").show
+-------------------+
|to_date('16', 'dd')|
+-------------------+
|         1970-01-16|
+-------------------+


scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+----------------------------------+
|to_timestamp('2019 40', 'yyyy mm')|
+----------------------------------+
|               2019-01-01 00:40:00|
+----------------------------------+


scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+----------------------------------------------+
|to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')|
+----------------------------------------------+
|                           2019-01-01 10:10:10|
+----------------------------------------------+

in branch 3.0

scala> sql("select to_timestamp('16', 'dd')").show
+--------------------+
|to_timestamp(16, dd)|
+--------------------+
|                null|
+--------------------+


scala> sql("select to_date('16', 'dd')").show
+---------------+
|to_date(16, dd)|
+---------------+
|           null|
+---------------+

scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+------------------------------+
|to_timestamp(2019 40, yyyy mm)|
+------------------------------+
|           2019-01-01 00:00:00|
+------------------------------+


scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+------------------------------------------+
|to_timestamp(2019 10:10:10, yyyy hh:mm:ss)|
+------------------------------------------+
|                       2019-01-01 00:00:00|
+------------------------------------------+

After this PR, the behavior becomes the same as 2.4.

How was this patch tested?

new tests

case _: Throwable => None
}
if (res.nonEmpty) {
case e: DateTimeException if SQLConf.get.legacyTimeParserPolicy == EXCEPTION =>
Copy link
Contributor Author

@cloud-fan cloud-fan May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the field value exceeds the valid range, the JDK throws DateTimeException, and we should catch it too. DateTimeParseException extends DateTimeException

.parseDefaulting(ChronoField.MONTH_OF_YEAR, 1)
.parseDefaulting(ChronoField.DAY_OF_MONTH, 1)
.parseDefaulting(ChronoField.MINUTE_OF_HOUR, 0)
.parseDefaulting(ChronoField.SECOND_OF_MINUTE, 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's too fragile to use these default values, as it may fail to parse due to conflicts. This PR decides default values manually, after parsing. See https://github.com/apache/spark/pull/28576/files#diff-3e4b85f54d2e75f1b5b845279749024eR42

@cloud-fan
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented May 19, 2020

Test build #122832 has finished for PR 28576 at commit 086eeb5.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.


// Converts the parsed temporal object to ZonedDateTime. It sets time components to zeros
// if they does not exist in the parsed object.
protected def toZonedDateTime(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, the additional functional calls will slow down overall performance. I think it makes sense to re-run datetime benchmarks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Can you help to re-generate benchmark numbers and send me a PR? It was generated by you last time and I'm afraid rerunning it locally at my side can't help us to show the difference.

}

protected def toLocalDate(temporalAccessor: TemporalAccessor): LocalDate = {
val year = getOrDefault(temporalAccessor, ChronoField.YEAR, 1970)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How often do users consider 1970 as the default year, your guess? I would propose current year as we do in stringToTimestamp:

Copy link
Contributor Author

@cloud-fan cloud-fan May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from 2.4. I don't care too much about it as long as we have a default value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is some kind of inconsistent behaviour between CAST and parsing via patterns (in to_timestamp for example). Should we document somewhere the default year?

val hour = if (temporalAccessor.isSupported(ChronoField.HOUR_OF_DAY)) {
temporalAccessor.get(ChronoField.HOUR_OF_DAY)
} else if (temporalAccessor.isSupported(ChronoField.HOUR_OF_AMPM)) {
// When we reach here, is mean am/pm is not specified. Here we assume it's am.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is mean am/pm is not specified. -> it means that am/pm is not specified.

}

protected def toLocalDate(temporalAccessor: TemporalAccessor): LocalDate = {
val year = getOrDefault(temporalAccessor, ChronoField.YEAR, 1970)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the case when ERA is set. In this case, year will be in the YEAR_OF_ERA fields, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TemporalAccessor.get is not getting the field, it's actually a query. So for yyyy G, we can still query the YEAR field because both YEAR_OF_ERA and ERA are available.


val micros1 = formatter.parse("2009-12-12 00")
assert(micros1 === TimeUnit.SECONDS.toMicros(
LocalDateTime.of(2009, 12, 12, 0, 0, 0).toEpochSecond(ZoneOffset.UTC)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use DateTimeTestUtils.days

// Returns microseconds since epoch for the given date
def date(
year: Int,
month: Byte = 1,
day: Byte = 1,
hour: Byte = 0,
minute: Byte = 0,
sec: Byte = 0,
micros: Int = 0,
zid: ZoneId = ZoneOffset.UTC): Long = {
val nanos = TimeUnit.MICROSECONDS.toNanos(micros).toInt
val localDateTime = LocalDateTime.of(year, month, day, hour, minute, sec, nanos)
localDateTimeToMicros(localDateTime, zid)
}
?

test("parsing hour with various patterns") {
def createFormatter(pattern: String): TimestampFormatter = {
// Use `SIMPLE_DATE_FORMAT`, so that the legacy parser also fails with invalid value range.
TimestampFormatter(pattern, ZoneOffset.UTC, LegacyDateFormats.SIMPLE_DATE_FORMAT, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can import UTC from DateTimeTestUtils instead of spreading ZoneOffset.UTC everywhere in tests.


val micros2 = formatter.parse("2009-12-12 15")
assert(micros2 === TimeUnit.SECONDS.toMicros(
LocalDateTime.of(2009, 12, 12, 15, 0, 0).toEpochSecond(ZoneOffset.UTC)))
Copy link
Member

@MaxGekk MaxGekk May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to improve test coverage and test different years -100, 0001, 1582, 1900, 1970, 2037, 2100 instead of checking only one year in all tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's UT, which is whitebox testing. It's clear from the code that the time parsing is unrelated to the value of year

}

protected def toLocalDate(temporalAccessor: TemporalAccessor): LocalDate = {
val year = getOrDefault(temporalAccessor, ChronoField.YEAR, 1970)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe leave a comment here to illustrate the default year/month/day is for compatibility with Spark 2.4, it's a little strange to hard code the value here.

@bart-samwel
Copy link

Note that if we were designing this fresh, the year 1970 would not be a great choice when it would be used to parse arbitrary Month/Day combinations. The reason being that it is not a leap year, which means that it would never parse Feb 29.

This behavior seems to be pretty broken. We should consider forbidding missing things in the middle / from the top, i.e., you have to have a prefix of (year, month, day, hour, minute, second, millis/micros/nanos). If anybody wants that, they can simulate the defaults by prefixing with their own stuff, e.g. if you want to parse hh:mm:ss then you have to parse to_timestamp(concat('1970-01-01 ', the_thing_we_want_to_parse, 'yyyy-MM-dd hh:mm:ss'). We can even hint this in the error message if we would forbid this by default, with a legacy config that people can use as a fallback.

@cloud-fan
Copy link
Contributor Author

The reason being that it is not a leap year, which means that it would never parse Feb 29.

This is a good point. Now I agree 1970 is not a good default. I'll fail it by default, with a legacy config to use 1970 as default.

1970
} else {
throw new SparkUpgradeException("3.0",
"Year must be given in the date/timestamp string to be parsed. You can set " +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also suggest the alternative workaround, e.g. prepending '1970 ' to the string-to-be-parsed and prepending 'yyyy '` to the format string. That one works without the legacy setting so I'd say it's preferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A problem is that, the pattern is flexible and the year may not be in the beginning, e.g. MM/dd yyyy .... I don't think there is a workaround, unless we provide an API to allow users to set the default values.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually great -- if any part is missing from the format and the input, they can just provide a default by appending a constant to the input, and a constant to the format string. I.e., the real format string is "MM dd" or "dd MM" or anything arbitrary but you're missing year, then you can always prepend "yyyy " to the format string, and "2009 " to the input value, and that fills out that part of the value.

SQLConf.LEGACY_ALLOW_MISSING_YEAR_DURING_PARSING.key + " to true, to pick 1970 as " +
"the default value of year.", null)
}
val month = getOrDefault(accessor, ChronoField.MONTH_OF_YEAR, 1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we also going to error out if they specify the day but not the month? Really, the only formats that make sense are the ones where a full prefix is given in the y-m-d h-m-s sequence, and all others are likely to be a case where they made a mistake (e.g. asked for "mm" twice where they meant MM).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

January should be fine? It accepts any DAY_OF_MONTH.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asked for "mm" twice where they meant MM

This will error out unless the 2 mm parse to the same value, e.g. mm mm yyyy with value 10 10 1990.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying it works (i.e., it accepts all valid values) but it's nonsensical; it'll never return the complete timestamp that people intended.

That said, I can see cases where people want to parse contiguous ranges (e.g. just a time, or just a month, or just a month + day) and then extract the parsed components using EXTRACT / date_part, or just to format them differently.

So I guess the most permissive but still safe thing here would be to forbid missing "year" only if someone also requests "day" and "month", because that's the situation where "year" affects the valid range of "day". But if they don't request year and they don't request month, then month would default to 1 and there would be no impact on the valid range of days, it would always go up to 32. I.e.:

"yyyy MM dd" -> OK
"MM dd" -> fail, need year to know valid range of dd
"MM" -> OK
"dd" -> OK
"hh:mm:ss" -> OK
"MM dd hh:mm:ss" -> fail, need year to know valid range of dd

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this looks better!

I'm wondering if we still need the legacy config though. It's very hard to hit the missing year exception, and if it happens it's very likely to be a data error. Users should fix the data or change the pattern and provide the default year.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok so we should go back to the first version, which just uses 1970 as the default value and fails for invalid dates at runtime 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW the current framework just returns null if the parser fails. Shall we treat this case specially and throw an error that fails the query?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error out in ANSI mode only, I'd say, just like with all other parse failures (which should fail in ANSI mode -- I hope they already do).

FWIW, going back to the first version is not a shame now that we know it's the right thing to do. :) Maybe we can leave some comments in the code to explain the reasoning why we don't error out if things are missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately datetime functions don't respect the ANSI flag at all. Maybe we should keep everything the same as Spark 2.4 (return null when parsing Feb 29 because 1970-02-29 is invalid), and implement ANSI behavior in Spark 3.1?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes! It seems terrible that we have to do that later, because then it can break existing workloads. Launching the ANSI behavior was our chance to do the right thing here. If we can still fix this in 3.0 then that would of course be preferable, but that seems like a stretch...

val LEGACY_ALLOW_MISSING_YEAR_DURING_PARSING =
buildConf("spark.sql.legacy.allowMissingYearDuringParsing")
.internal()
.doc("When true, DateFormatter/TimestampFormatter allows parsing date/timestamp string " +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too, you could suggest the alternative workaround that doesn't require setting the legacy flag.

@SparkQA
Copy link

SparkQA commented May 19, 2020

Test build #122851 has finished for PR 28576 at commit 58443e2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122880 has finished for PR 28576 at commit b17a648.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2020

Test build #122877 has finished for PR 28576 at commit afb070c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member

MaxGekk commented May 20, 2020

I have re-generated results of DateTimeBenchmark, CSVBenchmark and JsonBenchmark on JDK 8 and 11. @cloud-fan Please, review and merge this PR cloud-fan#17

@cloud-fan cloud-fan force-pushed the bug branch 2 times, most recently from 8c2ce05 to 9f03a47 Compare May 21, 2020 09:35
@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122923 has finished for PR 28576 at commit 8de3bda.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 21, 2020

Test build #122932 has finished for PR 28576 at commit 161223b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - I am just going to merge and ignore one nit.

import org.apache.spark.sql.internal.SQLConf.LegacyBehaviorPolicy._

trait DateTimeFormatterHelper {
private def getOrDefault(accessor: TemporalAccessor, field: ChronoField, default: Int): Int = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would follow the naming of getOrElse though ..

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 22, 2020

Merged to master.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 22, 2020

@cloud-fan, there's conflict on SQL file output against branch-3.0. It's trivial but can you just open a PR to backport since we're running RCs?

cloud-fan added a commit to cloud-fan/spark that referenced this pull request May 22, 2020
…p string

This PR allows missing hour fields when parsing date/timestamp string, with 0 as the default value.

If the year field is missing, this PR still fail the query by default, but provides a new legacy config to allow it and use 1970 as the default value. It's not a good default value, as it is not a leap year, which means that it would never parse Feb 29. We just pick it for backward compatibility.

To keep backward compatibility with Spark 2.4.

Yes.

Spark 2.4:
```
scala> sql("select to_timestamp('16', 'dd')").show
+------------------------+
|to_timestamp('16', 'dd')|
+------------------------+
|     1970-01-16 00:00:00|
+------------------------+

scala> sql("select to_date('16', 'dd')").show
+-------------------+
|to_date('16', 'dd')|
+-------------------+
|         1970-01-16|
+-------------------+

scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+----------------------------------+
|to_timestamp('2019 40', 'yyyy mm')|
+----------------------------------+
|               2019-01-01 00:40:00|
+----------------------------------+

scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+----------------------------------------------+
|to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')|
+----------------------------------------------+
|                           2019-01-01 10:10:10|
+----------------------------------------------+
```

in branch 3.0
```
scala> sql("select to_timestamp('16', 'dd')").show
+--------------------+
|to_timestamp(16, dd)|
+--------------------+
|                null|
+--------------------+

scala> sql("select to_date('16', 'dd')").show
+---------------+
|to_date(16, dd)|
+---------------+
|           null|
+---------------+

scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+------------------------------+
|to_timestamp(2019 40, yyyy mm)|
+------------------------------+
|           2019-01-01 00:00:00|
+------------------------------+

scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+------------------------------------------+
|to_timestamp(2019 10:10:10, yyyy hh:mm:ss)|
+------------------------------------------+
|                       2019-01-01 00:00:00|
+------------------------------------------+
```

After this PR, the behavior becomes the same as 2.4, if the legacy config is enabled.

new tests

Closes apache#28576 from cloud-fan/bug.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
cloud-fan added a commit to cloud-fan/spark that referenced this pull request May 22, 2020
…p string

This PR allows missing hour fields when parsing date/timestamp string, with 0 as the default value.

If the year field is missing, this PR still fail the query by default, but provides a new legacy config to allow it and use 1970 as the default value. It's not a good default value, as it is not a leap year, which means that it would never parse Feb 29. We just pick it for backward compatibility.

To keep backward compatibility with Spark 2.4.

Yes.

Spark 2.4:
```
scala> sql("select to_timestamp('16', 'dd')").show
+------------------------+
|to_timestamp('16', 'dd')|
+------------------------+
|     1970-01-16 00:00:00|
+------------------------+

scala> sql("select to_date('16', 'dd')").show
+-------------------+
|to_date('16', 'dd')|
+-------------------+
|         1970-01-16|
+-------------------+

scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+----------------------------------+
|to_timestamp('2019 40', 'yyyy mm')|
+----------------------------------+
|               2019-01-01 00:40:00|
+----------------------------------+

scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+----------------------------------------------+
|to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')|
+----------------------------------------------+
|                           2019-01-01 10:10:10|
+----------------------------------------------+
```

in branch 3.0
```
scala> sql("select to_timestamp('16', 'dd')").show
+--------------------+
|to_timestamp(16, dd)|
+--------------------+
|                null|
+--------------------+

scala> sql("select to_date('16', 'dd')").show
+---------------+
|to_date(16, dd)|
+---------------+
|           null|
+---------------+

scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+------------------------------+
|to_timestamp(2019 40, yyyy mm)|
+------------------------------+
|           2019-01-01 00:00:00|
+------------------------------+

scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+------------------------------------------+
|to_timestamp(2019 10:10:10, yyyy hh:mm:ss)|
+------------------------------------------+
|                       2019-01-01 00:00:00|
+------------------------------------------+
```

After this PR, the behavior becomes the same as 2.4, if the legacy config is enabled.

new tests

Closes apache#28576 from cloud-fan/bug.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
cloud-fan pushed a commit that referenced this pull request May 25, 2020
… results

### What changes were proposed in this pull request?
Re-generate results of:
- DateTimeBenchmark
- CSVBenchmark
- JsonBenchmark

in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

### Why are the changes needed?
1. The PR #28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing.
2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running benchmarks via the script:
```python
#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```

Closes #28613 from MaxGekk/missing-hour-year-benchmarks.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request May 25, 2020
… results

### What changes were proposed in this pull request?
Re-generate results of:
- DateTimeBenchmark
- CSVBenchmark
- JsonBenchmark

in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

### Why are the changes needed?
1. The PR #28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing.
2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running benchmarks via the script:
```python
#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```

Closes #28613 from MaxGekk/missing-hour-year-benchmarks.

Authored-by: Max Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 92685c0)
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants