Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Updated documentation and fixed tests
  • Loading branch information
Jonathancui123 committed Jul 12, 2022
commit 638064bb847396c749b0cf81df33c2ce03d3af46
6 changes: 6 additions & 0 deletions docs/sql-data-sources-csv.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,12 @@ Data source options of CSV can be set via:
<td>Infers the input schema automatically from data. It requires one extra pass over the data. CSV built-in functions ignore this option.</td>
<td>read</td>
</tr>
<tr>
<td><code>inferDate</code></td>
<td>false</td>
<td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. Legacy date formats in Timestamp columns cannot be parsed with this option.</td>
<td>read</td>
</tr>
<tr>
<td><code>enforceSchema</code></td>
<td>true</td>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -141,8 +141,18 @@ class CSVInferSchema(val options: CSVOptions) extends Serializable {
case _: DecimalType => tryParseDecimal(field)
case DoubleType => tryParseDouble(field)
case DateType => tryParseDateTime(field)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems changing the method tryParseDouble should be enough

Copy link
Contributor Author

@Jonathancui123 Jonathancui123 Jun 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is necessary:
Consider the column of a TimestampType followed by a DateType entry. We would expect this column to be inferred as a TimestampType column.
typeSoFar will be Timestamp when inferField is called on the second entry which is DateType. We need logic in inferField to try and parse DateType even when typeSoFar is Timestamp

case TimestampNTZType => tryParseTimestampNTZ(field)
case TimestampType => tryParseTimestamp(field)
case TimestampNTZType =>
if (options.inferDate) {
tryParseDateTime(field)
} else {
tryParseTimestampNTZ(field)
}
case TimestampType =>
if (options.inferDate) {
tryParseDateTime(field)
} else {
tryParseTimestamp(field)
}
case BooleanType => tryParseBoolean(field)
case StringType => StringType
case other: DataType =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,12 @@ package org.apache.spark.sql.catalyst.csv

import java.math.BigDecimal
import java.text.{DecimalFormat, DecimalFormatSymbols}
import java.time.{ZoneId, ZoneOffset}
import java.time.{ZoneOffset}
import java.util.{Locale, TimeZone}

import org.apache.commons.lang3.time.FastDateFormat
import org.apache.spark.SparkFunSuite

import org.apache.spark.SparkFunSuite
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.plans.SQLHelper
import org.apache.spark.sql.catalyst.util.DateTimeConstants._
Expand Down Expand Up @@ -367,11 +367,15 @@ class UnivocityParserSuite extends SparkFunSuite with SQLHelper {
"timestampNTZFormat" -> "dd-MM-yyyy HH:mm", "dateFormat" -> "dd_MM_yyyy"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One test we might need would be:

timestampFormat" -> "dd/MM/yyyy HH:mm and dateFormat -> dd/MM/yyyy to make sure timestamps are not parsed as date types without conflicting.

Copy link
Contributor

@bersprockets bersprockets Jun 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make sure timestamps are not parsed as date types without conflicting.

That's actually what happens:

Before this PR:

scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 15:00:00").toDS()
csvInput: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val df = spark.read.option("inferSchema", "true").csv(csvInput)
df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: timestamp]

scala> df.printSchema
root
 |-- _c0: integer (nullable = true)
 |-- _c1: timestamp (nullable = true)

scala> 

After this PR:

scala> val csvInput = Seq("0,2012-01-01 12:00:00", "1,2021-07-01 15:00:00").toDS()
csvInput: org.apache.spark.sql.Dataset[String] = [value: string]

scala> val df = spark.read.option("inferSchema", "true").csv(csvInput)
df: org.apache.spark.sql.DataFrame = [_c0: int, _c1: date]

scala> df.printSchema
root
 |-- _c0: integer (nullable = true)
 |-- _c1: date (nullable = true)

scala>

It looks like some tests fail too, like CSVInferSchemaSuite, and CSVv1Suite, possibly others (I ran these two suites on my laptop. For some reason, the github actions didn't run tests for this PR. Maybe @Jonathancui123 needs to turn them on in his fork?).

We should probably 1. add either SQL configuration or an option e.g., infersDate

I think you would need something like that: when set, the date formatter could use the slower, more strict method of parsing (so "2012-01-01 12:00:00" wouldn't parse as a date).

Edit: To do a strict parsing, one might need to use ParsePosition and check that the whole date/time value string was consumed. Even after setting lenient=false SimpleDateFormat.parse didn't complain about extra characters that weren't consumed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I addressed inference mistakes in the following code snippet and comment

false, "UTC")
val dateString = "08_09_2001"
val expected = date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.UTC)
val expected = dataType match {
case TimestampType | TimestampNTZType => date(2001, 9, 8, 0, 0, 0, 0, ZoneOffset.UTC)
case DateType => days(2001, 9, 8)
}
val parser = new UnivocityParser(new StructType(), timestampsOptions)
assert(parser.makeConverter("d", dataType).apply(dateString) == expected)
}
checkDate(TimestampType)
checkDate(TimestampNTZType)
checkDate(DateType)
}
}