Skip to content
Closed
Changes from 1 commit
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Move out column name handling logic.
  • Loading branch information
dongjoon-hyun committed Nov 29, 2017
commit cc40fba4e4f12d7f65dd7f426fc20f3b5c2a7b6e
Original file line number Diff line number Diff line change
Expand Up @@ -40,14 +40,19 @@ private[orc] class OrcDeserializer(

private[this] val length = requiredSchema.length

private[this] val unwrappers = requiredSchema.map(_.dataType).map(unwrapperFor).toArray
private[this] val unwrappers = requiredSchema.map { f =>
if (missingColumnNames.contains(f.name)) {
(value: Any, row: InternalRow, ordinal: Int) => row.setNullAt(ordinal)
} else {
unwrapperFor(f.dataType)
}
}.toArray

def deserialize(orcStruct: OrcStruct): InternalRow = {
var i = 0
val names = orcStruct.getSchema.getFieldNames
while (i < length) {
val name = requiredSchema(i).name
val writable = if (missingColumnNames.contains(name)) {
val fieldRefs = requiredSchema.map { f =>
val name = f.name
if (missingColumnNames.contains(name)) {
null
} else {
if (names.contains(name)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this kind of thing in the loop is pretty slow, especially on this critical path. Let's pre-compute it before the loop. Again, PLEASE follow the previous code, which did this check when creating the unwrapper.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW do we really need to handle missing columns for nested fields?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I moved out the column-name logic from while, but it still requires OrcStruct because we don't have StructObjectInspector. So, we cannot move out this from iter.map.
  2. This is about top-level fields. Could you tell me what do you mean by nested fields here?

Expand All @@ -56,6 +61,11 @@ private[orc] class OrcDeserializer(
orcStruct.getFieldValue("_col" + dataSchema.fieldIndex(name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, do we really need missingCols? If the name is missing, orcStruct.getFieldValue will just return null, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OrcStruct will raise Exception here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how can we guarantee this line won't throw exception?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We handled them in a missing columns in line 55.

}
}
}.toArray

var i = 0
while (i < length) {
val writable = fieldRefs(i)
if (writable == null) {
mutableRow.setNullAt(i)
} else {
Expand Down