Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Mar 11, 2018

What changes were proposed in this pull request?

The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253:
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190

If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline.

The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet).

How was this patch tested?

Added a test and a file with a char which is disallowed in UTF-8 - 0xFF.

@maropu
Copy link
Member

maropu commented Mar 12, 2018

Could you add more tests for all the invalid range 253-255 in UTF8StringSuite?

@HyukjinKwon Could you trigger this test? Also, kindly pinging cuz this is probably your domain?

@HyukjinKwon
Copy link
Member

ok to test

@HyukjinKwon
Copy link
Member

test this please

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Mar 12, 2018

Test build #88174 has finished for PR 20796 at commit d6c5f02.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

add to whitelist

@SparkQA
Copy link

SparkQA commented Mar 13, 2018

Test build #88196 has finished for PR 20796 at commit d6c5f02.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -0,0 +1,3 @@
channel,code
United,123
ABGUN�,456 No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did you create this file?

@cloud-fan
Copy link
Contributor

the fix LGTM, we can add more tests for different ranges of the invalid chars.

@HyukjinKwon
Copy link
Member

LGTM for the fix. +1 for more tests.


assert(df.schema == expectedSchema)

val badStr = new String("ABGUN".getBytes :+ 0xff.toByte)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we explicitly give encoding?

@SparkQA
Copy link

SparkQA commented Mar 16, 2018

Test build #88313 has finished for PR 20796 at commit c65f827.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 16, 2018

test this please

@SparkQA
Copy link

SparkQA commented Mar 16, 2018

Test build #88316 has finished for PR 20796 at commit 27e5a5b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 16, 2018

Test build #88320 has finished for PR 20796 at commit 27e5a5b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk MaxGekk changed the title [SPARK-23649][SQL] Prevent crashes on schema inferring of CSV containing wrong UTF-8 chars [SPARK-23649][SQL] Skipping chars disallowed in UTF-8 Mar 17, 2018
@SparkQA
Copy link

SparkQA commented Mar 17, 2018

Test build #88340 has finished for PR 20796 at commit 8a501e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 17, 2018

@HyukjinKwon @maropu @cloud-fan @gatorsmile Please, review it.

* Binary Hex Comments
* 0xxxxxxx 0x00..0x7F Only byte of a 1-byte character encoding
* 10xxxxxx 0x80..0xBF Continuation bytes (1-3 continuation bytes)
* 110xxxxx 0xC0..0xDF First byte of a 2-byte character encoding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is this 0xC2..0xDF?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, seems we should need to list 0xC0, 0xC1 here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this table is from the unicode standard (10.0, Table 3-6, page 126):
screen shot 2018-03-19 at 9 23 14 pm

0xC0, 0xC1 are first bytes of 2 bytes chars disallowed by UTF-8 (for now)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the table of allowed first bytes:
screen shot 2018-03-19 at 9 39 17 pm

Copy link
Member

@viirya viirya Mar 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it looks a bit inconsistent with the content of bytesOfCodePointInUTF8. I agree with @cloud-fan that we should list 0xC0, 0xC1 here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment about the first bytes disallowed by UTF-8. The comment describes from where the byte ranges and restrictions come otherwise the comments just duplicate the implementation.

* 10xxxxxx 0x80..0xBF Continuation bytes (1-3 continuation bytes)
* 110xxxxx 0xC0..0xDF First byte of a 2-byte character encoding
* 1110xxxx 0xE0..0xEF First byte of a 3-byte character encoding
* 11110xxx 0xF0..0xF4 First byte of a 4-byte character encoding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also 0xF5..0xFF

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add additional comment about which bytes are not allowed according to the table:
screen shot 2018-03-19 at 9 39 17 pm


@Test
public void skipWrongFirstByte() {
int[] wrongFirstBytes = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen if we print UTF8String with invalid bytes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bytes are not filtered by UTF8String methods. For instance, in the case of csv datasource the invalid bytes are just passed to the final result. See https://issues.apache.org/jira/browse/SPARK-23649

I have created a separate ticket to fix the issue: https://issues.apache.org/jira/browse/SPARK-23741 .

I am not sure that the issue of output of wrong UTF-8 chars should be addressed by this PR (this pr just fixes crashes on wrong input) because it could impact on users and other Spark components. Need to discuss it and test it carefully.

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Test build #88384 has finished for PR 20796 at commit 5557a80.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Mar 20, 2018

retest this please.

return (offset >= 0) ? bytesOfCodePointInUTF8[offset] : 1;
final int offset = b & 0xFF;
byte numBytes = bytesOfCodePointInUTF8[offset];
return (numBytes == 0) ? 1: numBytes; // Skip the first byte disallowed in UTF-8
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the comment valid? Do we skip it? Don't we still count the disallowed byte as one code point in numChars?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so. We jump over (skip by definition) such bytes and count it as one entity. If we don't count the bytes, we break substring, toUpperCase, toLowerCase, trimRight/trimLeft and etc. The reason of the changes is to not crash on bad input as previously we threw IndexOutOfBoundsexception on some wrong chars but could pass (count as 1) another wrong chars. This PR allows to cover whole range. I believe ignoring/removing of wrong chars should be addressed in changes for https://issues.apache.org/jira/browse/SPARK-23741

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Test build #88394 has finished for PR 20796 at commit 5557a80.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Mar 20, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Test build #88413 has finished for PR 20796 at commit 5557a80.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Mar 20, 2018

LGTM

asfgit pushed a commit that referenced this pull request Mar 20, 2018
## What changes were proposed in this pull request?

The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253:
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190

If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline.

The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet).

## How was this patch tested?

Added a test and a file with a char which is disallowed in UTF-8 - 0xFF.

Author: Maxim Gekk <[email protected]>

Closes #20796 from MaxGekk/skip-wrong-utf8-chars.

(cherry picked from commit 5e7bc2a)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3/2.2!

asfgit pushed a commit that referenced this pull request Mar 20, 2018
The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253:
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190

If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline.

The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet).

Added a test and a file with a char which is disallowed in UTF-8 - 0xFF.

Author: Maxim Gekk <[email protected]>

Closes #20796 from MaxGekk/skip-wrong-utf8-chars.

(cherry picked from commit 5e7bc2a)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in 5e7bc2a Mar 20, 2018
MatthewRBruce pushed a commit to Shopify/spark that referenced this pull request Jul 31, 2018
The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253:
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190

If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline.

The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet).

Added a test and a file with a char which is disallowed in UTF-8 - 0xFF.

Author: Maxim Gekk <[email protected]>

Closes apache#20796 from MaxGekk/skip-wrong-utf8-chars.

(cherry picked from commit 5e7bc2a)
Signed-off-by: Wenchen Fan <[email protected]>
peter-toth pushed a commit to peter-toth/spark that referenced this pull request Oct 6, 2018
## What changes were proposed in this pull request?

The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253:
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190

If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline.

The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet).

## How was this patch tested?

Added a test and a file with a char which is disallowed in UTF-8 - 0xFF.

Author: Maxim Gekk <[email protected]>

Closes apache#20796 from MaxGekk/skip-wrong-utf8-chars.

(cherry picked from commit 5e7bc2a)
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the skip-wrong-utf8-chars branch August 17, 2019 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants