[SPARK-23649][SQL] Skipping chars disallowed in UTF-8 #20796

MaxGekk · 2018-03-11T17:32:07Z

What changes were proposed in this pull request?

The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253:
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65
https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190

If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf

As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline.

The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet).

How was this patch tested?

Added a test and a file with a char which is disallowed in UTF-8 - 0xFF.

… disallowed in UTF-8

maropu · 2018-03-12T03:38:00Z

Could you add more tests for all the invalid range 253-255 in UTF8StringSuite?

@HyukjinKwon Could you trigger this test? Also, kindly pinging cuz this is probably your domain?

HyukjinKwon · 2018-03-12T10:10:52Z

ok to test

HyukjinKwon · 2018-03-12T10:20:58Z

test this please

HyukjinKwon · 2018-03-12T10:25:46Z

retest this please

SparkQA · 2018-03-12T14:27:49Z

Test build #88174 has finished for PR 20796 at commit d6c5f02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-03-13T05:18:04Z

add to whitelist

SparkQA · 2018-03-13T07:05:01Z

Test build #88196 has finished for PR 20796 at commit d6c5f02.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

…-chars

cloud-fan · 2018-03-13T21:39:45Z

sql/core/src/test/resources/test-data/utf8xFF.csv

@@ -0,0 +1,3 @@
+channel,code
+United,123
+ABGUN�,456


how did you create this file?

cloud-fan · 2018-03-13T22:22:57Z

the fix LGTM, we can add more tests for different ranges of the invalid chars.

HyukjinKwon · 2018-03-14T05:17:31Z

LGTM for the fix. +1 for more tests.

HyukjinKwon · 2018-03-14T05:20:34Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala

+
+    assert(df.schema == expectedSchema)
+
+    val badStr = new String("ABGUN".getBytes :+ 0xff.toByte)


Shall we explicitly give encoding?

…-chars

…t byte of UTF-8 char

SparkQA · 2018-03-16T16:58:18Z

Test build #88313 has finished for PR 20796 at commit c65f827.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-03-16T17:24:04Z

test this please

SparkQA · 2018-03-16T18:23:10Z

Test build #88316 has finished for PR 20796 at commit 27e5a5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…-chars

SparkQA · 2018-03-16T20:43:52Z

Test build #88320 has finished for PR 20796 at commit 27e5a5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…-chars

SparkQA · 2018-03-17T14:26:19Z

Test build #88340 has finished for PR 20796 at commit 8a501e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-03-17T15:01:49Z

@HyukjinKwon @maropu @cloud-fan @gatorsmile Please, review it.

viirya · 2018-03-18T09:05:36Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+   * Binary    Hex          Comments
+   * 0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
+   * 10xxxxxx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding


hmm, is this 0xC2..0xDF?

yea, seems we should need to list 0xC0, 0xC1 here.

Actually this table is from the unicode standard (10.0, Table 3-6, page 126):

0xC0, 0xC1 are first bytes of 2 bytes chars disallowed by UTF-8 (for now)

Here is the table of allowed first bytes:

Yes, it looks a bit inconsistent with the content of bytesOfCodePointInUTF8. I agree with @cloud-fan that we should list 0xC0, 0xC1 here.

I added a comment about the first bytes disallowed by UTF-8. The comment describes from where the byte ranges and restrictions come otherwise the comments just duplicate the implementation.

cloud-fan · 2018-03-19T18:36:45Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+   * 10xxxxxx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
+   * 110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
+   * 1110xxxx  0xE0..0xEF   First byte of a 3-byte character encoding
+   * 11110xxx  0xF0..0xF4   First byte of a 4-byte character encoding


and also 0xF5..0xFF

I will add additional comment about which bytes are not allowed according to the table:

cloud-fan · 2018-03-19T18:38:36Z

common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java

+
+  @Test
+  public void skipWrongFirstByte() {
+    int[] wrongFirstBytes = {


what will happen if we print UTF8String with invalid bytes?

The bytes are not filtered by UTF8String methods. For instance, in the case of csv datasource the invalid bytes are just passed to the final result. See https://issues.apache.org/jira/browse/SPARK-23649

I have created a separate ticket to fix the issue: https://issues.apache.org/jira/browse/SPARK-23741 .

I am not sure that the issue of output of wrong UTF-8 chars should be addressed by this PR (this pr just fixes crashes on wrong input) because it could impact on users and other Spark components. Need to discuss it and test it carefully.

cloud-fan · 2018-03-19T21:50:13Z

LGTM, pending jenkins

SparkQA · 2018-03-20T00:13:45Z

Test build #88384 has finished for PR 20796 at commit 5557a80.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-03-20T00:15:41Z

retest this please.

viirya · 2018-03-20T00:27:26Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

-    return (offset >= 0) ? bytesOfCodePointInUTF8[offset] : 1;
+    final int offset = b & 0xFF;
+    byte numBytes = bytesOfCodePointInUTF8[offset];
+    return (numBytes == 0) ? 1: numBytes; // Skip the first byte disallowed in UTF-8


Is the comment valid? Do we skip it? Don't we still count the disallowed byte as one code point in numChars?

I think so. We jump over (skip by definition) such bytes and count it as one entity. If we don't count the bytes, we break substring, toUpperCase, toLowerCase, trimRight/trimLeft and etc. The reason of the changes is to not crash on bad input as previously we threw IndexOutOfBoundsexception on some wrong chars but could pass (count as 1) another wrong chars. This PR allows to cover whole range. I believe ignoring/removing of wrong chars should be addressed in changes for https://issues.apache.org/jira/browse/SPARK-23741

SparkQA · 2018-03-20T03:32:38Z

Test build #88394 has finished for PR 20796 at commit 5557a80.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-03-20T08:10:04Z

retest this please.

SparkQA · 2018-03-20T12:15:22Z

Test build #88413 has finished for PR 20796 at commit 5557a80.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-03-20T14:43:32Z

LGTM

## What changes were proposed in this pull request? The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253: https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65 https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190 If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline. The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet). ## How was this patch tested? Added a test and a file with a char which is disallowed in UTF-8 - 0xFF. Author: Maxim Gekk <[email protected]> Closes #20796 from MaxGekk/skip-wrong-utf8-chars. (cherry picked from commit 5e7bc2a) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2018-03-20T17:38:12Z

thanks, merging to master/2.3/2.2!

The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253: https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65 https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190 If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline. The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet). Added a test and a file with a char which is disallowed in UTF-8 - 0xFF. Author: Maxim Gekk <[email protected]> Closes #20796 from MaxGekk/skip-wrong-utf8-chars. (cherry picked from commit 5e7bc2a) Signed-off-by: Wenchen Fan <[email protected]>

The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253: https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65 https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190 If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline. The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet). Added a test and a file with a char which is disallowed in UTF-8 - 0xFF. Author: Maxim Gekk <[email protected]> Closes apache#20796 from MaxGekk/skip-wrong-utf8-chars. (cherry picked from commit 5e7bc2a) Signed-off-by: Wenchen Fan <[email protected]>

## What changes were proposed in this pull request? The mapping of UTF-8 char's first byte to char's size doesn't cover whole range 0-255. It is defined only for 0-253: https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L60-L65 https://github.com/apache/spark/blob/master/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L190 If the first byte of a char is 253-255, IndexOutOfBoundsException is thrown. Besides of that values for 244-252 are not correct according to recent unicode standard for UTF-8: http://www.unicode.org/versions/Unicode10.0.0/UnicodeStandard-10.0.pdf As a consequence of the exception above, the length of input string in UTF-8 encoding cannot be calculated if the string contains chars started from 253 code. It is visible on user's side as for example crashing of schema inferring of csv file which contains such chars but the file can be read if the schema is specified explicitly or if the mode set to multiline. The proposed changes build correct mapping of first byte of UTF-8 char to its size (now it covers all cases) and skip disallowed chars (counts it as one octet). ## How was this patch tested? Added a test and a file with a char which is disallowed in UTF-8 - 0xFF. Author: Maxim Gekk <[email protected]> Closes apache#20796 from MaxGekk/skip-wrong-utf8-chars. (cherry picked from commit 5e7bc2a) Signed-off-by: Wenchen Fan <[email protected]>

MaxGekk added 4 commits March 10, 2018 22:16

Test: skip bytes disallowed in UTF-8

6d6b3ca

Making correct map of first byte to char size in UTF-8 and skip bytes…

0f474a0

… disallowed in UTF-8

Check inferred schema and returned bad string

2ee6616

The test csv was simplified

d6c5f02

Merge branch 'master' of github.com:apache/spark into skip-wrong-utf8…

264e5ac

…-chars

cloud-fan reviewed Mar 13, 2018

View reviewed changes

HyukjinKwon reviewed Mar 14, 2018

View reviewed changes

MaxGekk added 4 commits March 16, 2018 11:17

Test for handling of the first byte of UTF-8 chars

50b17b1

Using explicit charset in converting of a string to a byte array

c65f827

Merge branch 'master' of github.com:apache/spark into skip-wrong-utf8…

45f67fe

…-chars

Basic tests for 2-4 bytes chars and a test for skipping of wrong firs…

27e5a5b

…t byte of UTF-8 char

Merge branch 'master' of github.com:apache/spark into skip-wrong-utf8…

7891767

…-chars

MaxGekk added 2 commits March 17, 2018 10:50

Merge branch 'master' of github.com:apache/spark into skip-wrong-utf8…

e4f6e1a

…-chars

Removing tests for CSV and adding additional cases for UTF8StringSuite

8a501e3

MaxGekk changed the title ~~[SPARK-23649][SQL] Prevent crashes on schema inferring of CSV containing wrong UTF-8 chars~~ [SPARK-23649][SQL] Skipping chars disallowed in UTF-8 Mar 17, 2018

viirya reviewed Mar 18, 2018

View reviewed changes

cloud-fan reviewed Mar 19, 2018

View reviewed changes

Comment about disallowed first bytes according to UTF-8 standard

5557a80

viirya reviewed Mar 20, 2018

View reviewed changes

asfgit closed this in 5e7bc2a Mar 20, 2018

MaxGekk deleted the skip-wrong-utf8-chars branch August 17, 2019 13:34


		assert(df.schema == expectedSchema)

		val badStr = new String("ABGUN".getBytes :+ 0xff.toByte)

[SPARK-23649][SQL] Skipping chars disallowed in UTF-8 #20796

[SPARK-23649][SQL] Skipping chars disallowed in UTF-8 #20796

Uh oh!

Conversation

MaxGekk commented Mar 11, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

maropu commented Mar 12, 2018

Uh oh!

HyukjinKwon commented Mar 12, 2018

Uh oh!

HyukjinKwon commented Mar 12, 2018

Uh oh!

HyukjinKwon commented Mar 12, 2018

Uh oh!

SparkQA commented Mar 12, 2018

Uh oh!

gatorsmile commented Mar 13, 2018

Uh oh!

SparkQA commented Mar 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 13, 2018

Uh oh!

HyukjinKwon commented Mar 14, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 16, 2018

Uh oh!

MaxGekk commented Mar 16, 2018

Uh oh!

SparkQA commented Mar 16, 2018

Uh oh!

SparkQA commented Mar 16, 2018

Uh oh!

SparkQA commented Mar 17, 2018

Uh oh!

MaxGekk commented Mar 17, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya Mar 20, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Mar 19, 2018

Uh oh!

SparkQA commented Mar 20, 2018

Uh oh!

viirya commented Mar 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 20, 2018

Uh oh!

MaxGekk commented Mar 20, 2018

Uh oh!

SparkQA commented Mar 20, 2018

Uh oh!

viirya Mar 20, 2018 •

edited

Loading