Skip to content
Closed
Changes from 1 commit
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
685fd07
use UTF8String instead of String for StringType
Mar 31, 2015
21f67c6
cleanup
Mar 31, 2015
4699c3a
use Array[Byte] in UTF8String
Mar 31, 2015
d32abd1
fix utf8 for python api
Mar 31, 2015
a85fb27
refactor
Mar 31, 2015
6b499ac
fix style
Apr 1, 2015
5f9e120
fix sql tests
Apr 1, 2015
38c303e
fix python sql tests
Apr 1, 2015
c7dd4d2
fix some catalyst tests
Apr 1, 2015
bb52e44
fix scala style
Apr 1, 2015
8b45864
fix codegen with UTF8String
Apr 1, 2015
23a766c
refactor
Apr 1, 2015
9dc32d1
fix some hive tests
Apr 2, 2015
73e4363
Merge branch 'master' of github.com:apache/spark into string
Apr 2, 2015
956b0a4
fix hive tests
Apr 2, 2015
9f4c194
convert data type for data source
Apr 2, 2015
537631c
some comment about Date
Apr 2, 2015
28d6f32
refactor
Apr 2, 2015
28f3d81
Merge branch 'master' of github.com:apache/spark into string
Apr 3, 2015
e5fa5b8
remove clone in UTF8String
Apr 3, 2015
8d17f21
fix hive compatibility tests
Apr 3, 2015
fd11364
optimize UTF8String
Apr 3, 2015
ac18ae6
address comment
Apr 3, 2015
2089d24
add hashcode check back
Apr 3, 2015
13d9d42
Merge branch 'master' of github.com:apache/spark into string
Apr 3, 2015
867bf50
fix String filter push down
Apr 4, 2015
1314a37
address comments from Yin
Apr 8, 2015
5116b43
rollback unrelated changes
Apr 8, 2015
08d897b
Merge branch 'master' of github.com:apache/spark into string
Apr 9, 2015
b04a19c
add comment for getString/setString
Apr 10, 2015
744788f
Merge branch 'master' of github.com:apache/spark into string
Apr 13, 2015
341ec2c
turn off scala style check in UTF8StringSuite
Apr 13, 2015
59025c8
address comments from @marmbrus
Apr 15, 2015
6d776a9
Merge branch 'master' of github.com:apache/spark into string
Apr 15, 2015
2772f0d
fix new test failure
Apr 15, 2015
3b7bfa8
fix schema of AddJar
Apr 15, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
optimize UTF8String
  • Loading branch information
Davies Liu committed Apr 3, 2015
commit fd113643c48b633eb505540a13b8fd4798c0197d
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ import java.util.Arrays

final class UTF8String extends Ordered[UTF8String] with Serializable {

private var bytes: Array[Byte] = _
private[this] var bytes: Array[Byte] = _
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had assumed that we would want to use bytes: Array[Byte] + length: Int so that the same byte array could be reused multiple times for different values. It seems that allocating and zeroing out the byte arrays could actually be pretty expensive.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay talked to @rxin and we are going to try and do this later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, UTF8String will take bytes from Binary.getBytes or String.getBytes, no copy, until we call copy() explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the Binary that you are referring to at here? Also, can you explain what do you mean by no copy at here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Binary is parquet.io.api.Binary, When we create a UTFString from Binary.getBytes, we does not need to do another copy for bytes.

Before this patch, we will create a copy as String.


/**
* Update the UTF8String with String.
Expand All @@ -48,6 +48,12 @@ final class UTF8String extends Ordered[UTF8String] with Serializable {
this
}

@inline
private[this] def numOfBytes(b: Byte): Int = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment at here to explain it?

val offset = (b & 0xFF) - 192
if (offset >= 0) UTF8String.tailBytesOfUTF8(offset) else 1
}

/**
* Return the number of code points in it.
*
Expand All @@ -57,11 +63,7 @@ final class UTF8String extends Ordered[UTF8String] with Serializable {
var len = 0
var i: Int = 0
while (i < bytes.length) {
val b = bytes(i) & 0xFF
i += 1
if (b >= 192) {
i += UTF8String.tailBytesOfUTF8(b - 192)
}
i += numOfBytes(bytes(i))
len += 1
}
len
Expand All @@ -84,35 +86,47 @@ final class UTF8String extends Ordered[UTF8String] with Serializable {
var c = 0
var i: Int = 0
while (c < start && i < bytes.length) {
val b = bytes(i) & 0xFF
i += 1
if (b >= 192) {
i += UTF8String.tailBytesOfUTF8(b - 192)
}
i += numOfBytes(bytes(i))
c += 1
}
var j = i
while (c < until && j < bytes.length) {
val b = bytes(j) & 0xFF
j += 1
if (b >= 192) {
j += UTF8String.tailBytesOfUTF8(b - 192)
}
j += numOfBytes(bytes(j))
c += 1
}
UTF8String(Arrays.copyOfRange(bytes, i, j))
}

def contains(sub: UTF8String): Boolean = {
bytes.containsSlice(sub.bytes)
val b = sub.getBytes
if (b.length == 0) {
return true
}
var i: Int = 0
while (i <= bytes.length - b.length) {
// In worst case, it's O(N*K), but should works fine with SQL
if (bytes(i) == b(0) && Arrays.equals(Arrays.copyOfRange(bytes, i, i + b.length), b)) {
return true
}
i += 1
}
false
}

def startsWith(prefix: UTF8String): Boolean = {
bytes.startsWith(prefix.bytes)
val b = prefix.getBytes
if (b.length > bytes.length) {
return false
}
Arrays.equals(Arrays.copyOfRange(bytes, 0, b.length), b)
}

def endsWith(suffix: UTF8String): Boolean = {
bytes.endsWith(suffix.bytes)
val b = suffix.getBytes
if (b.length > bytes.length) {
return false
}
Arrays.equals(Arrays.copyOfRange(bytes, bytes.length - b.length, bytes.length), b)
}

def toUpperCase(): UTF8String = {
Expand All @@ -133,12 +147,13 @@ final class UTF8String extends Ordered[UTF8String] with Serializable {

override def compare(other: UTF8String): Int = {
var i: Int = 0
while (i < bytes.length && i < other.bytes.length) {
val res = bytes(i).compareTo(other.bytes(i))
val b = other.getBytes
while (i < bytes.length && i < b.length) {
val res = bytes(i).compareTo(b(i))
if (res != 0) return res
i += 1
}
bytes.length - other.bytes.length
bytes.length - b.length
}

override def compareTo(other: UTF8String): Int = {
Expand All @@ -147,7 +162,7 @@ final class UTF8String extends Ordered[UTF8String] with Serializable {

override def equals(other: Any): Boolean = other match {
case s: UTF8String =>
Arrays.equals(bytes, s.bytes)
Arrays.equals(bytes, s.getBytes)
case s: String =>
// fail fast
bytes.length >= s.length && length() == s.length && toString() == s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we do not expect that other is a String in a real use case, right? If so, why not change the test and get rid of it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove this, then we need to convert String into UTF8String manually in many cases, especially, a String inside a Map/Struct/Array.

This could simply the tests a lot, I'd like to keep this.

Expand All @@ -163,10 +178,12 @@ final class UTF8String extends Ordered[UTF8String] with Serializable {
object UTF8String {
// number of tailing bytes in a UTF8 sequence for a code point
// see http://en.wikipedia.org/wiki/UTF-8, 192-256 of Byte 1
private[types] val tailBytesOfUTF8: Array[Int] = Array(1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5)
private[types] val tailBytesOfUTF8: Array[Int] = Array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4,
5, 5, 5, 5,
6, 6, 6, 6)

/**
* Create a UTF-8 String from String
Expand Down