Add string_view and binary_view Logical types
#5817
Replies: 2 comments 5 replies
-
|
I guess this raises the question of what does it mean for a logical type to be added to lance? Does this mean that engines will need to support this type? Do Spark and/or Ray have support for the view types? |
Beta Was this translation helpful? Give feedback.
-
|
Now that I think about it more, I have a new question. When do we need to add a new logical type? I can think off the top of my head of two potential criteria.
So far, the motivation for our types, has mostly been category 1. I think this is more traditional for databases. I think it is rare that a database allows for category 2 to be a valid justification for a new type. For example, there is no Dictionary type in Snowflake, Postgres, or Databricks. By this rationale, I would argue that we should not create string_view and binary_view logical types. What if, instead, we had the following...
So, for example: The user ask for a string column and they say they will accept any encoding. The query engine looks at the plan and doesn't see any fancy operations. The query engine asks for the array from the storage layer and says it will accept any encoding. The storage layer just gives back the default string array, the compute layer passes that to the user. The user asks for a string column and they say they will accept any encoding. The query engine looks at the plan and sees a sort on the string column and knows it will be faster as string view. The query engine asks for the array from the storage layer but indicates it would prefer the string view encoding. The storage layer decodes directly into string view and passes the data back that way. The string view array is given back to the user. The user asks for a string column and says they want a basic arrow string array. The query engine then looks at the plan (maybe there is a sort) and has to make a judgement call to do Anyways, the point I am getting at here, is that it should be the query engine (and perhaps the storage layer) that picks between various encodings. These should not be logical types. (disclaimer: this is basically the same as the Substrait rules for data types too) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'd like to add new logical types:
string_viewUtf8Viewbinary_viewBinaryViewThese would be encoded by default the same as
LargeUtf8andLargeBinary(using 64-bit offsets). And they could be decoded either from 32-bit or 64-bit offsets. The default encoding of 64-bit offsets is chosen to be safe for large values (> 2GB).Proposed spec changes: wjones127#5
In progress library changes: #5685
Beta Was this translation helpful? Give feedback.
All reactions