Skip to content

Commit 0526805

Browse files
committed
Infer 'mixed' types as strings when using Arrow serialization (#702)
* Add failing test * More generously infer columns as strings if Arrow thought they are binary * Add failing test for true binary values * Infer 'mixed' columns by checking the first value * Update FORK.md
1 parent fdcc7b9 commit 0526805

File tree

3 files changed

+29
-4
lines changed

3 files changed

+29
-4
lines changed

FORK.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,9 +24,10 @@
2424
# Added
2525

2626
* Gradle plugin to easily create custom docker images for use with k8s
27-
* Filter rLibDir by exists so that daemon.R references the correct file [460](https://github.com/palantir/spark/pull/460)
27+
* Filter rLibDir by exists so that daemon.R references the correct file [(#460)](https://github.com/palantir/spark/pull/460)
2828
* Implementation of the shuffle I/O plugins from SPARK-25299 that asynchronously backs up shuffle files to remote storage
29-
* Add pre-installed conda configuration and use to find rlib directory [700](https://github.com/palantir/spark/pull/700)
29+
* Add pre-installed conda configuration and use to find rlib directory [(#700)](https://github.com/palantir/spark/pull/700)
30+
* Supports Arrow-serialization of Python 2 strings [(#678)](https://github.com/palantir/spark/pull/678)
3031

3132
# Reverted
3233
* [SPARK-25908](https://issues.apache.org/jira/browse/SPARK-25908) - Removal of `monotonicall_increasing_id`, `toDegree`, `toRadians`, `approxCountDistinct`, `unionAll`

python/pyspark/sql/tests/test_arrow.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -357,6 +357,20 @@ def test_createDataFrame_with_str_struct_col(self):
357357
df, df_arrow = self._createDataFrame_toggle(pdf)
358358
self.assertEqual(df.schema, df_arrow.schema)
359359

360+
def test_createDataFrame_with_str_binary_mixed(self):
361+
import pandas as pd
362+
pdf = pd.DataFrame({"a": [u"unicode-value", "binary-under-python-2"]})
363+
364+
df, df_arrow = self._createDataFrame_toggle(pdf)
365+
self.assertEqual(df.schema, df_arrow.schema)
366+
367+
def test_createDataFrame_with_real_binary(self):
368+
import pandas as pd
369+
pdf = pd.DataFrame({"a": [bytearray(b"a"), bytearray(b"c")]})
370+
371+
df, df_arrow = self._createDataFrame_toggle(pdf)
372+
self.assertEqual(df.schema, df_arrow.schema)
373+
360374
def test_createDataFrame_fallback_enabled(self):
361375
with QuietTest(self.sc):
362376
with self.sql_conf({"spark.sql.execution.arrow.fallback.enabled": True}):

python/pyspark/sql/types.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1694,10 +1694,20 @@ def from_arrow_schema(arrow_schema):
16941694
def _infer_binary_columns_as_arrow_string(schema, pandas_df):
16951695
import pandas as pd
16961696
import pyarrow as pa
1697+
import six
16971698

16981699
for field_index, field in enumerate(schema):
1699-
if field.type == pa.binary() and \
1700-
pd.api.types.infer_dtype(pandas_df.iloc[:, field_index], skipna=True) == "string":
1700+
if not field.type == pa.binary():
1701+
continue
1702+
1703+
inferred_dtype = pd.api.types.infer_dtype(pandas_df.iloc[:, field_index], skipna=True)
1704+
if inferred_dtype == 'string':
1705+
is_string_column = True
1706+
elif inferred_dtype == 'mixed' and len(pandas_df.index) > 0:
1707+
first_value = pandas_df.iloc[0, field_index]
1708+
is_string_column = isinstance(first_value, six.string_types)
1709+
1710+
if is_string_column:
17011711
field_as_string = pa.field(field.name, pa.string())
17021712
schema = schema.set(field_index, field_as_string)
17031713

0 commit comments

Comments
 (0)