Skip to content

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Aug 27, 2025

Part of #4516

This PR will add UDFs for JSON so users can read JSON like:

ds_path = Path(tmpdir) / "nested_json_test.lance"
lance.write_dataset(table, ds_path, data_storage_version="2.2")
dataset = lance.dataset(ds_path)

# Access nested fields using json_get recursively
# First get user, then profile, then name
result = dataset.to_table(
    filter="json_get_string(json_get(json_get(data, 'user'), 'profile'), 'name') = 'Alice'"
)
assert result.num_rows == 1
assert result["id"][0].as_py() == 1

# Or use JSONPath for deep access
result = dataset.to_table(
    filter="json_extract(data, '$.user.profile.settings.theme') = '\"dark\"'"
)
assert result.num_rows == 1
assert result["id"][0].as_py() == 1

This PR was primarily authored with Claude Code using Opus 4.1 and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.

@github-actions github-actions bot added enhancement New feature or request python labels Aug 27, 2025
@codecov-commenter
Copy link

codecov-commenter commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 57.99770% with 365 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.85%. Comparing base (e2f08db) to head (c59315f).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-datafusion/src/udf/json.rs 63.57% 167 Missing and 61 partials ⚠️
rust/lance-arrow/src/json.rs 48.32% 75 Missing and 2 partials ⚠️
rust/lance/src/dataset/utils.rs 23.07% 60 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4577      +/-   ##
==========================================
- Coverage   80.99%   80.85%   -0.15%     
==========================================
  Files         314      314              
  Lines      115727   116609     +882     
  Branches   115727   116609     +882     
==========================================
+ Hits        93735    94283     +548     
- Misses      18682    18952     +270     
- Partials     3310     3374      +64     
Flag Coverage Δ
unittests 80.85% <57.99%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick initial question. How do these compare to https://github.com/datafusion-contrib/datafusion-functions-json ?

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 27, 2025

Quick initial question. How do these compare to https://github.com/datafusion-contrib/datafusion-functions-json ?

datafusion-functions-json has:

json_get::json_get_udf(),
json_get_bool::json_get_bool_udf(),
json_get_float::json_get_float_udf(),
json_get_int::json_get_int_udf(),
json_get_json::json_get_json_udf(),
json_get_array::json_get_array_udf(),
json_as_text::json_as_text_udf(),
json_get_str::json_get_str_udf(),
json_contains::json_contains_udf(),
json_length::json_length_udf(),
json_object_keys::json_object_keys_udf(),

Perhaps it would be better to implement them directly on jsonb rather than converting to JSON and using those UDFs. What do you think?

Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool, I have some suggestions but no major concerns.

.iter()
.map(|arg| match arg {
ColumnarValue::Array(arr) => arr.clone(),
ColumnarValue::Scalar(scalar) => scalar.to_array().unwrap(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we could avoid broadcasting the scalars here. Let's create a follow-up?

Comment on lines 280 to 292
match raw_jsonb.get_by_name(key, false) {
Ok(Some(value)) => return Ok(Some(value.as_raw().as_ref().to_vec())),
Ok(None) => {}
Err(e) => {
return Err(datafusion::error::DataFusionError::Execution(format!(
"Failed to get field: {}",
e
)))
}
}

// Try as array index
if let Ok(index) = key.parse::<usize>() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more efficient to determine if the key is a field or an integer once instead of each value? Although I guess it could be an integer-named field and so vary from value to value 😛

Xuanwo added 3 commits August 28, 2025 17:08
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 28, 2025

Catching some bugs from python side, working on it.

@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 28, 2025

I really don't like how I'm handling JSON and JSONB conventions in the code. Maybe I over-optimized too early, adding unnecessary complexity just to allow we can perform UDFs on JSONB instead of the entire JSON.

I'm starting to reconsider @westonpace's suggestion to keep JSONB as part of the file format instead of treating it as a logical type. While it's working for now, we might as well keep moving forward and see how it holds up in the long run. We can always make adjustments later.

I'll merge this PR first and do anyother changes in follow-up PRs. This will also unlock us to implement index over JSON.

Xuanwo added 5 commits August 29, 2025 03:58
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
@Xuanwo Xuanwo merged commit a65857a into main Aug 29, 2025
30 checks passed
@Xuanwo Xuanwo deleted the add-udf-for-json branch August 29, 2025 04:01
@Xuanwo
Copy link
Collaborator Author

Xuanwo commented Aug 29, 2025

The follow-up tracked at #4595

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants