Skip to content

tbl() -> collect() workflow does not work in networks where blob.core.windows.net is blocked #205

@thothal

Description

@thothal

Issue

In dbSendQuery the disposition parameter is hardcoded and set to EXTERNAL_LINKS (for dbGetQuery on the other hand we can set the disposition ourselves). That is, in an environment where blob.core.windows.net is blocked by teh firewall, we cannot use dbSendQuery.

The problem becomes apparent if we want to use tbl, because under the hood it will eventually call dbSendQuery.

Call Chain: tbl()dbSendQuery()

tbl(con, "table")
  └─ dbplyr::tbl.DBIConnection()
       └─ dbplyr::tbl_sql()
            └─ builds a lazy "tbl_lazy" object
                 └─ NO SQL executed yet — only triggered by collect()!

collect(lazy_tbl)
  └─ dbplyr::collect.tbl_sql()
       └─ dbplyr::db_collect()
            └─ DBI::dbFetch()
                 └─ DBI::dbSendQuery()
                      └─ brickster:::dbSendQuery,DatabricksConnection
                           └─ db_sql_exec_query(disposition = "EXTERNAL_LINKS") ← 💥
Step Package What happens
tbl() dbplyr Builds a query plan only, no SQL sent
collect() dbplyr Triggers actual execution
dbFetch() DBI Fetches the results
dbSendQuery() DBI → brickster Sends SQL to Databricks — this is where the hardcoded disposition = "EXTERNAL_LINKS" is applied, with no way to override it

tbl() itself does not interact with the database at all. The critical point is collect(), which triggers the full dbplyr → DBI → brickster chain — and there is no way to pass disposition = "INLINE" through this path.

Desired Behaviour

Ideally there would be an option, with which we could override the hard coded disposition to allow for using tbl also on systems with no connection to blob.core.windows.net.

Critical Code Part

#' Send query to Databricks (asynchronous)
#' @param conn A DatabricksConnection object
#' @param statement SQL statement to execute
#' @param ... Additional arguments (ignored)
#' @returns A DatabricksResult object
#' @export
setMethod(
"dbSendQuery",
signature = c(conn = "DatabricksConnection", statement = "character"),
function(conn, statement, ...) {
db_assert_statement(statement)
# Execute query asynchronously
resp <- db_sql_exec_query(
warehouse_id = conn@warehouse_id,
statement = statement,
catalog = if (nzchar(conn@catalog)) conn@catalog else NULL,
schema = if (nzchar(conn@schema)) conn@schema else NULL,
disposition = "EXTERNAL_LINKS",
format = "ARROW_STREAM",
wait_timeout = "0s", # Async execution
host = conn@host,
token = conn@token
)
# Create result object
new(
"DatabricksResult",
statement_id = resp$statement_id,
statement = statement,
connection = conn,
completed = FALSE,
rows_fetched = 0
)
}
)

Error Message

library(DBI)
library(brickster)
library(dbplyr)
library(dplyr)

warehouse_id <- 12345
con <- dbConnect(
  DatabricksSQL(),
  warehouse_id = warehouse_id
)

con %>% 
  tbl(I("my_catalog.my_schema.my_table")) %>% 
  head(5L)

## Error in `collect()`:
## ! Failed to collect lazy table.
## Caused by error in `httr2::req_perform()`:
## ! Failed to perform HTTP request.
## Caused by error in `curl::curl_fetch_memory()`:
## ! SSL connect error [dbstoragejcmu2vp2x6lnw.blob.core.windows.net]:
## Recv failure: Connection was reset

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions