Skip to content

Crashes while handling non-select result set (DataFrame) #428

@joaoe

Description

@joaoe

What happens?

Hi.

Problem

As the result of a sparl.sql("non-select") where non-select is any SQL statement that is not a select, e.g., USE, INSERT, DROP, CREATE, ... the sql() function will correctly return an empty DataFrame, which is the behavior of the pyspark API.

However, that object crashes when using any of its APIs, because the internal relation object is None. The same applies when trying to create an empty DataFrame without columns. A

Fix

I think the best fix would require fixing the underlying c++ Relation object from the duckdb C++ library to support an empty relation without columns. There are also a couple other fixes like allowing the underlying duckdb.struct_type() to have no fields. That would make the low-level API more robust and require less patching in the python layer.

Then the DuckDBPyConnection::RunQuery function needs to return an empty relation for non-select statement, instead of nullptr. All these fixes felt a bit overwhelming so I won't submit a patch.

To Reproduce

Testcase. All this works with Spark.

@pytest.mark.parametrize("mode", ["pandas", "list", "non-select"])
def test_empty_sdf( spark_session_g, mode):
    from pyspark.sql import functions as f
    from pyspark.sql import types as t
    import pandas as pd

    spark = spark_session_g
    if mode =="pandas":
        sdf = spark.createDataFrame(pd.DataFrame(), t.StructType([]))
    elif mode == "list":
        sdf = spark.createDataFrame([], t.StructType([]))
    else:
        curr_db = spark.catalog.currentDatabase()
        sdf = spark.sql(f"USE {curr_db}")  # non-result set query

    assert sdf.schema ==  t.StructType([])
    assert sdf.columns == []
    assert sdf.collect() == []
    assert sdf.toPandas().empty
    assert sdf.toArrow().shape == (0, 0)
    sdf.createOrReplaceTempView("my_vv1")
    assert spark.sql("SELECT * from my_vv1").toArrow().shape == (0, 0)
    sdf.show() # no-op, no crash
    assert sdf.withColumn("col1", f.lit(1)).columns == ["col1"]
    assert sdf.withColumns({"col1": f.lit(1)}).columns == ["col1"]
    assert sdf.drop("noop").columns == []

OS:

Any

DuckDB Package Version:

Main branch

Python Version:

3.12

Full Name:

João Eiras

Affiliation:

private

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a source build

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions