I'm trying to extract a parquet file from a bucket but Sling seems to fail when trying to load the parquet file (using Duckdb, I guess). This only happens when using the Python wrapper and when using a local setup using Localstack. Perhaps the endpoint-url isn't passed to duckdb but that's just a wild guess. On the other hand, writing a parquet file works just fine. It also works with the non-Python CLI.
I'm running Sling as part of Dagster which might also be relevant as you seem to evaluate that in the Python wrapper (although, with a non-local bucket this works).
Run Localstack
You can run Localstack using Docker or Podman
Create a local bucket
aws s3 mb s3://my-test-bucket --endpoint=http://localhost:4566
Define a new Sling Connection for that bucket (in your env.yaml):
AWS_S3:
type: s3
bucket: my-test-bucket
region: eu-central-1
endpoint: http://localhost:4566
access_key_id: localstack
secret_access_key: localstack
Create a CSV file
echo "Hello,World\nHello,World" > test.txt
Define a new Local Connection to access that CSV file (in your env.yaml):
LOCAL:
type: local
url: file://<root/of/text/file>
Load the file as parquet into your bucket using a replication YAML
This is only to create a Parquet test file:
source: LOCAL
target: AWS_S3
defaults:
mode: full-refresh
target_options:
format: parquet
streams:
test.txt:
object: test.parquet
Extract the parquet file from the bucket to local storage
This fails:
source: AWS_S3
target: LOCAL
defaults:
mode: full-refresh
source_options:
format: parquet
streams:
test.parquet:
object: result.txt
I'm trying to extract a parquet file from a bucket but Sling seems to fail when trying to load the parquet file (using Duckdb, I guess). This only happens when using the Python wrapper and when using a local setup using Localstack. Perhaps the endpoint-url isn't passed to duckdb but that's just a wild guess. On the other hand, writing a parquet file works just fine. It also works with the non-Python CLI.
I'm running Sling as part of Dagster which might also be relevant as you seem to evaluate that in the Python wrapper (although, with a non-local bucket this works).
Run Localstack
You can run Localstack using Docker or Podman
Create a local bucket
Define a new Sling Connection for that bucket (in your
env.yaml):Create a CSV file
Define a new Local Connection to access that CSV file (in your
env.yaml):Load the file as parquet into your bucket using a replication YAML
This is only to create a Parquet test file:
Extract the parquet file from the bucket to local storage
This fails: