diff --git a/.cursor/rules/mintlify-mdx-gotchas.mdc b/.cursor/rules/mintlify-mdx-gotchas.mdc new file mode 100644 index 0000000000000..69f051ca47003 --- /dev/null +++ b/.cursor/rules/mintlify-mdx-gotchas.mdc @@ -0,0 +1,110 @@ +--- +description: Mintlify MDX parsing gotchas — patterns that look correct but render broken +globs: docs-mintlify/** +alwaysApply: true +--- + +# Mintlify MDX gotchas + +Mintlify uses MDX (Markdown + JSX). A handful of patterns parse cleanly in +plain Markdown but render broken once wrapped in JSX components like +``. Use these rules when authoring or editing pages in +`docs-mintlify/`. + +## `` must start with prose, not a fenced code block + +When a fenced code block is the **first child** of a `` component, +Mintlify's MDX parser collapses the entire block onto one line, exposing +the literal triple backticks. Every `` body must begin with at least +one line of prose (a single sentence is enough). + +❌ Renders broken — fence is the first child: + +````mdx + + ```json + { + "Version": "2012-10-17", + ... + } + ``` + +```` + +✅ Add a one-line lead-in: + +````mdx + + Allow the role to read, write, and list objects in your bucket: + + ```json + { + "Version": "2012-10-17", + ... + } + ``` + +```` + +The intro line also reads better — it tells the user what the snippet does +before they hit it. Apply the same rule to any code block (`json`, `bash`, +`dotenv`, `yaml`, `sql`, …) that would otherwise be the first child of a +``. + +This rule applies anywhere a fenced code block sits at the top of a JSX +component body. `` is the most common offender; `` and `` +behave the same way. + +## Angle-bracket placeholders must be backticked in prose + +``, ``, ``, etc. are useful +placeholders inside fenced code blocks (they're plain text there) but in +prose they have to be wrapped in backticks. Otherwise MDX tries to parse +them as JSX components and either errors out or silently swallows them. + +❌ MDX tries to parse `` as a component: + +```mdx +The issuer URL is https://.cubecloud.dev. +``` + +✅ Backticked: + +```mdx +The issuer URL is `https://.cubecloud.dev`. +``` + +Inside fenced ```` ``` ```` blocks (and Mermaid diagrams that don't pass +through the JSX parser) the bare form is fine. + +## Mermaid participant labels can't carry angle-bracket placeholders + +Mermaid renders `
` inside participant labels as a line break, but it +doesn't tolerate other `<…>` tags — they break the diagram. Use a plain +description ("your tenant URL") instead of `.cubecloud.dev` +in `participant` labels. + +## Bash heredoc / assignment with `` breaks copy-paste + +`PROJECT_NUMBER=` looks fine in a doc, but bash interprets +`<` as input redirection — pasting the line errors out. Quote it: + +```bash +PROJECT_NUMBER="" +DEPLOYMENT_ID="" +``` + +Quotes preserve the placeholder visually and let the user `sed`-replace +without the line failing first. + +## Don't use a body H1 + +Mintlify renders the frontmatter `title` as the page H1. Adding `# Heading` +in the body produces two H1s and breaks the doc outline. Use `## Heading` +for the first body section. + +## Verifying + +`mintlify dev --no-open` boots a local preview and surfaces compile errors +on startup. `mintlify broken-links` flags dead internal anchors. Run both +before publishing a non-trivial doc change. diff --git a/docs-mintlify/admin/deployment/oidc/aws.mdx b/docs-mintlify/admin/deployment/oidc/aws.mdx new file mode 100644 index 0000000000000..23409ecf1b05d --- /dev/null +++ b/docs-mintlify/admin/deployment/oidc/aws.mdx @@ -0,0 +1,528 @@ +--- +title: AWS +sidebarTitle: AWS +description: Configure AWS IAM trust for Cube's OIDC issuer and use it for Athena, S3 export buckets, Cube Store CSPS, and Bedrock. +noindex: true +--- + +This guide walks through configuring AWS to trust Cube's OIDC issuer and +shows the trust policies for the most common targets — Athena, an S3 export +bucket, Cube Store CSPS, and Bedrock for bring-your-own LLM. + +If you haven't enabled OIDC for your tenant yet, start with the +[OIDC overview][ref-oidc-overview]. + + + +Available on the [Enterprise plan](https://cube.dev/pricing). + + + +## Prerequisites + +- The Cube tenant has OIDC enabled and an `AWS` token config exists under + **Admin → OIDC**. +- IAM access to your AWS account sufficient to register an IAM OIDC provider + and create / update IAM roles. +- Your tenant slug — the leftmost label of your tenant's console URL. + Throughout this guide it's referenced as `` (and the full + issuer URL as `https://.cubecloud.dev`). Substitute your + actual slug everywhere it appears. + + + +The trust policies, env vars, and CLI commands in this guide use angle-bracket +placeholders — ``, ``, ``, +etc. **Replace each placeholder with your real value** before +copying. AWS will accept these strings literally and the federation call will +fail with a confusing error. + + + +## Step 1: Register Cube as an OIDC provider in AWS + +This is a **one-time** setup per AWS account. Once registered, every IAM role +in this account can be configured to trust deployments in your Cube tenant. + +```bash +aws iam create-open-id-connect-provider \ + --url https://.cubecloud.dev \ + --client-id-list sts.amazonaws.com +``` + + + +AWS no longer requires a TLS thumbprint for HTTPS OIDC providers. The +`--thumbprint-list` parameter is accepted for compatibility but ignored — +AWS validates the issuer's certificate chain against its own trust store. + + + +The command returns the provider ARN, which looks like: + +``` +arn:aws:iam:::oidc-provider/.cubecloud.dev +``` + +You'll reference this ARN as the `Federated` principal in every trust policy +below. From AWS's point of view, this provider _is_ your Cube tenant. + +## Step 2: Set the deployment identity + +Add `AWS_ROLE_ARN` to your deployment's environment variables under +**Settings → Environment variables**. This is the IAM role Cube assumes by +default for every AWS SDK call inside the deployment — drivers, export +bucket I/O, custom code in `cube.py` / `cube.js`. You can either grant this +role direct access to your data, or use it as the entry point for further +`AssumeRole` hops. + +```dotenv +AWS_ROLE_ARN=arn:aws:iam:::role/cube-deployment- +``` + + + +`sts:AssumeRoleWithWebIdentity` does **not** accept `sts:ExternalId`. Trust +policies for OIDC-federated roles can only condition on the standard OIDC +claims (`aud`, `sub`, `iss`). If you copy a trust policy from a non-federated +role assumption (cross-account access keys, for example) and it includes +`sts:ExternalId`, remove it — STS will reject the federation call. Use the +`sub` claim to pin the role to a specific deployment or component instead. + + + +## Step 3: Build the trust policy + +Every IAM role you want Cube to assume needs a trust policy with this shape: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Federated": "arn:aws:iam:::oidc-provider/.cubecloud.dev" + }, + "Action": "sts:AssumeRoleWithWebIdentity", + "Condition": { + "StringEquals": { + ".cubecloud.dev:aud": "sts.amazonaws.com" + }, + "StringLike": { + ".cubecloud.dev:sub": "cube:deployment::component:cube_api" + } + } + } + ] +} +``` + +The `aud` condition pins the audience to AWS STS — exactly what Cube's `aws` +token config emits. The `sub` condition is what scopes the trust to a +specific deployment (and optionally a specific Cube component). Patterns: + +| Trust scope | `sub` pattern | +| -------------------------------------------------------------------------- | ---------------------------------------------------------- | +| One specific deployment, any component | `cube:deployment::component:*` | +| One deployment, only the Cube API and refresh worker | `cube:deployment::component:cube_api` | +| One deployment, only Cube Store | `cube:deployment::component:cube_store` | +| Every deployment in the tenant, only Cube Store (e.g. tenant-wide CSPS) | `cube:deployment:*:component:cube_store` | +| Every deployment, every component | `cube:deployment:*:component:*` | + +Cube's default `sub` claim is `cube:deployment:`. To match +the `:component:` patterns in the table above (or to add +`:region:`), open your AWS token config in **Admin → OIDC** and +paste one of these templates into the **Subject Claim Format** field: + +- `cube:deployment:{deployment_id}:component:{component}` — for the patterns + in the table above. +- `cube:deployment:{deployment_id}:component:{component}:region:{region}` — + to additionally pin a [Cube Cloud region][ref-cube-cloud-region], useful + when you have dedicated regions per environment. + +See [the subject editor section][ref-sub-editor] for the full syntax. + + + +Update your AWS trust policy first, then change the **Subject Claim Format** +on the token config — otherwise existing tokens won't match the trust policy +and `AssumeRoleWithWebIdentity` will start failing. + + + +## Athena + +Configure an IAM role with permissions to query Athena and read query +results from your S3 results bucket. + + + + Trust policy — substitute your AWS account ID, tenant slug, and + deployment ID for the angle-bracket placeholders: + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Federated": "arn:aws:iam:::oidc-provider/.cubecloud.dev" + }, + "Action": "sts:AssumeRoleWithWebIdentity", + "Condition": { + "StringEquals": { + ".cubecloud.dev:aud": "sts.amazonaws.com" + }, + "StringLike": { + ".cubecloud.dev:sub": "cube:deployment::component:cube_api" + } + } + } + ] + } + ``` + + + Athena needs permission to start queries, read Glue metadata, and + read / write the S3 results bucket. The example below scopes the S3 + permissions to a single bucket; tighten the resource list as needed. + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "athena:StartQueryExecution", + "athena:GetQueryExecution", + "athena:GetQueryResults", + "athena:StopQueryExecution", + "athena:ListQueryExecutions", + "athena:GetWorkGroup", + "athena:ListWorkGroups" + ], + "Resource": "*" + }, + { + "Effect": "Allow", + "Action": [ + "glue:GetDatabase", + "glue:GetDatabases", + "glue:GetTable", + "glue:GetTables", + "glue:GetPartitions" + ], + "Resource": "*" + }, + { + "Effect": "Allow", + "Action": [ + "s3:GetBucketLocation", + "s3:GetObject", + "s3:ListBucket", + "s3:PutObject", + "s3:DeleteObject", + "s3:AbortMultipartUpload", + "s3:ListMultipartUploadParts" + ], + "Resource": [ + "arn:aws:s3:::my-athena-results", + "arn:aws:s3:::my-athena-results/*", + "arn:aws:s3:::my-athena-data", + "arn:aws:s3:::my-athena-data/*" + ] + } + ] + } + ``` + + + Set the deployment-level Athena env vars. With `AWS_ROLE_ARN` in place, + the Athena driver automatically assumes the role via OIDC federation — + no static credentials needed. + + ```dotenv + CUBEJS_DB_TYPE=athena + AWS_ROLE_ARN=arn:aws:iam:::role/cube-deployment- + CUBEJS_AWS_REGION=us-east-1 + CUBEJS_AWS_S3_OUTPUT_LOCATION=s3://my-athena-results/queries/ + ``` + + If Athena lives in a different role / account from the deployment's + default identity, set [`CUBEJS_AWS_ATHENA_ASSUME_ROLE_ARN`][ref-athena-assume-role] + in addition to `AWS_ROLE_ARN`. The Athena driver uses the deployment + identity to perform a second `AssumeRole` hop into the Athena role. + + + +## S3 export bucket + +If your data source uses an [export bucket][ref-export-bucket] for +pre-aggregation unloads (Snowflake, Redshift, Athena, BigQuery, …), Cube +needs `s3:PutObject` / `s3:GetObject` / `s3:ListBucket` on the bucket. The +deployment's default identity is the simplest place to put this. + + + + Add an inline statement to the policy attached to your deployment's + `AWS_ROLE_ARN`: + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:PutObject", + "s3:DeleteObject", + "s3:ListBucket", + "s3:GetBucketLocation", + "s3:AbortMultipartUpload", + "s3:ListMultipartUploadParts" + ], + "Resource": [ + "arn:aws:s3:::my-export-bucket", + "arn:aws:s3:::my-export-bucket/*" + ] + } + ] + } + ``` + + + Set the export bucket env vars on the deployment — leave the AWS access + key vars empty so the SDK falls back to the OIDC-derived credentials: + + ```dotenv + CUBEJS_DB_EXPORT_BUCKET_TYPE=s3 + CUBEJS_DB_EXPORT_BUCKET=my-export-bucket + CUBEJS_DB_EXPORT_BUCKET_AWS_REGION=us-east-1 + ``` + + See the [export bucket reference][ref-export-bucket] for the full set + of variables. + + + +## Cube Store CSPS bucket + +Cube Store CSPS lets you store pre-aggregations in your own S3 bucket. +Cube Store gets a separate OIDC token whose `sub` claim ends in +`component:cube_store`, so the trust policy can be locked down to that +component — even if the same role were ever shared with the rest of the +deployment, only Cube Store would be able to assume it. + +Because every Cube Store worker emits a `sub` of the form +`cube:deployment::component:cube_store`, the trust policy's +`StringLike` condition controls how broadly the role is shared: + +- `cube:deployment:*:component:cube_store` — **one role + one bucket for the + whole tenant.** Every deployment in the tenant writes pre-aggregations to + the same bucket, isolated only by Cube Store's own per-deployment path + prefix. Easiest to operate and the most common setup. +- `cube:deployment::component:cube_store` — **per-deployment + isolation.** Pin the role to one deployment so its pre-aggregations live + in a dedicated bucket that no other deployment can read or write. + +The example below shows the tenant-wide pattern; swap `*` for a specific +deployment ID if you want isolation. + + + + Trust policy — note the `StringLike` condition pinning the component + and using `*` so every deployment in the tenant can assume the role: + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Federated": "arn:aws:iam:::oidc-provider/.cubecloud.dev" + }, + "Action": "sts:AssumeRoleWithWebIdentity", + "Condition": { + "StringEquals": { + ".cubecloud.dev:aud": "sts.amazonaws.com" + }, + "StringLike": { + ".cubecloud.dev:sub": "cube:deployment:*:component:cube_store" + } + } + } + ] + } + ``` + + + Allow the role to read, write, and list objects in your CSPS bucket. + Cube Store needs all of the actions below — including the multipart + upload primitives — to handle large pre-aggregation partitions: + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "s3:GetObject", + "s3:PutObject", + "s3:DeleteObject", + "s3:ListBucket", + "s3:GetBucketLocation", + "s3:AbortMultipartUpload", + "s3:ListMultipartUploadParts" + ], + "Resource": [ + "arn:aws:s3:::my-csps-bucket", + "arn:aws:s3:::my-csps-bucket/*" + ] + } + ] + } + ``` + + + For each deployment that should use this bucket, go to **Settings → + Pre-Aggregation Storage** on the deployment and: + + - Toggle **Enable CSPS** on. + - **Storage Provider**: Amazon S3. + - **S3 Bucket**: `my-csps-bucket`. + - **S3 Region**: e.g. `us-east-1`. + - **IAM Role ARN**: `arn:aws:iam:::role/cube-cubestore-`. + + Click **Test Connection** to verify Cube Store can assume the role and + access the bucket, then **Apply**. Cube Store starts writing + pre-aggregations to your bucket on the next refresh. With the + tenant-wide trust policy above, every deployment in the tenant points + at the same role — no need to provision a new IAM role per deployment. + + + Settings → Pre-Aggregation Storage page on a Cube Cloud deployment, with the Enable CSPS toggle on and the Amazon S3 fields (S3 Bucket, S3 Region, IAM Role ARN) ready to fill in. + + + + +## Bedrock for bring-your-own LLM + +Bring-your-own LLM lets the AI engineer service call Bedrock through your +own AWS account. The AI engineer's `sub` claim ends in +`component:ai_engineer`, so the trust policy uses a `StringLike` match on +`cube:deployment:*:component:ai_engineer` to grant access tenant-wide +(every deployment's AI engineer assumes the same role and writes against +the same Bedrock account). + + + + Trust policy — note the `StringLike` condition pinning the component: + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Federated": "arn:aws:iam:::oidc-provider/.cubecloud.dev" + }, + "Action": "sts:AssumeRoleWithWebIdentity", + "Condition": { + "StringEquals": { + ".cubecloud.dev:aud": "sts.amazonaws.com" + }, + "StringLike": { + ".cubecloud.dev:sub": "cube:deployment:*:component:ai_engineer" + } + } + } + ] + } + ``` + + + Grant access only to the foundation models and inference profiles you + intend to use: + + ```json + { + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "bedrock:InvokeModel", + "bedrock:InvokeModelWithResponseStream", + "bedrock:Converse", + "bedrock:ConverseStream" + ], + "Resource": [ + "arn:aws:bedrock:*::foundation-model/anthropic.*", + "arn:aws:bedrock:*::inference-profile/*" + ] + } + ] + } + ``` + + Cross-region inference profiles need access to the foundation model in + every region the profile routes through, hence the `*` region in the + foundation-model ARN. + + + Under **Admin → AI → Models**, click **Add Model** and configure: + + - **Name** — a human-readable label for this BYOM entry. + - **Model Type** — `LLM`. + - **Provider** — `AWS Bedrock`. + - **Model** — the Claude (or other) model you want the AI engineer to + call. + - **Region** — the Bedrock region (e.g. `us-east-1`). + - **Inference Profile ID** — optional; leave blank to call the + foundation model directly, or set to a cross-region inference + profile ID. + - **Assume Role ARN** — the role you created above. + - **Use OIDC workload identity** — toggle on. With this on, the AI + engineer authenticates via the deployment's OIDC token instead of + static AWS credentials. + + + Admin → AI → Models → Add New Model modal in Cube Cloud, showing the AWS Bedrock provider selected with the Assume Role ARN field and the Use OIDC workload identity toggle on. + + + + +## Verifying the setup + +The fastest way to confirm the trust policy is wired up correctly is the +**Test connection** button on the relevant settings page (data source +wizard, CSPS settings, BYO LLM provider). Behind the scenes, this issues a +real Cube OIDC token, runs `AssumeRoleWithWebIdentity` against AWS STS, and +returns a precise error if the trust policy rejects it. + +If the test fails: + +| Symptom | Likely cause | +| -------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- | +| `Not authorized to perform sts:AssumeRoleWithWebIdentity` | The trust policy's `sub` condition doesn't match the deployment / component. Compare the `sub` in the error with what your `StringLike` allows. | +| `Incorrect token audience` | `aud` condition is wrong, or you copied a trust policy from a different tenant. The condition key must be `.cubecloud.dev:aud`. | +| `OpenIDConnectProvider not found` | The OIDC provider hasn't been registered in this AWS account yet, or its URL doesn't match your tenant's domain. Re-run `aws iam create-open-id-connect-provider` with the correct URL. | +| `An error occurred ... InvalidIdentityToken` | Trying to use `sts:ExternalId`. Remove it — federated assume-role doesn't accept it. | + +`AssumeRoleWithWebIdentity` events show up in CloudTrail with the deployment +subject in the `userIdentity.webIdFederationData.federatedProvider` and +`...attributes` fields — useful for auditing which deployments are +authenticating against which roles. + +[ref-oidc-overview]: /admin/deployment/oidc +[ref-sub-editor]: /admin/deployment/oidc#subject-claim-format +[ref-cube-cloud-region]: /admin/deployment/infrastructure#what-is-a-cube-cloud-region +[ref-export-bucket]: /admin/connect-to-data#export-bucket +[ref-athena-assume-role]: /admin/connect-to-data/data-sources/aws-athena#environment-variables diff --git a/docs-mintlify/admin/deployment/oidc/azure.mdx b/docs-mintlify/admin/deployment/oidc/azure.mdx new file mode 100644 index 0000000000000..a199b9cbb38e2 --- /dev/null +++ b/docs-mintlify/admin/deployment/oidc/azure.mdx @@ -0,0 +1,279 @@ +--- +title: Azure +sidebarTitle: Azure +description: Configure an Azure AD app registration with federated credentials that trust Cube's OIDC issuer, then use it for Azure SQL and Blob Storage. +noindex: true +--- + +This guide walks through configuring Azure to trust Cube's OIDC issuer using +**federated credentials** on an Azure AD app registration, and shows the +setup for the most common targets — Azure SQL and an Azure Blob Storage +export bucket. + +If you haven't enabled OIDC for your tenant yet, start with the +[OIDC overview][ref-oidc-overview]. + + + +Available on the [Enterprise plan](https://cube.dev/pricing). + + + +## Prerequisites + +- The Cube tenant has OIDC enabled and an `Azure` token config exists under + **Admin → OIDC**. +- Permissions in your Azure AD tenant sufficient to create an app + registration and add federated credentials, plus an Azure subscription + where you can grant the app role assignments on resources. +- Your Cube tenant slug — the leftmost label of your tenant's console URL. + Throughout this guide it's referenced as `` (and the full + issuer URL as `https://.cubecloud.dev`). Substitute your + actual slug everywhere it appears. + + + +Commands and federated-credential snippets in this guide use angle-bracket +placeholders — ``, ``, etc. +**Replace each placeholder with your real value** before running. Azure +will accept these strings literally and the federation call will fail with +a confusing error. + + + +## How Azure federation works + +Azure doesn't have a generic OIDC provider registration the way AWS does. +Instead, you configure each app registration (or user-assigned managed +identity) with a **federated credential** that pins the issuer URL, +expected `aud`, and exact `sub` value. When Cube presents a JWT that +matches all three, Azure AD swaps it for an access token scoped to the +app registration. + +```mermaid +sequenceDiagram + participant Cube as Cube Deployment + participant AAD as Azure AD
(login.microsoftonline.com) + participant Resource as Azure Resource
(SQL / Blob / OpenAI) + + Cube->>AAD: client_assertion = OIDC JWT
grant_type = client_credentials
audience = api://AzureADTokenExchange + AAD->>AAD: Match federated credential by issuer + sub + aud
Validate JWT against Cube's JWKS + AAD->>Cube: Access token for the app registration + Cube->>Resource: Authenticated request +``` + + + +Azure caps you at **20 federated credentials per app registration**. For +tenants with many deployments, see [Scaling past 20 +credentials](#scaling-past-20-federated-credentials) below. + + + +## Step 1: Create or pick an app registration + +You can use an existing app registration or create a new one — Cube doesn't +have any opinion as long as it has the federated credential and the role +assignments. + +```bash +az ad app create --display-name "Cube Cloud Deployment" +``` + +Note the **Application (client) ID** — this is your `AZURE_CLIENT_ID`. The +**Directory (tenant) ID** is your `AZURE_TENANT_ID`. Both are visible in +the Azure portal under **Microsoft Entra ID → App registrations → Cube +Cloud Deployment → Overview**. + +## Step 2: Add a federated credential + +The federated credential is what binds the app registration to a specific +Cube subject. Each credential matches **exactly one** issuer + subject + +audience triple — Azure doesn't support wildcards or pattern matching here. + +```bash +az ad app federated-credential create \ + --id \ + --parameters '{ + "name": "cube--deployment-", + "issuer": "https://.cubecloud.dev", + "subject": "cube:deployment::component:cube_api", + "audiences": ["api://AzureADTokenExchange"] + }' +``` + +| Field | Value | +| ------------- | --------------------------------------------------------------------------------------------- | +| `issuer` | Your Cube tenant's URL: `https://.cubecloud.dev`. | +| `subject` | The exact `sub` claim Cube emits — typically `cube:deployment::component:`. | +| `audiences` | Always `["api://AzureADTokenExchange"]` — this is the standard Azure AD token-exchange audience. | +| `name` | A human-readable label. Pick something that lets you find this credential later. | + +Each Cube component you want this app to authenticate as needs its own +federated credential. So if a deployment runs both Cube API and Cube Store +against the same Azure resource, you create two credentials — one with +`subject` ending in `:component:cube_api` and one ending in +`:component:cube_store`. + +Cube's default `sub` claim is `cube:deployment:`. To match +the `:component:` examples in this guide (or to add +`:region:`), open your Azure token config in **Admin → OIDC** and +paste one of these templates into the **Subject Claim Format** field: + +- `cube:deployment:{deployment_id}:component:{component}` — for the + `cube:deployment::component:` examples below. +- `cube:deployment:{deployment_id}:component:{component}:region:{region}` — + to additionally pin a [Cube Cloud region][ref-cube-cloud-region]. + +See [the subject editor section][ref-sub-editor] for the full syntax. + + + +Azure pins the federated credential's `subject` field literally — changing +the format means recreating every federated credential that references it. +Create the new federated credential first, then change the **Subject Claim +Format** on the token config. + + + +## Step 3: Set the deployment identity + +Add two env vars to your deployment under **Settings → Environment +variables**: + +```dotenv +AZURE_TENANT_ID=00000000-0000-0000-0000-000000000000 +AZURE_CLIENT_ID=11111111-1111-1111-1111-111111111111 +``` + +- **`AZURE_TENANT_ID`** — the Microsoft Entra ID (Azure AD) tenant where + your app registration lives. +- **`AZURE_CLIENT_ID`** — the Application (client) ID of the app + registration. + +## Step 4: Assign roles on Azure resources + +Grant the app registration the standard Azure RBAC roles it needs, the +same way you'd grant any service principal access to a resource. Examples +follow per target. + +## Azure SQL + + + + See [Step 2](#step-2-add-a-federated-credential) — pin subject to + `cube:deployment::component:cube_api`. + + + Connect to your Azure SQL database as an Entra-authenticated admin and + create a user backed by the app registration: + + ```sql + CREATE USER [Cube Cloud Deployment] FROM EXTERNAL PROVIDER; + ALTER ROLE db_datareader ADD MEMBER [Cube Cloud Deployment]; + GRANT EXECUTE ON SCHEMA::dbo TO [Cube Cloud Deployment]; + ``` + + The user name must match the app registration's display name. + + + Set the MSSQL driver and identity env vars on the deployment: + + ```dotenv + CUBEJS_DB_TYPE=mssql + CUBEJS_DB_HOST=my-server.database.windows.net + CUBEJS_DB_NAME=my-db + CUBEJS_DB_PORT=1433 + AZURE_TENANT_ID=00000000-0000-0000-0000-000000000000 + AZURE_CLIENT_ID=11111111-1111-1111-1111-111111111111 + ``` + + The MSSQL driver uses the `azure-active-directory-access-token` auth + type with the federated token Cube provides. No username, password, or + client secret needed. + + + +## Azure Blob Storage export bucket + +If your data source uses an [export bucket][ref-export-bucket] for +pre-aggregation unloads, grant the app registration **Storage Blob Data +Contributor** on the storage account. + + + + Assign **Storage Blob Data Contributor** on the storage account scope: + + ```bash + az role assignment create \ + --assignee \ + --role "Storage Blob Data Contributor" \ + --scope "/subscriptions//resourceGroups//providers/Microsoft.Storage/storageAccounts/" + ``` + + For tighter scoping, narrow to a specific container with + `.../blobServices/default/containers/` instead of the whole + storage account. + + + Point the export bucket env vars at your container: + + ```dotenv + CUBEJS_DB_EXPORT_BUCKET_TYPE=azure + CUBEJS_DB_EXPORT_BUCKET=https://.blob.core.windows.net/ + AZURE_TENANT_ID=00000000-0000-0000-0000-000000000000 + AZURE_CLIENT_ID=11111111-1111-1111-1111-111111111111 + ``` + + The Azure storage client picks up the same federated identity. See the + [export bucket reference][ref-export-bucket] for the full set of + variables. + + + +## Scaling past 20 federated credentials + +A single app registration accepts at most **20 federated credentials**. +Three patterns cover most growth scenarios: + +- **One app registration per deployment.** Each deployment gets its own + app + role assignments. Clean isolation, but you have to provision a new + app every time you add a deployment. +- **Multiple app registrations behind one set of role assignments.** Group + deployments by access pattern (e.g. read-only vs read-write); each group + gets its own app, and the role assignments target the same resources. +- **User-assigned managed identities.** Each managed identity has its own + 20-credential limit, and you can attach many of them to your tenant. + Useful when you want the resource permissions managed in Azure + alongside other infrastructure rather than as RBAC on app registrations. + +If you expect more than ~10 deployments in a single tenant, plan for one +of these patterns up front — splitting later is a recreation, not a +migration. + +## Verifying the setup + +The fastest way to confirm the federated credential is wired up correctly +is the **Test connection** button on the relevant settings page (data +source wizard, BYO LLM provider). Behind the scenes, Cube issues a real +OIDC token, exchanges it with Azure AD, and returns a precise error if +anything is misconfigured. + +If the test fails: + +| Symptom | Likely cause | +| ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `AADSTS70021: No matching federated identity record found` | The federated credential's `issuer`, `subject`, and `audiences` triple doesn't match the JWT exactly. Compare with the `iss` / `sub` / `aud` of a token from the **Test connection** error response. | +| `AADSTS700213: No matching federated identity record found ... subject` | The subject claim differs by even a single character. Azure doesn't support wildcards — you need one credential per exact subject. | +| `AADSTS50034: The user account does not exist` | You're using the wrong `AZURE_TENANT_ID`. Double-check the directory ID on the app registration's Overview blade. | +| `AADSTS500011: The resource principal ... was not found` | The role assignment hasn't been created on the target resource, or it's still propagating. Wait a minute and retry; if it persists, re-check the `--assignee` and `--scope` values. | + +Federation events show up in **Microsoft Entra ID → Sign-in logs** under +the **Service principal sign-ins** tab — filter by the app registration to +see which deployments are authenticating, when, and which resources they +hit. + +[ref-oidc-overview]: /admin/deployment/oidc +[ref-sub-editor]: /admin/deployment/oidc#subject-claim-format +[ref-cube-cloud-region]: /admin/deployment/infrastructure#what-is-a-cube-cloud-region +[ref-export-bucket]: /admin/connect-to-data#export-bucket diff --git a/docs-mintlify/admin/deployment/oidc/gcp.mdx b/docs-mintlify/admin/deployment/oidc/gcp.mdx new file mode 100644 index 0000000000000..788543b6509d0 --- /dev/null +++ b/docs-mintlify/admin/deployment/oidc/gcp.mdx @@ -0,0 +1,314 @@ +--- +title: GCP +sidebarTitle: GCP +description: Configure GCP Workload Identity Federation to trust Cube's OIDC issuer and use it for BigQuery and GCS export buckets. +noindex: true +--- + +This guide walks through configuring GCP to trust Cube's OIDC issuer using +Workload Identity Federation (WIF) and shows the setup for the most common +targets — BigQuery and a GCS export bucket. + +If you haven't enabled OIDC for your tenant yet, start with the +[OIDC overview][ref-oidc-overview]. + + + +Available on the [Enterprise plan](https://cube.dev/pricing). + + + +## Prerequisites + +- The Cube tenant has OIDC enabled and a `GCP` token config exists under + **Admin → OIDC**. +- IAM access to your GCP project sufficient to create Workload Identity + Pools, providers, and service accounts. +- Your tenant slug — the leftmost label of your tenant's console URL. + Throughout this guide it's referenced as `` (and the full + issuer URL as `https://.cubecloud.dev`). Substitute your + actual slug everywhere it appears. + + + +Commands and config snippets in this guide use angle-bracket placeholders — +``, ``, ``, etc. **Replace each +placeholder with your real value** before running. GCP will accept these +strings literally and the federation call will fail with a confusing error. + + + +## How GCP federation works + +Cube doesn't talk to GCP STS directly. Instead, it writes a small +**credential configuration JSON** to disk that points the GCP SDK at the +OIDC token file. The Google client library handles the two-step exchange +internally: + +```mermaid +sequenceDiagram + participant Cube as Cube Deployment + participant STS as Google STS + participant IAM as IAM Credentials API + participant BQ as BigQuery / GCS + + Cube->>STS: Exchange OIDC JWT for federated access token
(audience = WIF pool provider URI) + STS->>Cube: Federated access token + Cube->>IAM: generateAccessToken on target service account
(impersonation) + IAM->>Cube: Service account access token (1h) + Cube->>BQ: Authenticated request +``` + +You can also skip the impersonation step and grant permissions directly to +the federated principal — see [Direct federation](#direct-federation) at the +end of this page. + +## Step 1: Create a Workload Identity Pool and provider + +Run these once per GCP project. The pool is a container; the provider is +what actually trusts your Cube issuer. + +```bash +gcloud iam workload-identity-pools create cube-pool \ + --location=global \ + --display-name="Cube workload identity" + +gcloud iam workload-identity-pools providers create-oidc cube \ + --location=global \ + --workload-identity-pool=cube-pool \ + --issuer-uri="https://.cubecloud.dev" \ + --attribute-mapping="google.subject=assertion.sub,attribute.issuer=assertion.iss" \ + --attribute-condition="assertion.sub.startsWith('cube:deployment:')" +``` + +What each option does: + +| Option | Purpose | +| ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | +| `--issuer-uri` | Cube tenant URL. GCP fetches `${issuer-uri}/.well-known/openid-configuration` and JWKS from here. | +| `--attribute-mapping` | Maps the JWT `sub` to GCP's `google.subject`. The mapped subject is what IAM bindings reference when granting impersonation rights. | +| `--attribute-condition` | An optional CEL expression that GCP evaluates on every token exchange. The condition above accepts only deployment-scoped tokens. | + +Note the provider's full resource URI — you'll need it shortly: + +``` +//iam.googleapis.com/projects/PROJECT_NUMBER/locations/global/workloadIdentityPools/cube-pool/providers/cube +``` + +`PROJECT_NUMBER` is the numeric project number, not the project ID. You can +fetch it with `gcloud projects describe PROJECT_ID --format='value(projectNumber)'`. + +## Step 2: Set the deployment identity + +Add two env vars to your deployment under **Settings → Environment variables**: + +```dotenv +GCP_POOL_AUDIENCE=//iam.googleapis.com/projects//locations/global/workloadIdentityPools/cube-pool/providers/cube +GCP_SERVICE_ACCOUNT_EMAIL=cube-deployment@my-project.iam.gserviceaccount.com +``` + +- **`GCP_POOL_AUDIENCE`** — the full resource URI of the WIF provider you + created in Step 1. This becomes the `aud` claim on the GCP token Cube + mints. +- **`GCP_SERVICE_ACCOUNT_EMAIL`** — the service account that Cube + impersonates after federation succeeds. Cube assumes this service account + by default for every GCP SDK call inside the deployment. + +If you want to skip impersonation entirely and have Cube call GCP services +as the federated principal directly, leave `GCP_SERVICE_ACCOUNT_EMAIL` unset. +See [Direct federation](#direct-federation) below. + +## Step 3: Build the IAM bindings + +There are two distinct IAM bindings to set: + +1. **Workload Identity User** on the impersonated service account — lets + the federated principal call `generateAccessToken` on it. +2. **Resource access** (BigQuery, GCS, etc.) on the impersonated service + account itself — what the SA is actually allowed to do once Cube is + running as it. + +The Workload Identity User binding's `--member` controls which Cube +deployments / components can impersonate the SA. Patterns mirror the AWS +`sub` patterns: + +| Trust scope | `--member` | +| ---------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | +| One specific deployment, any component | `principalSet://iam.googleapis.com/projects/123/locations/global/workloadIdentityPools/cube-pool/subject/cube:deployment::component:cube_api` | +| Every component of every deployment in the tenant | `principalSet://iam.googleapis.com/projects/123/locations/global/workloadIdentityPools/cube-pool/*` | +| Only Cube Store, across every deployment | Use a CEL `--attribute-condition` on the provider that constrains `sub` to end in `:component:cube_store`, then bind the whole pool. | + +The `principal://` (singular) form pins to one exact subject; `principalSet://` +matches a set. + +Cube's default `sub` claim is `cube:deployment:`. To match +the `:component:` patterns in the table above (or to add +`:region:`), open your GCP token config in **Admin → OIDC** and +paste one of these templates into the **Subject Claim Format** field: + +- `cube:deployment:{deployment_id}:component:{component}` — for the + `principalSet://...subject/cube:deployment::component:` + patterns above. +- `cube:deployment:{deployment_id}:component:{component}:region:{region}` — + to additionally pin a [Cube Cloud region][ref-cube-cloud-region]. + +See [the subject editor section][ref-sub-editor] for the full syntax. + + + +Update the GCP IAM binding (or the WIF provider's CEL `--attribute-condition`) +first, then change the **Subject Claim Format** on the token config — +otherwise existing tokens won't match the binding and impersonation will +fail. + + + +## BigQuery + + + + Provision a regular GCP service account that will hold the BigQuery + permissions Cube assumes: + + ```bash + gcloud iam service-accounts create cube-deployment \ + --display-name="Cube deployment" + ``` + + Note the email — `cube-deployment@PROJECT_ID.iam.gserviceaccount.com` — + this is what you'll set as `GCP_SERVICE_ACCOUNT_EMAIL`. + + + Bind the federated principal to the service account so the deployment's + OIDC token can impersonate it: + + ```bash + PROJECT_NUMBER="" + DEPLOYMENT_ID="" + + gcloud iam service-accounts add-iam-policy-binding \ + cube-deployment@my-project.iam.gserviceaccount.com \ + --role=roles/iam.workloadIdentityUser \ + --member="principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/cube-pool/subject/cube:deployment:${DEPLOYMENT_ID}:component:cube_api" + ``` + + + Standard IAM, as if the service account were any normal workload: + + ```bash + gcloud projects add-iam-policy-binding my-project \ + --member="serviceAccount:cube-deployment@my-project.iam.gserviceaccount.com" \ + --role=roles/bigquery.dataViewer + + gcloud projects add-iam-policy-binding my-project \ + --member="serviceAccount:cube-deployment@my-project.iam.gserviceaccount.com" \ + --role=roles/bigquery.jobUser + ``` + + Tighten these to specific datasets / tables in production. The minimum + set of roles Cube needs to run BigQuery queries is + `roles/bigquery.dataViewer` plus `roles/bigquery.jobUser` on the project + the queries run in. + + + Set the BigQuery driver and identity env vars on the deployment: + + ```dotenv + CUBEJS_DB_TYPE=bigquery + CUBEJS_DB_BQ_PROJECT_ID=my-project + GCP_POOL_AUDIENCE=//iam.googleapis.com/projects//locations/global/workloadIdentityPools/cube-pool/providers/cube + GCP_SERVICE_ACCOUNT_EMAIL=cube-deployment@my-project.iam.gserviceaccount.com + ``` + + The BigQuery driver follows the GCP default credential chain, picks up + the credential config Cube generates, and runs as + `cube-deployment@my-project.iam.gserviceaccount.com`. No service account + JSON key is ever used. + + + +## GCS export bucket + +If your data source uses an [export bucket][ref-export-bucket] for +pre-aggregation unloads (BigQuery, Snowflake on GCP, etc.), grant the +deployment's service account read / write access to the bucket. + + + + Add a bucket-scoped IAM binding for the deployment's service account: + + ```bash + gcloud storage buckets add-iam-policy-binding gs://my-export-bucket \ + --member="serviceAccount:cube-deployment@my-project.iam.gserviceaccount.com" \ + --role=roles/storage.objectAdmin + ``` + + `objectAdmin` covers reads, writes, and deletes within the bucket. If + you only need writes (e.g. you have a separate process cleaning up old + exports), `roles/storage.objectCreator` is enough. + + + Point the export bucket env vars at your bucket: + + ```dotenv + CUBEJS_DB_EXPORT_BUCKET_TYPE=gcs + CUBEJS_DB_EXPORT_BUCKET=my-export-bucket + ``` + + The GCS client inside Cube picks up the same default identity. See the + [export bucket reference][ref-export-bucket] for the full set of + variables. + + + +## Direct federation + +If you'd rather skip the service account impersonation hop, grant +permissions directly to the federated principal and leave +`GCP_SERVICE_ACCOUNT_EMAIL` unset on the deployment. Cube generates a +credential config that performs only the OIDC-to-federated-token exchange, +and the resulting token is what your code authenticates with. + +```bash +PROJECT_NUMBER="" +DEPLOYMENT_ID="" + +gcloud projects add-iam-policy-binding my-project \ + --member="principal://iam.googleapis.com/projects/${PROJECT_NUMBER}/locations/global/workloadIdentityPools/cube-pool/subject/cube:deployment:${DEPLOYMENT_ID}:component:cube_api" \ + --role=roles/bigquery.dataViewer +``` + +Direct federation is simpler — fewer moving parts, and the principal +identity in audit logs is the Cube subject itself rather than an +intermediate service account. The trade-off is that some GCP services +(notably anything that requires `iam.serviceAccountTokenCreator`) only +accept service-account principals, so you may need the impersonation path +for those. + +## Verifying the setup + +The fastest way to confirm WIF is wired up correctly is the **Test +connection** button on the relevant settings page (data source wizard, +CSPS settings). Behind the scenes, Cube issues a real OIDC token, performs +the GCP STS exchange, optionally impersonates the service account, and +returns a precise error if anything is misconfigured. + +If the test fails: + +| Symptom | Likely cause | +| ---------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `Permission iam.serviceAccounts.getAccessToken denied` | The Workload Identity User binding on the service account is missing or its `--member` doesn't match the deployment's `sub`. Double-check the principal URI. | +| `INVALID_ARGUMENT: Invalid value for audience` | `GCP_POOL_AUDIENCE` doesn't match the WIF provider's URI. Re-run `gcloud iam workload-identity-pools providers describe` and copy the value verbatim. | +| `The token issuer ... does not match the configured issuer` | The provider was created with a different `--issuer-uri` than your tenant URL, or your tenant slug has changed. Re-create the provider with the correct URL. | +| `attribute condition ... evaluated to false` | The CEL `--attribute-condition` on the provider rejected the token. Inspect the `sub` Cube emits and adjust the condition. | + +Federation events show up in **Cloud Audit Logs** under +`sts.googleapis.com` (the token exchange) and `iamcredentials.googleapis.com` +(the SA impersonation). The Cube subject is included in both, so you can +trace which deployment authenticated against which service account at any +point in time. + +[ref-oidc-overview]: /admin/deployment/oidc +[ref-sub-editor]: /admin/deployment/oidc#subject-claim-format +[ref-cube-cloud-region]: /admin/deployment/infrastructure#what-is-a-cube-cloud-region +[ref-export-bucket]: /admin/connect-to-data#export-bucket diff --git a/docs-mintlify/admin/deployment/oidc/index.mdx b/docs-mintlify/admin/deployment/oidc/index.mdx new file mode 100644 index 0000000000000..effa1add7b694 --- /dev/null +++ b/docs-mintlify/admin/deployment/oidc/index.mdx @@ -0,0 +1,320 @@ +--- +title: OIDC workload identity +sidebarTitle: Overview +description: Cube's keyless authentication mechanism. Federate your cloud accounts and external services to a Cube-issued OIDC token instead of provisioning long-lived secrets. +noindex: true +--- + +OIDC workload identity is Cube's keyless authentication mechanism. Cube acts +as an OpenID Connect (OIDC) identity provider for every deployment in your +account: it mints short-lived, signed JWTs that you federate with the systems +your deployment talks to. The system on the other side — AWS, GCP, Azure, +Snowflake, your own Bedrock-hosted LLM — verifies the token directly against +Cube's published JWKS and grants access, eliminating the need for long-lived +access keys, manual rotation, or shared secrets passed through Cube. + + + +Available on the [Enterprise plan](https://cube.dev/pricing). + + + +You can use OIDC workload identity to authenticate to: + +- **Data sources** — AWS Athena, Redshift, BigQuery, Snowflake, and any other + driver that supports federated credentials. +- **Export buckets** — S3 and GCS buckets used for `EXPORT_BUCKET` pre-aggregation + unloads. +- **Cube Store CSPS** — a per-deployment S3 / GCS bucket that holds your + Cube Store pre-aggregations (Customer-Supplied Pre-aggregation Storage). +- **Bring-your-own LLM providers** — AWS Bedrock, Google Vertex AI, and Azure + OpenAI behind your own cloud account. +- **Other cloud services** — any cloud service that supports OIDC federation + that you'd like to access from the dynamic part of your data model + (`cube.py` / `cube.js`, dynamic schema generators, `checkAuth`, custom + pre-aggregation strategies, etc.) — works with the same token. + +The same mechanism powers all of these — the only thing that changes between +integrations is the trust policy on the receiving side and which IAM role, +service account, or app registration Cube assumes. + +## How it works + +If you're new to OpenID Connect, the [OpenID Connect Core 1.0 +specification][link-oidc-spec] is the canonical reference; the +[OAuth.com OIDC primer][link-oidc-primer] is a friendlier read. Cube's +implementation is a standard OIDC provider — anything that works with +GitHub Actions OIDC, Vercel OIDC, or AWS IAM OIDC providers works the +same way here. + +Each Cube tenant hosts a per-tenant OIDC issuer at its own domain, e.g. +`https://.cubecloud.dev`. The issuer publishes the standard OIDC discovery +endpoints, and Cube signs every token with a KMS-backed RS256 key. + +```mermaid +sequenceDiagram + participant Cube as Cube Deployment + participant Issuer as Cube OIDC Issuer
(your tenant URL) + participant STS as Cloud STS
(AWS / GCP / Azure) + participant Resource as Customer Resource
(Athena / S3 / BigQuery / ...) + + Cube->>Issuer: Request signed JWT for audience X + Issuer->>Cube: JWT (sub, aud, iss, exp = +1h) + Cube->>STS: Exchange JWT for temporary credentials + STS->>Issuer: Fetch JWKS, verify signature, evaluate trust policy + STS->>Cube: Temporary credentials (1h) + Cube->>Resource: Access with temporary credentials +``` + +The deployment never sees an access key or service account JSON. The cloud +SDKs inside Cube — AWS SDK, GCP `google-auth-library`, Azure SDK — pick up +the token automatically from a path supplied by Cube and exchange it with +the cloud's own STS for short-lived credentials. + +## Endpoints served by Cube + +Each tenant's issuer URL is `https://.cubecloud.dev`. Cube serves +the standard OIDC endpoints from that domain: + +| Endpoint | Purpose | +| --------------------------------------- | ------------------------------------------------------------------------------------------------------ | +| `/.well-known/openid-configuration` | OIDC discovery document. Cloud providers fetch this to learn the issuer, supported algorithms, and JWKS location. | +| `/.well-known/jwks.json` | JSON Web Key Set. Contains the public keys cloud providers use to verify token signatures. | + +Both endpoints are unauthenticated, because they're consumed by AWS STS, +Google STS, Azure AD, and any other relying party that validates +Cube-issued tokens. + +## Token configs + +A **token config** is a tenant-level entry that tells Cube which audience to +mint tokens for and how to format the `sub` claim each token carries. Each +token config produces one audience-scoped JWT that is delivered to all +deployments in the tenant. + +| Field | What it controls | +| ------------------------- | ---------------------------------------------------------------------------------------------------------------------- | +| **Audience type** | One of `AWS`, `GCP`, `Azure`, or `Custom`. Well-known types pre-fill the audience value. | +| **Audience** | The JWT `aud` claim value. Pre-set for `AWS`, `GCP`, and `Azure`; user-supplied for `Custom`. | +| **Name** | A short, filesystem-safe slug that becomes part of the token file name (`cube_cloud_token_`). | +| **Subject Claim Format** | Template for the JWT `sub` claim. Defaults to `cube:deployment:{deployment_id}` if left empty. | + +You can have one config per well-known audience type (AWS, GCP, Azure) plus +any number of custom configs for tools like Snowflake or Databricks. A +deployment that integrates with several clouds simultaneously gets a separate +token file per audience — the Cube runtime picks the right one based on which +SDK is making the call. + +### Subject claim format + +The `sub` claim is what the receiving cloud provider's trust policy matches +on to decide which deployment or component is allowed to assume a role. +Cube's default template is `cube:deployment:{deployment_id}` — sufficient if +you only need to identify the deployment. + +To scope further — by component (Cube API vs. Cube Store vs. the AI Engineer) +or by Cube Cloud region — override the template with one of the **Common +templates** the dialog offers: + +- `cube:deployment:{deployment_id}` (default) +- `cube:deployment:{deployment_id}:component:{component}` +- `cube:deployment:{deployment_id}:component:{component}:region:{region}` + +Three placeholders are supported anywhere in the template: + +| Placeholder | What it resolves to | +| ----------------- | ------------------------------------------------------------------- | +| `{deployment_id}` | The deployment's numeric ID. Substitutes to the literal `global` for tenant-scoped tokens (e.g., AI Engineer running in the control plane). | +| `{component}` | The component requesting the token: `cube_api`, `cube_store`, or `ai_engineer`. Substitutes to `global` when no component context is set. | +| `{region}` | The deployment's [Cube Cloud region][ref-cube-cloud-region]. For tenant-scoped tokens minted in the control plane, substitutes to the literal `control-plane`. | + +The component values Cube emits are `cube_api` (the Cube API, refresh +worker, dev-mode pods, Cold Start Workers, etc.), `cube_store` (Cube Store), +and `ai_engineer` (the AI Engineer service, used for BYO LLM auth). + +Outside of placeholders, the template may contain only the characters +`[A-Za-z0-9:_-]`. Anything else is rejected at save time. + + + Add OIDC Token Config dialog in Cube Cloud, showing the Audience Type, Name, Subject Claim Format field with a Common templates picker, and a live preview of the rendered sub claim. + + + + +Changing the subject claim format on a token config that's already in use +**will break customer trust policies that pin the old `sub` shape**. Update +the trust policy in your cloud account first, then change the format here. + + + + + +Subject claim format changes can take **up to one hour to fully propagate**. +Already-minted tokens are cached at several layers — the credentials broker +sidecar and the cloud SDKs' own STS-credential caches — and remain valid +until their TTL expires (up to 1h). Plan trust-policy rollouts so the new +`sub` shape is accepted before the old one stops being honored, and expect +a window where both formats are in flight. + + + +## Trust mechanism + +The receiving system trusts Cube tokens by referencing two things: + +1. **The issuer URL** — `https://.cubecloud.dev`. The cloud + provider fetches your tenant's JWKS from this URL and uses it to validate + every token signature. As long as Cube holds the matching private key, + tokens issued for your tenant verify; tokens issued by any other Cube + tenant do not. +2. **The subject claim** — `sub`. The trust policy on the receiving side + matches the `sub` exactly (or with wildcards) against the format you + configured above to decide which deployments / components are allowed + in. + +## Token lifecycle + +| Property | Value | +| --------------------- | ---------------------------------------------------------------------------------------------- | +| **Algorithm** | RS256 | +| **TTL** | 1 hour | +| **Refresh** | Automatic — Cube re-mints each token before it expires and atomically replaces the token file. The cloud SDK on the other side picks up the new token without reconnecting. | +| **Signing keys** | Per-tenant, KMS-backed. Rotated regularly; the JWKS endpoint always serves the current and previous public keys so in-flight tokens remain valid across rotations. | +| **Revocation** | TTL-based. Emergency revocation is performed by rotating the signing key, which invalidates all tokens within the next TTL window. | + +## Enabling OIDC for your tenant + +Workload identity is configured at the **tenant** level under +**Admin → OIDC**. From there, you turn the issuer on and create one token +config per integration target. + + + + In **Admin → OIDC**, flip the **OIDC token issuance is active for all + deployments** switch to on. This activates the discovery and JWKS + endpoints on your tenant URL and unlocks the deployment-level OIDC + settings everywhere else in the UI. Click **OpenID Configuration** or + **JWKS Endpoint** to grab the URLs you'll hand to your cloud provider + when configuring federation. + + + Admin → OIDC page in Cube Cloud showing the enable toggle, issuer URL, discovery and JWKS endpoint links, and the configured token configs table. + + + + Click **Add Config** and pick the audience type. For `AWS`, `GCP`, and + `Azure`, the audience auto-fills with the standard value; for `Custom`, + enter the audience your target system expects in the **Custom Audience** + field. Optionally set the **Subject Claim Format** — see + [Token configs](#token-configs) for the available templates. + + + Follow the cloud-specific guides below to register Cube as an OIDC + provider on the system you want Cube to access, create a role / service + account / app registration, and write a trust policy that pins the + issuer URL and the subject pattern. + + + + Athena, S3 export buckets, Bedrock, and any other AWS-IAM-protected service. + + + BigQuery, GCS export buckets, and any other GCP service via Workload Identity Federation. + + + Azure SQL, Blob Storage export buckets, and any other Azure service via federated credentials. + + + + + For each deployment, set the cloud-specific "who to become" env vars + under **Settings → Environment variables**. These point Cube at the + role or service account it should assume on your side: + + | Cloud | Env vars | + | ----- | ----------------------------------------------------------------- | + | AWS | `AWS_ROLE_ARN` | + | GCP | `GCP_POOL_AUDIENCE`, `GCP_SERVICE_ACCOUNT_EMAIL` | + | Azure | `AZURE_TENANT_ID`, `AZURE_CLIENT_ID` | + + These are the **default identity** for the deployment — see below. + + + +## Default identity for a deployment + +When you set `AWS_ROLE_ARN` (or the GCP / Azure equivalents) on a deployment, +that role becomes the default cloud identity for everything that runs inside +that deployment: + +- **Drivers** that follow the AWS / GCP / Azure default credential chain + (Athena, Redshift, BigQuery, RDS Postgres / MySQL with IAM auth, Azure SQL, + and more) automatically authenticate as this role / service account. +- **Pre-aggregation export bucket I/O** uses the same identity to write to + S3 / GCS / Blob Storage. +- **Code in your data model** — `dataSource` factories, `driverFactory`, + `extendContext`, `checkAuth`, dynamic schema generators — runs with the same + default identity. If you instantiate an AWS / GCP / Azure SDK client with no + explicit credentials, it picks up the deployment's OIDC token automatically. + +If you need a driver to talk to a different role or service account than the +deployment default — for example, the deployment's default role can read S3 +but a specific Athena workgroup lives in a different account — set the +**driver-specific assume-role env var** in addition to the default. The +driver then uses the default identity to perform a second +`AssumeRole` / impersonation hop and connect as the downstream principal. + +| Driver | Driver-specific assume-role env var | +| ---------------------- | -------------------------------------------------------------- | +| AWS Athena | `CUBEJS_AWS_ATHENA_ASSUME_ROLE_ARN` | +| AWS Redshift | `CUBEJS_DB_AWS_ROLE_ARN` | +| Other AWS drivers | Driver-specific — see the driver page in [Connect to data](/admin/connect-to-data/data-sources). | + +The chain is always: + +``` +Cube OIDC token → default deployment identity → (optional) driver-specific assumed role +``` + +Most setups don't need the second hop — granting permissions directly to the +deployment's default role is simpler and easier to reason about. + +## Cube Store CSPS + +Cube Store can store its pre-aggregations in a **per-deployment** S3 or GCS +bucket that you own — Customer-Supplied Pre-aggregation Storage (CSPS). When +you enable CSPS for a deployment, Cube Store authenticates to your bucket +with the same OIDC machinery as the rest of the deployment. The trust policy +on your IAM role pins the `sub` claim to `cube:deployment::component:cube_store`, +giving Cube Store its own identity that's distinct from the Cube API. + +CSPS is configured under **Settings → Pre-Aggregation Storage** on the +deployment. See +[AWS](/admin/deployment/oidc/aws#cube-store-csps-bucket) and +[GCP](/admin/deployment/oidc/gcp) for the trust policy shape on each cloud. + +## Audit trail + +Every token Cube issues is recorded in the tenant audit log with the +deployment ID, component, audience, and TTL. On the receiving side, the +cloud's own audit trail shows the federation event: + +- **AWS CloudTrail** — `AssumeRoleWithWebIdentity` events with the deployment + subject. +- **GCP Cloud Audit Logs** — Workload Identity Federation token-exchange and + service-account-impersonation events. +- **Azure Activity Log** — Federated-credential sign-in events on the app + registration. + +Together these give you an end-to-end record of which deployment accessed +which resource and when, without ever issuing a long-lived credential. + +[ref-oidc-aws]: /admin/deployment/oidc/aws +[ref-oidc-gcp]: /admin/deployment/oidc/gcp +[ref-oidc-azure]: /admin/deployment/oidc/azure +[ref-env-vars]: /admin/deployment#environment-variables +[ref-data-sources]: /admin/connect-to-data/data-sources +[ref-sub-editor]: /admin/deployment/oidc#subject-claim-format +[ref-cube-cloud-region]: /admin/deployment/infrastructure#what-is-a-cube-cloud-region +[link-oidc-spec]: https://openid.net/specs/openid-connect-core-1_0.html +[link-oidc-primer]: https://www.oauth.com/oauth2-servers/openid-connect/ diff --git a/docs/content/product/apis-integrations/core-data-apis/rest-api/reference.mdx b/docs/content/product/apis-integrations/core-data-apis/rest-api/reference.mdx index 872ad2012bc8f..561ce41c09324 100644 --- a/docs/content/product/apis-integrations/core-data-apis/rest-api/reference.mdx +++ b/docs/content/product/apis-integrations/core-data-apis/rest-api/reference.mdx @@ -318,6 +318,11 @@ It provides additional information such as data lineage. Get meta-information for cubes and views defined in the data model. Information about cubes and views with `public: false` will not be returned. +| Parameter | Description | Required | +| --- | --- | --- | +| `onlyViews` | When set to `true`, the response will include only views (entries with `type: "view"`) and any associated `viewGroups`. Useful for clients that consume views exclusively to avoid loading duplicated cube metadata. Defaults to `false`. | ❌ No | +| `extended` | If present, returns extended schema information including SQL definitions for cubes and views. | ❌ No | + Response - `cubes` - Array of cubes and views @@ -342,6 +347,15 @@ curl \ http://localhost:4000/cubejs-api/v1/meta ``` +Example request to load only views: + +```bash +curl \ + -H "Authorization: TOKEN" \ + -G \ + http://localhost:4000/cubejs-api/v1/meta?onlyViews=true +``` + Example response: ```json diff --git a/packages/cubejs-api-gateway/openspec.yml b/packages/cubejs-api-gateway/openspec.yml index b7623ccfd81f7..0a5c71ddddc04 100644 --- a/packages/cubejs-api-gateway/openspec.yml +++ b/packages/cubejs-api-gateway/openspec.yml @@ -11,10 +11,17 @@ paths: # parameters: # - in: query # name: extended - # required: true + # required: false # schema: # type: boolean - # description: You will receive extended response if this parameter is true + # description: You will receive extended response if this parameter is present (presence-only, value is ignored) + # - in: query + # name: onlyViews + # required: false + # schema: + # type: boolean + # default: false + # description: When set to "true", only views (entries with type="view") and their viewGroups are returned, excluding cubes description: "" operationId: "metaV1" responses: diff --git a/packages/cubejs-api-gateway/src/gateway.ts b/packages/cubejs-api-gateway/src/gateway.ts index 84d017cb4956b..08c462c00d178 100644 --- a/packages/cubejs-api-gateway/src/gateway.ts +++ b/packages/cubejs-api-gateway/src/gateway.ts @@ -129,6 +129,40 @@ function systemAsyncHandler(handler: (req: Request & { context: ExtendedRequestC }; } +function rowsToColumnar(rawData: any): { members: string[]; columns: any[][] } { + let rows: any[]; + + if (Array.isArray(rawData)) { + rows = rawData; + } else if (rawData) { + rows = Array.from(rawData as Iterable); + } else { + rows = []; + } + + const rowCount = rows.length; + if (rowCount === 0) { + return { members: [], columns: [] }; + } + + const members = Object.keys(rows[0]); + const memberCount = members.length; + const columns: any[][] = new Array(memberCount); + + for (let j = 0; j < memberCount; j++) { + const member = members[j]; + const col = new Array(rowCount); + + for (let i = 0; i < rowCount; i++) { + col[i] = rows[i][member]; + } + + columns[j] = col; + } + + return { members, columns }; +} + // Prepared CheckAuthFn, default or from config: always async type PreparedCheckAuthFn = (ctx: any, authorization?: string) => Promise<{ securityContext: any; @@ -435,15 +469,18 @@ class ApiGateway { `${this.basePath}/v1/meta`, userMiddlewares, userAsyncHandler(async (req, res) => { + const onlyViews = req.query.onlyViews === 'true'; if ('extended' in req.query) { await this.metaExtended({ context: req.context, res: this.resToResultFn(res), + onlyViews, }); } else { await this.meta({ context: req.context, res: this.resToResultFn(res), + onlyViews, }); } }) @@ -636,11 +673,12 @@ class ApiGateway { })).filter(cube => cube.config.measures?.length || cube.config.dimensions?.length || cube.config.segments?.length); } - public async meta({ context, res, includeCompilerId, onlyCompilerId }: { + public async meta({ context, res, includeCompilerId, onlyCompilerId, onlyViews }: { context: RequestContext, res: MetaResponseResultFn, includeCompilerId?: boolean, - onlyCompilerId?: boolean + onlyCompilerId?: boolean, + onlyViews?: boolean, }) { const requestStarted = new Date(); @@ -660,7 +698,9 @@ class ApiGateway { res(response); return; } - const cubesConfig = metaConfig.cubes; + const cubesConfig = onlyViews + ? metaConfig.cubes.filter((c: any) => c.config?.type === 'view') + : metaConfig.cubes; const cubes = this.filterVisibleItemsInMeta(context, cubesConfig).map(cube => cube.config); const visibleCubeNames = new Set(cubes.map(c => c.name)); const viewGroups = (metaConfig.viewGroups || []) @@ -688,7 +728,11 @@ class ApiGateway { } } - public async metaExtended({ context, res }: { context: ExtendedRequestContext, res: ResponseResultFn }) { + public async metaExtended({ context, res, onlyViews }: { + context: ExtendedRequestContext, + res: ResponseResultFn, + onlyViews?: boolean, + }) { const requestStarted = new Date(); try { @@ -699,7 +743,10 @@ class ApiGateway { }); const { metaConfig, cubeDefinitions } = metaConfigExtended; - const cubes = this.filterVisibleItemsInMeta(context, metaConfig) + const filteredMetaConfig = onlyViews + ? metaConfig.filter((c: any) => c.config?.type === 'view') + : metaConfig; + const cubes = this.filterVisibleItemsInMeta(context, filteredMetaConfig) .map((meta) => meta.config) .map((cube) => ({ ...transformCube(cube, cubeDefinitions), @@ -2061,11 +2108,6 @@ class ApiGateway { await this.assertApiScope('data', context.securityContext); query = this.parseQueryParam(request.query); - let resType: ResultType = ResultType.DEFAULT; - - if (!Array.isArray(query) && query.responseFormat) { - resType = query.responseFormat; - } const [queryType, normalizedQueries] = await this.getNormalizedQueries(query, context, request.streaming, request.memberExpressions, cacheMode); @@ -2130,7 +2172,16 @@ class ApiGateway { // TODO Can we just pass through data? Ensure hidden members can't be queried results = [{ - data: response.data, + /** + * I am still working on improving cpu/memory performance of moving results from ResultWrapper + * into RecordBatch on a rust side. Right now, it's slower then doing transpose and JSON serialize on JS side + + * deserializing on Rust. + * + * Right now, it's a temporary solution, which means it will probably outlive us all. + * + * TODO(ovr): You must finish it, move to ResultWrapper after optimizing it. + */ + data: rowsToColumnar(response.data), annotation }]; } @@ -2163,7 +2214,7 @@ class ApiGateway { sqlQueries[index], annotation, response, - resType, + ResultType.COLUMNAR, ); }) ); diff --git a/packages/cubejs-api-gateway/test/index.test.ts b/packages/cubejs-api-gateway/test/index.test.ts index 1f150c2621b45..43e6efed80a7c 100644 --- a/packages/cubejs-api-gateway/test/index.test.ts +++ b/packages/cubejs-api-gateway/test/index.test.ts @@ -732,6 +732,52 @@ describe('API Gateway', () => { expect(res.body.cubes[0]?.segments.find(segment => segment.name === 'Foo.quux').description).toBe('segment from compilerApi mock'); }); + test('meta endpoint with onlyViews=true returns only views and their groups', async () => { + const { app } = await createApiGateway(); + + const res = await request(app) + .get('/cubejs-api/v1/meta?onlyViews=true') + .set('Authorization', 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.t-IDcSemACt8x4iTMCda8Yhe3iZaWbvV5XKSTbuAn0M') + .expect(200); + + expect(res.body).toHaveProperty('cubes'); + expect(res.body.cubes).toHaveLength(1); + expect(res.body.cubes[0].name).toBe('FooView'); + expect(res.body.cubes[0].type).toBe('view'); + expect(res.body.viewGroups).toEqual([ + { + name: 'analytics', + title: 'Analytics', + description: 'Analytics related views', + views: ['FooView'], + }, + ]); + }); + + test('meta endpoint with onlyViews=false returns all cubes and views', async () => { + const { app } = await createApiGateway(); + + const res = await request(app) + .get('/cubejs-api/v1/meta?onlyViews=false') + .set('Authorization', 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.t-IDcSemACt8x4iTMCda8Yhe3iZaWbvV5XKSTbuAn0M') + .expect(200); + + expect(res.body.cubes.map(c => c.name).sort()).toEqual(['Foo', 'FooView']); + }); + + test('meta endpoint extended with onlyViews=true returns only views', async () => { + const { app } = await createApiGateway(); + + const res = await request(app) + .get('/cubejs-api/v1/meta?extended&onlyViews=true') + .set('Authorization', 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.e30.t-IDcSemACt8x4iTMCda8Yhe3iZaWbvV5XKSTbuAn0M') + .expect(200); + + expect(res.body).toHaveProperty('cubes'); + expect(res.body.cubes).toHaveLength(1); + expect(res.body.cubes[0].name).toBe('FooView'); + }); + describe('multi query support', () => { const searchParams = new URLSearchParams({ query: JSON.stringify({ diff --git a/packages/cubejs-api-gateway/test/mocks.ts b/packages/cubejs-api-gateway/test/mocks.ts index 907d399ab9a4a..65ff08a0c9609 100644 --- a/packages/cubejs-api-gateway/test/mocks.ts +++ b/packages/cubejs-api-gateway/test/mocks.ts @@ -204,6 +204,26 @@ export const compilerApi = jest.fn().mockImplementation(async () => ({ ], }, }, + { + config: { + name: 'FooView', + type: 'view', + description: 'view from compilerApi mock', + measures: [ + { + name: 'FooView.bar', + isVisible: true, + }, + ], + dimensions: [ + { + name: 'FooView.id', + isVisible: true, + }, + ], + segments: [], + }, + }, ]; const cubeDefinitions = { @@ -211,7 +231,11 @@ export const compilerApi = jest.fn().mockImplementation(async () => ({ sql: () => 'SELECT * FROM Foo', measures: {}, dimension: {}, - } + }, + FooView: { + measures: {}, + dimensions: {}, + }, }; return { diff --git a/packages/cubejs-backend-native/src/orchestrator.rs b/packages/cubejs-backend-native/src/orchestrator.rs index baa52f19ba73f..96905c387f7b8 100644 --- a/packages/cubejs-backend-native/src/orchestrator.rs +++ b/packages/cubejs-backend-native/src/orchestrator.rs @@ -6,7 +6,7 @@ use cubeorchestrator::query_result_transform::{ TransformedData, }; use cubeorchestrator::transport::{JsRawData, TransformDataRequest}; -use cubesql::compile::engine::df::scan::{FieldValue, ValueObject}; +use cubesql::compile::engine::df::scan::{ColumnarValueObject, FieldValue, ValueObject}; use cubesql::CubeError; use neon::context::{Context, FunctionContext, ModuleContext}; use neon::handle::Handle; @@ -141,6 +141,18 @@ impl ResultWrapper { } } +fn db_primitive_to_field_value(value: &DBResponsePrimitive) -> FieldValue<'_> { + match value { + DBResponsePrimitive::String(s) => FieldValue::String(Cow::Borrowed(s)), + DBResponsePrimitive::Number(n) => FieldValue::Number(*n), + DBResponsePrimitive::Boolean(b) => FieldValue::Bool(*b), + DBResponsePrimitive::Uncommon(v) => FieldValue::String(Cow::Owned( + serde_json::to_string(&v).unwrap_or_else(|_| v.to_string()), + )), + DBResponsePrimitive::Null => FieldValue::Null, + } +} + impl ValueObject for ResultWrapper { fn len(&mut self) -> Result { if self.transformed_data.is_none() { @@ -179,20 +191,16 @@ impl ValueObject for ResultWrapper { }; let Some(member_index) = members.iter().position(|m| m == field_name) else { - return Err(CubeError::user(format!( - "Field name '{}' not found in members", - field_name - ))); + // Missing field → NULL, matching `Vanilla` semantics below. + return Ok(FieldValue::Null); }; row.get(member_index).unwrap_or(&DBResponsePrimitive::Null) } TransformedData::Columnar { members, columns } => { let Some(member_index) = members.iter().position(|m| m == field_name) else { - return Err(CubeError::user(format!( - "Field name '{}' not found in members", - field_name - ))); + // Missing field → NULL, matching `Vanilla` semantics below. + return Ok(FieldValue::Null); }; let Some(column) = columns.get(member_index) else { @@ -223,15 +231,58 @@ impl ValueObject for ResultWrapper { } }; - Ok(match value { - DBResponsePrimitive::String(s) => FieldValue::String(Cow::Borrowed(s)), - DBResponsePrimitive::Number(n) => FieldValue::Number(*n), - DBResponsePrimitive::Boolean(b) => FieldValue::Bool(*b), - DBResponsePrimitive::Uncommon(v) => FieldValue::String(Cow::Owned( - serde_json::to_string(&v).unwrap_or_else(|_| v.to_string()), - )), - DBResponsePrimitive::Null => FieldValue::Null, - }) + Ok(db_primitive_to_field_value(value)) + } +} + +impl ColumnarValueObject for ResultWrapper { + fn len(&mut self) -> Result { + if self.transformed_data.is_none() { + self.transform_result()?; + } + + let TransformedData::Columnar { columns, .. } = self.transformed_data.as_ref().unwrap() + else { + return Err(CubeError::internal( + "ColumnarValueObject is only supported for columnar TransformedData".to_string(), + )); + }; + + Ok(columns.first().map(|c| c.len()).unwrap_or(0)) + } + + fn column<'a>( + &'a mut self, + field_name: &str, + ) -> Result, CubeError>> + 'a>, CubeError> { + if self.transformed_data.is_none() { + self.transform_result()?; + } + + let TransformedData::Columnar { members, columns } = + self.transformed_data.as_ref().unwrap() + else { + return Err(CubeError::internal( + "ColumnarValueObject is only supported for columnar TransformedData".to_string(), + )); + }; + + let Some(member_index) = members.iter().position(|m| m == field_name) else { + // Missing field → column of NULLs. See JsonColumnarValueObject::column. + let len = columns.first().map(|c| c.len()).unwrap_or(0); + return Ok(Box::new((0..len).map(|_| Ok(FieldValue::Null)))); + }; + + let Some(column) = columns.get(member_index) else { + return Err(CubeError::user(format!( + "Unexpected response from Cube, missing column for '{}'", + field_name + ))); + }; + + Ok(Box::new( + column.iter().map(|v| Ok(db_primitive_to_field_value(v))), + )) } } diff --git a/packages/cubejs-backend-native/src/transport.rs b/packages/cubejs-backend-native/src/transport.rs index 718068ca53bfc..275c08542c0b1 100644 --- a/packages/cubejs-backend-native/src/transport.rs +++ b/packages/cubejs-backend-native/src/transport.rs @@ -16,11 +16,12 @@ use async_trait::async_trait; use cubeorchestrator::query_result_transform::RequestResultData; use cubesql::compile::arrow::datatypes::Schema; use cubesql::compile::engine::df::scan::{ - convert_transport_response, transform_response, CacheMode, MemberField, RecordBatch, SchemaRef, + convert_transport_response_columnar, transform_columnar_response, CacheMode, MemberField, + RecordBatch, SchemaRef, }; use cubesql::compile::engine::df::wrapper::SqlQuery; use cubesql::transport::{ - SpanId, SqlGenerator, SqlResponse, TransportLoadRequestQuery, TransportLoadResponse, + SpanId, SqlGenerator, SqlResponse, TransportLoadRequestQuery, TransportLoadResponseColumnar, TransportMetaResponse, }; use cubesql::{ @@ -469,6 +470,11 @@ impl TransportService for NodeBridgeTransport { } } + #[cfg(debug_assertions)] + trace!("[transport] Request <- {:?}", result); + #[cfg(not(debug_assertions))] + trace!("[transport] Request <- "); + match result? { ValueFromJs::String(result) => { let response: serde_json::Value = match serde_json::from_str(&result) { @@ -476,11 +482,6 @@ impl TransportService for NodeBridgeTransport { Err(err) => return Err(CubeError::internal(err.to_string())), }; - #[cfg(debug_assertions)] - trace!("[transport] Request <- {:?}", response); - #[cfg(not(debug_assertions))] - trace!("[transport] Request <- "); - if let Some(error_value) = response.get("error") { match error_value { serde_json::Value::String(error) => { @@ -517,15 +518,20 @@ impl TransportService for NodeBridgeTransport { } }; - let response = match serde_json::from_value::(response) { - Ok(v) => v, - Err(err) => { - return Err(CubeError::user(err.to_string())); - } - }; - - break convert_transport_response(response, schema.clone(), member_fields) - .map_err(|err| CubeError::user(err.to_string())); + let response = + match serde_json::from_value::(response) { + Ok(v) => v, + Err(err) => { + return Err(CubeError::user(err.to_string())); + } + }; + + break convert_transport_response_columnar( + response, + schema.clone(), + member_fields, + ) + .map_err(|err| CubeError::user(err.to_string())); } ValueFromJs::ResultWrapper(result_wrappers) => { break result_wrappers @@ -544,7 +550,11 @@ impl TransportService for NodeBridgeTransport { schema.clone() }; - transform_response(&mut wrapper, updated_schema, &member_fields) + transform_columnar_response( + &mut wrapper, + updated_schema, + &member_fields, + ) }) .collect::, _>>(); } diff --git a/rust/cubesql/cubesql/Cargo.toml b/rust/cubesql/cubesql/Cargo.toml index 6b01e5fad3c0f..a1557fe6cfb44 100644 --- a/rust/cubesql/cubesql/Cargo.toml +++ b/rust/cubesql/cubesql/Cargo.toml @@ -87,6 +87,10 @@ harness = false name = "large_model" harness = false +[[bench]] +name = "transform_response" +harness = false + # Code in cubesql workspace is not ready for full-blown clippy # So we disable some rules to enable per-rule latch in CI, not for a whole clippy run # Feel free to remove any rule from here and fix all warnings with it diff --git a/rust/cubesql/cubesql/benches/transform_response.rs b/rust/cubesql/cubesql/benches/transform_response.rs new file mode 100644 index 0000000000000..a03e1a01e7311 --- /dev/null +++ b/rust/cubesql/cubesql/benches/transform_response.rs @@ -0,0 +1,268 @@ +use std::sync::Arc; + +use criterion::{criterion_group, criterion_main, BenchmarkId, Criterion, Throughput}; +use cubesql::compile::engine::df::scan::{ + convert_transport_response, convert_transport_response_columnar, DataType, MemberField, Schema, + SchemaRef, +}; +use cubesql::transport::{TransportLoadResponse, TransportLoadResponseColumnar}; +use datafusion::arrow::datatypes::{Field, TimeUnit}; +use serde_json::json; + +const ROWS: &[usize] = &[1_000, 5_000, 10_000, 50_000, 100_000]; +const COLS: &[usize] = &[8, 16, 32, 64]; +const TIME_DIMS: &[usize] = &[0, 1, 2]; + +#[derive(Clone, Copy)] +enum ColKind { + TimeDim, + Float, + Int, + Str, +} + +fn col_kinds(cols: usize, time_dims: usize) -> Vec { + let mut out = Vec::with_capacity(cols); + for i in 0..cols { + if i < time_dims { + out.push(ColKind::TimeDim); + } else { + // Rotate through Float/Int/Str so the row mix is realistic and + // independent of `cols` modulo 3. + out.push(match (i - time_dims) % 3 { + 0 => ColKind::Float, + 1 => ColKind::Int, + _ => ColKind::Str, + }); + } + } + out +} + +fn member_name(idx: usize, kind: ColKind) -> String { + match kind { + ColKind::TimeDim => format!("Cube.t{}", idx), + ColKind::Float => format!("Cube.f{}", idx), + ColKind::Int => format!("Cube.i{}", idx), + ColKind::Str => format!("Cube.s{}", idx), + } +} + +/// Field name as it appears in the Cube response (and so in `MemberField::field_name`). +/// Matches `MemberField::time_dimension` which appends `.` to the member. +fn field_name(idx: usize, kind: ColKind) -> String { + let base = member_name(idx, kind); + match kind { + ColKind::TimeDim => format!("{}.day", base), + _ => base, + } +} + +fn build_schema(kinds: &[ColKind]) -> SchemaRef { + let fields = kinds + .iter() + .enumerate() + .map(|(i, k)| { + let name = field_name(i, *k); + let dt = match k { + ColKind::TimeDim => DataType::Timestamp(TimeUnit::Nanosecond, None), + ColKind::Float => DataType::Float64, + ColKind::Int => DataType::Int64, + ColKind::Str => DataType::Utf8, + }; + Field::new(&name, dt, true) + }) + .collect(); + Arc::new(Schema::new(fields)) +} + +fn build_member_fields(kinds: &[ColKind]) -> Vec { + kinds + .iter() + .enumerate() + .map(|(i, k)| match k { + ColKind::TimeDim => MemberField::time_dimension(member_name(i, *k), "day".to_string()), + _ => MemberField::regular(member_name(i, *k)), + }) + .collect() +} + +fn cell_value(row: usize, col: usize, kind: ColKind) -> serde_json::Value { + match kind { + ColKind::TimeDim => { + // Spread rows over a 12 x 28 grid of always-valid dates. + let n = row % (12 * 28); + let month = (n / 28) + 1; + let day = (n % 28) + 1; + + // ISO-8601 with no offset, parseable by `parse_date_str`. + json!(format!("2024-{:02}-{:02}T00:00:00.000", month, day)) + } + ColKind::Float => json!(row as f64 + (col as f64) / 100.0), + ColKind::Int => json!(row.wrapping_mul(31).wrapping_add(col).to_string()), + ColKind::Str => json!(format!("row-{}-c{}", row, col)), + } +} + +fn annotation_value() -> serde_json::Value { + json!({ + "measures": {}, + "dimensions": {}, + "segments": {}, + "timeDimensions": {}, + }) +} + +fn build_row_json(rows: usize, kinds: &[ColKind]) -> String { + let names: Vec = kinds + .iter() + .enumerate() + .map(|(i, k)| field_name(i, *k)) + .collect(); + + let data: Vec = (0..rows) + .map(|r| { + let mut row_obj = serde_json::Map::with_capacity(kinds.len()); + for (c, kind) in kinds.iter().enumerate() { + row_obj.insert(names[c].clone(), cell_value(r, c, *kind)); + } + serde_json::Value::Object(row_obj) + }) + .collect(); + + let response = json!({ + "results": [{ + "annotation": annotation_value(), + "data": data, + }] + }); + + serde_json::to_string(&response).expect("serialize row json") +} + +fn build_columnar_json(rows: usize, kinds: &[ColKind]) -> String { + let names: Vec = kinds + .iter() + .enumerate() + .map(|(i, k)| field_name(i, *k)) + .collect(); + + let columns: Vec> = kinds + .iter() + .enumerate() + .map(|(c, kind)| (0..rows).map(|r| cell_value(r, c, *kind)).collect()) + .collect(); + + let response = json!({ + "results": [{ + "annotation": annotation_value(), + "data": { + "members": names, + "columns": columns, + }, + }] + }); + + serde_json::to_string(&response).expect("serialize columnar json") +} + +struct Inputs { + schema: SchemaRef, + member_fields: Vec, + row_json: String, + columnar_json: String, +} + +fn build_inputs(rows: usize, cols: usize, time_dims: usize) -> Inputs { + let kinds = col_kinds(cols, time_dims); + Inputs { + schema: build_schema(&kinds), + member_fields: build_member_fields(&kinds), + row_json: build_row_json(rows, &kinds), + columnar_json: build_columnar_json(rows, &kinds), + } +} + +fn sample_size_for(rows: usize) -> usize { + // Default Criterion sample size is 100; trim for the heavy shapes so + // a full matrix run completes in reasonable wall time. + if rows >= 50_000 { + 10 + } else if rows >= 10_000 { + 20 + } else { + 50 + } +} + +fn bench_transform_response(c: &mut Criterion) { + let mut group = c.benchmark_group("transform_response"); + + for &rows in ROWS { + let sample = sample_size_for(rows); + group.sample_size(sample); + group.throughput(Throughput::Elements(rows as u64)); + + for &cols in COLS { + for &td in TIME_DIMS { + if td > cols { + continue; + } + let inputs = build_inputs(rows, cols, td); + let Inputs { + schema, + member_fields, + row_json, + columnar_json, + } = &inputs; + + let row_id = format!("row/rows={}/cols={}/td={}", rows, cols, td); + group.bench_with_input( + BenchmarkId::from_parameter(&row_id), + row_json.as_str(), + |b, json| { + b.iter(|| { + let value: serde_json::Value = + serde_json::from_str(json).expect("row from_str"); + let response: TransportLoadResponse = + serde_json::from_value(value).expect("row from_value"); + convert_transport_response( + response, + schema.clone(), + member_fields.clone(), + ) + .expect("convert_transport_response") + }) + }, + ); + + let col_id = format!("columnar/rows={}/cols={}/td={}", rows, cols, td); + group.bench_with_input( + BenchmarkId::from_parameter(&col_id), + columnar_json.as_str(), + |b, json| { + b.iter(|| { + let value: serde_json::Value = + serde_json::from_str(json).expect("columnar from_str"); + let response: TransportLoadResponseColumnar = + serde_json::from_value(value).expect("columnar from_value"); + convert_transport_response_columnar( + response, + schema.clone(), + member_fields.clone(), + ) + .expect("convert_transport_response_columnar") + }) + }, + ); + + drop(inputs); + } + } + } + + group.finish(); +} + +criterion_group!(benches, bench_transform_response); +criterion_main!(benches); diff --git a/rust/cubesql/cubesql/src/compile/engine/df/scan.rs b/rust/cubesql/cubesql/src/compile/engine/df/scan.rs index 34f412f440152..f2406c1070821 100644 --- a/rust/cubesql/cubesql/src/compile/engine/df/scan.rs +++ b/rust/cubesql/cubesql/src/compile/engine/df/scan.rs @@ -11,7 +11,9 @@ use crate::{ }; use async_trait::async_trait; use chrono::{Datelike, NaiveDate}; -use cubeclient::models::{V1LoadRequestQuery, V1LoadResponse, V1LoadResult}; +use cubeclient::models::{ + V1LoadRequestQuery, V1LoadResponse, V1LoadResult, V1LoadResultDataColumnar, +}; pub use datafusion::{ arrow::{ array::{ @@ -318,6 +320,18 @@ pub trait ValueObject { ) -> std::result::Result, CubeError>; } +pub trait ColumnarValueObject { + fn len(&mut self) -> std::result::Result; + + fn column<'a>( + &'a mut self, + field_name: &str, + ) -> std::result::Result< + Box, CubeError>> + 'a>, + CubeError, + >; +} + pub struct JsonValueObject { rows: Vec, } @@ -328,6 +342,23 @@ impl JsonValueObject { } } +fn json_value_to_field_value(value: &Value) -> std::result::Result, CubeError> { + Ok(match value { + Value::String(s) => FieldValue::String(Cow::Borrowed(s)), + Value::Number(n) => FieldValue::Number(n.as_f64().ok_or_else(|| { + DataFusionError::Execution(format!("Can't convert {:?} to float", n)) + })?), + Value::Bool(b) => FieldValue::Bool(*b), + Value::Null => FieldValue::Null, + x => { + return Err(CubeError::user(format!( + "Expected primitive value but found: {:?}", + x + ))); + } + }) +} + impl ValueObject for JsonValueObject { fn len(&mut self) -> std::result::Result { Ok(self.rows.len()) @@ -346,40 +377,89 @@ impl ValueObject for JsonValueObject { }; let value = as_object.get(field_name).unwrap_or(&Value::Null); + json_value_to_field_value(value) + } +} - Ok(match value { - Value::String(s) => FieldValue::String(Cow::Borrowed(s)), - Value::Number(n) => FieldValue::Number(n.as_f64().ok_or( - DataFusionError::Execution(format!("Can't convert {:?} to float", n)), - )?), - Value::Bool(b) => FieldValue::Bool(*b), - Value::Null => FieldValue::Null, - x => { - return Err(CubeError::user(format!( - "Expected primitive value but found: {:?}", - x - ))); - } - }) +pub struct JsonColumnarValueObject { + members: Vec, + columns: Vec>, +} + +impl JsonColumnarValueObject { + pub fn new(members: Vec, columns: Vec>) -> Self { + debug_assert!( + columns.windows(2).all(|w| w[0].len() == w[1].len()), + "columnar response has ragged columns" + ); + + Self { members, columns } + } +} + +impl ColumnarValueObject for JsonColumnarValueObject { + fn len(&mut self) -> std::result::Result { + Ok(self.columns.first().map(|c| c.len()).unwrap_or(0)) + } + + fn column<'a>( + &'a mut self, + field_name: &str, + ) -> std::result::Result< + Box, CubeError>> + 'a>, + CubeError, + > { + let Some(idx) = self.members.iter().position(|m| m == field_name) else { + // Match the row-mode `JsonValueObject::get` semantics: a missing field + // is treated as a column of NULLs. + let len = self.columns.first().map(|c| c.len()).unwrap_or(0); + return Ok(Box::new((0..len).map(|_| Ok(FieldValue::Null)))); + }; + + let Some(column) = self.columns.get(idx) else { + return Err(CubeError::user(format!( + "Unexpected response from Cube, missing column for '{}'", + field_name + ))); + }; + + Ok(Box::new(column.iter().map(json_value_to_field_value))) } } +// `$mode` is one of `row` or `columnar`. The `row` arm calls `$response.get(i, field_name)?` +// per cell; the `columnar` arm fetches the entire column slice once via +// `$response.column(field_name)?` and iterates it. +macro_rules! build_column_iter_loop { + (row, $response:expr, $len:expr, $field_name:expr, $value:ident, $body:block) => {{ + for i in 0..$len { + let $value = $response.get(i, $field_name)?; + $body + } + }}; + (columnar, $response:expr, $len:expr, $field_name:expr, $value:ident, $body:block) => {{ + for cell in $response.column($field_name)? { + let $value = cell?; + $body + } + }}; +} + macro_rules! build_column { - ($data_type:expr, $builder_ty:ty, $response:expr, $field_name:expr, { $($builder_block:tt)* }, { $($scalar_block:tt)* }) => {{ + ($data_type:expr, $builder_ty:ty, $mode:tt, $response:expr, $field_name:expr, { $($builder_block:tt)* }, { $($scalar_block:tt)* }) => {{ let len = $response.len()?; let mut builder = <$builder_ty>::new(len); - build_column_custom_builder!($data_type, len, builder, $response, $field_name, { $($builder_block)* }, { $($scalar_block)* }) + build_column_custom_builder!($data_type, $mode, len, builder, $response, $field_name, { $($builder_block)* }, { $($scalar_block)* }) }} } macro_rules! build_column_custom_builder { - ($data_type:expr, $len:expr, $builder:expr, $response:expr, $field_name: expr, { $($builder_block:tt)* }, { $($scalar_block:tt)* }) => {{ + ($data_type:expr, $mode:tt, $len:expr, $builder:expr, $response:expr, $field_name: expr, { $($builder_block:tt)* }, { $($scalar_block:tt)* }) => {{ match $field_name { MemberField::Member(member) => { let field_name = &member.field_name; - for i in 0..$len { - let value = $response.get(i, &field_name)?; + build_column_iter_loop!($mode, $response, $len, &field_name, value, { match (value, &mut $builder) { (FieldValue::Null, builder) => builder.append_null()?, $($builder_block)* @@ -392,7 +472,7 @@ macro_rules! build_column_custom_builder { ))); } }; - } + }); } MemberField::Literal(value) => { for _ in 0..$len { @@ -751,8 +831,8 @@ async fn load_data( } ArrowError::ExternalError(Box::new(err)) })?; - let response = result.first(); - if let Some(data) = response.cloned() { + + if let Some(data) = result.into_iter().next() { match (options.max_records, data.num_rows()) { (Some(max_records), len) if len >= max_records => { return Err(ArrowError::ExternalError(Box::new(CubeError::user( @@ -813,377 +893,409 @@ fn load_to_stream_sync(one_shot_stream: &mut CubeScanOneShotStream) -> Result<() Ok(()) } -pub fn transform_response( - response: &mut V, - schema: SchemaRef, - member_fields: &Vec, -) -> std::result::Result { - let mut columns = vec![]; - - for (i, schema_field) in schema.fields().iter().enumerate() { - let field_name = &member_fields[i]; - let column = match schema_field.data_type() { - DataType::Utf8 => { - build_column!( - DataType::Utf8, - StringBuilder, - response, - field_name, - { - (FieldValue::String(v), builder) => builder.append_value(v)?, - (FieldValue::Bool(v), builder) => builder.append_value(if v { "true" } else { "false" })?, - (FieldValue::Number(v), builder) => builder.append_value(v.to_string())?, - }, - { - (ScalarValue::Utf8(v), builder) => builder.append_option(v.as_ref())?, - } - ) - } - DataType::Int16 => { - build_column!( - DataType::Int16, - Int16Builder, - response, - field_name, - { - (FieldValue::Number(number), builder) => builder.append_value(number.round() as i16)?, - (FieldValue::String(s), builder) => match s.parse::() { - Ok(v) => builder.append_value(v)?, - Err(error) => { - warn!( - "Unable to parse value as i16: {}", - error.to_string() - ); - - builder.append_null()? - } - }, - }, - { - (ScalarValue::Int16(v), builder) => builder.append_option(*v)?, - } - ) - } - DataType::Int32 => { - build_column!( - DataType::Int32, - Int32Builder, - response, - field_name, - { - (FieldValue::Number(number), builder) => builder.append_value(number.round() as i32)?, - (FieldValue::String(s), builder) => match s.parse::() { - Ok(v) => builder.append_value(v)?, - Err(error) => { - warn!( - "Unable to parse value as i32: {}", - error.to_string() - ); - - builder.append_null()? - } - }, - }, - { - (ScalarValue::Int32(v), builder) => builder.append_option(*v)?, - } - ) - } - DataType::Int64 => { - build_column!( - DataType::Int64, - Int64Builder, - response, - field_name, - { - (FieldValue::Number(number), builder) => builder.append_value(number.round() as i64)?, - (FieldValue::String(s), builder) => match s.parse::() { - Ok(v) => builder.append_value(v)?, - Err(error) => { - warn!( - "Unable to parse value as i64: {}", - error.to_string() - ); - - builder.append_null()? - } - }, - }, - { - (ScalarValue::Int64(v), builder) => builder.append_option(*v)?, - } - ) - } - DataType::Float32 => { - build_column!( - DataType::Float32, - Float32Builder, - response, - field_name, - { - (FieldValue::Number(number), builder) => builder.append_value(number as f32)?, - (FieldValue::String(s), builder) => match s.parse::() { - Ok(v) => builder.append_value(v)?, - Err(error) => { - warn!( - "Unable to parse value as f32: {}", - error.to_string() - ); - - builder.append_null()? - } +// Body of `transform_response` / `transform_columnar_response`. The two functions differ +// only in how they iterate per-cell values: row-major (`$mode = row`) goes through +// `ValueObject::get` per cell, columnar (`$mode = columnar`) fetches the whole column once +// via `ColumnarValueObject::column`. Per-`DataType` coercion arms are shared. +macro_rules! transform_response_body { + ($mode:tt, $response:expr, $schema:expr, $member_fields:expr) => {{ + let mut columns = vec![]; + + for (i, schema_field) in $schema.fields().iter().enumerate() { + let field_name = &$member_fields[i]; + let column = match schema_field.data_type() { + DataType::Utf8 => { + build_column!( + DataType::Utf8, + StringBuilder, + $mode, + $response, + field_name, + { + (FieldValue::String(v), builder) => builder.append_value(v)?, + (FieldValue::Bool(v), builder) => builder.append_value(if v { "true" } else { "false" })?, + (FieldValue::Number(v), builder) => builder.append_value(v.to_string())?, }, - }, - { - (ScalarValue::Float32(v), builder) => builder.append_option(*v)?, - } - ) - } - DataType::Float64 => { - build_column!( - DataType::Float64, - Float64Builder, - response, - field_name, - { - (FieldValue::Number(number), builder) => builder.append_value(number)?, - (FieldValue::String(s), builder) => match s.parse::() { - Ok(v) => builder.append_value(v)?, - Err(error) => { - warn!( - "Unable to parse value as f64: {}", - error.to_string() - ); - - builder.append_null()? - } - }, - }, - { - (ScalarValue::Float64(v), builder) => builder.append_option(*v)?, - } - ) - } - DataType::Boolean => { - build_column!( - DataType::Boolean, - BooleanBuilder, - response, - field_name, - { - (FieldValue::Bool(v), builder) => builder.append_value(v)?, - (FieldValue::String(v), builder) => match v.as_ref() { - "true" | "1" => builder.append_value(true)?, - "false" | "0" => builder.append_value(false)?, - _ => { - log::error!("Unable to map value {:?} to DataType::Boolean (returning null)", v); - - builder.append_null()? - } + { + (ScalarValue::Utf8(v), builder) => builder.append_option(v.as_ref())?, + } + ) + } + DataType::Int16 => { + build_column!( + DataType::Int16, + Int16Builder, + $mode, + $response, + field_name, + { + (FieldValue::Number(number), builder) => builder.append_value(number.round() as i16)?, + (FieldValue::String(s), builder) => match s.parse::() { + Ok(v) => builder.append_value(v)?, + Err(error) => { + warn!( + "Unable to parse value as i16: {}", + error.to_string() + ); + + builder.append_null()? + } + }, }, - }, - { - (ScalarValue::Boolean(v), builder) => builder.append_option(*v)?, - } - ) - } - DataType::Timestamp(TimeUnit::Nanosecond, None) => { - build_column!( - DataType::Timestamp(TimeUnit::Nanosecond, None), - TimestampNanosecondBuilder, - response, - field_name, - { - (FieldValue::String(s), builder) => { - let timestamp = parse_date_str(s.as_ref())?; - // TODO switch parsing to microseconds - if timestamp.and_utc().timestamp_millis() > (((1i64) << 62) / 1_000_000) { - builder.append_null()?; - } else if let Some(nanos) = timestamp.and_utc().timestamp_nanos_opt() { - builder.append_value(nanos)?; - } else { - log::error!( - "Unable to cast timestamp value to nanoseconds: {}", - timestamp.to_string() - ); - builder.append_null()?; - } + { + (ScalarValue::Int16(v), builder) => builder.append_option(*v)?, + } + ) + } + DataType::Int32 => { + build_column!( + DataType::Int32, + Int32Builder, + $mode, + $response, + field_name, + { + (FieldValue::Number(number), builder) => builder.append_value(number.round() as i32)?, + (FieldValue::String(s), builder) => match s.parse::() { + Ok(v) => builder.append_value(v)?, + Err(error) => { + warn!( + "Unable to parse value as i32: {}", + error.to_string() + ); + + builder.append_null()? + } + }, }, - }, - { - (ScalarValue::TimestampNanosecond(v, None), builder) => builder.append_option(*v)?, - } - ) - } - DataType::Timestamp(TimeUnit::Millisecond, None) => { - build_column!( - DataType::Timestamp(TimeUnit::Millisecond, None), - TimestampMillisecondBuilder, - response, - field_name, - { - (FieldValue::String(s), builder) => { - let timestamp = parse_date_str(s.as_ref())?; - // TODO switch parsing to microseconds - if timestamp.and_utc().timestamp_millis() > (((1 as i64) << 62) / 1_000_000) { - builder.append_null()?; - } else { - builder.append_value(timestamp.and_utc().timestamp_millis())?; - } + { + (ScalarValue::Int32(v), builder) => builder.append_option(*v)?, + } + ) + } + DataType::Int64 => { + build_column!( + DataType::Int64, + Int64Builder, + $mode, + $response, + field_name, + { + (FieldValue::Number(number), builder) => builder.append_value(number.round() as i64)?, + (FieldValue::String(s), builder) => match s.parse::() { + Ok(v) => builder.append_value(v)?, + Err(error) => { + warn!( + "Unable to parse value as i64: {}", + error.to_string() + ); + + builder.append_null()? + } + }, }, - }, - { - (ScalarValue::TimestampMillisecond(v, None), builder) => builder.append_option(*v)?, - } - ) - } - DataType::Date32 => { - build_column!( - DataType::Date32, - Date32Builder, - response, - field_name, - { - (FieldValue::String(s), builder) => { - let date = NaiveDate::parse_from_str(s.as_ref(), "%Y-%m-%d") - // FIXME: temporary solution for cases when expected type is Date32 - // but underlying data is a Timestamp - .or_else(|_| NaiveDate::parse_from_str(s.as_ref(), "%Y-%m-%dT00:00:00.000")) - .map_err(|e| { - DataFusionError::Execution(format!( - "Can't parse date: '{}': {}", - s, e - )) - }); - match date { - Ok(date) => { - let epoch = NaiveDate::from_ymd_opt(1970, 1, 1).unwrap(); - let days_since_epoch = date.num_days_from_ce() - epoch.num_days_from_ce(); - builder.append_value(days_since_epoch)?; + { + (ScalarValue::Int64(v), builder) => builder.append_option(*v)?, + } + ) + } + DataType::Float32 => { + build_column!( + DataType::Float32, + Float32Builder, + $mode, + $response, + field_name, + { + (FieldValue::Number(number), builder) => builder.append_value(number as f32)?, + (FieldValue::String(s), builder) => match s.parse::() { + Ok(v) => builder.append_value(v)?, + Err(error) => { + warn!( + "Unable to parse value as f32: {}", + error.to_string() + ); + + builder.append_null()? } + }, + }, + { + (ScalarValue::Float32(v), builder) => builder.append_option(*v)?, + } + ) + } + DataType::Float64 => { + build_column!( + DataType::Float64, + Float64Builder, + $mode, + $response, + field_name, + { + (FieldValue::Number(number), builder) => builder.append_value(number)?, + (FieldValue::String(s), builder) => match s.parse::() { + Ok(v) => builder.append_value(v)?, Err(error) => { - log::error!( - "Unable to parse value as Date32: {}", + warn!( + "Unable to parse value as f64: {}", error.to_string() ); builder.append_null()? } - } + }, + }, + { + (ScalarValue::Float64(v), builder) => builder.append_option(*v)?, } - }, - { - (ScalarValue::Date32(v), builder) => builder.append_option(*v)?, - } - ) - } - DataType::Decimal(precision, scale) => { - let len = response.len()?; - let mut builder = DecimalBuilder::new(len, *precision, *scale); - - build_column_custom_builder!( - DataType::Decimal(*precision, *scale), - len, - builder, - response, - field_name, - { - (FieldValue::String(s), builder) => { - let mut parts = s.split("."); - match parts.next() { - None => builder.append_null()?, - Some(int_part) => { - let frac_part = format!("{:0 *scale { - Err(DataFusionError::Execution(format!("Decimal scale is higher than requested: expected {}, got {}", scale, frac_part.len())))?; - } - if let Some(_) = parts.next() { - Err(DataFusionError::Execution(format!("Unable to parse decimal, value contains two dots: {}", s)))?; + ) + } + DataType::Boolean => { + build_column!( + DataType::Boolean, + BooleanBuilder, + $mode, + $response, + field_name, + { + (FieldValue::Bool(v), builder) => builder.append_value(v)?, + (FieldValue::String(v), builder) => match v.as_ref() { + "true" | "1" => builder.append_value(true)?, + "false" | "0" => builder.append_value(false)?, + _ => { + log::error!("Unable to map value {:?} to DataType::Boolean (returning null)", v); + + builder.append_null()? + } + }, + }, + { + (ScalarValue::Boolean(v), builder) => builder.append_option(*v)?, + } + ) + } + DataType::Timestamp(TimeUnit::Nanosecond, None) => { + build_column!( + DataType::Timestamp(TimeUnit::Nanosecond, None), + TimestampNanosecondBuilder, + $mode, + $response, + field_name, + { + (FieldValue::String(s), builder) => { + let timestamp = parse_date_str(s.as_ref())?; + // TODO switch parsing to microseconds + if timestamp.and_utc().timestamp_millis() > (((1i64) << 62) / 1_000_000) { + builder.append_null()?; + } else if let Some(nanos) = timestamp.and_utc().timestamp_nanos_opt() { + builder.append_value(nanos)?; + } else { + log::error!( + "Unable to cast timestamp value to nanoseconds: {}", + timestamp.to_string() + ); + builder.append_null()?; + } + }, + }, + { + (ScalarValue::TimestampNanosecond(v, None), builder) => builder.append_option(*v)?, + } + ) + } + DataType::Timestamp(TimeUnit::Millisecond, None) => { + build_column!( + DataType::Timestamp(TimeUnit::Millisecond, None), + TimestampMillisecondBuilder, + $mode, + $response, + field_name, + { + (FieldValue::String(s), builder) => { + let timestamp = parse_date_str(s.as_ref())?; + // TODO switch parsing to microseconds + if timestamp.and_utc().timestamp_millis() > (((1 as i64) << 62) / 1_000_000) { + builder.append_null()?; + } else { + builder.append_value(timestamp.and_utc().timestamp_millis())?; + } + }, + }, + { + (ScalarValue::TimestampMillisecond(v, None), builder) => builder.append_option(*v)?, + } + ) + } + DataType::Date32 => { + build_column!( + DataType::Date32, + Date32Builder, + $mode, + $response, + field_name, + { + (FieldValue::String(s), builder) => { + let date = NaiveDate::parse_from_str(s.as_ref(), "%Y-%m-%d") + // FIXME: temporary solution for cases when expected type is Date32 + // but underlying data is a Timestamp + .or_else(|_| NaiveDate::parse_from_str(s.as_ref(), "%Y-%m-%dT00:00:00.000")) + .map_err(|e| { + DataFusionError::Execution(format!( + "Can't parse date: '{}': {}", + s, e + )) + }); + match date { + Ok(date) => { + let epoch = NaiveDate::from_ymd_opt(1970, 1, 1).unwrap(); + let days_since_epoch = date.num_days_from_ce() - epoch.num_days_from_ce(); + builder.append_value(days_since_epoch)?; } - let decimal_str = format!("{}{}", int_part, frac_part); - if decimal_str.len() > *precision { - Err(DataFusionError::Execution(format!("Decimal precision is higher than requested: expected {}, got {}", precision, decimal_str.len())))?; + Err(error) => { + log::error!( + "Unable to parse value as Date32: {}", + error.to_string() + ); + + builder.append_null()? } - if let Ok(value) = decimal_str.parse::() { - builder.append_value(value)?; - } else { - Err(DataFusionError::Execution(format!("Unable to parse decimal as an i128: {}", decimal_str)))?; + } + } + }, + { + (ScalarValue::Date32(v), builder) => builder.append_option(*v)?, + } + ) + } + DataType::Decimal(precision, scale) => { + let len = $response.len()?; + let mut builder = DecimalBuilder::new(len, *precision, *scale); + + build_column_custom_builder!( + DataType::Decimal(*precision, *scale), + $mode, + len, + builder, + $response, + field_name, + { + (FieldValue::String(s), builder) => { + let mut parts = s.split("."); + match parts.next() { + None => builder.append_null()?, + Some(int_part) => { + let frac_part = format!("{:0 *scale { + Err(DataFusionError::Execution(format!("Decimal scale is higher than requested: expected {}, got {}", scale, frac_part.len())))?; + } + if let Some(_) = parts.next() { + Err(DataFusionError::Execution(format!("Unable to parse decimal, value contains two dots: {}", s)))?; + } + let decimal_str = format!("{}{}", int_part, frac_part); + if decimal_str.len() > *precision { + Err(DataFusionError::Execution(format!("Decimal precision is higher than requested: expected {}, got {}", precision, decimal_str.len())))?; + } + if let Ok(value) = decimal_str.parse::() { + builder.append_value(value)?; + } else { + Err(DataFusionError::Execution(format!("Unable to parse decimal as an i128: {}", decimal_str)))?; + } } + }; + }, + }, + { + (ScalarValue::Decimal128(v, _, _), builder) => { + // TODO: check precision and scale, adjust accordingly + if let Some(v) = v { + builder.append_value(*v)?; + } else { + builder.append_null()?; } - }; + }, + } + ) + } + DataType::Interval(IntervalUnit::YearMonth) => { + build_column!( + DataType::Interval(IntervalUnit::YearMonth), + IntervalYearMonthBuilder, + $mode, + $response, + field_name, + { + // TODO }, - }, - { - (ScalarValue::Decimal128(v, _, _), builder) => { - // TODO: check precision and scale, adjust accordingly - if let Some(v) = v { - builder.append_value(*v)?; - } else { - builder.append_null()?; - } + { + (ScalarValue::IntervalYearMonth(v), builder) => builder.append_option(*v)?, + } + ) + } + DataType::Interval(IntervalUnit::DayTime) => { + build_column!( + DataType::Interval(IntervalUnit::DayTime), + IntervalDayTimeBuilder, + $mode, + $response, + field_name, + { + // TODO }, - } - ) - } - DataType::Interval(IntervalUnit::YearMonth) => { - build_column!( - DataType::Interval(IntervalUnit::YearMonth), - IntervalYearMonthBuilder, - response, - field_name, - { - // TODO - }, - { - (ScalarValue::IntervalYearMonth(v), builder) => builder.append_option(*v)?, - } - ) - } - DataType::Interval(IntervalUnit::DayTime) => { - build_column!( - DataType::Interval(IntervalUnit::DayTime), - IntervalDayTimeBuilder, - response, - field_name, - { - // TODO - }, - { - (ScalarValue::IntervalDayTime(v), builder) => builder.append_option(*v)?, - } - ) - } - DataType::Interval(IntervalUnit::MonthDayNano) => { - build_column!( - DataType::Interval(IntervalUnit::MonthDayNano), - IntervalMonthDayNanoBuilder, - response, - field_name, - { - // TODO - }, - { - (ScalarValue::IntervalMonthDayNano(v), builder) => builder.append_option(*v)?, - } - ) - } - DataType::Null => { - let len = response.len()?; - let array = NullArray::new(len); - Arc::new(array) - } - t => { - return Err(CubeError::user(format!( - "Type {} is not supported in response transformation from Cube", - t, - ))) - } - }; + { + (ScalarValue::IntervalDayTime(v), builder) => builder.append_option(*v)?, + } + ) + } + DataType::Interval(IntervalUnit::MonthDayNano) => { + build_column!( + DataType::Interval(IntervalUnit::MonthDayNano), + IntervalMonthDayNanoBuilder, + $mode, + $response, + field_name, + { + // TODO + }, + { + (ScalarValue::IntervalMonthDayNano(v), builder) => builder.append_option(*v)?, + } + ) + } + DataType::Null => { + let len = $response.len()?; + let array = NullArray::new(len); + Arc::new(array) + } + t => { + return Err(CubeError::user(format!( + "Type {} is not supported in response transformation from Cube", + t, + ))) + } + }; - columns.push(column); - } + columns.push(column); + } - Ok(RecordBatch::try_new(schema.clone(), columns)?) + Ok(RecordBatch::try_new($schema.clone(), columns)?) + }}; +} + +pub fn transform_response( + response: &mut V, + schema: SchemaRef, + member_fields: &Vec, +) -> std::result::Result { + transform_response_body!(row, response, schema, member_fields) +} + +pub fn transform_columnar_response( + response: &mut C, + schema: SchemaRef, + member_fields: &Vec, +) -> std::result::Result { + transform_response_body!(columnar, response, schema, member_fields) } pub fn convert_transport_response( @@ -1218,6 +1330,39 @@ pub fn convert_transport_response( .collect::, CubeError>>() } +pub fn convert_transport_response_columnar( + response: V1LoadResponse, + schema: SchemaRef, + member_fields: Vec, +) -> std::result::Result, CubeError> { + response + .results + .into_iter() + .map(|result| { + let V1LoadResult { + data, + last_refresh_time, + .. + } = result; + let V1LoadResultDataColumnar { members, columns } = data; + + let mut response = JsonColumnarValueObject::new(members, columns); + let updated_schema = if let Some(last_refresh_time) = last_refresh_time { + let mut metadata = schema.metadata().clone(); + metadata.insert("lastRefreshTime".to_string(), last_refresh_time); + Arc::new(Schema::new_with_metadata( + schema.fields().to_vec(), + metadata, + )) + } else { + schema.clone() + }; + + transform_columnar_response(&mut response, updated_schema, &member_fields) + }) + .collect::, CubeError>>() +} + #[cfg(test)] mod tests { use super::*; diff --git a/rust/cubesql/cubesql/src/transport/mod.rs b/rust/cubesql/cubesql/src/transport/mod.rs index 8ed401947603e..d95c1957e0707 100644 --- a/rust/cubesql/cubesql/src/transport/mod.rs +++ b/rust/cubesql/cubesql/src/transport/mod.rs @@ -26,6 +26,8 @@ pub type CubeMetaDimensionOrder = cubeclient::models::V1CubeMetaDimensionOrder; pub type CubeMetaFormat = cubeclient::models::V1CubeMetaFormat; // Request/Response pub type TransportLoadResponse = cubeclient::models::V1LoadResponse; +pub type TransportLoadResponseColumnar = + cubeclient::models::V1LoadResponse; pub type TransportLoadRequestQuery = cubeclient::models::V1LoadRequestQuery; pub type TransportLoadRequest = cubeclient::models::V1LoadRequest; pub type TransportLoadRequestCacheMode = cubeclient::models::Cache;