UN-3414 [FIX] SharePoint OAuth connector creation and metadata handling#1919
UN-3414 [FIX] SharePoint OAuth connector creation and metadata handling#1919kirtimanmishrazipstack wants to merge 9 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
Summary by CodeRabbit
WalkthroughConnector metadata handling was changed to preserve non-token fields while merging refreshed OAuth token fields; Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(63, 81, 181, 0.5)
participant Client
end
rect rgba(0, 150, 136, 0.5)
participant View
participant Serializer
end
rect rgba(255, 193, 7, 0.5)
participant ConnectorAuthCache
participant ConnectorAuthModel
end
rect rgba(233, 30, 99, 0.5)
participant DB
end
Client->>View: POST/PUT with oauth_key and connector_metadata
View->>ConnectorAuthCache: fetch cached oauth tokens (oauth_key)
ConnectorAuthCache-->>View: oauth_tokens or MissingParam
View->>View: merge oauth_tokens with non-secret request metadata
View->>Serializer: initialize with merged metadata
Serializer->>ConnectorAuthModel: _refresh_tokens(provider, uid, existing_metadata)
ConnectorAuthModel->>ConnectorAuthModel: call external token refresh
ConnectorAuthModel-->>Serializer: refreshed token subset (whitelisted)
Serializer->>DB: save ConnectorInstance with merged metadata
DB-->>Serializer: saved instance
Serializer-->>Client: response
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
| Filename | Overview |
|---|---|
| backend/connector_auth_v2/constants.py | Adds OAUTH_TOKEN_KEYS frozenset as a shared constant to define exactly which keys are safe to propagate across connectors sharing the same (provider, uid). |
| backend/connector_auth_v2/models.py | Sibling-loop in get_and_refresh_tokens now whitelist-merges only OAUTH_TOKEN_KEYS from extra_data instead of assigning the full dict, preventing form-field cross-contamination between connectors sharing the same OAuth identity. |
| backend/connector_v2/fields.py | _refresh_tokens now accepts existing_metadata and uses the OAUTH_TOKEN_KEYS whitelist before merging, so per-instance form fields (site_url, drive_id) are preserved on DB read-triggered auto-refresh. |
| backend/connector_v2/serializers.py | Adds connector_name backfill from JSON schema default in validate() with correct partial-update guard; save() uses whitelist-merge for refreshed tokens; connector_name marked required=False via extra_kwargs. |
| backend/connector_v2/views.py | _get_connector_metadata now merges form fields with OAuth tokens ({**form_metadata, **oauth_tokens}) instead of replacing; adds an isinstance guard for non-dict form values. |
| unstract/connectors/src/unstract/connectors/filesystems/sharepoint/sharepoint.py | Fixes drive-by-ID lookup to use bracket indexing (ctx.drives[drive_id]) and SharePoint site resolution to use ctx.sites.get_by_url(site_url).drive instead of the broken get_by_path approach. |
| unstract/connectors/src/unstract/connectors/filesystems/sharepoint/static/json_schema.json | Updates user_email description to clarify it's only for OneDrive Client Credentials; marks the field as format: "password" (already present in the base, description now clearer). |
Sequence Diagram
sequenceDiagram
participant FE as Frontend
participant View as ConnectorInstanceViewSet
participant Ser as ConnectorInstanceSerializer
participant Cache as OAuth Cache
participant PSA as ConnectorAuth (PSA)
participant DB as ConnectorInstance DB
FE->>View: POST /connector/ with oauth_key param
View->>Cache: get_oauth_creds_from_cache(oauth_key)
Cache-->>View: oauth_tokens with access_token, uid, provider
Note over View: form_metadata from request body merged with oauth_tokens
View->>Ser: serializer.save(connector_metadata=merged)
Note over Ser: validate() backfills connector_name from schema default
Ser->>PSA: get_or_create_connector_auth(merged)
PSA-->>Ser: connector_oauth instance
Ser->>PSA: get_and_refresh_tokens()
PSA-->>Ser: refreshed_metadata (full extra_data)
Note over Ser: token_updates filtered by OAUTH_TOKEN_KEYS
Ser->>DB: super().save(**kwargs)
DB-->>FE: 201 Created
Reviews (5): Last reviewed commit: "greptile partial updatye" | Re-trigger Greptile
| # Discard return value: ConnectorAuth.extra_data is shared across | ||
| # every connector with the same (provider, uid) and would overwrite | ||
| # this connector's form fields (site_url, drive_id). | ||
| connector_oauth.get_and_refresh_tokens() |
There was a problem hiding this comment.
Refreshed tokens silently dropped on update
get_and_refresh_tokens() can issue a token refresh and return updated access_token/refresh_token values. By discarding the return value entirely, those new tokens are never merged back into kwargs[CIKey.CONNECTOR_METADATA], so the ConnectorInstance ends up persisting the old (possibly just-invalidated) tokens. The next time the connector is used for a task, SharePointFS reads access_token straight from its stored settings dict and will fail with an expired credential.
A targeted merge — updating only the token-specific keys instead of the entire blob — preserves form fields (site_url, drive_id) and still keeps tokens current:
refreshed_data, refresh_status = connector_oauth.get_and_refresh_tokens()
# Merge only OAuth token fields back; do not overwrite form-specific keys.
_TOKEN_KEYS = {"access_token", "refresh_token", "expires_in", "token_type", "scope"}
token_updates = {k: v for k, v in refreshed_data.items() if k in _TOKEN_KEYS}
kwargs[CIKey.CONNECTOR_METADATA] = {
**kwargs.get(CIKey.CONNECTOR_METADATA, {}),
**token_updates,
}Prompt To Fix With AI
This is a comment left during a code review.
Path: backend/connector_v2/serializers.py
Line: 59
Comment:
**Refreshed tokens silently dropped on update**
`get_and_refresh_tokens()` can issue a token refresh and return updated `access_token`/`refresh_token` values. By discarding the return value entirely, those new tokens are never merged back into `kwargs[CIKey.CONNECTOR_METADATA]`, so the `ConnectorInstance` ends up persisting the old (possibly just-invalidated) tokens. The next time the connector is used for a task, `SharePointFS` reads `access_token` straight from its stored settings dict and will fail with an expired credential.
A targeted merge — updating only the token-specific keys instead of the entire blob — preserves form fields (`site_url`, `drive_id`) and still keeps tokens current:
```python
refreshed_data, refresh_status = connector_oauth.get_and_refresh_tokens()
# Merge only OAuth token fields back; do not overwrite form-specific keys.
_TOKEN_KEYS = {"access_token", "refresh_token", "expires_in", "token_type", "scope"}
token_updates = {k: v for k, v in refreshed_data.items() if k in _TOKEN_KEYS}
kwargs[CIKey.CONNECTOR_METADATA] = {
**kwargs.get(CIKey.CONNECTOR_METADATA, {}),
**token_updates,
}
```
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Check this @kirtimanmishrazipstack , might affect an existing connector
There was a problem hiding this comment.
Addressed — applied the whitelist-based merge in ConnectorInstanceSerializer.save(). Introduced _OAUTH_TOKEN_KEYS (access_token, refresh_token, token_type, expires, auth_time, refresh_after, expires_in, scope) so only token keys flow from the shared extra_data back into kwargs[CIKey.CONNECTOR_METADATA]. super().save(**kwargs) now persists the fresh token instead of the stale one from the request body, and per-connector form fields (site_url, drive_id) are preserved because they sit outside the whitelist.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/connector_v2/views.py`:
- Around line 101-103: When handling OAuth update/reauth flows in views.py the
code builds connector_metadata from only request data and oauth_tokens
(CIKey.CONNECTOR_METADATA -> form_metadata; connector_metadata =
{**form_metadata, **oauth_tokens}), which drops existing instance metadata like
site_url/drive_id when form_metadata is omitted; fix by seeding form_metadata
with the existing instance metadata when available (e.g. base_metadata =
(self.instance.connector_metadata or {}) then overlay with form_metadata and
finally oauth_tokens) so the merge preserves prior non-secret fields; apply the
same change to the other OAuth update block referenced around lines 164-175.
- Around line 108-135: The fallback in _fill_default_connector_name is currently
applied to all creates; restrict it to OAuth connector creation by adding an
early guard that returns unless the request is an OAuth create (e.g., check
request_data.get(CIKey.AUTH_TYPE) == AuthType.OAUTH or
request_data.get(CIKey.AUTH_TYPE) == "oauth"). Update the function to check
CIKey.AUTH_TYPE (and use AuthType.OAUTH constant) before fetching the
schema/details and only proceed to call ConnectorProcessor.get_json_schema and
set CIKey.CONNECTOR_NAME when the request is for an OAuth connector.
In
`@unstract/connectors/src/unstract/connectors/filesystems/sharepoint/sharepoint.py`:
- Around line 139-141: The code sets self._drive using ctx.drives[self.drive_id]
and later accesses ctx.sites.get_by_url(self.site_url).drive but does not
execute the required query chain; update both locations to call
.get().execute_query() on the drive resource so the drive object is initialized
(e.g., replace ctx.drives[self.drive_id] and the .drive access from
ctx.sites.get_by_url(self.site_url) with the same .get().execute_query()
pattern) and assign the resulting loaded object to self._drive/drive as done
elsewhere in this module.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 981ec789-a892-48cf-8fdf-ab45b5cd9d44
📒 Files selected for processing (4)
backend/connector_v2/serializers.pybackend/connector_v2/views.pyunstract/connectors/src/unstract/connectors/filesystems/sharepoint/sharepoint.pyunstract/connectors/src/unstract/connectors/filesystems/sharepoint/static/json_schema.json
| # Preserve non-secret form fields (e.g. site_url connector Sharepoint) | ||
| form_metadata = self.request.data.get(CIKey.CONNECTOR_METADATA) or {} | ||
| connector_metadata = {**form_metadata, **oauth_tokens} |
There was a problem hiding this comment.
Preserve existing instance metadata during OAuth update flows.
On update/reauth, if the request has oauth_key but omits connector_metadata, form_metadata becomes {} and the save path persists token-only metadata, dropping existing fields like site_url and drive_id.
🐛 Proposed fix
- def _get_connector_metadata(self, connector_id: str) -> dict[str, str] | None:
+ def _get_connector_metadata(
+ self,
+ connector_id: str,
+ base_metadata: dict[str, str] | None = None,
+ ) -> dict[str, str] | None:
@@
- form_metadata = self.request.data.get(CIKey.CONNECTOR_METADATA) or {}
+ form_metadata = (
+ self.request.data.get(CIKey.CONNECTOR_METADATA)
+ if CIKey.CONNECTOR_METADATA in self.request.data
+ else base_metadata
+ ) or {}
connector_metadata = {**form_metadata, **oauth_tokens}
@@
- connector_metadata = self._get_connector_metadata(connector_id)
+ connector_metadata = self._get_connector_metadata(
+ connector_id,
+ base_metadata=serializer.instance.connector_metadata,
+ )Also applies to: 164-175
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/connector_v2/views.py` around lines 101 - 103, When handling OAuth
update/reauth flows in views.py the code builds connector_metadata from only
request data and oauth_tokens (CIKey.CONNECTOR_METADATA -> form_metadata;
connector_metadata = {**form_metadata, **oauth_tokens}), which drops existing
instance metadata like site_url/drive_id when form_metadata is omitted; fix by
seeding form_metadata with the existing instance metadata when available (e.g.
base_metadata = (self.instance.connector_metadata or {}) then overlay with
form_metadata and finally oauth_tokens) so the merge preserves prior non-secret
fields; apply the same change to the other OAuth update block referenced around
lines 164-175.
There was a problem hiding this comment.
The crash case (TypeError on non-dict metadata) is addressed via isinstance guard + json.loads recovery on the create path — see reply on Greptile G3. Seeding form_metadata from self.instance.connector_metadata on PATCH-with-metadata-omitted is a legitimate follow-up but outside this PR's scope (which is OAuth create + refresh persistence). Tracking separately.
There was a problem hiding this comment.
@kirtimanmishrazipstack, understood — the isinstance guard in the current code covers the non-dict crash path, and the base_metadata seeding concern is a valid-but-separate gap. Would you like me to open a GitHub issue to track the follow-up (seeding form_metadata from serializer.instance.connector_metadata during OAuth update/reauth when connector_metadata is omitted from the request)?
| def _fill_default_connector_name(self, request_data: Any) -> None: | ||
| """Fill ``connector_name`` from the schema default when missing. | ||
|
|
||
| Defense-in-depth for an intermittent frontend race where the RJSF | ||
| default does not reach the POST body for OAuth connector creation, | ||
| causing the serializer to 400 on ``connector_name: required``. | ||
| """ | ||
| if request_data.get(CIKey.CONNECTOR_NAME): | ||
| return | ||
| connector_id = request_data.get(CIKey.CONNECTOR_ID) | ||
| if not connector_id: | ||
| return | ||
| try: | ||
| schema_details = ConnectorProcessor.get_json_schema(connector_id=connector_id) | ||
| except InvalidConnectorID: | ||
| return | ||
| default_name = ( | ||
| schema_details.get(ConnectorKeys.JSON_SCHEMA, {}) | ||
| .get("properties", {}) | ||
| .get("connectorName", {}) | ||
| .get("default") | ||
| ) | ||
| if not default_name: | ||
| return | ||
| request_data[CIKey.CONNECTOR_NAME] = default_name | ||
| logger.info( | ||
| "Filled missing connector_name with schema default for %s", connector_id | ||
| ) |
There was a problem hiding this comment.
Scope the connector-name fallback to OAuth creates.
The fallback currently applies to every create request with a missing connector_name, which broadens behavior beyond the OAuth race described in the PR and can let non-OAuth/direct API creates bypass the required-name validation.
🛡️ Proposed fix
connector_id = request_data.get(CIKey.CONNECTOR_ID)
if not connector_id:
return
+ if not (
+ self.request.query_params.get(ConnectorAuthKey.OAUTH_KEY)
+ and ConnectorInstance.supportsOAuth(connector_id=connector_id)
+ ):
+ return
try:
schema_details = ConnectorProcessor.get_json_schema(connector_id=connector_id)Also applies to: 203-204
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/connector_v2/views.py` around lines 108 - 135, The fallback in
_fill_default_connector_name is currently applied to all creates; restrict it to
OAuth connector creation by adding an early guard that returns unless the
request is an OAuth create (e.g., check request_data.get(CIKey.AUTH_TYPE) ==
AuthType.OAUTH or request_data.get(CIKey.AUTH_TYPE) == "oauth"). Update the
function to check CIKey.AUTH_TYPE (and use AuthType.OAUTH constant) before
fetching the schema/details and only proceed to call
ConnectorProcessor.get_json_schema and set CIKey.CONNECTOR_NAME when the request
is for an OAuth connector.
There was a problem hiding this comment.
The helper itself has been removed per Chandrashekhar's suggestion — the fallback now lives in ConnectorInstanceSerializer.validate(). Intentionally applied to all connectors rather than gated on auth_type: the schema default is a valid fallback regardless of auth mode, and the model's blank=False still rejects connectors whose schema has no default, so no silent empty-name can slip through.
There was a problem hiding this comment.
@kirtimanmishrazipstack, that makes sense — moving it to ConnectorInstanceSerializer.validate() is a cleaner home for this logic, and the blank=False model constraint is a solid safety net: if the schema has no default, the name stays empty and the model rejects it, so no silent empty-name connector can be persisted regardless of auth mode. Happy with this approach, thanks for the context!
| if self.drive_id: | ||
| # Specific drive by ID | ||
| self._drive = ctx.drives.get_by_id(self.drive_id) | ||
| # Specific drive by ID — EntityCollection uses bracket indexing. | ||
| self._drive = ctx.drives[self.drive_id] |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Description:
# Find the pinned Office365-REST-Python-Client declaration and inspect whether
# the pinned package source exposes the APIs used by this PR.
#
# Expected:
# - Dependency declaration shows a version that includes:
# - EntityCollection.__getitem__(str)
# - SitesWithRoot.get_by_url(...)
echo "Dependency declarations:"
fd -i '^(pyproject\.toml|requirements.*|poetry\.lock|uv\.lock|Pipfile\.lock|setup\.py)$' -t f \
-x sh -c 'printf "\n--- %s ---\n" "$1"; rg -n "Office365-REST-Python-Client|office365-rest-python-client" "$1" || true' sh {}
python - <<'PY'
from pathlib import Path
import re
patterns = [
re.compile(r'Office365-REST-Python-Client[^0-9<>=!~]*([<>=!~]*\s*[0-9][^,\s"\']*)?', re.I),
re.compile(r'office365-rest-python-client[^0-9<>=!~]*([<>=!~]*\s*[0-9][^,\s"\']*)?', re.I),
]
for path in Path(".").rglob("*"):
if path.name in {"pyproject.toml", "poetry.lock", "uv.lock", "Pipfile.lock", "setup.py"} or path.name.startswith("requirements"):
try:
text = path.read_text(errors="ignore")
except Exception:
continue
for line_no, line in enumerate(text.splitlines(), 1):
if any(p.search(line) for p in patterns):
print(f"{path}:{line_no}: {line}")
PYRepository: Zipstack/unstract
Length of output: 4492
🌐 Web query:
Office365-REST-Python-Client version 2.6.0 release notes EntityCollection getitem
💡 Result:
Office365-REST-Python-Client version 2.6.0 was released on March 30, 2025. No detailed release notes or changelog are published on the GitHub releases page for this version; the tag simply notes "2.6.0 release" at commit c93a12c. The PyPI page confirms the release date but provides no specific changes. EntityCollection inherits from ClientObjectCollection (located in office365/runtime/client_object_collection.py), which supports getitem for indexed access to items. This enables lazy loading: accessing collection[i] loads the item on demand if not already fetched. Historical issues (e.g., #187 in 2020) with collection access beyond the initial batch (e.g., items[100]) were fixed in later versions, including improvements to lazy loading support starting from around v2.1.8. By v2.6.0, EntityCollection.getitem works reliably for accessing items by index after executing a query like ctx.load(collection); ctx.execute_query. No specific changes to EntityCollection.getitem are mentioned in available sources for v2.6.0. Example usage: items = ctx.web.lists.get_by_title('Documents').items ctx.load(items) ctx.execute_query item = items[0] # Triggers load if needed
Citations:
- 1: https://github.com/vgrem/Office365-REST-Python-Client/releases
- 2: https://pypi.python.org/project/Office365-REST-Python-Client/2.6.2/
- 3: https://github.com/vgrem/office365-rest-python-client/releases/tag/2.6.2
- 4: https://pypi.org/project/Office365-REST-Python-Client/
- 5: vgrem/office365-rest-python-client@d21fe7c
- 6: https://github.com/vgrem/Office365-REST-Python-Client/blob/master/office365/runtime/client_object.py
- 7: Sharepoint List get_items() errors itterating past 100 items vgrem/office365-rest-python-client#187
- 8: https://github.com/vgrem/office365-rest-python-client/releases
- 9: https://pypi.org/project/Office365-REST-Python-Client/2.6.0/
- 10: https://github.com/vgrem/Office365-REST-Python-Client/tags
🌐 Web query:
Office365-REST-Python-Client 2.6.0 SitesWithRoot get_by_url API
💡 Result:
In Office365-REST-Python-Client version 2.6.0, SitesWithRoot is a class in the Microsoft Graph API client (GraphClient), defined as class SitesWithRoot(EntityCollection[Site]): """Sites container""". It supports the get_by_url method to retrieve a specific SharePoint site by its absolute URL. Usage example with GraphClient (Microsoft Graph v2 APIs for OneDrive/SharePoint): from office365.graph_client import GraphClient client = GraphClient(tenant="yourtenant.onmicrosoft.com") # Authenticate as shown in repo docs site_url = "https://yourtenant.sharepoint.com/sites/yoursite" site = client.sites.get_by_url(site_url) site.get.execute_query # Load site properties print(site.web_url) This matches Microsoft Graph Sites API: GET /sites/{site-id} or /sites/graph.microsoft.com/v1.0/sites/{hostname}:/{site-path}. For legacy SharePoint REST API (ClientContext), connect directly to the site URL: ctx = ClientContext(site_url).with_credentials(...); web = ctx.web.get.execute_query. SitesWithRoot likely accessible via client.sites (SitesWithRoot instance), as seen in issue discussions and examples like client.sites.get_by_url('https://...') for accessing lists/sites. Version 2.6.0 released 2025-03-30, supports Graph and SharePoint REST APIs per PyPI/GitHub.
Citations:
- 1: https://github.com/vgrem/office365-rest-python-client
- 2: https://pypi.org/project/Office365-REST-Python-Client/
- 3: https://github.com/vgrem/Office365-REST-Python-Client
- 4: Get all sites vgrem/office365-rest-python-client#301
- 5: I am stuggling to add items to a sharepoint list vgrem/office365-rest-python-client#914
- 6: https://github.com/vgrem/Office365-REST-Python-Client/blob/master/examples/onedrive/folders/list_files.py
- 7: https://github.com/vgrem/office365-rest-python-client/tree/master/office365
- 8: How to navigate a site instead of root ? vgrem/office365-rest-python-client#280
- 9: https://stackoverflow.com/questions/72210088/how-to-programmatically-access-sharepoint-sites-other-than-root-from-office365-r
🏁 Script executed:
#!/bin/bash
# Check if we can find any release notes or version info in the repository
find . -name "CHANGELOG*" -o -name "HISTORY*" -o -name "RELEASE*" 2>/dev/null | head -20
# Also look for any version-specific documentation or comments in the codebase
rg "2\.6\." --type-list | head -5
rg "2\.6\." . --type py -B2 -A2 | head -30Repository: Zipstack/unstract
Length of output: 450
🌐 Web query:
Office365-REST-Python-Client EntityCollection string ID bracket indexing drives
💡 Result:
In the Office365-REST-Python-Client library, EntityCollection (from office365.entity_collection) is a collection of entities returned by queries like client.drives.get. It supports iteration after execute_query, as shown in the official example: from office365.graph_client import GraphClient client = GraphClient(tenant=tenant_name) drives = client.drives.get.execute_query for drive in drives: print(drive.web_url) No direct evidence found in searched sources for bracket indexing (e.g., drives['id']) on EntityCollection with string IDs. Collections support positional integer indexing for loaded items (drives[0]), but string ID access uses path notation like client.drives['drive-id'] before executing the query, which returns a single Drive object. To get a drive by string ID: drive = client.drives['your-drive-id'].get.execute_query EntityCollection does not appear to support direct string key lookup via getitem based on available source code snippets and examples. For large collections, load and filter or use get_by_id if available on the specific collection type.
Citations:
- 1: https://github.com/vgrem/office365-rest-python-client
- 2: https://github.com/vgrem/Office365-REST-Python-Client/blob/master/office365/runtime/client_object.py
- 3: https://github.com/vgrem/Office365-REST-Python-Client/blob/master/office365/graph_client.py
- 4: https://github.com/vgrem/office365-rest-python-client/tree/master/office365
🏁 Script executed:
#!/bin/bash
# Check the actual usage context in the PR code
cat -n unstract/connectors/src/unstract/connectors/filesystems/sharepoint/sharepoint.py | sed -n '130,170p'Repository: Zipstack/unstract
Length of output: 1955
🏁 Script executed:
#!/bin/bash
# Search for execute_query calls and context around line 141 and 160
cat -n unstract/connectors/src/unstract/connectors/filesystems/sharepoint/sharepoint.py | sed -n '134,180p'
# Check for execute_query usage patterns in the file
rg "execute_query|\.get\(\)" unstract/connectors/src/unstract/connectors/filesystems/sharepoint/sharepoint.py -A2 -B2Repository: Zipstack/unstract
Length of output: 3496
🌐 Web query:
Office365-REST-Python-Client GraphClient drives bracket notation get execute_query
💡 Result:
To use the Office365-REST-Python-Client library with GraphClient to access drives via bracket notation (e.g., client.drives['drive-id']) and execute a query, follow this pattern based on the library's design. The GraphClient provides access to drives as an EntityCollection: client.drives returns EntityCollection(self, Drive, ResourcePath("drives")). Bracket notation on collections supports retrieving a specific drive by ID, similar to how users are accessed (e.g., client.users["user-id"]). Example code to get a specific drive by ID: from office365.graph_client import GraphClient client = GraphClient(tenant="your-tenant.onmicrosoft.com").with_client_secret("client_id", "client_secret") drive_id = "b!your-drive-id" drive = client.drives[drive_id].get.execute_query print(drive.web_url) This corresponds to the Microsoft Graph API endpoint GET /drives/{drive-id}. For listing drives first (to get IDs): drives = client.drives.get.execute_query for d in drives: print(d.id, d.web_url) The .get method prepares a read query on the ClientObject, and execute_query submits it. No explicit "drives bracket notation get execute_query" example exists in docs, but this follows the established pattern from source code and examples like client.users["id"].drive.get.execute_query.
Citations:
- 1: https://github.com/vgrem/Office365-REST-Python-Client/blob/master/examples/onedrive/drives/list.py
- 2: https://github.com/vgrem/office365-rest-python-client
- 3: https://learn.microsoft.com/en-us/graph/api/drive-get?view=graph-rest-1.0
- 4: https://github.com/vgrem/office365-rest-python-client/blob/master/examples/onedrive/drives/list.py
- 5: https://github.com/vgrem/Office365-REST-Python-Client
- 6: https://github.com/vgrem/Office365-REST-Python-Client/blob/master/office365/graph_client.py
- 7: https://github.com/vgrem/Office365-REST-Python-Client/blob/master/office365/runtime/client_object.py
🌐 Web query:
Office365-REST-Python-Client Site drive property lazy loading
💡 Result:
The Office365-REST-Python-Client library supports lazy loading for collections and properties, including SharePoint site drive properties accessed via Microsoft Graph API. Collections like list items or drive items are implemented as lazy-load collections, meaning properties and items are loaded only when explicitly requested via ctx.load or .execute_query, rather than all at once. This avoids unnecessary API calls and improves performance. For a SharePoint site drive: - Use GraphClient to get the site: site = client.sites.get_by_url(site_url) - Lists on the site have a .drive property: site.lists["Documents"].drive - The drive object supports lazy loading; access properties like drive.root only after loading: site.lists["Documents"].drive.root.get_files.execute_query - Example from library issue: items = site.lists["Documents"].drive.root.get_files(True, 1000).execute_query demonstrates accessing site list drive with pagination support (fixed in v2.5.9+). No explicit "lazy loading" toggle for site.drive property; it's inherent to the client model's design where ctx.load or execute_query triggers actual API requests. Official docs are in GitHub repo examples/onedrive/sites.py and issues #79, #844.
Citations:
- 1: Getting more than 100 items from a list vgrem/office365-rest-python-client#79
- 2: Missing sharepoint drive items vgrem/office365-rest-python-client#844
- 3: https://github.com/vgrem/Office365-REST-Python-Client
- 4: https://github.com/vgrem/Office365-REST-Python-Client/tree/master/examples/onedrive
- 5: https://github.com/vgrem/office365-rest-python-client
- 6: Missing sharepoint drive items vgrem/office365-rest-python-client#844
- 7: https://github.com/vgrem/office365-rest-python-client/tree/master/examples
Add .get().execute_query() calls to initialize drive objects—the current code will fail at runtime.
Line 141 uses bracket notation correctly but is missing the required query execution chain. The correct pattern is ctx.drives[self.drive_id].get().execute_query(), not just ctx.drives[self.drive_id]. This applies similarly to line 160 with ctx.sites.get_by_url(self.site_url).drive—the .drive property access needs chaining with .get().execute_query() to actually load the drive. Both patterns require this chain based on Office365-REST-Python-Client v2.6.0 API semantics. The codebase already uses this pattern elsewhere (e.g., lines 207–209), so follow the same convention here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@unstract/connectors/src/unstract/connectors/filesystems/sharepoint/sharepoint.py`
around lines 139 - 141, The code sets self._drive using
ctx.drives[self.drive_id] and later accesses
ctx.sites.get_by_url(self.site_url).drive but does not execute the required
query chain; update both locations to call .get().execute_query() on the drive
resource so the drive object is initialized (e.g., replace
ctx.drives[self.drive_id] and the .drive access from
ctx.sites.get_by_url(self.site_url) with the same .get().execute_query()
pattern) and assign the resulting loaded object to self._drive/drive as done
elsewhere in this module.
There was a problem hiding this comment.
Not needed — the Office365 library resolves drive proxies lazily. execute_query() fires inside the first child operation (e.g. drive.root.get_by_path(path).children.get().execute_query() inside ls()). The pre-existing ctx.me.drive and ctx.users[email].drive paths in the same file work the same way without an explicit execute_query() on the drive itself. If staging surfaces a lazy-init issue we'll follow up.
There was a problem hiding this comment.
@kirtimanmishrazipstack, that's a clear explanation — the lazy proxy resolution model means the drive object doesn't need its own execute_query() call since the chain fires when the first child operation (like drive.root.get_by_path(...).children.get().execute_query()) is invoked. The consistency with ctx.me.drive and ctx.users[email].drive in the same file confirms this is the correct pattern. I'll withdraw the concern.
📝 Learnings were identified and not saved (knowledge base disabled). Enable
DRF's request.data is an immutable QueryDict for multipart / form-encoded POSTs; directly mutating it inside _fill_default_connector_name would raise AttributeError for any non-JSON caller. Copy to a mutable dict once in create() and feed that copy to both the helper and the serializer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
chandrasekharan-zipstack
left a comment
There was a problem hiding this comment.
Two pieces of feedback — see inline comments. The serializer-level fallback is a nice-to-have refactor; the incomplete-fix concern on the OAuth refresh path is the more important one.
| logger.info( | ||
| "Filled missing connector_name with schema default for %s", connector_id | ||
| ) | ||
|
|
There was a problem hiding this comment.
Could this live on the serializer instead? Make connector_name non-required and inject the schema default in ConnectorInstanceSerializer.validate() using attrs['connector_id']. Benefits:
- Keeps
create()clean (norequest.data.copy()+ helper). - Fallback applies to any future endpoint using this serializer (bulk import, internal API, PATCH that clears the field).
Tradeoff: the view-level helper is more discoverable as a temporary defense-in-depth workaround. If this is meant to be removed once the frontend race is fixed, keep it here with a TODO; if it'll stick around, push it down to the serializer.
There was a problem hiding this comment.
Moved. connector_name is now required=False via Meta.extra_kwargs, and the schema-default backfill lives in ConnectorInstanceSerializer.validate() using attrs['connector_id']. Removed the view-level helper, the data = request.data.copy() override in create(), and the associated imports (ConnectorProcessor, ConnectorKeys, InvalidConnectorID). Applies to PATCH/bulk callers as well, not just the POST path.
| # Discard return value: ConnectorAuth.extra_data is shared across | ||
| # every connector with the same (provider, uid) and would overwrite | ||
| # this connector's form fields (site_url, drive_id). | ||
| connector_oauth.get_and_refresh_tokens() |
There was a problem hiding this comment.
This fix only covers the create path. The same extra_data overwrite still happens in two other places this PR doesn't touch:
connector_auth_v2/models.py:97-100— insideget_and_refresh_tokens, on every refresh it loopsself.connectorinstance_set.all()and doesconnector_instance.connector_metadata = self.extra_data; save(). This overwrites every sibling connector's stored metadata (site_url,drive_id) with the sharedextra_data. Same root cause as this bug, different trigger.connector_v2/fields.py:31-41—ConnectorAuthJSONField.from_db_valuereturns the refreshedextra_datadirectly whenrefresh_afterhas passed, dropping form fields on read.
So a SharePoint connector created cleanly with this PR will still lose site_url the first time a token refresh fires during normal usage (file listing, etc.), as long as another connector shares the same (provider, uid).
To actually hold, the same merge pattern ({**connector_instance.connector_metadata, **self.extra_data}) needs to be applied in those two places too. Otherwise existing connectors remain vulnerable, and even newly-created ones will regress on first refresh.
There was a problem hiding this comment.
Done — applied the {**connector_instance.connector_metadata, **self.extra_data} pattern at both sites:
connector_auth_v2/models.py— sibling-loop inget_and_refresh_tokensconnector_v2/fields.py—from_db_value._refresh_tokens
Per-connector form fields (site_url, drive_id, tenant_id) now survive on both the save path (sibling-loop on token refresh) and the read path (from_db_value when refresh_after has elapsed).
| if oauth_tokens is None: | ||
| raise MissingParamException(param=ConnectorAuthKey.OAUTH_KEY) | ||
| # Preserve non-secret form fields (e.g. site_url connector Sharepoint) | ||
| form_metadata = self.request.data.get(CIKey.CONNECTOR_METADATA) or {} |
There was a problem hiding this comment.
Minor: self.request.data.get(CIKey.CONNECTOR_METADATA) or {} — if the form sends metadata as a JSON string (depending on content-type / multipart shape), **form_metadata will TypeError. Worth confirming the request shape, or coercing with isinstance(form_metadata, dict).
There was a problem hiding this comment.
Addressed together with Greptile G3 — json.loads recovers the dict when metadata arrives as a string (multipart/x-www-form-urlencoded), with an isinstance fallback for any other non-dict case.
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/connector_auth_v2/models.py`:
- Around line 99-112: The merge in the refresh path is allowing non-token
enrichment keys from self.extra_data to leak into other ConnectorAuth siblings;
update the merge that sets connector_instance.connector_metadata to first filter
self.extra_data through the same whitelist used by serializers.py (the
_OAUTH_TOKEN_KEYS used to allow only token-related keys) before combining with
existing_metadata, or move _OAUTH_TOKEN_KEYS to a shared constant and reuse it
here; reference the connector_instance.connector_metadata assignment and
self.extra_data (and serializers._OAUTH_TOKEN_KEYS or
GoogleAuthHelper.enrich_connector_metadata as the provenance of enrichment) and
ensure only whitelisted token keys are propagated to sibling connector metadata.
In `@backend/connector_v2/fields.py`:
- Around line 31-45: The _refresh_tokens method currently merges
connector_auth.get_and_refresh_tokens() into existing_metadata unfiltered, which
can reintroduce enrichment keys from ConnectorAuth.extra_data; update
_refresh_tokens to whitelist only OAuth token-related keys (use the same
_OAUTH_TOKEN_KEYS constant used in serializers.py) by filtering
refreshed_metadata to contain only those keys before merging, then return
{**existing_metadata, **filtered_refreshed_metadata} so non-token enrichment
fields are not injected on read.
In `@backend/connector_v2/serializers.py`:
- Around line 109-124: connector_oauth.get_and_refresh_tokens() currently
iterates self.connectorinstance_set.all() and writes an unfiltered merge of
self.extra_data into each sibling connector, which lets non-token keys leak into
sibling connector_metadata; update that method so when it persists token data
back to sibling ConnectorInstance rows it only writes the whitelisted keys (use
the same _OAUTH_TOKEN_KEYS whitelist) and/or build a token_updates dict (like
the save() code does) and merge only those into each sibling's
CIKey.CONNECTOR_METADATA, explicitly excluding other extra_data keys before
saving.
- Around line 54-86: The validate method currently backfills connector_name from
the JSON schema but can return attrs without connector_name (causing a DB
IntegrityError later); update validate (in the serializer containing validate
and _get_schema_default_connector_name) so that after attempting to backfill via
_get_schema_default_connector_name (which calls
ConnectorProcessor.get_json_schema) it raises a serializers.ValidationError
referencing CIKey.CONNECTOR_NAME (or similar field name) when no connector_name
is present, preserving the API contract and producing a 400 instead of letting
ConnectorInstance.connector_name trigger a NOT NULL DB error.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ae16b27d-bd5a-48a2-b609-34d30ea24dbe
📒 Files selected for processing (4)
backend/connector_auth_v2/models.pybackend/connector_v2/fields.pybackend/connector_v2/serializers.pybackend/connector_v2/views.py
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/connector_v2/views.py
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/connector_v2/serializers.py (1)
57-60: Run the schema default through the same connector-name validator.The backfilled value is assigned after field-level validation, so it currently bypasses
validate_connector_name(). Reuse that validator before storing the default to keep fallback behavior identical to normal submissions.♻️ Proposed consistency fix
if not default_name: raise ValidationError({CIKey.CONNECTOR_NAME: "This field is required."}) - attrs[CIKey.CONNECTOR_NAME] = default_name + attrs[CIKey.CONNECTOR_NAME] = self.validate_connector_name(default_name) logger.info(🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/connector_v2/serializers.py` around lines 57 - 60, The default connector name assigned to attrs[CIKey.CONNECTOR_NAME] bypasses the field-level validator; call the existing validate_connector_name function on default_name (and use/assign its validated/normalized return or let it raise ValidationError) before setting attrs[CIKey.CONNECTOR_NAME] so fallback behavior matches normal submissions; update the block around validate_connector_name, default_name and attrs[CIKey.CONNECTOR_NAME] in serializers.py accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/connector_v2/serializers.py`:
- Around line 57-60: The default connector name assigned to
attrs[CIKey.CONNECTOR_NAME] bypasses the field-level validator; call the
existing validate_connector_name function on default_name (and use/assign its
validated/normalized return or let it raise ValidationError) before setting
attrs[CIKey.CONNECTOR_NAME] so fallback behavior matches normal submissions;
update the block around validate_connector_name, default_name and
attrs[CIKey.CONNECTOR_NAME] in serializers.py accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 81ef659e-1926-44b0-9c40-9d01ba5e9715
📒 Files selected for processing (4)
backend/connector_auth_v2/constants.pybackend/connector_auth_v2/models.pybackend/connector_v2/fields.pybackend/connector_v2/serializers.py
✅ Files skipped from review due to trivial changes (1)
- backend/connector_auth_v2/constants.py
🚧 Files skipped from review as they are similar to previous changes (2)
- backend/connector_auth_v2/models.py
- backend/connector_v2/fields.py
Test ResultsSummary
Runner Tests - Full Report
SDK1 Tests - Full Report
|
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/connector_v2/serializers.py (1)
70-81: Consider broadening the exception catch in_get_schema_default_connector_name.
ConnectorProcessor.get_json_schemare-raises any exception fromjson.loads(...)on a malformed stored schema (seebackend/connector_processor/connector_processor.py:61-85). OnlyInvalidConnectorIDis caught here, so a corruptJSON_SCHEMArow for a validconnector_idwill surface as a 500 on every create — strictly worse than the pre-fix 400, and an operator typing a legitimate connector ID wouldn't get a useful response.Two reasonable paths:
- Log and return
Noneon any non-InvalidConnectorIDexception, letting the downstreamValidationErrorproduce a 400 (and keeping the create flow usable for other connectors if one is misconfigured).- Leave as-is if you explicitly want malformed schemas to page loudly — in that case a
logger.exceptionabove the re-raise inget_json_schemais probably sufficient and this can be skipped.♻️ Option: swallow and fall through to 400
- try: - schema_details = ConnectorProcessor.get_json_schema(connector_id=connector_id) - except InvalidConnectorID: - return None + try: + schema_details = ConnectorProcessor.get_json_schema(connector_id=connector_id) + except InvalidConnectorID: + return None + except Exception: + logger.exception( + "Failed to load JSON schema for connector_id=%s while " + "backfilling connector_name", + connector_id, + ) + return None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/connector_v2/serializers.py` around lines 70 - 81, The _get_schema_default_connector_name currently only catches InvalidConnectorID from ConnectorProcessor.get_json_schema; broaden the exception handling to also catch any other exceptions raised by get_json_schema (e.g., JSONDecodeError or generic Exception), log the exception for operators, and return None so malformed stored schemas don't raise a 500 during create; keep the existing InvalidConnectorID behavior but ensure any non-InvalidConnectorID errors are handled by logging (using the module logger) and falling through returning None from _get_schema_default_connector_name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@backend/connector_v2/serializers.py`:
- Around line 70-81: The _get_schema_default_connector_name currently only
catches InvalidConnectorID from ConnectorProcessor.get_json_schema; broaden the
exception handling to also catch any other exceptions raised by get_json_schema
(e.g., JSONDecodeError or generic Exception), log the exception for operators,
and return None so malformed stored schemas don't raise a 500 during create;
keep the existing InvalidConnectorID behavior but ensure any
non-InvalidConnectorID errors are handled by logging (using the module logger)
and falling through returning None from _get_schema_default_connector_name.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 32939327-9660-48aa-baf1-3b8f12ef2b20
📒 Files selected for processing (1)
backend/connector_v2/serializers.py



What
backend/connector_v2/views.py— preserve form-entered fields (site_url,drive_id) when OAuth metadata is fetched from cache; backfill missingconnector_namefrom the connector's JSON schema default before serializer validation.backend/connector_v2/serializers.py— stop overwriting per-connector form fields withConnectorAuth.extra_dataon OAuth refresh.unstract/connectors/src/unstract/connectors/filesystems/sharepoint/sharepoint.py— fix drive resolution bydrive_id(bracket indexing onEntityCollection) and by full site URL (ctx.sites.get_by_url(...)).unstract/connectors/src/unstract/connectors/filesystems/sharepoint/static/json_schema.json— markuser_emailasformat: "password"; refine its description.Why
site_urlanddrive_id, breaking SharePoint/OneDrive connection.(provider, uid)OAuth identity,ConnectorAuth.extra_datawas cross-contaminating per-connector metadata.connector_namefrom the POST body, causingconnector_name: This field is required.400s on staging for OAuth connectors (SharePoint, Google Drive). Local dev does not reproduce. Backend fallback unblocks staging without waiting for the frontend race to be isolated.drive_idand for site-URL addressing against the Graph API.How
_get_connector_metadatanow merges form fields with OAuth tokens ({**form_metadata, **oauth_tokens}) instead of replacing.create()calls a new_fill_default_connector_namebeforeserializer.is_valid(...); the helper reads the connector's JSON schema viaConnectorProcessor.get_json_schemaand injectsproperties.connectorName.defaultwhen the field is missing. Logs at INFO when it fires.ConnectorInstanceSerializer.saveno longer reassignsget_and_refresh_tokens()output back intoconnector_metadata— tokens still refresh, but the sharedextra_datais not persisted per connector._get_driveusesctx.drives[self.drive_id];_get_sharepoint_site_driveusesctx.sites.get_by_url(self.site_url).drive.Can this PR break any existing features. If yes, please list possible items. If no, please explain why. (PS: Admins do not merge the PR without this section filled)
connector_nameis already empty. Non-SharePoint connectors and non-OAuth flows are untouched. The serializer change preserves token-refresh behavior while stopping cross-contamination between connectors that share the same(provider, uid).Database Migrations
Env Config
Relevant Docs
Related Issues or PRs
Dependencies Versions
Notes on Testing
https://tenant.sharepoint.com/sites/<name>URL; verify drive loads and files list correctly.drive_id: create connector with a known drive ID; verify listing works.connector_namefallback: submit a connector create withconnector_nameomitted; verify 201 response and server log lineFilled missing connector_name with schema default for <id>.Screenshots
Checklist
I have read and understood the Contribution Guidelines.