-
Notifications
You must be signed in to change notification settings - Fork 28
feat(eval): add ClassifierEvaluator (pure-metadata aggregator) #1674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
749da30
feat(eval): add ClassifierEvaluator (pure-metadata aggregator)
ajay-kesavan 4067b5a
test(eval): add classifier_demo fixture for end-to-end SDK validation
ajay-kesavan b2c32bd
fix(eval): give ClassifierEvaluator a concrete EvaluationCriteria type
ajay-kesavan e92e734
feat(eval): collapse standalone Classifier into ExactMatch.aggregators
ajay-kesavan e37707c
feat(eval): carry ExactMatch aggregators in per-datapoint justification
ajay-kesavan 6702603
feat(eval): low-code LegacyExactMatchEvaluator carries aggregators
ajay-kesavan 8b3f72f
fix(eval): address review feedback on classifier aggregators
ajay-kesavan File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,136 @@ | ||
| # Classifier aggregator end-to-end demo | ||
|
|
||
| A minimal intent-classification agent that exercises the new | ||
| classification **aggregator** end-to-end. Use this as the test fixture for | ||
| both SDK-only validation (Path A below) and Studio Web full-stack validation | ||
| (Path B). | ||
|
|
||
| ## What's here | ||
|
|
||
| ``` | ||
| classifier_demo/ | ||
| ├── main.py # 3-class keyword classifier | ||
| ├── uipath.json | ||
| ├── pyproject.toml | ||
| ├── bindings.json | ||
| └── evaluations/ | ||
| ├── eval-sets/ | ||
| │ └── main.json # 9 datapoints, 3 per class, some intentionally wrong | ||
| └── evaluators/ | ||
| └── intent_match.json # ExactMatch on agent_output.intent + classification aggregator | ||
| ``` | ||
|
|
||
| There is **one** evaluator. `intent_match` is an `ExactMatchEvaluator` whose | ||
| `evaluatorConfig` carries an `aggregators: [{ name: "classification", classes: [...] }]` | ||
| entry. Per datapoint, the evaluator emits a 1.0/0.0 score and an | ||
| `ExactMatchJustification` whose `aggregators` field round-trips the config | ||
| through to the downstream consumer (the C# layer in Studio Web), which builds | ||
| a confusion matrix and precision / recall / F-score across the dataset. | ||
|
|
||
| ## Path A — SDK only (real run, ~30 seconds) | ||
|
|
||
| ```bash | ||
| cd packages/uipath | ||
| uv sync --all-extras | ||
|
|
||
| cd samples/classifier_demo | ||
| uv run --project ../.. uipath eval main main.json --no-report --output-file /tmp/out.json | ||
| ``` | ||
|
|
||
| Expected: a results table with a single `intent_match` column averaging 0.667 | ||
| (6/9 correct). | ||
|
|
||
| To see the metadata payload that lands in the backend's | ||
| `CodedEvaluatorScore.Justification`: | ||
|
|
||
| ```bash | ||
| python3 -c " | ||
| import json | ||
| with open('/tmp/out.json') as f: d = json.load(f) | ||
| for r in d['evaluationSetResults'][0]['evaluationRunResults']: | ||
| print(r['evaluatorName'], r['result'].get('details')) | ||
| " | ||
| ``` | ||
|
|
||
| You should see entries like: | ||
|
|
||
| ``` | ||
| intent_match {'expected': 'book', 'actual': 'book', 'aggregators': [{'name': 'classification', 'classes': ['book', 'cancel', 'reschedule']}]} | ||
| ``` | ||
|
|
||
| The `aggregators` list is identical on every datapoint by design — it's the | ||
| mechanism by which the per-datapoint records carry the class set to the C# | ||
| post-pass without requiring a separate evaluator-snapshot lookup. | ||
|
|
||
| ## Path B — Full Studio Web stack (real UI, click Run, see panel) | ||
|
|
||
| The pieces below assume you have a local KinD cluster running per | ||
| `Agents/LOCAL_DEVELOPMENT.md`. | ||
|
|
||
| ### Prereqs | ||
| - Docker installed and running | ||
| - `make` available | ||
| - Azure CLI authenticated session (`az login`) | ||
| - Azure DevOps PAT exported as `AZURE_DEVOPS_PAT` | ||
| - GitHub NPM registry token exported as `GH_NPM_REGISTRY_TOKEN` | ||
| - Azure access token exported as `AZURE_ACCESS_TOKEN` (for the python worker build) | ||
| - `cloud-provider-kind` binary (used for the local KinD cluster) | ||
|
|
||
| ### Steps | ||
|
|
||
| 1. **Point python-eval-worker at the local SDK branch.** The published | ||
| `uipath` package on PyPI doesn't yet have the classification aggregator. | ||
| Edit `Agents/python-eval-worker/pyproject.toml`: | ||
|
|
||
| ```toml | ||
| [tool.uv.sources] | ||
| uipath = { path = "../../uipath-python/packages/uipath", editable = true } | ||
| ``` | ||
|
|
||
| Then `cd python-eval-worker && uv lock && uv sync`. | ||
|
|
||
| 2. **Bring up the local KinD cluster** (from `Agents/`): | ||
| ```bash | ||
| make create-kind-cluster | ||
| kubectl get nodes | ||
| sudo ./bin/cloud-provider-kind & # in a separate shell or background | ||
| make up | ||
| make deploy | ||
| ``` | ||
|
|
||
| 3. **Build the backend with the classifier changes:** | ||
| ```bash | ||
| git checkout feat/eval-classifier-backend # in Agents repo | ||
| # Re-trigger the helm/skaffold deploy for the backend | ||
| make deploy | ||
| ``` | ||
|
|
||
| 4. **Build the frontend with the UI changes:** | ||
| ```bash | ||
| git checkout feat/eval-dataset-evaluators-ui # in Agents repo | ||
| # Same deploy command rebuilds frontend image | ||
| ``` | ||
|
|
||
| 5. **Open Studio Web** (URL surfaced by the deploy output), create an agent | ||
| project, upload the eval-set + evaluator JSONs from this directory (or | ||
| author them in the UI — the evaluator picker exposes an | ||
| "Aggregators" section on ExactMatch where the classification aggregator | ||
| can be attached with its class list), and click Run. | ||
|
|
||
| 6. **Verify** the Aggregations panel renders between the run header and the | ||
| datapoint table, with the confusion matrix matching what Path A's Python | ||
| payload encodes (macro F1 ≈ 0.667 on this fixture). | ||
|
|
||
| ### Open questions for the team owning local dev | ||
|
|
||
| - Does the existing PAT / token set get refreshed automatically by the dev tooling, or do contributors need to rotate them periodically? | ||
| - Is there a simpler "local-only" path that bypasses the KinD cluster (e.g. docker-compose) for changes that don't touch K8s manifests? | ||
| - What's the standard pattern for pointing the python worker at a non-PyPI uipath build? The `[tool.uv.sources]` override above is the standard uv path — confirm there's no Helm/skaffold complication. | ||
|
|
||
| ## Companion PRs | ||
|
|
||
| | Repo | Branch | PR | What | | ||
| |---|---|---|---| | ||
| | uipath-python | `feat/eval-classifier-evaluator` | [#1674](https://github.com/UiPath/uipath-python/pull/1674) | SDK `ExactMatch.aggregators` + `LegacyExactMatch.aggregators` | | ||
| | Agents | `feat/eval-classifier-backend` | [#5313](https://github.com/UiPath/Agents/pull/5313) | C# math + activity + envelope storage | | ||
| | Agents | `feat/eval-dataset-evaluators-ui` | [#5306](https://github.com/UiPath/Agents/pull/5306) | Frontend picker + Aggregations panel | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,4 @@ | ||
| { | ||
| "version": "2.0", | ||
| "resources": [] | ||
| } |
163 changes: 163 additions & 0 deletions
163
packages/uipath/samples/classifier_demo/evaluations/eval-sets/main.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,163 @@ | ||
| { | ||
| "version": "1.0", | ||
| "id": "classifier-demo-eval-set", | ||
| "name": "Classifier demo eval set", | ||
| "evaluatorRefs": [ | ||
| "intent_match" | ||
| ], | ||
| "evaluations": [ | ||
| { | ||
| "id": "book-1", | ||
| "name": "book — straightforward", | ||
| "inputs": { | ||
| "utterance": "I want to book a table for two" | ||
| }, | ||
| "expectedOutput": { | ||
| "intent": "book" | ||
| }, | ||
| "evaluationCriterias": { | ||
| "intent_match": { | ||
| "expectedOutput": { | ||
| "intent": "book" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| { | ||
| "id": "book-2", | ||
| "name": "book — schedule keyword", | ||
| "inputs": { | ||
| "utterance": "Please schedule an appointment" | ||
| }, | ||
| "expectedOutput": { | ||
| "intent": "book" | ||
| }, | ||
| "evaluationCriterias": { | ||
| "intent_match": { | ||
| "expectedOutput": { | ||
| "intent": "book" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| { | ||
| "id": "book-3", | ||
| "name": "book — agent misclassifies (utterance triggers cancel keyword)", | ||
| "inputs": { | ||
| "utterance": "I had to cancel my last attempt but I want to reserve a slot now" | ||
| }, | ||
| "expectedOutput": { | ||
| "intent": "book" | ||
| }, | ||
| "evaluationCriterias": { | ||
| "intent_match": { | ||
| "expectedOutput": { | ||
| "intent": "book" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| { | ||
| "id": "cancel-1", | ||
| "name": "cancel — straightforward", | ||
| "inputs": { | ||
| "utterance": "Please cancel my reservation" | ||
| }, | ||
| "expectedOutput": { | ||
| "intent": "cancel" | ||
| }, | ||
| "evaluationCriterias": { | ||
| "intent_match": { | ||
| "expectedOutput": { | ||
| "intent": "cancel" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| { | ||
| "id": "cancel-2", | ||
| "name": "cancel — void synonym", | ||
| "inputs": { | ||
| "utterance": "I want to void the order" | ||
| }, | ||
| "expectedOutput": { | ||
| "intent": "cancel" | ||
| }, | ||
| "evaluationCriterias": { | ||
| "intent_match": { | ||
| "expectedOutput": { | ||
| "intent": "cancel" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| { | ||
| "id": "cancel-3", | ||
| "name": "cancel — agent misclassifies (utterance has 'move' which triggers reschedule)", | ||
| "inputs": { | ||
| "utterance": "I need to move past this and cancel everything" | ||
| }, | ||
| "expectedOutput": { | ||
| "intent": "cancel" | ||
| }, | ||
| "evaluationCriterias": { | ||
| "intent_match": { | ||
| "expectedOutput": { | ||
| "intent": "cancel" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| { | ||
| "id": "reschedule-1", | ||
| "name": "reschedule — straightforward", | ||
| "inputs": { | ||
| "utterance": "I want to reschedule the meeting" | ||
| }, | ||
| "expectedOutput": { | ||
| "intent": "reschedule" | ||
| }, | ||
| "evaluationCriterias": { | ||
| "intent_match": { | ||
| "expectedOutput": { | ||
| "intent": "reschedule" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| { | ||
| "id": "reschedule-2", | ||
| "name": "reschedule — move synonym", | ||
| "inputs": { | ||
| "utterance": "Can we move the slot to tomorrow" | ||
| }, | ||
| "expectedOutput": { | ||
| "intent": "reschedule" | ||
| }, | ||
| "evaluationCriterias": { | ||
| "intent_match": { | ||
| "expectedOutput": { | ||
| "intent": "reschedule" | ||
| } | ||
| } | ||
| } | ||
| }, | ||
| { | ||
| "id": "reschedule-3", | ||
| "name": "reschedule — agent misclassifies (falls through to default 'book')", | ||
| "inputs": { | ||
| "utterance": "Different timing please" | ||
| }, | ||
| "expectedOutput": { | ||
| "intent": "reschedule" | ||
| }, | ||
| "evaluationCriterias": { | ||
| "intent_match": { | ||
| "expectedOutput": { | ||
| "intent": "reschedule" | ||
| } | ||
| } | ||
| } | ||
| } | ||
| ] | ||
| } |
21 changes: 21 additions & 0 deletions
21
packages/uipath/samples/classifier_demo/evaluations/evaluators/intent_match.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| { | ||
| "version": "1.0", | ||
| "id": "intent_match", | ||
| "description": "Per-datapoint ExactMatch on the agent's `intent` output. The attached classification aggregator carries the class list to the downstream backend, which builds a confusion matrix and precision/recall/F-score across the dataset.", | ||
| "evaluatorTypeId": "uipath-exact-match", | ||
| "evaluatorConfig": { | ||
|
ajay-kesavan marked this conversation as resolved.
|
||
| "name": "intent_match", | ||
| "targetOutputKey": "intent", | ||
| "caseSensitive": false, | ||
| "negated": false, | ||
| "defaultEvaluationCriteria": { | ||
| "expectedOutput": "book" | ||
| }, | ||
| "aggregators": [ | ||
| { | ||
| "name": "classification", | ||
| "classes": ["book", "cancel", "reschedule"] | ||
| } | ||
| ] | ||
| } | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| """Tiny intent-classification agent for the ClassifierEvaluator demo. | ||
|
|
||
|
|
||
| Given an utterance, returns the intent label. Three intents: | ||
| - book (anything containing "book" / "reserve" / "schedule") | ||
| - cancel (anything containing "cancel" / "void") | ||
| - reschedule (anything containing "reschedule" / "move") | ||
|
|
||
| A few datapoints are deliberately misclassified so the run-level | ||
| classification metrics (precision/recall/F-score) come out non-trivially. | ||
| """ | ||
|
|
||
| from dataclasses import dataclass | ||
|
|
||
|
|
||
| @dataclass | ||
| class IntentInput: | ||
| utterance: str | ||
|
|
||
|
|
||
| @dataclass | ||
| class IntentOutput: | ||
| intent: str | ||
|
|
||
|
|
||
| BOOK_KEYWORDS = {"book", "reserve", "schedule"} | ||
| CANCEL_KEYWORDS = {"cancel", "void"} | ||
| RESCHEDULE_KEYWORDS = {"reschedule", "move"} | ||
|
|
||
|
|
||
| async def main(input: IntentInput) -> IntentOutput: | ||
| """Classify the utterance into book / cancel / reschedule.""" | ||
| text = input.utterance.lower() | ||
| tokens = set(text.split()) | ||
|
|
||
| if tokens & RESCHEDULE_KEYWORDS: | ||
| return IntentOutput(intent="reschedule") | ||
| if tokens & CANCEL_KEYWORDS: | ||
| return IntentOutput(intent="cancel") | ||
| if tokens & BOOK_KEYWORDS: | ||
| return IntentOutput(intent="book") | ||
| # Fallback to "book" — deliberately wrong-ish so the matrix is interesting. | ||
| return IntentOutput(intent="book") | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| [project] | ||
| name = "classifier-demo" | ||
| version = "0.0.1" | ||
| description = "Tiny intent-classification agent that exercises the new ClassifierEvaluator end-to-end via `uipath eval`." | ||
|
|
||
| requires-python = ">=3.11" | ||
| dependencies = ["uipath"] | ||
|
|
||
| [dependency-groups] | ||
| dev = ["uipath-dev"] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| { | ||
| "functions": { | ||
| "main": "main.py:main" | ||
| } | ||
| } |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔵 L1 — version bump + conflict
2.10.70 → 2.10.72; the comment in the original commit notes.71was an unused dev cache-bust. Branch isCONFLICTINGand this line will collide with #1632's→ 2.10.68. Rebase before merge.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack — handing the rebase + version-bump conflict back to @ajay-kesavan to resolve locally. Leaving this thread open until the rebase lands so the conversation tracks the final version number.