From 56ce76028ac06141ebe8f374ca093e433355f087 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?=EC=86=90=EC=84=B1=EC=A4=80?= Date: Thu, 2 Jul 2026 20:12:12 +0900 Subject: [PATCH] Document DeepSeek agent loop 50 query eval --- .../diagnostics/agent_loop_20260702_201106.md | 91 +++++++++++++++++++ .../agent_loop_deepseek_v4_flash_50.jsonl | 50 ++++++++++ .../diagnostics/public_scale_20260702.md | 13 +++ 3 files changed, 154 insertions(+) create mode 100644 examples/ablation/diagnostics/agent_loop_20260702_201106.md create mode 100644 examples/ablation/diagnostics/agent_loop_deepseek_v4_flash_50.jsonl diff --git a/examples/ablation/diagnostics/agent_loop_20260702_201106.md b/examples/ablation/diagnostics/agent_loop_20260702_201106.md new file mode 100644 index 0000000..6898411 --- /dev/null +++ b/examples/ablation/diagnostics/agent_loop_20260702_201106.md @@ -0,0 +1,91 @@ +# Agent Loop Retrieval Benchmark — Synaptic + +- Run at: 2026-07-02 20:11:06 KST +- Dataset path: tests/benchmark/data/msmarco_passage_full.json +- SQLite DB path: tests/benchmark/data/msmarco_full.db +- Subset: 50 +- Corpus limit: 8841823 +- LLM base URL: https://api.deepseek.com/v1 +- Model: deepseek-v4-flash +- Max turns: 5 +- Sufficiency gate: yes +- Force first tool: yes +- Incremental JSONL: examples/ablation/diagnostics/agent_loop_deepseek_v4_flash_50.jsonl +- SQLite FTS AND-first threshold: 20 +- SQLite FTS lexical rerank pool: 500 + +This measures LLM-planned exploration. The agent can change follow-up queries and tool choices based on evidence from earlier turns. The main metric is document reach, not ranked MRR, because the agent loop returns a cumulative evidence set. + +## Summary + +- Reach: 23/50 (0.460) +- Mean turns: 4.14 +- Mean tool calls: 5.78 +- Mean first relevant turn: 1.70 +- Mean first relevant tool calls: 2.22 +- Mean elapsed: 50.8s +- P50/P90 elapsed: 48.7s / 67.0s +- Mean prompt tokens: 19616 +- Mean completion tokens: 960 +- Mean unique tools: 2.38 +- Mean unique search targets: 5.40 +- Mean query rewrites: 4.20 +- Queries with >1 tool type: 48/50 +- Queries with query rewrites: 48/50 +- Duplicate tool calls: 0 +- Empty tool calls: 11 + +## Per Query + +| QID | Reach | Turns | Calls | Tools | Targets | Rewrites | First Rel Turn | First Rel Calls | Found Relevant | Elapsed | Query | +|-----|:-----:|------:|------:|------:|--------:|---------:|---------------:|----------------:|----------------|--------:|-------| +| 300674 | yes | 4 | 4 | 2 | 3 | 3 | 1 | 1 | 7067032 | 36.2s | how many years did william bradford serve as governor of plymouth colony? | +| 125705 | no | 3 | 3 | 2 | 2 | 2 | - | - | - | 38.5s | define preventive | +| 94798 | yes | 4 | 5 | 2 | 5 | 4 | 1 | 1 | 7067181 | 46.4s | color overlay photoshop | +| 9083 | yes | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 7067274 | 28.3s | ____________________ is considered the father of modern medicine. | +| 174249 | no | 4 | 3 | 2 | 3 | 3 | - | - | - | 39.9s | does xpress bet charge to deposit money in your account | +| 320792 | no | 5 | 7 | 3 | 7 | 7 | - | - | - | 61.3s | how much is a cost to run disneyland | +| 1090270 | yes | 4 | 3 | 2 | 1 | 0 | 1 | 1 | 7067796 | 40.0s | botulinum definition | +| 1101279 | no | 5 | 6 | 2 | 6 | 4 | - | - | - | 53.8s | do physicians pay for insurance from their salaries? | +| 201376 | yes | 5 | 8 | 3 | 8 | 2 | 1 | 1 | 7068066 | 60.4s | here there be dragons comic | +| 54544 | no | 5 | 10 | 3 | 10 | 10 | - | - | - | 67.0s | blood diseases that are sexually transmitted | +| 118457 | no | 3 | 3 | 2 | 2 | 1 | - | - | - | 38.9s | define bona fides | +| 178627 | yes | 5 | 13 | 3 | 13 | 12 | 3 | 6 | 7068519 | 75.8s | effects of detox juice cleanse | +| 1101278 | no | 3 | 4 | 2 | 4 | 1 | - | - | - | 41.6s | do prince harry and william have last names | +| 68095 | yes | 4 | 8 | 2 | 8 | 4 | 1 | 1 | 7069266 | 59.2s | can hives be a sign of pregnancy | +| 87892 | yes | 5 | 10 | 3 | 8 | 8 | 3 | 5 | 7069601 | 64.6s | causes of petechial hemorrhage | +| 257309 | no | 4 | 5 | 3 | 5 | 5 | - | - | - | 54.1s | how long does it take to get your bsrn if you already have a bachelors degree | +| 1090242 | yes | 5 | 14 | 3 | 14 | 13 | 3 | 7 | 7070556 | 70.1s | symptoms of ptsd in vietnam veterans | +| 211691 | no | 5 | 5 | 3 | 4 | 3 | - | - | - | 45.3s | how coffee works quote | +| 165002 | yes | 4 | 4 | 2 | 4 | 1 | 1 | 1 | 7070877 | 44.6s | does contraction of the ciliary muscles shorten the lens | +| 1101276 | yes | 3 | 2 | 2 | 2 | 1 | 1 | 1 | 7070950 | 34.6s | do spiders eat other animals | +| 264827 | yes | 3 | 2 | 2 | 2 | 2 | 1 | 1 | 7071066 | 32.8s | how long is the flight from chicago to cairo | +| 342285 | no | 4 | 6 | 2 | 6 | 6 | - | - | - | 48.3s | how titanic facts | +| 372586 | no | 4 | 6 | 2 | 6 | 5 | - | - | - | 49.3s | how to play blu ray discs | +| 89786 | yes | 3 | 2 | 2 | 2 | 1 | 1 | 1 | 7071501 | 33.0s | central city definition | +| 118448 | no | 4 | 4 | 2 | 4 | 1 | - | - | - | 42.5s | define body muscular endurance | +| 92542 | yes | 4 | 5 | 2 | 5 | 4 | 1 | 1 | 7072003 | 45.6s | circulation money definition | +| 206117 | yes | 5 | 8 | 3 | 8 | 3 | 1 | 1 | 7072155, 7072160 | 56.9s | hotels in thornton co | +| 141472 | yes | 3 | 3 | 2 | 2 | 1 | 1 | 1 | 7072290 | 41.1s | derriere definition | +| 293992 | no | 5 | 9 | 3 | 9 | 9 | - | - | - | 79.4s | how many product lines does coca cola have | +| 196232 | no | 2 | 1 | 1 | 1 | 0 | - | - | - | 32.2s | government does do | +| 352818 | no | 5 | 7 | 2 | 6 | 5 | - | - | - | 62.2s | how to cook string beans | +| 45924 | yes | 5 | 7 | 3 | 7 | 6 | 5 | 7 | 5167800 | 62.5s | average temperatures las vegas by month | +| 208145 | no | 5 | 9 | 3 | 9 | 9 | - | - | - | 71.0s | how bicycle tire tubes are sized | +| 79891 | no | 5 | 7 | 2 | 7 | 3 | - | - | - | 60.9s | can you substitute chocolate chips for semi-sweet | +| 208494 | yes | 4 | 5 | 3 | 5 | 5 | 2 | 2 | 7073272 | 51.1s | how big do newfypoo's get | +| 319564 | no | 3 | 3 | 2 | 3 | 3 | - | - | - | 37.7s | how much fiber is in carrots | +| 155234 | no | 3 | 3 | 2 | 3 | 2 | - | - | - | 40.0s | do bigger tires affect gas mileage | +| 14151 | no | 5 | 13 | 3 | 13 | 12 | - | - | - | 83.9s | age requirements for name change | +| 67802 | yes | 3 | 4 | 2 | 4 | 3 | 1 | 1 | 7074071 | 45.9s | can green tea cause stomach problems | +| 1090184 | yes | 5 | 7 | 3 | 7 | 1 | 1 | 1 | 7074235 | 52.4s | synonym of subordinate | +| 323382 | no | 3 | 6 | 2 | 6 | 1 | - | - | - | 45.1s | how much is the stamp to send a card | +| 323998 | yes | 5 | 5 | 3 | 5 | 5 | 3 | 4 | 7074377 | 47.8s | how much magnesium in kidney beans | +| 91711 | no | 5 | 8 | 3 | 3 | 2 | - | - | - | 60.6s | child psychiatrist salary 2016 | +| 125898 | no | 5 | 6 | 2 | 5 | 5 | - | - | - | 57.9s | define prosthetic device | +| 289812 | no | 3 | 3 | 2 | 3 | 3 | - | - | - | 32.6s | how many mm is a nickel coin | +| 333486 | yes | 5 | 4 | 3 | 4 | 4 | 4 | 4 | 7075218 | 56.1s | how old do you have to be to get a job in idaho | +| 1090171 | yes | 4 | 7 | 2 | 4 | 3 | 1 | 1 | 7075317 | 50.9s | synonyms for the word discipline | +| 73257 | no | 5 | 6 | 2 | 6 | 6 | - | - | - | 48.7s | can seizure meds cause low sodium? | +| 1090170 | no | 5 | 6 | 4 | 6 | 6 | - | - | - | 48.7s | synonyms for the word, decline | +| 237373 | no | 5 | 9 | 3 | 9 | 9 | - | - | - | 64.1s | how is soil created from rocks | diff --git a/examples/ablation/diagnostics/agent_loop_deepseek_v4_flash_50.jsonl b/examples/ablation/diagnostics/agent_loop_deepseek_v4_flash_50.jsonl new file mode 100644 index 0000000..6446140 --- /dev/null +++ b/examples/ablation/diagnostics/agent_loop_deepseek_v4_flash_50.jsonl @@ -0,0 +1,50 @@ +{"completion_tokens": 701, "duplicate_tool_calls": 0, "elapsed_sec": 36.18947866698727, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 10, "found_relevant_docs": ["7067032"], "prompt_tokens": 16545, "qid": "300674", "query": "how many years did william bradford serve as governor of plymouth colony?", "query_rewrites": 3, "reached": true, "relevant_docs": ["7067032"], "search_targets": ["william bradford governor plymouth colony years served", "william bradford governor plymouth colony years", "william bradford served as plymouth governor for 30 years"], "tool_calls": 4, "tool_sequence": ["search", "get_document", "get_document", "get_document"], "turns": 4, "unique_search_targets": 3, "unique_tools": 2} +{"completion_tokens": 702, "duplicate_tool_calls": 0, "elapsed_sec": 38.52050423901528, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 5, "found_relevant_docs": [], "prompt_tokens": 10944, "qid": "125705", "query": "define preventive", "query_rewrites": 2, "reached": false, "relevant_docs": ["7067056"], "search_targets": ["preventive definition", "preventive maintenance definition"], "tool_calls": 3, "tool_sequence": ["deep_search", "get_document", "get_document"], "turns": 3, "unique_search_targets": 2, "unique_tools": 2} +{"completion_tokens": 956, "duplicate_tool_calls": 0, "elapsed_sec": 46.40595618914813, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 5, "found_relevant_docs": ["7067181"], "prompt_tokens": 16293, "qid": "94798", "query": "color overlay photoshop", "query_rewrites": 4, "reached": true, "relevant_docs": ["7067181"], "search_targets": ["color overlay photoshop", "photoshop layer styles color overlay settings", "color overlay photoshop layer styles", "photoshop color overlay", "how to use color overlay in photoshop"], "tool_calls": 5, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "get_document"], "turns": 4, "unique_search_targets": 5, "unique_tools": 2} +{"completion_tokens": 208, "duplicate_tool_calls": 0, "elapsed_sec": 28.265028935857117, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 5, "found_relevant_docs": ["7067274"], "prompt_tokens": 6460, "qid": "9083", "query": "____________________ is considered the father of modern medicine.", "query_rewrites": 1, "reached": true, "relevant_docs": ["7067274"], "search_targets": ["father of modern medicine"], "tool_calls": 1, "tool_sequence": ["deep_search"], "turns": 2, "unique_search_targets": 1, "unique_tools": 1} +{"completion_tokens": 773, "duplicate_tool_calls": 0, "elapsed_sec": 39.93833482963964, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 5, "found_relevant_docs": [], "prompt_tokens": 14916, "qid": "174249", "query": "does xpress bet charge to deposit money in your account", "query_rewrites": 3, "reached": false, "relevant_docs": ["7067348"], "search_targets": ["xpressbet deposit fees charges", "xpressbet deposit convenience fee cashier", "xpressbet no per wager fees charged"], "tool_calls": 3, "tool_sequence": ["deep_search", "get_document", "get_document"], "turns": 4, "unique_search_targets": 3, "unique_tools": 2} +{"completion_tokens": 919, "duplicate_tool_calls": 0, "elapsed_sec": 61.31934346398339, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 45, "found_relevant_docs": [], "prompt_tokens": 29795, "qid": "320792", "query": "how much is a cost to run disneyland", "query_rewrites": 7, "reached": false, "relevant_docs": ["7067677"], "search_targets": ["disneyland operating cost operating expense", "disneyland cost to operate per day annual operating cost", "disneyland operating cost", "disneyland resort annual operating expense cost to run", "\"disneyland\" \"operating\" expense million billion", "disneyland operating cost expense billion million", "disneyland park operating costs annual expense revenue"], "tool_calls": 7, "tool_sequence": ["deep_search", "search", "get_document", "search", "search", "get_document", "search"], "turns": 5, "unique_search_targets": 7, "unique_tools": 3} +{"completion_tokens": 726, "duplicate_tool_calls": 0, "elapsed_sec": 39.983878755010664, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 5, "found_relevant_docs": ["7067796"], "prompt_tokens": 14695, "qid": "1090270", "query": "botulinum definition", "query_rewrites": 0, "reached": true, "relevant_docs": ["7067796"], "search_targets": ["botulinum definition"], "tool_calls": 3, "tool_sequence": ["deep_search", "get_document", "get_document"], "turns": 4, "unique_search_targets": 1, "unique_tools": 2} +{"completion_tokens": 1042, "duplicate_tool_calls": 0, "elapsed_sec": 53.772729511838406, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 40, "found_relevant_docs": [], "prompt_tokens": 29692, "qid": "1101279", "query": "do physicians pay for insurance from their salaries?", "query_rewrites": 4, "reached": false, "relevant_docs": ["7067891"], "search_targets": ["physicians pay for insurance from their salaries medical malpractice", "1027637", "do physicians pay for malpractice insurance out of their salary", "2055614", "physicians pay for their own health insurance disability insurance from salary", "who pays for physician malpractice insurance employer or doctor self-employed"], "tool_calls": 6, "tool_sequence": ["search", "get_document", "search", "get_document", "search", "search"], "turns": 5, "unique_search_targets": 6, "unique_tools": 2} +{"completion_tokens": 1084, "duplicate_tool_calls": 0, "elapsed_sec": 60.38634355319664, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 25, "found_relevant_docs": ["7068066"], "prompt_tokens": 28315, "qid": "201376", "query": "here there be dragons comic", "query_rewrites": 2, "reached": true, "relevant_docs": ["7068066"], "search_targets": ["here there be dragons comic", "7068066", "7068073", "7068070", "\"here there be dragons\" comic james owen", "\"the deep: here be dragons\" comic", "7068069", "6132021"], "tool_calls": 8, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "search", "search", "get_document", "get_document"], "turns": 5, "unique_search_targets": 8, "unique_tools": 3} +{"completion_tokens": 1279, "duplicate_tool_calls": 0, "elapsed_sec": 67.03253900120035, "empty_tool_calls": 3, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 30, "found_relevant_docs": [], "prompt_tokens": 29882, "qid": "54544", "query": "blood diseases that are sexually transmitted", "query_rewrites": 10, "reached": false, "relevant_docs": ["7068203"], "search_targets": ["blood diseases sexually transmitted", "sexually transmitted blood diseases list hepatitis hiv syphilis", "bloodborne diseases sexually transmitted infections list", "blood diseases sexually transmitted list", "hiv hepatitis b syphilis sexually transmitted blood", "blood tested syphilis hepatitis west nile virus sexually transmitted", "hepatitis b sexually transmitted blood disease", "hiv aids sexually transmitted blood infection", "three types of hepatitis virus sexually transmitted", "sexually transmitted and blood borne infections stbbi"], "tool_calls": 10, "tool_sequence": ["deep_search", "deep_search", "search", "get_document", "get_document", "get_document", "search", "search", "get_document", "get_document"], "turns": 5, "unique_search_targets": 10, "unique_tools": 3} +{"completion_tokens": 563, "duplicate_tool_calls": 0, "elapsed_sec": 38.8562790742144, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 5, "found_relevant_docs": [], "prompt_tokens": 11027, "qid": "118457", "query": "define bona fides", "query_rewrites": 1, "reached": false, "relevant_docs": ["7068493"], "search_targets": ["define bona fides", "bona fides definition"], "tool_calls": 3, "tool_sequence": ["deep_search", "get_document", "get_document"], "turns": 3, "unique_search_targets": 2, "unique_tools": 2} +{"completion_tokens": 1770, "duplicate_tool_calls": 0, "elapsed_sec": 75.8238385450095, "empty_tool_calls": 2, "first_relevant_tool_calls": 6, "first_relevant_turn": 3, "found_ids_count": 25, "found_relevant_docs": ["7068519"], "prompt_tokens": 31934, "qid": "178627", "query": "effects of detox juice cleanse", "query_rewrites": 12, "reached": true, "relevant_docs": ["7068519", "7068520"], "search_targets": ["effects of detox juice cleanse", "effects of detox juice cleanse benefits side effects", "detox juice cleanse effects side effects weight loss", "detox juice cleanse effects", "detox juice cleanse health effects benefits", "juice cleanse side effects weight loss detox", "juice cleanse health benefits effects", "common side effects of detox juices", "positive side effects of detox cleanse weight loss", "detox benefits better health glowing skin better digestion increased energy", "juice cleanse best ways to detox losing weight", "juice cleanse negative side effects headaches fatigue detox", "health enhancements juice cleanse breaking cravings"], "tool_calls": 13, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "search", "search", "get_document", "get_document", "get_document", "get_document", "get_document", "search", "get_document"], "turns": 5, "unique_search_targets": 13, "unique_tools": 3} +{"completion_tokens": 927, "duplicate_tool_calls": 0, "elapsed_sec": 41.57614717306569, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 10, "found_relevant_docs": [], "prompt_tokens": 12403, "qid": "1101278", "query": "do prince harry and william have last names", "query_rewrites": 1, "reached": false, "relevant_docs": ["7068907"], "search_targets": ["prince harry william last name british royal family surname", "7068913", "67332", "2919121"], "tool_calls": 4, "tool_sequence": ["search", "get_document", "get_document", "get_document"], "turns": 3, "unique_search_targets": 4, "unique_tools": 2} +{"completion_tokens": 1350, "duplicate_tool_calls": 0, "elapsed_sec": 59.2060615378432, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 5, "found_relevant_docs": ["7069266"], "prompt_tokens": 17972, "qid": "68095", "query": "can hives be a sign of pregnancy", "query_rewrites": 4, "reached": true, "relevant_docs": ["7069266"], "search_targets": ["can hives be a sign of pregnancy", "hives and pregnancy", "hives early pregnancy symptom", "hives during pregnancy", "hives during pregnancy causes", "3114980", "7069269", "6159958"], "tool_calls": 8, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "get_document", "get_document", "get_document", "get_document"], "turns": 4, "unique_search_targets": 8, "unique_tools": 2} +{"completion_tokens": 1155, "duplicate_tool_calls": 0, "elapsed_sec": 64.58518175315112, "empty_tool_calls": 0, "first_relevant_tool_calls": 5, "first_relevant_turn": 3, "found_ids_count": 15, "found_relevant_docs": ["7069601"], "prompt_tokens": 27869, "qid": "87892", "query": "causes of petechial hemorrhage", "query_rewrites": 8, "reached": true, "relevant_docs": ["7069601"], "search_targets": ["causes of petechial hemorrhage petechiae", "causes of petechial hemorrhage petechiae causes", "petechial hemorrhage causes medical conditions thrombocytopenia infection trauma", "causes of petechial hemorrhages injury trauma brain", "causes of petechiae medical conditions", "petechial rash causes", "petechial hemorrhage causes factors medical conditions infections trauma", "causes of intracranial hemorrhage petechial"], "tool_calls": 10, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "search", "get_document", "get_document", "get_document", "get_document", "get_document"], "turns": 5, "unique_search_targets": 8, "unique_tools": 3} +{"completion_tokens": 1117, "duplicate_tool_calls": 0, "elapsed_sec": 54.08409190410748, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 25, "found_relevant_docs": [], "prompt_tokens": 19436, "qid": "257309", "query": "how long does it take to get your bsrn if you already have a bachelors degree", "query_rewrites": 5, "reached": false, "relevant_docs": ["4959637"], "search_targets": ["bsrn second degree bachelors how long does it take", "accelerated bsn second degree how long", "bsrn program duration second bachelor degree nursing", "accelerated bsn second degree 11 to 18 months", "accelerated bsn program how long second degree"], "tool_calls": 5, "tool_sequence": ["deep_search", "search", "search", "get_document", "get_document"], "turns": 4, "unique_search_targets": 5, "unique_tools": 3} +{"completion_tokens": 1504, "duplicate_tool_calls": 0, "elapsed_sec": 70.11464598635212, "empty_tool_calls": 3, "first_relevant_tool_calls": 7, "first_relevant_turn": 3, "found_ids_count": 25, "found_relevant_docs": ["7070556"], "prompt_tokens": 31491, "qid": "1090242", "query": "symptoms of ptsd in vietnam veterans", "query_rewrites": 13, "reached": true, "relevant_docs": ["7070556"], "search_targets": ["symptoms of ptsd in vietnam veterans", "ptsd symptoms vietnam veterans", "ptsd symptoms vietnam veterans stress reactions", "vietnam veterans ptsd symptoms decades later", "vietnam syndrome symptoms physical psychological", "ptsd symptoms vietnam veterans nightmares flashbacks hypervigilance", "vietnam veterans ptsd clinical symptoms diagnostic", "symptoms of ptsd", "ptsd symptoms severe anxiety flashbacks nightmares uncontrollable thoughts", "ptsd symptoms reliving event bad memories nightmares flashbacks fear horror", "ptsd symptoms insomnia anxiety irritability vietnam veterans", "ptsd symptoms flashbacks nightmares severe anxiety uncontrollable thoughts", "ptsd symptoms must persist for at least a month diagnosis", "vietnam war ptsd rates national vietnam veterans readjustment study"], "tool_calls": 14, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "get_document", "search", "search", "get_document", "get_document", "get_document", "get_document", "get_document", "get_document", "get_document"], "turns": 5, "unique_search_targets": 14, "unique_tools": 3} +{"completion_tokens": 820, "duplicate_tool_calls": 0, "elapsed_sec": 45.28732342366129, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 30, "found_relevant_docs": [], "prompt_tokens": 24872, "qid": "211691", "query": "how coffee works quote", "query_rewrites": 3, "reached": false, "relevant_docs": ["7070643"], "search_targets": ["how coffee works quote", "\"how coffee works\" quote", "\"how coffee works\" famous quote saying", "\"coffee\" \"how it works\" quote"], "tool_calls": 5, "tool_sequence": ["search", "deep_search", "get_document", "search", "search"], "turns": 5, "unique_search_targets": 4, "unique_tools": 3} +{"completion_tokens": 1089, "duplicate_tool_calls": 0, "elapsed_sec": 44.62640297180042, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 5, "found_relevant_docs": ["7070877"], "prompt_tokens": 15925, "qid": "165002", "query": "does contraction of the ciliary muscles shorten the lens", "query_rewrites": 1, "reached": true, "relevant_docs": ["7070877"], "search_targets": ["contraction of ciliary muscles shortens lens", "7070881", "7070877", "328493"], "tool_calls": 4, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document"], "turns": 4, "unique_search_targets": 4, "unique_tools": 2} +{"completion_tokens": 631, "duplicate_tool_calls": 0, "elapsed_sec": 34.6135217868723, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 5, "found_relevant_docs": ["7070950"], "prompt_tokens": 10514, "qid": "1101276", "query": "do spiders eat other animals", "query_rewrites": 1, "reached": true, "relevant_docs": ["7070950"], "search_targets": ["do spiders eat other animals", "what do spiders eat? predatory on other animals"], "tool_calls": 2, "tool_sequence": ["deep_search", "get_document"], "turns": 3, "unique_search_targets": 2, "unique_tools": 2} +{"completion_tokens": 459, "duplicate_tool_calls": 0, "elapsed_sec": 32.81690010102466, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 5, "found_relevant_docs": ["7071066"], "prompt_tokens": 10445, "qid": "264827", "query": "how long is the flight from chicago to cairo", "query_rewrites": 2, "reached": true, "relevant_docs": ["7071066"], "search_targets": ["flight duration chicago to cairo", "chicago to cairo flight duration"], "tool_calls": 2, "tool_sequence": ["deep_search", "get_document"], "turns": 3, "unique_search_targets": 2, "unique_tools": 2} +{"completion_tokens": 910, "duplicate_tool_calls": 0, "elapsed_sec": 48.25249875290319, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 5, "found_relevant_docs": [], "prompt_tokens": 16749, "qid": "342285", "query": "how titanic facts", "query_rewrites": 6, "reached": false, "relevant_docs": ["7071436"], "search_targets": ["titanic facts", "titanic facts construction", "titanic facts the ship", "titanic facts for kids", "little-known facts about the titanic", "titanic built in belfast ireland facts"], "tool_calls": 6, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "get_document", "get_document"], "turns": 4, "unique_search_targets": 6, "unique_tools": 2} +{"completion_tokens": 1127, "duplicate_tool_calls": 0, "elapsed_sec": 49.28605866804719, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 5, "found_relevant_docs": [], "prompt_tokens": 17566, "qid": "372586", "query": "how to play blu ray discs", "query_rewrites": 5, "reached": false, "relevant_docs": ["7071494"], "search_targets": ["how to play blu ray discs", "play blu ray discs", "play blu ray discs requirements", "blu ray drive play discs", "laptop blu ray disc player", "ultra hd blu-ray player play discs"], "tool_calls": 6, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "get_document", "get_document"], "turns": 4, "unique_search_targets": 6, "unique_tools": 2} +{"completion_tokens": 432, "duplicate_tool_calls": 0, "elapsed_sec": 33.02692404994741, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 5, "found_relevant_docs": ["7071501"], "prompt_tokens": 10168, "qid": "89786", "query": "central city definition", "query_rewrites": 1, "reached": true, "relevant_docs": ["7071501"], "search_targets": ["central city definition", "definition of central city"], "tool_calls": 2, "tool_sequence": ["deep_search", "get_document"], "turns": 3, "unique_search_targets": 2, "unique_tools": 2} +{"completion_tokens": 640, "duplicate_tool_calls": 0, "elapsed_sec": 42.49098385684192, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 5, "found_relevant_docs": [], "prompt_tokens": 15278, "qid": "118448", "query": "define body muscular endurance", "query_rewrites": 1, "reached": false, "relevant_docs": ["7071642"], "search_targets": ["body muscular endurance definition", "1186203", "2000773", "8804731"], "tool_calls": 4, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document"], "turns": 4, "unique_search_targets": 4, "unique_tools": 2} +{"completion_tokens": 926, "duplicate_tool_calls": 0, "elapsed_sec": 45.55889833718538, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 5, "found_relevant_docs": ["7072003"], "prompt_tokens": 15751, "qid": "92542", "query": "circulation money definition", "query_rewrites": 4, "reached": true, "relevant_docs": ["7072003"], "search_targets": ["circulation money definition", "circulation of money definition", "money in circulation definition", "circulation of money definition cobuild", "money stock definition"], "tool_calls": 5, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "get_document"], "turns": 4, "unique_search_targets": 5, "unique_tools": 2} +{"completion_tokens": 885, "duplicate_tool_calls": 0, "elapsed_sec": 56.86026880936697, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 25, "found_relevant_docs": ["7072155", "7072160"], "prompt_tokens": 28339, "qid": "206117", "query": "hotels in thornton co", "query_rewrites": 3, "reached": true, "relevant_docs": ["7072155", "7072156", "7072160"], "search_targets": ["hotels in thornton colorado", "7072160", "7072155", "7072159", "7072158", "1996531", "thornton co hotels list hampton inn candlewood suites econo lodge", "hotels in thornton colorado list"], "tool_calls": 8, "tool_sequence": ["search", "get_document", "get_document", "get_document", "get_document", "get_document", "search", "deep_search"], "turns": 5, "unique_search_targets": 8, "unique_tools": 3} +{"completion_tokens": 874, "duplicate_tool_calls": 0, "elapsed_sec": 41.11599511932582, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 5, "found_relevant_docs": ["7072290"], "prompt_tokens": 11083, "qid": "141472", "query": "derriere definition", "query_rewrites": 1, "reached": true, "relevant_docs": ["7072290"], "search_targets": ["derriere definition", "derriere origin and definition"], "tool_calls": 3, "tool_sequence": ["deep_search", "get_document", "get_document"], "turns": 3, "unique_search_targets": 2, "unique_tools": 2} +{"completion_tokens": 1815, "duplicate_tool_calls": 0, "elapsed_sec": 79.42422071425244, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 45, "found_relevant_docs": [], "prompt_tokens": 34603, "qid": "293992", "query": "how many product lines does coca cola have", "query_rewrites": 9, "reached": false, "relevant_docs": ["2790193"], "search_targets": ["coca cola product lines count", "coca-cola company product lines number of brands", "coca-cola product portfolio how many product lines", "coca-cola product mix product lines", "coca-cola product lines count", "coca-cola products product lines", "coca-cola product lines categories carbonated water juice tea coffee sports drinks", "coca-cola product line count how many product lines does coca-cola have", "coca-cola product lines include"], "tool_calls": 9, "tool_sequence": ["deep_search", "search", "search", "get_document", "get_document", "get_document", "search", "search", "get_document"], "turns": 5, "unique_search_targets": 9, "unique_tools": 3} +{"completion_tokens": 487, "duplicate_tool_calls": 0, "elapsed_sec": 32.23342724516988, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 5, "found_relevant_docs": [], "prompt_tokens": 6529, "qid": "196232", "query": "government does do", "query_rewrites": 0, "reached": false, "relevant_docs": ["7072326"], "search_targets": ["government does do"], "tool_calls": 1, "tool_sequence": ["deep_search"], "turns": 2, "unique_search_targets": 1, "unique_tools": 1} +{"completion_tokens": 1031, "duplicate_tool_calls": 0, "elapsed_sec": 62.24838368361816, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 10, "found_relevant_docs": [], "prompt_tokens": 24488, "qid": "352818", "query": "how to cook string beans", "query_rewrites": 5, "reached": false, "relevant_docs": ["7072358"], "search_targets": ["how to cook string beans", "how to cook snap beans string beans", "cooking string beans green beans boiling steaming methods recipe", "steaming green beans string beans cooking method", "stir fry cooking green beans string beans methods", "how to cook snap beans string beans green beans boiling steaming stir fry"], "tool_calls": 7, "tool_sequence": ["deep_search", "get_document", "get_document", "deep_search", "get_document", "get_document", "get_document"], "turns": 5, "unique_search_targets": 6, "unique_tools": 2} +{"completion_tokens": 1192, "duplicate_tool_calls": 0, "elapsed_sec": 62.53753410279751, "empty_tool_calls": 0, "first_relevant_tool_calls": 7, "first_relevant_turn": 5, "found_ids_count": 25, "found_relevant_docs": ["5167800"], "prompt_tokens": 26192, "qid": "45924", "query": "average temperatures las vegas by month", "query_rewrites": 6, "reached": true, "relevant_docs": ["5167800", "7072691"], "search_targets": ["average temperatures las vegas by month", "average temperatures for las vegas", "las vegas temperatures averages by month", "average high and low temperatures for each month in las vegas", "las vegas average monthly temperatures high low fahrenheit", "average high and low temperatures listed below by month and annually for las vegas fahrenheit celsius", "\"las vegas\" \"january\" \"average high\" \"average low\" temperature"], "tool_calls": 7, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "search", "get_document", "search"], "turns": 5, "unique_search_targets": 7, "unique_tools": 3} +{"completion_tokens": 1721, "duplicate_tool_calls": 0, "elapsed_sec": 71.04964815685526, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 35, "found_relevant_docs": [], "prompt_tokens": 28295, "qid": "208145", "query": "how bicycle tire tubes are sized", "query_rewrites": 9, "reached": false, "relevant_docs": ["7072838", "7072843"], "search_targets": ["bicycle tire tube sizing how are they measured", "bicycle tire tube sizing fractions decimals french international sizing", "bicycle tire sizing 26 inch older tire sizes", "bicycle inner tube sizing 700c 26 inch 27 inch how to read tire size", "bicycle tires inner tubes sold in variety of diameters sizing", "tire sizing arcane complicated 26 inch 700c mountain bike road bike", "27 inch 700c inner tubes compatible tire sizing", "etrto bicycle tire sizing system iso standard bead seat diameter", "bicycle inner tube size chart 700x23 26x1.95 how to choose tube size"], "tool_calls": 9, "tool_sequence": ["deep_search", "get_document", "get_document", "search", "get_document", "get_document", "get_document", "search", "search"], "turns": 5, "unique_search_targets": 9, "unique_tools": 3} +{"completion_tokens": 834, "duplicate_tool_calls": 0, "elapsed_sec": 60.870857327245176, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 10, "found_relevant_docs": [], "prompt_tokens": 25227, "qid": "79891", "query": "can you substitute chocolate chips for semi-sweet", "query_rewrites": 3, "reached": false, "relevant_docs": ["7073211"], "search_targets": ["substitute chocolate chips for semi-sweet chocolate", "3694357", "1884408", "3491847", "can i use chocolate chips instead of semi-sweet chocolate baking substitution", "3107090", "chocolate chips substitution semi-sweet baking squares"], "tool_calls": 7, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "deep_search", "get_document", "get_document"], "turns": 5, "unique_search_targets": 7, "unique_tools": 2} +{"completion_tokens": 1005, "duplicate_tool_calls": 0, "elapsed_sec": 51.101275702007115, "empty_tool_calls": 0, "first_relevant_tool_calls": 2, "first_relevant_turn": 2, "found_ids_count": 15, "found_relevant_docs": ["7073272"], "prompt_tokens": 17346, "qid": "208494", "query": "how big do newfypoo's get", "query_rewrites": 5, "reached": true, "relevant_docs": ["7073272"], "search_targets": ["newfypoo size how big do they get", "newfypoo dog breed size weight", "newfypoo size weight large dog breed", "newfypoo size adult size poodle breeding large dog", "newfypoo hybrid size medium large dog breed"], "tool_calls": 5, "tool_sequence": ["deep_search", "search", "get_document", "get_document", "get_document"], "turns": 4, "unique_search_targets": 5, "unique_tools": 3} +{"completion_tokens": 671, "duplicate_tool_calls": 0, "elapsed_sec": 37.67149941343814, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 5, "found_relevant_docs": [], "prompt_tokens": 10952, "qid": "319564", "query": "how much fiber is in carrots", "query_rewrites": 3, "reached": false, "relevant_docs": ["7073381"], "search_targets": ["fiber content in carrots", "fiber content carrots grams", "carrot fiber content grams"], "tool_calls": 3, "tool_sequence": ["deep_search", "get_document", "get_document"], "turns": 3, "unique_search_targets": 3, "unique_tools": 2} +{"completion_tokens": 841, "duplicate_tool_calls": 0, "elapsed_sec": 40.00007836567238, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 5, "found_relevant_docs": [], "prompt_tokens": 11269, "qid": "155234", "query": "do bigger tires affect gas mileage", "query_rewrites": 2, "reached": false, "relevant_docs": ["502713"], "search_targets": ["do bigger tires affect gas mileage", "bigger tires affect gas mileage decrease", "tire size affect fuel economy gas mileage"], "tool_calls": 3, "tool_sequence": ["deep_search", "get_document", "get_document"], "turns": 3, "unique_search_targets": 3, "unique_tools": 2} +{"completion_tokens": 1913, "duplicate_tool_calls": 0, "elapsed_sec": 83.91100671608001, "empty_tool_calls": 2, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 25, "found_relevant_docs": [], "prompt_tokens": 31550, "qid": "14151", "query": "age requirements for name change", "query_rewrites": 12, "reached": false, "relevant_docs": ["7073772"], "search_targets": ["age requirements for name change", "age requirements for name change minor california", "name change age requirements iowa", "requirements for name address change", "name change age requirement minor adult minimum age", "how old do you have to be to change your name legal age", "name change age requirements adult minor 14 years", "must be at least 18 years old to change your name", "change your name legally must be at least 18 years old resident alberta", "name change without parent's permission no legal age", "age for a name change how old does a child have to be", "minnesota name change age requirement", "texas name change minor guardian parent requirements"], "tool_calls": 13, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "search", "search", "get_document", "get_document", "get_document", "get_document", "get_document", "get_document", "get_document"], "turns": 5, "unique_search_targets": 13, "unique_tools": 3} +{"completion_tokens": 813, "duplicate_tool_calls": 0, "elapsed_sec": 45.9172401772812, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 5, "found_relevant_docs": ["7074071"], "prompt_tokens": 11557, "qid": "67802", "query": "can green tea cause stomach problems", "query_rewrites": 3, "reached": true, "relevant_docs": ["7074071"], "search_targets": ["can green tea cause stomach problems", "green tea stomach problems", "green tea tannins stomach upset", "green tea extract stomach adverse effects empty stomach"], "tool_calls": 4, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document"], "turns": 3, "unique_search_targets": 4, "unique_tools": 2} +{"completion_tokens": 851, "duplicate_tool_calls": 0, "elapsed_sec": 52.36761954287067, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 15, "found_relevant_docs": ["7074235"], "prompt_tokens": 23980, "qid": "1090184", "query": "synonym of subordinate", "query_rewrites": 1, "reached": true, "relevant_docs": ["7074235"], "search_targets": ["synonym of subordinate", "7074235", "1192099", "3276120", "subordinate synonyms inferior junior underling", "7074238", "7074241"], "tool_calls": 7, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "search", "get_document", "get_document"], "turns": 5, "unique_search_targets": 7, "unique_tools": 3} +{"completion_tokens": 897, "duplicate_tool_calls": 0, "elapsed_sec": 45.07101959036663, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 10, "found_relevant_docs": [], "prompt_tokens": 12921, "qid": "323382", "query": "how much is the stamp to send a card", "query_rewrites": 1, "reached": false, "relevant_docs": ["4778293"], "search_targets": ["stamp price send a card postage", "361429", "3234937", "294032", "7999359", "2034687"], "tool_calls": 6, "tool_sequence": ["search", "get_document", "get_document", "get_document", "get_document", "get_document"], "turns": 3, "unique_search_targets": 6, "unique_tools": 2} +{"completion_tokens": 601, "duplicate_tool_calls": 0, "elapsed_sec": 47.80267889192328, "empty_tool_calls": 0, "first_relevant_tool_calls": 4, "first_relevant_turn": 3, "found_ids_count": 15, "found_relevant_docs": ["7074377"], "prompt_tokens": 20767, "qid": "323998", "query": "how much magnesium in kidney beans", "query_rewrites": 5, "reached": true, "relevant_docs": ["7074377"], "search_targets": ["magnesium content in kidney beans", "magnesium kidney beans amount", "kidney beans magnesium", "magnesium mg in kidney beans per cup", "kidney beans magnesium 70 mg"], "tool_calls": 5, "tool_sequence": ["deep_search", "get_document", "get_document", "search", "get_document"], "turns": 5, "unique_search_targets": 5, "unique_tools": 3} +{"completion_tokens": 1232, "duplicate_tool_calls": 0, "elapsed_sec": 60.60274072503671, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 15, "found_relevant_docs": [], "prompt_tokens": 26892, "qid": "91711", "query": "child psychiatrist salary 2016", "query_rewrites": 2, "reached": false, "relevant_docs": ["1956185"], "search_targets": ["child psychiatrist salary 2016", "\"child psychiatrist\" 2016 salary", "psychiatrist salary in 2016 child psychiatrist"], "tool_calls": 8, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "get_document", "search", "get_document", "get_document"], "turns": 5, "unique_search_targets": 3, "unique_tools": 3} +{"completion_tokens": 862, "duplicate_tool_calls": 0, "elapsed_sec": 57.88452936755493, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 10, "found_relevant_docs": [], "prompt_tokens": 19862, "qid": "125898", "query": "define prosthetic device", "query_rewrites": 5, "reached": false, "relevant_docs": ["7074710"], "search_targets": ["prosthetic device definition", "medical definition of prosthetic", "prosthesis definition artificial body part replacement", "prosthetic definition artificial substitute replacement", "artificial device replaces missing body part"], "tool_calls": 6, "tool_sequence": ["deep_search", "get_document", "get_document", "deep_search", "get_document", "get_document"], "turns": 5, "unique_search_targets": 5, "unique_tools": 2} +{"completion_tokens": 642, "duplicate_tool_calls": 0, "elapsed_sec": 32.60420227004215, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 10, "found_relevant_docs": [], "prompt_tokens": 11761, "qid": "289812", "query": "how many mm is a nickel coin", "query_rewrites": 3, "reached": false, "relevant_docs": ["262205"], "search_targets": ["nickel coin thickness mm", "nickel coin dimensions thickness diameter mm", "nickel thickness mm"], "tool_calls": 3, "tool_sequence": ["search", "get_document", "get_document"], "turns": 3, "unique_search_targets": 3, "unique_tools": 2} +{"completion_tokens": 671, "duplicate_tool_calls": 0, "elapsed_sec": 56.121242301072925, "empty_tool_calls": 0, "first_relevant_tool_calls": 4, "first_relevant_turn": 4, "found_ids_count": 25, "found_relevant_docs": ["7075218"], "prompt_tokens": 21073, "qid": "333486", "query": "how old do you have to be to get a job in idaho", "query_rewrites": 4, "reached": true, "relevant_docs": ["7075218"], "search_targets": ["how old do you have to be to get a job in idaho minimum age employment", "idaho minimum age employment law child labor minor work age", "idaho child labor law minimum age employment minor", "idaho child labor agricultural farm work age minor exception"], "tool_calls": 4, "tool_sequence": ["deep_search", "search", "get_document", "search"], "turns": 5, "unique_search_targets": 4, "unique_tools": 3} +{"completion_tokens": 1248, "duplicate_tool_calls": 0, "elapsed_sec": 50.902933293022215, "empty_tool_calls": 0, "first_relevant_tool_calls": 1, "first_relevant_turn": 1, "found_ids_count": 10, "found_relevant_docs": ["7075317"], "prompt_tokens": 18738, "qid": "1090171", "query": "synonyms for the word discipline", "query_rewrites": 3, "reached": true, "relevant_docs": ["7075317"], "search_targets": ["synonyms for the word discipline", "synonyms for discipline", "synonyms discipline", "synonyms discipline temperance"], "tool_calls": 7, "tool_sequence": ["search", "get_document", "get_document", "get_document", "get_document", "get_document", "get_document"], "turns": 4, "unique_search_targets": 4, "unique_tools": 2} +{"completion_tokens": 1064, "duplicate_tool_calls": 0, "elapsed_sec": 48.74893113086, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 20, "found_relevant_docs": [], "prompt_tokens": 23363, "qid": "73257", "query": "can seizure meds cause low sodium?", "query_rewrites": 6, "reached": false, "relevant_docs": ["7075398"], "search_targets": ["seizure medications cause low sodium hyponatremia", "antiepileptic drugs anticonvulsants hyponatremia low sodium side effect", "antiepileptic hyponatremia low sodium side effect", "anticonvulsants hyponatremia sodium low valproate topiramate", "drugs causing low sodium hyponatremia antiepileptic", "antiepileptic drugs carbamazepine oxcarbazepine sodium valproate hyponatremia"], "tool_calls": 6, "tool_sequence": ["search", "search", "get_document", "get_document", "get_document", "get_document"], "turns": 5, "unique_search_targets": 6, "unique_tools": 2} +{"completion_tokens": 878, "duplicate_tool_calls": 0, "elapsed_sec": 48.71238382626325, "empty_tool_calls": 1, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 25, "found_relevant_docs": [], "prompt_tokens": 24509, "qid": "1090170", "query": "synonyms for the word, decline", "query_rewrites": 6, "reached": false, "relevant_docs": ["7075411"], "search_targets": ["synonyms for the word decline", "synonyms for decline", "decline thesaurus", "decline synonyms word list", "decline synonyms list", "\"decline\" synonyms \"refuse\" \"reject\" \"diminish\" \"decrease\""], "tool_calls": 6, "tool_sequence": ["deep_search", "get_document", "get_document", "search", "expand", "search"], "turns": 5, "unique_search_targets": 6, "unique_tools": 4} +{"completion_tokens": 1187, "duplicate_tool_calls": 0, "elapsed_sec": 64.14346273010597, "empty_tool_calls": 0, "first_relevant_tool_calls": 0, "first_relevant_turn": 0, "found_ids_count": 15, "found_relevant_docs": [], "prompt_tokens": 22559, "qid": "237373", "query": "how is soil created from rocks", "query_rewrites": 9, "reached": false, "relevant_docs": ["7075449"], "search_targets": ["how is soil created from rocks soil formation weathering", "weathering soil formation rocks broken down", "weathering soil rocks minerals physical chemical processes", "pedochemical weathering soil formation saprolites", "rocks broken down into small grains soil process", "physical chemical biological weathering soil formation from rocks", "three types of weathering physical chemical biological", "weathering physical breakdown disintegration chemical alteration decomposition rocks soil", "soil formed physical chemical biological processes rocks broken down smaller particles"], "tool_calls": 9, "tool_sequence": ["deep_search", "get_document", "get_document", "get_document", "get_document", "search", "get_document", "get_document", "get_document"], "turns": 5, "unique_search_targets": 9, "unique_tools": 3} diff --git a/examples/ablation/diagnostics/public_scale_20260702.md b/examples/ablation/diagnostics/public_scale_20260702.md index 4fcee99..ebbe655 100644 --- a/examples/ablation/diagnostics/public_scale_20260702.md +++ b/examples/ablation/diagnostics/public_scale_20260702.md @@ -159,6 +159,7 @@ gitignored `.env` file. | historical zero-tool allowed | `qwen3:14b` via Ollama | 8,841,823 | 20 | 6/20 | 2.50 | 1.90 | 1.17 | 41.3s | 1.90 | 1.85 | 1.20 | 12/20 | 14/20 | 2/20 | | force-first-tool default | `qwen3:14b` via Ollama | 8,841,823 | 20 | 9/20 | 2.60 | 2.10 | 1.33 | 42.2s | 1.75 | 1.65 | 1.10 | 11/20 | 16/20 | 0/20 | | DeepSeek Flash quality path | `deepseek-v4-flash` | 8,841,823 | 20 | 11/20 | 4.10 | 5.90 | 1.55 | 50.0s | 2.35 | 5.50 | 4.25 | 19/20 | 19/20 | 0/20 | +| DeepSeek Flash quality path (50-query check) | `deepseek-v4-flash` | 8,841,823 | 50 | 23/50 | 4.14 | 5.78 | 1.70 | 50.8s | 2.38 | 5.40 | 4.20 | 48/50 | 48/50 | 0/50 | Historical per-query report: `examples/ablation/diagnostics/agent_loop_20260702_181702.md`. Historical incremental rows: `examples/ablation/diagnostics/agent_loop_ollama_qwen3_14b_smoke.jsonl`. @@ -166,6 +167,8 @@ Force-first per-query report: `examples/ablation/diagnostics/agent_loop_20260702 Force-first incremental rows: `examples/ablation/diagnostics/agent_loop_ollama_qwen3_14b_force_first.jsonl`. DeepSeek per-query report: `examples/ablation/diagnostics/agent_loop_20260702_194134.md`. DeepSeek incremental rows: `examples/ablation/diagnostics/agent_loop_deepseek_v4_flash_20.jsonl`. +DeepSeek 50-query per-query report: `examples/ablation/diagnostics/agent_loop_20260702_201106.md`. +DeepSeek 50-query incremental rows: `examples/ablation/diagnostics/agent_loop_deepseek_v4_flash_50.jsonl`. Observed failure pattern: the fallback model demonstrates real exploration behavior, but quality is not yet a Qwen3.6-grade reference. It made no tool call @@ -196,6 +199,16 @@ and latency: mean elapsed rose from 42.2s to 50.0s and mean prompt tokens from which confirms that true follow-up exploration can recover evidence not found by the first search. +The 50-query DeepSeek extension reached 23/50 (0.460) while preserving the same +operating shape: zero zero-tool answers, zero duplicate calls, 48/50 queries with +multiple tool types, and 48/50 queries with rewrites. Mean prompt tokens stayed +around 19.6k/query and mean elapsed was 50.8s/query. The delayed-discovery set +expanded to `178627`, `87892`, `1090242`, `45924`, `323998`, and `333486`. +High-call misses (`54544`, `293992`, `208145`, `14151`, `91711`, `237373`) +show the next bottleneck: the agent is willing to explore, but still needs +better target selection or retrieval-side candidate expansion when many +follow-up searches miss the gold document. + The local artifacts are gitignored: - `tests/benchmark/data/msmarco_passage.json` - 511 KB manifest