examples(contract-review-agent): multi-model contract review with the OpenAI Agents SDK#197
Open
svonava wants to merge 1 commit into
Open
examples(contract-review-agent): multi-model contract review with the OpenAI Agents SDK#197svonava wants to merge 1 commit into
svonava wants to merge 1 commit into
Conversation
… OpenAI Agents SDK An OpenAI Agents SDK agent whose every model call is served by one SIE cluster. An autonomous 'investigator' agent fans out across the SIE catalog (no structured output_type, so it must use its tools) and a 'synthesizer' agent produces a grounded, structured ContractReview: - triage Qwen3-0.6B, orchestration Qwen3-4B-Instruct, vision Qwen3.5-4B, risk-analysis sub-agent Qwen3-4B-Instruct (newer Qwen3.5-4B / stronger Qwen3.6-27B where the cluster serves them), text-to-SQL sqlcoder-7b-2, OCR LightOnOCR-2-1B, embeddings bge-m3, rerank Qwen3-Reranker-4B, entities gliner_large; granite-guardian input guardrail. - Real contracts from CUAD (CC BY 4.0), with a synthetic offline fallback. - Per-model observability: cold-start warm-up vs warm throughput, per call. - Resilient: fail-open guardrail, graceful tool degradation, and provisioning retries for cold/evicted models. Validated end-to-end against a GPU SIE cluster.
2c77c31 to
6912f3e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Multi-model contract review with the OpenAI Agents SDK
A new runnable example under
examples/contract-review-agent/: a contract reviewer built on the OpenAI Agents SDK where every model call is served by one SIE cluster — noapi.openai.com, no per-token bill. It demonstrates the SIE model catalog by fanning a single request across ~9 specialised models, and it doubles as a per-model observability tool.This is the "one cluster powers every model your agent calls" idea from the landing page, made real, runnable, and grounded on real contracts.
The catalog — the right model for each job
Qwen/Qwen3-0.6BQwen/Qwen3-4B-Instruct-2507Qwen/Qwen3.5-4BQwen/Qwen3-4B-Instruct-2507(↑Qwen3.5-4B/Qwen3.6-27Bwhere served)defog/sqlcoder-7b-2ibm-granite/granite-guardian-3.0-2blightonai/LightOnOCR-2-1BBAAI/bge-m3Qwen/Qwen3-Reranker-4Burchade/gliner_large-v2.1Each role is one line in
config.yaml; swap a string to try another catalog model.Architecture — two agents
The whole wiring is one idea: the Agents SDK speaks the OpenAI wire protocol and SIE serves an OpenAI-compatible
/v1, so we point the SDK at SIE (set_default_openai_client+set_default_openai_api("chat_completions")+set_tracing_disabled) and eachAgentnames a SIE model.The flow is deliberately two agents:
output_type) autonomously calls tools to gather grounded facts — guarded by agranite-guardianinput guardrail, delegating clause-risk to a reasoning sub-agent.output_type=ContractReview, no tools) formats the findings into structured output via SIE's JSON-schema-constrained generation.Why split? With a structured
output_type, a small open model emits the schema immediately and skips the tools (it will even hallucinate the fields). Separating "gather with tools" from "format the result" keeps the multi-model fan-out real and the output grounded. (tool_choice="required"would be another lever, but SIE returns400for forced tool calls on this model — onlyautoworks.)Observability
A normal
uv run reviewprints a per-model ledger: model, SIE function, cold-start warm-up, warm latency, data sent, and warm throughput (tokens/s) — warm-up shown separately from throughput so a cold model's numbers aren't blended into a meaningless "1 tok/s".Data
Default corpus is CUAD (Contract Understanding Atticus Dataset) — 510 real SEC-filed commercial contracts, CC BY 4.0.
uv run fetch-contractsdownloads CUAD's ~18 MB archive once and parses the contract text;uv run make-samplebuilds a fully synthetic offline alternative. The text-to-SQL obligations DB is seeded from the fetched contracts.Run it
Validated end-to-end against a GPU SIE cluster
Run against a live GPU cluster, the investigator autonomously fanned out across 6–7 distinct models and produced a grounded review of a real CUAD contract (a Shenzhen LOHAS Supply Contract) — correctly extracting the buyer, flagging the contradictory 30%-vs-5% late-delivery penalty and the force-majeure certificate issue, and — when the OCR/vision tools were unavailable that run — honestly reporting execution as uncertain "due to tool failures" instead of hallucinating.
Sample observability ledger from a live run:
Qwen3-0.6B(triage)Qwen3-4B-Instruct(risk sub-agent, this run)bge-m3(embed)Qwen3-Reranker-4B(rerank)gliner_large(entities)Qwen3.5-4B(vision)The committed default uses
Qwen3-4B-Instruct-2507for reasoning — the model that produced the grounded review above.Qwen3.5-4B(newest) andQwen3.6-27B(strongest) are one-line swaps for clusters that serve them well.Engineering decisions surfaced by live testing
max_retries=0); Agents-SDK calls — which we can't wrap — use a hard-retrying client (max_retries=12) so they survive a model being evicted mid-run on a busy cluster.Notes / caveats
openai-agentsis pinned<0.14— newer releases pinwebsockets>=15(for realtime/voice, unused here), which conflicts withsie-sdk'swebsockets<15.data.zipdirectly (thedatasets5.x release dropped script-based datasets, soload_datasetno longer works for it).Qwen3.6-27B,sqlcoder-7b-2,granite-guardian,LightOnOCR, and (intermittently) the vision modelQwen3.5-4Bdid not reliably provision — those steps degraded gracefully (logged in the ledger) and run on a fully-provisioned cluster.reasoningtherefore defaults to the reliableQwen3-4B-Instruct-2507;Qwen3.5-4B(newest) /Qwen3.6-27B(strongest) are documented one-line swaps.