Skip to content

feat(contentunderstanding): Copilot skills for authoring custom analyzers#49672

Draft
chienyuanchang wants to merge 4 commits into
mainfrom
cu-sdk/custom-analyzer-skills
Draft

feat(contentunderstanding): Copilot skills for authoring custom analyzers#49672
chienyuanchang wants to merge 4 commits into
mainfrom
cu-sdk/custom-analyzer-skills

Conversation

@chienyuanchang

Copy link
Copy Markdown
Member

feat(contentunderstanding): Copilot skills for authoring custom analyzers

Summary

Adds two GitHub Copilot skills under sdk/contentunderstanding/azure-ai-contentunderstanding/.github/skills/ that guide users through the iterative cycle of authoring custom Content Understanding analyzers directly inside VS Code:

Skill Use case
cu-sdk-author-analyzer Single-document-type authoring (e.g. invoices, receipts, contracts)
cu-sdk-author-analyzer-classify-route Classify-and-route pipelines for mixed-document packets (e.g. invoice + bank statement + loan application in one PDF)

Both skills delegate to a small cu-skill Maven tool under .github/skills/_shared/ that exposes three subcommands:

  • extract-layout — pulls document structure into .layout.{json,md} so Copilot has ground-truth context for schema drafting.
  • create-and-test — validates a schema locally, creates the analyzer, batch-tests inputs, prints a per-field fill-rate + avg-confidence summary, and (with --ephemeral) cleans up.
  • create-and-test-router — same loop for classify-and-route: creates N inner extractors + 1 outer classifier, wires contentCategories[*].analyzerId, runs, and prints a category-aware summary.

A pure-Java SchemaValidator (Jackson only, no com.azure.* imports) catches structural mistakes (unknown baseAnalyzerId, missing fieldSchema, malformed contentCategories routes) before a service round-trip.

This mirrors the .NET skills shipped in Azure/azure-sdk-for-net#60394 and the Python skills in Azure/azure-sdk-for-python#47218.

Live test

Tested against a real CU resource with mixed_financial_docs.pdf (an invoice + bank statement + loan application):

$ mvn -f .github/skills/_shared/pom.xml exec:java \
    -Dexec.args="extract-layout --input mixed_financial_docs.pdf --output /tmp/layout/"
[RUN ] mixed_financial_docs.pdf -> /tmp/layout/mixed_financial_docs.layout.{json,md}
[DONE] 1 ok, 0 failed
$ java -cp .github/skills/_shared/target/classes:<deps> \
    com.azure.ai.contentunderstanding.skills.Cli create-and-test \
    --schema bank_statement.schema.json \
    --input mixed_financial_docs.pdf --output /tmp/results/ \
    --reuse --ephemeral
[CREATE] analyzer_id=bank_statement.schema_9a2bf4bc
[CREATE] bank_statement.schema_9a2bf4bc ready
[ANALYZE] mixed_financial_docs.pdf -> /tmp/results/mixed_financial_docs.json
[CLEANUP] delete analyzer bank_statement.schema_9a2bf4bc

[SUMMARY]
category: (single)  (1 document)
  field                                    fill rate  avg conf
  accountHolder                            100.0%    0.960
  accountNumber                            100.0%    0.962
  beginningBalance                         100.0%    0.902
  endingBalance                            100.0%    0.902
  ...
  transactions[].deposit                   15.4%     0.880
lowest-confidence fields:
  0.587  transactions[].date  (mixed_financial_docs)

The router variant created 3 inner + 1 outer analyzer, classified 3 segments, extracted all fields, and cleaned up all 4 analyzers (~59 s end-to-end).

What's in the PR

Component LOC Notes
_shared/SchemaValidator.java 335 Pure Jackson — no com.azure.* deps; purity-guarded by a unit test
_shared/ExtractLayoutCommand.java 271 Stage 1
_shared/CreateAndTestCommand.java 712 Stage 2 (single-type)
_shared/CreateAndTestRouterCommand.java 559 Stage 2 (classify-route)
_shared/Cli.java 61 Subcommand dispatcher
_shared/pom.xml + _shared/README.md 164 Standalone Maven module; intentionally not a child of azure-client-sdk-parent and not referenced from the package POM, so zero effect on the published artifact
cu-sdk-author-analyzer/SKILL.md + template 478 Single-type SKILL
cu-sdk-author-analyzer-classify-route/SKILL.md + template 507 Classify-route SKILL
SchemaValidatorTest.java 366 18 JUnit cases, all passing
README.md +24 New "What's New" section + two new author-skill rows in the existing table
CHANGELOG.md +18 Entry under 1.1.0-beta.3 (Unreleased)
Total +3,495

Test results

$ mvn -B test
[INFO] Running com.azure.ai.contentunderstanding.skills.SchemaValidatorTest
[INFO] Tests run: 18, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.605 s
[INFO] BUILD SUCCESS

Coverage:

  • Valid single-type and classify-route schemas
  • Classify-route allows category without analyzerId (catch-all "other" bucket)
  • Unknown baseAnalyzerId rejected (catches the prebuilt-documentAnalyzer typo class)
  • Missing fieldSchema on non-classifier
  • Empty fields object, unknown field type/method
  • Nested object.properties and array.items recursion
  • Classify-route + top-level fieldSchema → rejected
  • Classify-route without enableSegment: true → rejected
  • Empty category description → rejected
  • validateFile: missing file, invalid JSON, valid round-trip
  • KNOWN_BASE_ANALYZER_IDS allow-list sanity check
  • Purity guard — fails if SchemaValidator.java ever pulls in com.azure.* / java.net.http / HttpURLConnection

Bugs fixed during live test

  1. Env var quoting.env values typically wrap in "…"; raw export keeps the quotes. Added a readEnv() helper that strips one layer of surrounding quotes (mirrors the .NET fix).
  2. DefaultAzureCredentialBuilder does not expose excludeCredentials in the Java SDK (unlike .NET). Built a focused ChainedTokenCredentialBuilder chain (EnvironmentCredentialAzureCliCredential) so the IMDS probe doesn't stall on dev boxes (~30 s wait on WSL).
  3. Output JSON shape parity — used the BinaryData protocol overload of beginAnalyzeBinary / beginCreateAnalyzer so we can keep the wire format (valueString/valueNumber/...) and unwrap the LRO envelope {id,status,result,usage} to match Python's poller.result() output and the .NET skill output.
  4. Fill-rate denominator — initial implementation used results.size() (doc count) instead of rows.size() (per-field row count), which inflated array-leaf fields to 1300%. Switched to per-field row count to match Python and .NET semantics.

Backward compatibility

No public API changes. The skills tree is opt-in tooling that lives under .github/skills/ and is only loaded by Copilot when a user explicitly asks. The cu-skill Maven module is standalone — not a child of azure-client-sdk-parent, not referenced from the package POM — so it doesn't affect normal package builds or the published azure-ai-contentunderstanding artifact.

Checklist

…nalyzers

Add two GitHub Copilot skills under .github/skills/ that guide users
through the iterative cycle of authoring custom Content Understanding
analyzers in VS Code:

- cu-sdk-author-analyzer            — single-document-type authoring
- cu-sdk-author-analyzer-classify-route — classify-and-route pipelines

Both skills delegate to a small cu-skill Maven tool under
.github/skills/_shared/ that exposes three subcommands (extract-layout,
create-and-test, create-and-test-router) and a pure-Java SchemaValidator
(no com.azure.* deps) so structural mistakes are caught before a service
round-trip.

Mirrors the .NET skills shipped in Azure/azure-sdk-for-net#60394 and
the Python skills in Azure/azure-sdk-for-python#47218.

Live-tested against a real CU resource with mixed_financial_docs.pdf:

- extract-layout produces .layout.{json,md}
- create-and-test reports per-field fill rate + avg confidence and
  cleans up the analyzer when --ephemeral is set
- create-and-test-router creates N inner + 1 outer analyzer, prints a
  category-aware summary, and cleans up all four

New unit tests:

- SchemaValidatorTest (18 cases) mirrors Python's
  test_skills_shared_schema_validator.py and the .NET
  SkillSchemaValidatorTests.cs: valid single-type and classify-route
  schemas, every rejection path, validateFile error handling,
  KNOWN_BASE_ANALYZER_IDS allow-list sanity, and a purity guard that
  fails if the validator source ever pulls in com.azure.* or HTTP
  namespaces.

README:

- New 'What's New' section and two new entries in the existing
  'GitHub Copilot Skills' table.

CHANGELOG:

- Entry under 1.1.0-beta.3 (Unreleased) describing the new skills.

Build:

- mvn -B -DskipTests compile: clean
- mvn -B test: 18/18 tests pass

The Maven module is intentionally NOT a child of azure-client-sdk-parent
and is NOT referenced from the package POM, so it has zero effect on the
published azure-ai-contentunderstanding artifact.
CI cspell check flagged abbreviated locals and a few domain acronyms
in the skill-tooling sources. Replaces them with the full words used
elsewhere in the file so the dictionary does not need a new entry.

- confs -> confidences (CreateAndTestCommand, CreateAndTestRouterCommand)
- fobj -> fieldObj, vobj -> valueObj (CreateAndTestCommand)
- IMDS -> metadata-service (ExtractLayoutCommand comment)
- prebuilts -> prebuilt analyzers (SchemaValidator javadoc)

No behavior change. All 18 unit tests still pass.
…le paths

Verify-Links flagged 15 broken relative links in the two analyzer skill
docs. The Python source skills referenced 'samples/sample_*.py' files
at the package root; the Java port mechanically copied the link
structure but the Java layout is different:

- Java samples live at src/samples/java/com/azure/ai/contentunderstanding/samples/,
  not samples/.
- Java does not have the per-sample .md companion files Python ships
  (Sample06_GetAnalyzer.md, Sample02_AnalyzeUrl.md, etc.) — only the
  .java source files.

Rewrites all 15 links to point at the correct .java source under
src/samples/java/. All 10 distinct sample files referenced were
verified to exist on disk before the rewrite.
…ts from Python

Three changes for parity with Python PR #47218:

1. **summarizeRouted denominator bug fix** (was a port-time regression).
   The denominator for a per-field fill rate must be the per-category
   segment count, NOT the per-field row count. Two invoice segments
   where only one has TotalAmount correctly reports 50% (1/2 segments),
   not 100% (1/1 rows). This matches Python and .NET; the same bug
   existed in JS and is being fixed on its PR. The comment claiming
   'Mirrors Python/.NET semantics' was misleading — fixed in this commit.

2. **New CLI-helper test classes** — CreateAndTestCommandTest and
   CreateAndTestRouterCommandTest. Together they cover the 4
   pure-helper tests Python ships (summarize leaf-row flattening,
   summarizeRouted per-category denominator, summarizeRouted zero-fill,
   wireInnerIds alias substitution). The denom regression above was
   discovered by these tests.

   Python ships 10 CLI-helper tests total; the remaining 6 either test
   argparse-specific behaviour (--help smoke, monkey-patched client)
   or test code structured differently in Java (error-tuple API), so
   they don't translate 1:1. We cover the substantive behaviour.

3. **Cross-skill SKILL.md updates** that Python's PR #47218 also
   shipped: a two-stage-pipeline section + a baseAnalyzerId table in
   cu-sdk-common-knowledge, 'Next step' cross-refs in cu-sdk-sample-run,
   and a 'step numbering is contract' callout in cu-sdk-setup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant