feat(contentunderstanding): Copilot skills for authoring custom analyzers by chienyuanchang · Pull Request #49672 · Azure/azure-sdk-for-java

chienyuanchang · 2026-06-29T22:51:42Z

feat(contentunderstanding): Copilot skills for authoring custom analyzers

Summary

Adds two GitHub Copilot skills under sdk/contentunderstanding/azure-ai-contentunderstanding/.github/skills/ that guide users through the iterative cycle of authoring custom Content Understanding analyzers directly inside VS Code:

Skill	Use case
`cu-sdk-author-analyzer`	Single-document-type authoring (e.g. invoices, receipts, contracts)
`cu-sdk-author-analyzer-classify-route`	Classify-and-route pipelines for mixed-document packets (e.g. invoice + bank statement + loan application in one PDF)

Both skills delegate to a small cu-skill Maven tool under .github/skills/_shared/ that exposes three subcommands:

extract-layout — pulls document structure into .layout.{json,md} so Copilot has ground-truth context for schema drafting.
create-and-test — validates a schema locally, creates the analyzer, batch-tests inputs, prints a per-field fill-rate + avg-confidence summary, and (with --ephemeral) cleans up.
create-and-test-router — same loop for classify-and-route: creates N inner extractors + 1 outer classifier, wires contentCategories[*].analyzerId, runs, and prints a category-aware summary.

A pure-Java SchemaValidator (Jackson only, no com.azure.* imports) catches structural mistakes (unknown baseAnalyzerId, missing fieldSchema, malformed contentCategories routes) before a service round-trip.

This mirrors the .NET skills shipped in Azure/azure-sdk-for-net#60394 and the Python skills in Azure/azure-sdk-for-python#47218.

Live test

Tested against a real CU resource with mixed_financial_docs.pdf (an invoice + bank statement + loan application):

$ mvn -f .github/skills/_shared/pom.xml exec:java \
    -Dexec.args="extract-layout --input mixed_financial_docs.pdf --output /tmp/layout/"
[RUN ] mixed_financial_docs.pdf -> /tmp/layout/mixed_financial_docs.layout.{json,md}
[DONE] 1 ok, 0 failed

$ java -cp .github/skills/_shared/target/classes:<deps> \
    com.azure.ai.contentunderstanding.skills.Cli create-and-test \
    --schema bank_statement.schema.json \
    --input mixed_financial_docs.pdf --output /tmp/results/ \
    --reuse --ephemeral
[CREATE] analyzer_id=bank_statement.schema_9a2bf4bc
[CREATE] bank_statement.schema_9a2bf4bc ready
[ANALYZE] mixed_financial_docs.pdf -> /tmp/results/mixed_financial_docs.json
[CLEANUP] delete analyzer bank_statement.schema_9a2bf4bc

[SUMMARY]
category: (single)  (1 document)
  field                                    fill rate  avg conf
  accountHolder                            100.0%    0.960
  accountNumber                            100.0%    0.962
  beginningBalance                         100.0%    0.902
  endingBalance                            100.0%    0.902
  ...
  transactions[].deposit                   15.4%     0.880
lowest-confidence fields:
  0.587  transactions[].date  (mixed_financial_docs)

The router variant created 3 inner + 1 outer analyzer, classified 3 segments, extracted all fields, and cleaned up all 4 analyzers (~59 s end-to-end).

What's in the PR

Component	LOC	Notes
`_shared/SchemaValidator.java`	335	Pure Jackson — no `com.azure.*` deps; purity-guarded by a unit test
`_shared/ExtractLayoutCommand.java`	271	Stage 1
`_shared/CreateAndTestCommand.java`	712	Stage 2 (single-type)
`_shared/CreateAndTestRouterCommand.java`	559	Stage 2 (classify-route)
`_shared/Cli.java`	61	Subcommand dispatcher
`_shared/pom.xml` + `_shared/README.md`	164	Standalone Maven module; intentionally not a child of azure-client-sdk-parent and not referenced from the package POM, so zero effect on the published artifact
`cu-sdk-author-analyzer/SKILL.md` + template	478	Single-type SKILL
`cu-sdk-author-analyzer-classify-route/SKILL.md` + template	507	Classify-route SKILL
`SchemaValidatorTest.java`	366	18 JUnit cases, all passing
`README.md`	+24	New "What's New" section + two new author-skill rows in the existing table
`CHANGELOG.md`	+18	Entry under 1.1.0-beta.3 (Unreleased)
Total	+3,495

Test results

$ mvn -B test
[INFO] Running com.azure.ai.contentunderstanding.skills.SchemaValidatorTest
[INFO] Tests run: 18, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 1.605 s
[INFO] BUILD SUCCESS

Coverage:

Valid single-type and classify-route schemas
Classify-route allows category without analyzerId (catch-all "other" bucket)
Unknown baseAnalyzerId rejected (catches the prebuilt-documentAnalyzer typo class)
Missing fieldSchema on non-classifier
Empty fields object, unknown field type/method
Nested object.properties and array.items recursion
Classify-route + top-level fieldSchema → rejected
Classify-route without enableSegment: true → rejected
Empty category description → rejected
validateFile: missing file, invalid JSON, valid round-trip
KNOWN_BASE_ANALYZER_IDS allow-list sanity check
Purity guard — fails if SchemaValidator.java ever pulls in com.azure.* / java.net.http / HttpURLConnection

Bugs fixed during live test

Env var quoting — .env values typically wrap in "…"; raw export keeps the quotes. Added a readEnv() helper that strips one layer of surrounding quotes (mirrors the .NET fix).
DefaultAzureCredentialBuilder does not expose excludeCredentials in the Java SDK (unlike .NET). Built a focused ChainedTokenCredentialBuilder chain (EnvironmentCredential → AzureCliCredential) so the IMDS probe doesn't stall on dev boxes (~30 s wait on WSL).
Output JSON shape parity — used the BinaryData protocol overload of beginAnalyzeBinary / beginCreateAnalyzer so we can keep the wire format (valueString/valueNumber/...) and unwrap the LRO envelope {id,status,result,usage} to match Python's poller.result() output and the .NET skill output.
Fill-rate denominator — initial implementation used results.size() (doc count) instead of rows.size() (per-field row count), which inflated array-leaf fields to 1300%. Switched to per-field row count to match Python and .NET semantics.

Backward compatibility

No public API changes. The skills tree is opt-in tooling that lives under .github/skills/ and is only loaded by Copilot when a user explicitly asks. The cu-skill Maven module is standalone — not a child of azure-client-sdk-parent, not referenced from the package POM — so it doesn't affect normal package builds or the published azure-ai-contentunderstanding artifact.

Checklist

Live-tested against real CU resource
Unit tests added (18 cases)
mvn -B -DskipTests compile clean
mvn -B test 18/18 passing
CHANGELOG entry
README updated
No public API changes
Mirrors .NET PR (feat(contentunderstanding): Copilot skills for authoring custom analyzers azure-sdk-for-net#60394) and Python PR ([Content Understanding] Add Copilot skills for custom-analyzer authoring azure-sdk-for-python#47218)

…nalyzers Add two GitHub Copilot skills under .github/skills/ that guide users through the iterative cycle of authoring custom Content Understanding analyzers in VS Code: - cu-sdk-author-analyzer — single-document-type authoring - cu-sdk-author-analyzer-classify-route — classify-and-route pipelines Both skills delegate to a small cu-skill Maven tool under .github/skills/_shared/ that exposes three subcommands (extract-layout, create-and-test, create-and-test-router) and a pure-Java SchemaValidator (no com.azure.* deps) so structural mistakes are caught before a service round-trip. Mirrors the .NET skills shipped in Azure/azure-sdk-for-net#60394 and the Python skills in Azure/azure-sdk-for-python#47218. Live-tested against a real CU resource with mixed_financial_docs.pdf: - extract-layout produces .layout.{json,md} - create-and-test reports per-field fill rate + avg confidence and cleans up the analyzer when --ephemeral is set - create-and-test-router creates N inner + 1 outer analyzer, prints a category-aware summary, and cleans up all four New unit tests: - SchemaValidatorTest (18 cases) mirrors Python's test_skills_shared_schema_validator.py and the .NET SkillSchemaValidatorTests.cs: valid single-type and classify-route schemas, every rejection path, validateFile error handling, KNOWN_BASE_ANALYZER_IDS allow-list sanity, and a purity guard that fails if the validator source ever pulls in com.azure.* or HTTP namespaces. README: - New 'What's New' section and two new entries in the existing 'GitHub Copilot Skills' table. CHANGELOG: - Entry under 1.1.0-beta.3 (Unreleased) describing the new skills. Build: - mvn -B -DskipTests compile: clean - mvn -B test: 18/18 tests pass The Maven module is intentionally NOT a child of azure-client-sdk-parent and is NOT referenced from the package POM, so it has zero effect on the published azure-ai-contentunderstanding artifact.

CI cspell check flagged abbreviated locals and a few domain acronyms in the skill-tooling sources. Replaces them with the full words used elsewhere in the file so the dictionary does not need a new entry. - confs -> confidences (CreateAndTestCommand, CreateAndTestRouterCommand) - fobj -> fieldObj, vobj -> valueObj (CreateAndTestCommand) - IMDS -> metadata-service (ExtractLayoutCommand comment) - prebuilts -> prebuilt analyzers (SchemaValidator javadoc) No behavior change. All 18 unit tests still pass.

…le paths Verify-Links flagged 15 broken relative links in the two analyzer skill docs. The Python source skills referenced 'samples/sample_*.py' files at the package root; the Java port mechanically copied the link structure but the Java layout is different: - Java samples live at src/samples/java/com/azure/ai/contentunderstanding/samples/, not samples/. - Java does not have the per-sample .md companion files Python ships (Sample06_GetAnalyzer.md, Sample02_AnalyzeUrl.md, etc.) — only the .java source files. Rewrites all 15 links to point at the correct .java source under src/samples/java/. All 10 distinct sample files referenced were verified to exist on disk before the rewrite.

…ts from Python Three changes for parity with Python PR #47218: 1. **summarizeRouted denominator bug fix** (was a port-time regression). The denominator for a per-field fill rate must be the per-category segment count, NOT the per-field row count. Two invoice segments where only one has TotalAmount correctly reports 50% (1/2 segments), not 100% (1/1 rows). This matches Python and .NET; the same bug existed in JS and is being fixed on its PR. The comment claiming 'Mirrors Python/.NET semantics' was misleading — fixed in this commit. 2. **New CLI-helper test classes** — CreateAndTestCommandTest and CreateAndTestRouterCommandTest. Together they cover the 4 pure-helper tests Python ships (summarize leaf-row flattening, summarizeRouted per-category denominator, summarizeRouted zero-fill, wireInnerIds alias substitution). The denom regression above was discovered by these tests. Python ships 10 CLI-helper tests total; the remaining 6 either test argparse-specific behaviour (--help smoke, monkey-patched client) or test code structured differently in Java (error-tuple API), so they don't translate 1:1. We cover the substantive behaviour. 3. **Cross-skill SKILL.md updates** that Python's PR #47218 also shipped: a two-stage-pipeline section + a baseAnalyzerId table in cu-sdk-common-knowledge, 'Next step' cross-refs in cu-sdk-sample-run, and a 'step numbering is contract' callout in cu-sdk-setup.

github-actions Bot added the Cognitive - Content Understanding label Jun 29, 2026

chienyuanchang mentioned this pull request Jun 29, 2026

feat(contentunderstanding): Copilot skills for authoring custom analyzers Azure/azure-sdk-for-js#39137

Draft

8 tasks

chienyuanchang added 3 commits June 29, 2026 19:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(contentunderstanding): Copilot skills for authoring custom analyzers#49672

feat(contentunderstanding): Copilot skills for authoring custom analyzers#49672
chienyuanchang wants to merge 4 commits into
mainfrom
cu-sdk/custom-analyzer-skills

chienyuanchang commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chienyuanchang commented Jun 29, 2026

feat(contentunderstanding): Copilot skills for authoring custom analyzers

Summary

Live test

What's in the PR

Test results

Bugs fixed during live test

Backward compatibility

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant