Skip to content

Latest commit

ย 

History

History
513 lines (443 loc) ยท 31.1 KB

File metadata and controls

513 lines (443 loc) ยท 31.1 KB
CodeLLM-DevKit

codeanalyzer-python (canpy)

A Python static-analysis toolkit โ€” the CLDK backend that emits a canonical symbol table and call graph, as analysis.json or a Neo4j property graph.

PyPI GitHub release Release License


canpy is a static analyzer for Python built on Jedi, with optional CodeQL-resolved call edges and Tree-sitter parsing. It produces the canonical CodeLLM-DevKit (CLDK) analysis.json โ€” a symbol table plus a call graph โ€” and can project that same analysis into a Neo4j property graph. It is the Python backend behind CLDK, mirroring its TypeScript (cants) and Java siblings.

Every run produces a symbol table and a call graph. Edges come from Jedi's lexical resolution by default; --codeql resolves additional edges (RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges, also backfilling callees Jedi could not resolve.

Table of Contents

Features

  • Symbol table โ€” modules, classes, functions, methods, variables, decorators, imports, and docstrings, with precise source spans.
  • Call graph โ€” Jedi's lexical resolver by default, with optional CodeQL-resolved edges (--codeql) for RPC / third-party / dynamically-dispatched targets, merged with the Jedi edges; CodeQL also backfills callees Jedi could not resolve.
  • Neo4j output โ€” project the analysis into a labeled property graph: a self-contained graph.cypher snapshot, or an incremental push to a live database over Bolt.
  • Versioned schema โ€” a machine-readable, version-stamped Neo4j schema contract (--emit schema), checked in as schema.neo4j.json and shipped with every release.
  • Incremental cache โ€” per-file results are cached under .codeanalyzer; --lazy (default) reuses them, --eager forces a clean rebuild. --ray distributes the work across cores.
  • Compact output โ€” canonical analysis.json, or binary analysis.msgpack for smaller artifacts.

Installation

Prerequisites

  • Python 3.10 or newer.

  • A C toolchain and the venv / development headers โ€” the analyzer builds an isolated virtual environment per project (via Python's venv) so Jedi can resolve types and imports:

    # Ubuntu / Debian
    sudo apt install python3-venv python3-dev build-essential
    
    # Fedora / RHEL / CentOS
    sudo dnf group install "Development Tools" && sudo dnf install python3-venv python3-devel
    
    # macOS
    xcode-select --install

Install via pip (PyPI)

pip install codeanalyzer-python
canpy --help

For the optional live Neo4j push (--emit neo4j --neo4j-uri โ€ฆ), install the neo4j extra:

pip install 'codeanalyzer-python[neo4j]'

Install via shell script

Install the CLI as an isolated tool with the one-line installer (provisions via uv / pipx / pip):

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/codellm-devkit/codeanalyzer-python/releases/latest/download/canpy-installer.sh | sh

Install via Homebrew

brew install codellm-devkit/tap/codeanalyzer-python

The formula depends on uv and installs canpy as an isolated, version-pinned uv tool (the package and its dependencies are resolved and cached on first run).

Build from source

This project uses uv for dependency management.

git clone https://github.com/codellm-devkit/codeanalyzer-python
cd codeanalyzer-python
uv sync --all-groups
uv run canpy --help

Usage

canpy --input /path/to/python/project

With no --output, the analysis is printed to stdout as compact JSON; with --output <dir> it is written to analysis.json (or graph.cypher for --emit neo4j, or analysis.msgpack with --format msgpack) in that directory.

Options

$ canpy --help

 Usage: canpy [OPTIONS] COMMAND [ARGS]...

 Static Analysis on Python source code using Jedi, PyCG and Tree sitter.

โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --input            -i                     PATH              Path to the      โ”‚
โ”‚                                                             project root     โ”‚
โ”‚                                                             directory (not   โ”‚
โ”‚                                                             required for     โ”‚
โ”‚                                                             --emit schema).  โ”‚
โ”‚ --output           -o                     PATH              Output directory โ”‚
โ”‚                                                             for artifacts.   โ”‚
โ”‚ --format           -f                     [json|msgpack]    Output format    โ”‚
โ”‚                                                             for --emit json: โ”‚
โ”‚                                                             json or msgpack. โ”‚
โ”‚                                                             [default: json]  โ”‚
โ”‚ --emit                                    [json|neo4j|sche  Output target:   โ”‚
โ”‚                                           ma]               json             โ”‚
โ”‚                                                             (analysis.json,  โ”‚
โ”‚                                                             default) | neo4j โ”‚
โ”‚                                                             (graph.cypher or โ”‚
โ”‚                                                             live Bolt push)  โ”‚
โ”‚                                                             | schema (the    โ”‚
โ”‚                                                             Neo4j            โ”‚
โ”‚                                                             schema.json      โ”‚
โ”‚                                                             contract).       โ”‚
โ”‚                                                             [default: json]  โ”‚
โ”‚ --app-name                                TEXT              Logical          โ”‚
โ”‚                                                             application name โ”‚
โ”‚                                                             for the graph    โ”‚
โ”‚                                                             :PyApplication   โ”‚
โ”‚                                                             anchor (default: โ”‚
โ”‚                                                             input dir name). โ”‚
โ”‚ --neo4j-uri                               TEXT              Push the graph   โ”‚
โ”‚                                                             to a live Neo4j  โ”‚
โ”‚                                                             over Bolt        โ”‚
โ”‚                                                             (incremental);   โ”‚
โ”‚                                                             omit to write    โ”‚
โ”‚                                                             graph.cypher.    โ”‚
โ”‚                                                             [env var:        โ”‚
โ”‚                                                             NEO4J_URI]       โ”‚
โ”‚ --neo4j-user                              TEXT              Neo4j username.  โ”‚
โ”‚                                                             [env var:        โ”‚
โ”‚                                                             NEO4J_USERNAME]  โ”‚
โ”‚                                                             [default: neo4j] โ”‚
โ”‚ --neo4j-password                          TEXT              Neo4j password.  โ”‚
โ”‚                                                             Prefer the env   โ”‚
โ”‚                                                             var over the     โ”‚
โ”‚                                                             flag (the flag   โ”‚
โ”‚                                                             is visible in    โ”‚
โ”‚                                                             shell history /  โ”‚
โ”‚                                                             process list).   โ”‚
โ”‚                                                             [env var:        โ”‚
โ”‚                                                             NEO4J_PASSWORD]  โ”‚
โ”‚                                                             [default: neo4j] โ”‚
โ”‚ --neo4j-database                          TEXT              Neo4j database   โ”‚
โ”‚                                                             name (default:   โ”‚
โ”‚                                                             server default). โ”‚
โ”‚                                                             [env var:        โ”‚
โ”‚                                                             NEO4J_DATABASE]  โ”‚
โ”‚ --analysis-level   -a                     INTEGER RANGE     Analysis depth:  โ”‚
โ”‚                                           [1<=x<=2]         1=symbol         โ”‚
โ”‚                                                             table+Jedi call  โ”‚
โ”‚                                                             graph, 2=+PyCG   โ”‚
โ”‚                                                             call graph.      โ”‚
โ”‚                                                             [default: 1]     โ”‚
โ”‚ --ray                  --no-ray                             Enable Ray for   โ”‚
โ”‚                                                             distributed      โ”‚
โ”‚                                                             analysis.        โ”‚
โ”‚                                                             [default:        โ”‚
โ”‚                                                             no-ray]          โ”‚
โ”‚ --eager                --lazy                               Enable eager or  โ”‚
โ”‚                                                             lazy analysis.   โ”‚
โ”‚                                                             Defaults to      โ”‚
โ”‚                                                             lazy.            โ”‚
โ”‚                                                             [default: lazy]  โ”‚
โ”‚ --skip-tests           --include-tests                      Skip test files  โ”‚
โ”‚                                                             in analysis.     โ”‚
โ”‚                                                             [default:        โ”‚
โ”‚                                                             skip-tests]      โ”‚
โ”‚ --no-venv              --venv                               Skip virtualenv  โ”‚
โ”‚                                                             creation and     โ”‚
โ”‚                                                             dependency       โ”‚
โ”‚                                                             installation;    โ”‚
โ”‚                                                             resolve imports  โ”‚
โ”‚                                                             against the      โ”‚
โ”‚                                                             ambient Python   โ”‚
โ”‚                                                             environment      โ”‚
โ”‚                                                             instead.         โ”‚
โ”‚                                                             [default: venv]  โ”‚
โ”‚ --file-name                               PATH              Analyze only the โ”‚
โ”‚                                                             specified file   โ”‚
โ”‚                                                             (relative to     โ”‚
โ”‚                                                             input            โ”‚
โ”‚                                                             directory).      โ”‚
โ”‚ --cache-dir        -c                     PATH              Directory to     โ”‚
โ”‚                                                             store analysis   โ”‚
โ”‚                                                             cache. Defaults  โ”‚
โ”‚                                                             to               โ”‚
โ”‚                                                             '.codeanalyzer'  โ”‚
โ”‚                                                             in the input     โ”‚
โ”‚                                                             directory.       โ”‚
โ”‚ --clear-cache          --keep-cache                         Clear cache      โ”‚
โ”‚                                                             after analysis.  โ”‚
โ”‚                                                             By default,      โ”‚
โ”‚                                                             cache is         โ”‚
โ”‚                                                             retained.        โ”‚
โ”‚                                                             [default:        โ”‚
โ”‚                                                             keep-cache]      โ”‚
โ”‚                    -v                     INTEGER           Increase         โ”‚
โ”‚                                                             verbosity: -v,   โ”‚
โ”‚                                                             -vv, -vvv        โ”‚
โ”‚                                                             [default: 0]     โ”‚
โ”‚ --pycg-shard           --no-pycg-shard                      Shard PyCG       โ”‚
โ”‚                                                             call-graph       โ”‚
โ”‚                                                             analysis by      โ”‚
โ”‚                                                             Python package   โ”‚
โ”‚                                                             (level 2 only).  โ”‚
โ”‚                                                             When the project โ”‚
โ”‚                                                             exceeds the      โ”‚
โ”‚                                                             500-file         โ”‚
โ”‚                                                             ceiling, PyCG is โ”‚
โ”‚                                                             run              โ”‚
โ”‚                                                             independently    โ”‚
โ”‚                                                             per top-level    โ”‚
โ”‚                                                             package with     โ”‚
โ”‚                                                             cross-package    โ”‚
โ”‚                                                             imports treated  โ”‚
โ”‚                                                             as ghost nodes.  โ”‚
โ”‚                                                             Without this     โ”‚
โ”‚                                                             flag, projects   โ”‚
โ”‚                                                             over the ceiling โ”‚
โ”‚                                                             fall back to     โ”‚
โ”‚                                                             Jedi-only edges. โ”‚
โ”‚                                                             [default:        โ”‚
โ”‚                                                             no-pycg-shard]   โ”‚
โ”‚ --pycg-shard-ceiโ€ฆ                         INTEGER RANGE     Maximum files    โ”‚
โ”‚                                           [x>=1]            per shard when   โ”‚
โ”‚                                                             --pycg-shard is  โ”‚
โ”‚                                                             active (default  โ”‚
โ”‚                                                             100). Shards     โ”‚
โ”‚                                                             exceeding this   โ”‚
โ”‚                                                             limit are        โ”‚
โ”‚                                                             skipped; their   โ”‚
โ”‚                                                             call edges are   โ”‚
โ”‚                                                             omitted from the โ”‚
โ”‚                                                             call graph (Jedi โ”‚
โ”‚                                                             edges for those  โ”‚
โ”‚                                                             packages are     โ”‚
โ”‚                                                             still included). โ”‚
โ”‚                                                             Lower values are โ”‚
โ”‚                                                             safer for        โ”‚
โ”‚                                                             packages with    โ”‚
โ”‚                                                             deep class       โ”‚
โ”‚                                                             hierarchies or   โ”‚
โ”‚                                                             heavy import     โ”‚
โ”‚                                                             graphs.          โ”‚
โ”‚                                                             [default: 100]   โ”‚
โ”‚ --pycg-shard-timโ€ฆ                         INTEGER RANGE     Per-shard        โ”‚
โ”‚                                           [x>=0]            wall-clock       โ”‚
โ”‚                                                             timeout in       โ”‚
โ”‚                                                             seconds when     โ”‚
โ”‚                                                             --pycg-shard is  โ”‚
โ”‚                                                             active (default  โ”‚
โ”‚                                                             120). A shard    โ”‚
โ”‚                                                             that exceeds     โ”‚
โ”‚                                                             this limit is    โ”‚
โ”‚                                                             skipped          โ”‚
โ”‚                                                             gracefully.      โ”‚
โ”‚                                                             PyCG's fixpoint  โ”‚
โ”‚                                                             is bimodal: it   โ”‚
โ”‚                                                             either converges โ”‚
โ”‚                                                             quickly or       โ”‚
โ”‚                                                             diverges         โ”‚
โ”‚                                                             indefinitely, so โ”‚
โ”‚                                                             the timeout acts โ”‚
โ”‚                                                             as a final       โ”‚
โ”‚                                                             safety net after โ”‚
โ”‚                                                             the file-count   โ”‚
โ”‚                                                             ceiling. Set to  โ”‚
โ”‚                                                             0 to disable.    โ”‚
โ”‚                                                             POSIX only       โ”‚
โ”‚                                                             (macOS / Linux); โ”‚
โ”‚                                                             ignored on       โ”‚
โ”‚                                                             Windows.         โ”‚
โ”‚                                                             [default: 120]   โ”‚
โ”‚ --pycg-shard-strโ€ฆ                         [jedi|package]    How --pycg-shard โ”‚
โ”‚                                                             groups files     โ”‚
โ”‚                                                             (level 2 only).  โ”‚
โ”‚                                                             'jedi' (default) โ”‚
โ”‚                                                             partitions the   โ”‚
โ”‚                                                             Jedi             โ”‚
โ”‚                                                             module-dependenโ€ฆ โ”‚
โ”‚                                                             graph (SCC +     โ”‚
โ”‚                                                             Louvain) so      โ”‚
โ”‚                                                             tightly-coupled  โ”‚
โ”‚                                                             modules          โ”‚
โ”‚                                                             co-compute and   โ”‚
โ”‚                                                             few call edges   โ”‚
โ”‚                                                             are severed      โ”‚
โ”‚                                                             between shards;  โ”‚
โ”‚                                                             import cycles    โ”‚
โ”‚                                                             are never split. โ”‚
โ”‚                                                             'package' uses   โ”‚
โ”‚                                                             the legacy       โ”‚
โ”‚                                                             one-shard-per-pโ€ฆ โ”‚
โ”‚                                                             grouping.        โ”‚
โ”‚                                                             [default: jedi]  โ”‚
โ”‚ --pycg-max-iter                           INTEGER RANGE     Cap on PyCG's    โ”‚
โ”‚                                           [x>=-1]           fixpoint passes  โ”‚
โ”‚                                                             per              โ”‚
โ”‚                                                             shard/project    โ”‚
โ”‚                                                             (level 2;        โ”‚
โ”‚                                                             default 50).     โ”‚
โ”‚                                                             PyCG iterates    โ”‚
โ”‚                                                             until its        โ”‚
โ”‚                                                             points-to state  โ”‚
โ”‚                                                             stops changing,  โ”‚
โ”‚                                                             but its          โ”‚
โ”‚                                                             access-path      โ”‚
โ”‚                                                             domain has no    โ”‚
โ”‚                                                             convergence      โ”‚
โ”‚                                                             bound, so heavy  โ”‚
โ”‚                                                             metaclass/mixin  โ”‚
โ”‚                                                             code (e.g. an    โ”‚
โ”‚                                                             ORM) can loop    โ”‚
โ”‚                                                             with each pass   โ”‚
โ”‚                                                             costing seconds. โ”‚
โ”‚                                                             The cap returns  โ”‚
โ”‚                                                             a                โ”‚
โ”‚                                                             sound-but-incomโ€ฆ โ”‚
โ”‚                                                             call graph       โ”‚
โ”‚                                                             instead of       โ”‚
โ”‚                                                             looping until    โ”‚
โ”‚                                                             the timeout      โ”‚
โ”‚                                                             kills it. Set to โ”‚
โ”‚                                                             -1 for PyCG's    โ”‚
โ”‚                                                             unbounded        โ”‚
โ”‚                                                             run-to-convergeโ€ฆ โ”‚
โ”‚                                                             behaviour.       โ”‚
โ”‚                                                             [default: 50]    โ”‚
โ”‚ --help                                                      Show this        โ”‚
โ”‚                                                             message and      โ”‚
โ”‚                                                             exit.            โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

Examples

  1. Basic analysis to stdout, or to a file:

    canpy --input ./my-python-project                        # compact JSON on stdout
    canpy --input ./my-python-project --output ./out         # โ†’ ./out/analysis.json
  2. Binary output (msgpack):

    canpy --input ./my-python-project --output ./out --format msgpack   # โ†’ ./out/analysis.msgpack
  3. Resolve extra call edges with CodeQL:

    canpy --input ./my-python-project --codeql

    By default, edges come from Jedi's lexical analysis. Adding --codeql resolves additional edges (including RPC / third-party / dynamically-dispatched targets) and merges them with the Jedi-derived edges; CodeQL also backfills resolved callees Jedi could not resolve. CodeQL integration is experimental; the CLI is downloaded into <cache_dir>/codeql/ on first use.

  4. Emit a Neo4j snapshot, or push to a live database:

    canpy --input ./my-python-project --emit neo4j --output ./out   # โ†’ ./out/graph.cypher
    canpy --input ./my-python-project --emit neo4j \
      --neo4j-uri bolt://localhost:7687 --neo4j-user neo4j --neo4j-password secret
  5. Emit the Neo4j schema contract:

    canpy --emit schema                   # print schema.json to stdout (no project needed)
    canpy --emit schema --output ./out    # โ†’ ./out/schema.json
  6. Force a clean rebuild with a custom cache directory:

    canpy --input ./my-python-project --eager --cache-dir /path/to/custom-cache

Output targets

canpy builds one analysis in memory and can emit it three ways (--emit):

analysis.json (default)

A PyApplication document โ€” the canonical CLDK contract:

{
  "symbol_table": { /* file path โ†’ module (classes, functions, variables, imports, โ€ฆ) */ },
  "call_graph":   [ /* CALL_DEP edges: { source, target, weight, provenance } keyed by callable signature */ ]
}

By default this is printed to stdout in JSON; with --output it is written to analysis.json (or analysis.msgpack with --format msgpack, a more compact binary format).

Neo4j graph

--emit neo4j projects the same analysis into a labeled property graph. Every node label is Py-prefixed and every relationship type is PY_-prefixed (e.g. :PyClass, PY_CALLS) so multiple language analyzers can share one database without label or relationship-type collisions. Declarations are keyed by their signature under a shared :PySymbol label; calls, imports, inheritance, decorators, and call sites are relationships:

  • Without --neo4j-uri โ€” writes a self-contained graph.cypher (constraints + indexes, a scoped wipe, then batched MERGEs). Load it with cypher-shell < graph.cypher. Needs no extra dependencies.
  • With --neo4j-uri โ€” pushes to a live Neo4j over Bolt incrementally: only modules whose content hash changed are rewritten, and on a full run modules whose source file vanished are pruned. Requires the neo4j extra. Every graph carries a schema_version on its :PyApplication node.

Call-graph endpoints that aren't present in the symbol table (third-party / framework / RPC targets) are materialized as :PyExternal ghost nodes, mirroring the analyzer's own ghost-node behaviour.

The connection options also read from the standard Neo4j environment variables โ€” NEO4J_URI, NEO4J_USERNAME, NEO4J_PASSWORD, NEO4J_DATABASE โ€” when the corresponding flag is omitted (an explicit flag wins). Prefer the env var for the password so it doesn't land in shell history or the process list:

export NEO4J_URI=bolt://localhost:7687
export NEO4J_PASSWORD=secret
canpy -i ./my-project --emit neo4j     # credentials picked up from the environment

Schema contract

--emit schema writes the machine-readable, version-stamped Neo4j schema (schema.json: node labels, relationships, properties, constraints, and indexes). It needs no project and is checked into the repo as schema.neo4j.json and bundled in every release as a GitHub Release asset, so a consumer can validate producer/consumer compatibility without invoking the tool. The shape of the contract matches the codeanalyzer-typescript backend.

A UML of the analysis.json schema (the PyApplication containment tree) is checked in as schema-uml.drawio, and the property-graph schema as neo4j-schema.drawio.

Development

This project uses uv.

uv sync --all-groups
uv run canpy --input /path/to/project           # run from source
uv run canpy --emit schema > schema.neo4j.json  # regenerate the checked-in schema contract
uv run python scripts/update_readme.py          # regenerate the canpy --help block above
uv run pytest                                   # run the test suite

The Neo4j schema-conformance test always runs. The Neo4j bolt integration test spins up a real Neo4j via Testcontainers and is opt-in โ€” it needs a container runtime (Docker or Podman) and is enabled with an environment variable:

RUN_CONTAINER_TESTS=1 uv run pytest test/test_neo4j_bolt.py -s

License

Apache 2.0 โ€” see LICENSE.