Skip to content

InternScience/GraphAnything

Repository files navigation

GraphAnything

Turn anything into a navigable knowledge graph. Markdown vaults, OpenAPI specs, contracts, meeting notes, chat transcripts — all converge on the same {nodes, edges} schema with full provenance, versioning, federation, and quality reporting.

LLM-driven extraction goes through GraphAnything's built-in OpenAI-compatible client, so you can point it at any chat.completions-shaped endpoint:

  • Local vLLM serve, llama.cpp, Ollama, LM Studio
  • OpenAI itself, or any commercial OpenAI-compatible host

中文版 / Chinese → README.zh.md

At a glance

Schema presets 10 (chat-log / codebase / contracts / db-schema / fstree / meeting / obsidian-vault / openapi / papers / pr-review)
Extractors 8 (markdown / json-yaml / openapi / fstree / chatlog / llm-entity / vlm-stub / noop)
Render formats 9 (mermaid / html / svg / cypher / graphml / ascii / json / canvas / timeline)
MCP tools 17
CLI sub-commands 19
LLM backend any OpenAI-compatible endpoint (HTTP)
External API optional — read-only / rule-based paths need no LLM

Three drivers, one core

              ┌──────────────────────────────────────────┐
              │   GraphAnything (core)                   │
              │   Session state machine + 8 extractors   │
              │   + 10 schema presets + 9 viz formats    │
              │   + temporal + federate + ask + quality. │
              └──────────────────────────────────────────┘
                  ▲              ▲              ▲
                  │              │              │
        ┌─────────┴────┐  ┌──────┴────┐  ┌──────┴──────────┐
        │  CLI / REPL  │  │  Skill    │  │  llm_client.py  │
        │ graphanything│  │ /graphany.│  │ (OpenAI-compat) │
        └──────────────┘  └───────────┘  └─────────────────┘

Install

cd GraphAnything
pip install -e .

This registers the graphanything console-script. As a fallback, python -m GraphAnything.cli ... always works.

Optional extras:

pip install -e ".[mcp]"     # MCP stdio server (Claude Code / Cursor / Gemini CLI)
pip install -e ".[neo4j]"   # direct push to a running Neo4j instance
pip install -e ".[svg]"     # SVG renderer (matplotlib)
pip install -e ".[repl]"    # nicer REPL (history + completion)
pip install -e ".[all]"     # everything above

For LLM-gated commands (refine --llm, sample --extractor llm-entity, ask --llm, eval --llm), point the client at any OpenAI-compatible chat-completions endpoint:

# vLLM serve / llama.cpp / Ollama / LM Studio / OpenAI / …
export GA_API_BASE=http://localhost:8000/v1     # default
export GA_MODEL=Qwen3-32B-Instruct              # required
export GA_API_KEY=local                         # optional; many local servers
                                                # accept any string

# Legacy upstream env vars are also honoured:
#   OPENAI_API_BASE / OPENAI_API_KEY / OPENAI_MODEL
#   API_BASE        / API_KEY        / SUMMARY_MODEL_NAME

Rule-based extractors (markdown, json-yaml, openapi, fstree, chatlog) and all read-only commands (render, explain, ask without --llm, versions, diff, federate, eval without --llm) need no LLM at all.

CLI quickstart

# One-shot end-to-end
graphanything new ./vault/ --preset obsidian-vault --auto

# Step-by-step (recommended for non-trivial corpora)
graphanything new ./contracts --preset contracts
graphanything sample --n 5
graphanything review --merge ABC_Corp,abc_corp,ABC公司
graphanything refine "add GoverningLaw entity"
graphanything run

# Query
graphanything ask "all clauses with amount > 100k"
graphanything explain ep_api_get_user

# Versioning (incremental re-extract on changed files only)
graphanything update                # re-hash all inputs; redo only changed
graphanything versions              # list snapshots written so far
graphanything diff 1 2              # what changed between v1 and v2

# Federation
graphanything federate g1.json g2.json --out universe.json --fuzzy

# Quality
graphanything eval --out-dir graphanything-out/quality --llm --judge-n 20

# Render (9 formats)
graphanything render --fmt mermaid                       # for Claude / chat
graphanything render --fmt cypher  --out g.cypher        # → Neo4j
graphanything render --fmt graphml --out g.graphml       # → Gephi
graphanything render --fmt html    --out g.html          # standalone, force-directed
graphanything render --fmt json    --out g.json          # NetworkX JSON
graphanything render --fmt timeline --out timeline.html  # X = year, Y = community
graphanything render --fmt canvas  --out g.canvas        # Obsidian Canvas
graphanything render --fmt ascii                         # piped into terminal
graphanything render --fmt svg     --out g.svg
graphanything render --fmt mermaid --budget-tokens 4000  # PageRank-prune to fit

All 19 CLI sub-commands

Sub-command Purpose
new <inputs> Open a session; --preset NAME, --extractor NAME, --auto, budget caps
propose [--n N] [--llm] Suggest an initial schema (rule-derived if possible, else generic / LLM)
refine "<instruction>" [--llm] Edit the schema (regex first, LLM fallback)
sample [--n N] [--extractor NAME] Extract from N inputs into pending, propose merges
review --accept-all / --accept ID... / --reject ID... [--reason ...] / --merge a,b[,c]
run [--out DIR] [--extractor NAME] Lock schema → run all inputs → write graph.json + snapshot
update [--out DIR] [--extractor NAME] Re-extract only inputs whose source_hash changed; new snapshot
versions [--out-root R] List snapshots written by run / update
diff <v_old> <v_new> Diff two snapshots (added / removed / modified nodes & edges)
ask "<question>" [--llm] NL query → graph traversal (regex first, LLM fallback)
explain <id|label|"src → rel → tgt"> Provenance for a node or edge
render --fmt FMT [--out PATH] [--graph G] [--budget-tokens N] 9 formats (see below)
federate g1 g2... --out U [--fuzzy] [--fuzzy-threshold T] [--llm] Merge multiple graphs into one universe
eval [--out-dir D] [--llm] [--judge-n N] [--graph G] Coverage / dedup / per-extractor / sampled LLM-judge
presets List the 10 built-in schema presets
extractors List the 8 registered extractors
sessions List sessions in graphanything-out/sessions/
use <session_id> Switch the active session pointer
repl [<inputs>] Interactive shell (history + completion via prompt_toolkit if installed)

Top-level flags (apply to every sub-command):

  • --sessions-dir PATH — where session JSONs live (default graphanything-out/sessions/).
  • --session ID — override the "current session" pointer for one call.

new and repl accept budget soft caps:

  • --max-tokens N — total token ceiling
  • --max-dollars D — total $ ceiling
  • --max-api-calls N — total LLM-call ceiling

run / sample stop early when any cap is exceeded; remaining inputs are listed in the result notes.

Skill / MCP server (Claude Code, Cursor, Gemini, …)

The same Session core, accessed through 17 MCP tools:

Tool Purpose
graphanything_open_session Start a session over inputs
graphanything_list_presets 10 built-in schema templates
graphanything_list_extractors 8 extractors (rule + LLM + VLM stub)
graphanything_propose_schema Suggest a starting schema
graphanything_refine_schema Edit schema (regex + LLM fallback)
graphanything_sample Extract from N inputs into pending
graphanything_review Apply accept_all / accept / reject / merge / rule actions
graphanything_run Full extraction → graph.json + snapshot
graphanything_status Counts + cost + schema
graphanything_ask Natural-language query
graphanything_explain Provenance for one node / edge
graphanything_update Incremental re-extract on changed files
graphanything_versions List graph snapshots
graphanything_diff Diff two snapshots
graphanything_federate Combine multiple graphs
graphanything_eval Coverage / dedup / quality report
graphanything_render Mermaid / HTML / SVG / Cypher / GraphML / ASCII / JSON / Canvas / Timeline

Start the server:

python -m GraphAnything.serve

Wire into Claude Code / Cursor / Gemini CLI by registering the same process in your MCP config (~/.claude.json / .mcp.json / equivalent):

{
  "mcpServers": {
    "graphanything": {
      "command": "python",
      "args": ["-m", "GraphAnything.serve"],
      "env": {
        "GA_API_BASE": "http://localhost:8000/v1",
        "GA_MODEL": "Qwen3-32B-Instruct",
        "GA_API_KEY": "local"
      }
    }
  }
}

10 schema presets

graphanything presets lists them; graphanything new --preset NAME applies one. Drop your own YAML in GraphAnything/schemas/<name>.yaml to register a new one.

Preset Domain
chat-log Slack / Claude Code .jsonl / Discord → user / message / tool
codebase Source repo → module / file / class / function / import / call
contracts Legal contracts → party / clause / date / amount / governing law
db-schema DDL / migrations / ORM → table / column / FK / index
fstree Plain filesystem → directory / file / symlink
meeting Meeting notes → person / topic / decision / action item
obsidian-vault Obsidian / Notion vault → note / tag / wikilink / backlink
openapi OpenAPI 2.x/3.x spec → endpoint / schema / ref / security
papers Generic LLM-driven paper extraction
pr-review GitHub PR trail → file / function / reviewer / concern

8 built-in extractors

graphanything extractors lists them. Suffix-based dispatcher picks one unless --extractor NAME overrides.

Extractor LLM? Handles Notes
markdown .md, .markdown Note / Heading / Tag / WikiLink
json-yaml .json, .ndjson, .yaml, .yml, .toml Generic config tree + $ref
openapi .yaml, .yml, .json API / Endpoint / Schema / Parameter (force via --extractor openapi)
fstree directories Directory / File / Symlink
chatlog .jsonl, .txt, .log Channel / User / Message / Tool / ToolCall
llm-entity * (any text) Generic entity / relation, with evidence_span + rationale
vlm .pdf, .png, .jpg, .jpeg Stub — install a plugin to enable
noop * Empty graph; for tests

Adding a new extractor (Python plugin):

from GraphAnything import register_extractor

def extract_my_format(path, **_):
    return {
        "nodes": [{"id": "x", "label": "X", "file_type": "document",
                   "source_file": str(path)}],
        "edges": [],
    }

register_extractor(
    "my-format", extract_my_format,
    version="0.1.0", handles=(".myext",),
    description="My custom format extractor",
)

run_extractor() automatically stamps provenance (extractor_id, extractor_version, extraction_time, source_hash).

To replace the VLM stub with a real model:

register_extractor(
    "vlm", my_real_impl, version="1.0.0",
    handles=(".pdf", ".png", ".jpg"),
    needs_llm=True, overwrite=True,
)

Versioning, federation, quality

Incremental updates. graphanything update rehashes every input; unchanged files keep their previous nodes / edges verbatim, changed files are re-extracted, the result is normalised and snapshotted as the next versions/v<N>.json. diff then works between any two versions.

Federation. graphanything federate g1 g2 ... --out universe.json merges several graphs into one universe. Same-label entities of the same type collapse exactly; with --fuzzy --fuzzy-threshold 0.7 it also proposes same_as edges via Jaccard token overlap, optionally with --llm tie-breaking on borderline pairs.

Quality eval. graphanything eval --out-dir graphanything-out/quality writes QUALITY_REPORT.md: coverage by node type, dedup density, per-extractor stats, and (with --llm --judge-n 20) an LLM verdict on 20 sampled edges against their evidence_span.

REPL mode

graphanything repl ./contracts --preset contracts

Inside, every CLI sub-command is also a REPL command (schema, propose [N], refine "...", sample [N], review accept-all|accept ID...|reject ID...|merge a,b[,c], run [DIR], render FMT [PATH], explain TARGET, status, cost, llm on|off, presets, extractors, help, quit).

If prompt_toolkit is installed (pip install -e ".[repl]"), you get history + completion; otherwise the REPL falls back to bare input().

Session state on disk

graphanything-out/
├── sessions/
│   ├── .current               ← active session pointer
│   ├── sess_a1b2c3d4.json     ← one file per session
│   └── sess_...json
├── graph.json                 ← latest run/update output
├── versions/
│   ├── v1.json
│   ├── v2.json
│   └── manifest.json          ← schema_version + source_hashes per snapshot
└── quality/
    └── QUALITY_REPORT.md

The Session JSON is the source of truth: accepted / pending / rejected graph fragments, schema (with version), feedback log, running cost log, normalize rules, last source hashes for incremental update. Delete the file → the session is gone; copy it elsewhere → it relocates intact.

Configuration reference (env vars)

GraphAnything reads env vars only on demand; everything has sensible defaults. Names are listed in priority order — the first one that's set wins.

Setting Env vars Default
Chat-completions URL GA_API_BASE / OPENAI_API_BASE / OPENAI_BASE_URL / API_BASE http://localhost:8000/v1
Model name GA_MODEL / OPENAI_MODEL / SUMMARY_MODEL_NAME (required for LLM ops)
Bearer token GA_API_KEY / OPENAI_API_KEY / API_KEY empty (many local servers accept any string)
HTTP timeout (s) GA_HTTP_TIMEOUT 600

Each LLM call sends a standard chat.completions POST to {API_BASE}/chat/completions with model, messages, temperature, max_tokens, and (for chat_json) response_format: {type: "json_object"}. If the server rejects response_format, the call retries without it and GraphAnything regex-extracts JSON from the answer (so reasoning models emitting <think>...</think> blocks also work).

Programmatic use

from GraphAnything import open_session
from GraphAnything.llm_client import make_client

llm = make_client()                          # reads env vars

sess = open_session(["./vault/"], preset="obsidian-vault")
sess.propose(auto_accept=True, llm=llm)
sess.run(llm=llm, out_dir="graphanything-out")

print(sess.accepted["nodes"][:3])

make_client() accepts overrides:

llm = make_client(
    api_base="http://h-100:8000/v1",
    model="Qwen3-32B-Instruct",
    api_key="local",
    timeout=900,
)

License

MIT.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages