Skip to content

gsc_pgo: online and offline PGO #2587

Open
jeff-hykin wants to merge 59 commits into
mainfrom
jeff/feat/jnav_pgo
Open

gsc_pgo: online and offline PGO #2587
jeff-hykin wants to merge 59 commits into
mainfrom
jeff/feat/jnav_pgo

Conversation

@jeff-hykin

@jeff-hykin jeff-hykin commented Jun 24, 2026

Copy link
Copy Markdown
Member

gsc_pgo, Ports the PGO / loop-closure stack into the new jnav layout, plus a tf-tree feature for memory2 stores.

process_observable gains an optional on_drop callback fired once per
message dropped by the dispatcher's single-slot LATEST mailbox. The
Recorder uses it to count dropped frames per stream and log a throttled
warning, so a slow sink no longer loses data silently.
@greptile-apps

greptile-apps Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR ports the PGO / loop-closure stack into the jnav layout and adds a tf-tree feature for memory2 stores: a new DbTfSql that serves loop-closure-corrected transforms from recorded SQLite data, a companion live-path DeformationBuffer in PubSubTF, and a new gsc_pgo native module wrapping the C++ iSAM2 + PCL ICP backend.

  • DbTfSql replaces the old in-RAM replay with an indexed-query approach: topology is stored in a tf_graph change-log stream; per-lookup SQL UNION subqueries fetch only the bracketing rows for each frame in the chain; a tf_deformation_nodes stream carries per-keyframe corrections that are blended at query time for loop-closure accuracy.
  • post_process.py adds offline two-stage PGO (GTSAM tag factors + ICP loop closures) that writes corrected odometry/lidar streams and the deformation-node stream back into the recording so DbTfSql can replay the correction identically.
  • recording_db.py now registers an atexit cleanup, and DbTfLive uses double-checked locking \u2014 both addressing findings from previous review rounds.

Confidence Score: 3/5

The default run of post_process.py fails immediately with ModuleNotFoundError due to the wrong import path for make_rrd; two prior-round bugs in DbTfSql (OperationalError on a tf-less store, IndexError on empty graph list) remain unresolved.

post_process.py default invocation hits from gsc_pgo import make_rrd which resolves to a path that does not exist. Two bugs identified in earlier rounds of review in DbTfSql are still present: _ensure_built() queries the tf table without guarding against it being absent, and _graph_at() dereferences index 0 of an empty list when the graph stream has zero entries.

dimos/navigation/jnav/components/loop_closure/gsc_pgo/scripts/post_process.py (broken make_rrd import), dimos/memory2/db_tf_sql.py (two open bugs from prior review rounds)

Important Files Changed

Filename Overview
dimos/navigation/jnav/components/loop_closure/gsc_pgo/scripts/post_process.py New offline PGO post-processing script; contains a broken import for make_rrd (wrong package path) that fails on the default run path, and no guard against an empty odometry stream.
dimos/memory2/db_tf_sql.py New SQLite-backed transform lookup with loop-closure deformation. Prior reviews flagged empty-table OperationalError in _ensure_built and IndexError in _graph_at; both remain open. Also accesses private _source attribute on Stream objects in multiple places.
dimos/memory2/db_tf_live.py New in-RAM TF lookup for any Store; correct double-checked locking, addresses the previous thread's non-atomic init concern.
dimos/memory2/db_tf.py Adds DbTf protocol and TfGraph payload; TfGraph copies the structure dict on construction, correctly isolating snapshots from later mutations.
dimos/navigation/jnav/components/loop_closure/gsc_pgo/module.py New PGO NativeModule wrapper; ports config and port declarations cleanly, publishes initial zero transform on start.
dimos/protocol/tf/deformation.py New DeformationBuffer for live loop-closure correction; SE(3) blend math is consistent with the recorded path in DbTfSql.
dimos/protocol/tf/tf.py Adds deformation-aware get_transform override to PubSubTF; PubSubTFConfig gains optional deformation_topic wired to LCM by default.
dimos/navigation/jnav/utils/recording_db.py New recording DB helper with module-level store cache; now includes close_all() and atexit.register(), addressing the previous review's resource-leak concern.
dimos/memory2/module.py SemanticSearch gains tf-aware pose resolution at query time (loop-closure-corrected), falling back to baked pose when no chain exists.
dimos/navigation/jnav/components/loop_closure/gsc_pgo/scripts/make_rrd.py New rerun visualization script; defines build() function, but is placed in scripts/ subdirectory while post_process.py imports it via the wrong parent package path.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    subgraph Online["Online (live robot)"]
        PGO["PGO NativeModule\n(C++ iSAM2 + PCL ICP)"]
        DN["tf_deformation_nodes\nOut[DeformationNode]"]
        TF["PubSubTF / LCMTF\nDeformationBuffer"]
        PGO -->|"corrected keyframe poses"| DN
        DN -->|"receive_deformation()"| TF
        TF -->|"loop-closure-corrected\nget() / lookup()"| Consumers["Consumers\n(SemanticSearch, nav, etc.)"]
    end
    subgraph Offline["Offline (post_process.py)"]
        PP["post_process.py\nStage 1: GTSAM tag PGO\nStage 2: ICP refinement"]
        DeformStream["gt_tf_deformation_nodes\nstream in mem2.db"]
        PP -->|"write raw + optimized\nkeyframe poses"| DeformStream
    end
    subgraph Memory2["memory2 / SqliteStore"]
        DbTfSql["DbTfSql.get()\n(indexed SQL + deformation)"]
        DeformStream -->|"_load_deformation_ids()\n_edge_delta()"| DbTfSql
        DbTfSql -->|"loop-closure-corrected\nTransform"| Consumers2["Consumers\n(SemanticSearch, eval)"]
    end
    Recorder["Recorder\n(memory2.Recorder)"] -->|"tf stream\ntf_deformation_nodes stream"| SqliteDB["mem2.db\nSQLite"]
    SqliteDB --> DbTfSql
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    subgraph Online["Online (live robot)"]
        PGO["PGO NativeModule\n(C++ iSAM2 + PCL ICP)"]
        DN["tf_deformation_nodes\nOut[DeformationNode]"]
        TF["PubSubTF / LCMTF\nDeformationBuffer"]
        PGO -->|"corrected keyframe poses"| DN
        DN -->|"receive_deformation()"| TF
        TF -->|"loop-closure-corrected\nget() / lookup()"| Consumers["Consumers\n(SemanticSearch, nav, etc.)"]
    end
    subgraph Offline["Offline (post_process.py)"]
        PP["post_process.py\nStage 1: GTSAM tag PGO\nStage 2: ICP refinement"]
        DeformStream["gt_tf_deformation_nodes\nstream in mem2.db"]
        PP -->|"write raw + optimized\nkeyframe poses"| DeformStream
    end
    subgraph Memory2["memory2 / SqliteStore"]
        DbTfSql["DbTfSql.get()\n(indexed SQL + deformation)"]
        DeformStream -->|"_load_deformation_ids()\n_edge_delta()"| DbTfSql
        DbTfSql -->|"loop-closure-corrected\nTransform"| Consumers2["Consumers\n(SemanticSearch, eval)"]
    end
    Recorder["Recorder\n(memory2.Recorder)"] -->|"tf stream\ntf_deformation_nodes stream"| SqliteDB["mem2.db\nSQLite"]
    SqliteDB --> DbTfSql
Loading

Reviews (9): Last reviewed commit: "test(memory2): skip DbTfSql tests when s..." | Re-trigger Greptile

Comment thread dimos/memory2/db_tf.py Outdated
Comment thread dimos/memory2/db_tf.py Outdated
Comment thread dimos/navigation/jnav/utils/recording_db.py
Comment thread dimos/memory2/db_tf.py Outdated
Comment thread dimos/navigation/jnav/components/loop_closure/unrefined_pgo/module.py Outdated
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 69.44444% with 440 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
dimos/navigation/jnav/utils/trajectory_metrics.py 15.67% 113 Missing ⚠️
dimos/navigation/jnav/msgs/Graph3D.py 35.92% 66 Missing ⚠️
dimos/memory2/db_tf_sql.py 79.85% 38 Missing and 19 partials ⚠️
dimos/memory2/module.py 15.15% 56 Missing ⚠️
dimos/navigation/jnav/msgs/GraphDelta3D.py 34.17% 52 Missing ⚠️
dimos/navigation/jnav/utils/apriltag_agreement.py 57.35% 28 Missing and 1 partial ⚠️
dimos/protocol/tf/tf.py 37.50% 18 Missing and 2 partials ⚠️
dimos/navigation/jnav/utils/voxel_map.py 60.00% 12 Missing ⚠️
dimos/protocol/tf/deformation.py 82.85% 8 Missing and 4 partials ⚠️
...ion/jnav/components/loop_closure/gsc_pgo/module.py 91.56% 7 Missing ⚠️
... and 8 more
@@            Coverage Diff             @@
##             main    #2587      +/-   ##
==========================================
+ Coverage   71.85%   71.89%   +0.04%     
==========================================
  Files         891      909      +18     
  Lines       80512    81939    +1427     
  Branches     7331     7487     +156     
==========================================
+ Hits        57849    58909    +1060     
- Misses      20726    21093     +367     
  Partials     1937     1937              
Flag Coverage Δ
OS-ubuntu-24.04-arm 63.56% <46.87%> (-0.30%) ⬇️
OS-ubuntu-latest 66.62% <69.44%> (+0.06%) ⬆️
Py-3.10 66.60% <69.44%> (+0.06%) ⬆️
Py-3.11 66.60% <69.44%> (+0.06%) ⬆️
Py-3.12 66.61% <69.44%> (+0.07%) ⬆️
Py-3.13 66.61% <69.44%> (+0.06%) ⬆️
Py-3.14 66.62% <69.44%> (+0.06%) ⬆️
Py-3.14t 66.61% <69.44%> (+0.05%) ⬆️
SelfHosted-Linux 37.70% <24.50%> (-0.21%) ⬇️
SelfHosted-macOS 36.04% <24.50%> (-0.15%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
dimos/hardware/sensors/lidar/pointlio/recorder.py 81.81% <100.00%> (+0.86%) ⬆️
dimos/navigation/jnav/msgs/LocationConstraint.py 100.00% <100.00%> (ø)
...os/navigation/jnav/msgs/test_LocationConstraint.py 100.00% <100.00%> (ø)
...s/navigation/jnav/utils/test_apriltag_agreement.py 100.00% <100.00%> (ø)
dimos/navigation/jnav/utils/test_kitti.py 100.00% <100.00%> (ø)
dimos/protocol/tf/test_deformation.py 100.00% <100.00%> (ø)
dimos/robot/all_blueprints.py 100.00% <ø> (ø)
.../robot/unitree/go2/go2_mid360_static_transforms.py 90.00% <ø> (ø)
dimos/memory2/db_tf.py 90.00% <90.00%> (ø)
dimos/memory2/test_db_tf.py 99.37% <99.37%> (ø)
... and 16 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jeff-hykin jeff-hykin changed the title jnav: port PGO/loop-closure + tf-tree for memory2 stores gsc_pgo: online and offline PGO Jun 24, 2026
@jeff-hykin jeff-hykin enabled auto-merge (squash) June 24, 2026 08:35
Replace per-frame static/bracket queries with a single UNION of per-frame
LIMIT-1 subqueries joined to the blob store. Each subquery is a direct
(child_frame, ts) range seek via a new composite index, so the warm lookup
is one query (graph served from RAM) at the same wall-time as the per-row
DbTf (~126us) instead of ~4 queries.
Replace the per-row tf-tree DbTf with the graph-stream implementation (formerly
DbTf2): topology change-log + child_frame-tagged rows, in-RAM graph cache with a
query-per-lookup fallback, latched statics, and disjoint multi-robot graphs.

- merge db_tf2.py into db_tf.py; DbTf2 -> DbTf; keep transform_matrix +
  write_tf_tree (now one-transform-per-row so the reader can tag/range-query it)
  and the non-sqlite RAM fallback.
- recorder writes only tf_graph now (no per-row tree).
- trim tests to four integration cases (interp vs full-load, latched static,
  mid-run re-parent, disjoint multi-robot); remove db_tf2 tests + bench scaffold.
The topology change-log was a raw sqlite side-table written directly by
TfGraphWriter. Promote it to a real memory2 stream ("<tf>_graph") so it's
discoverable via list_streams(), replayable, and uses the standard codec/blob
machinery like every other recorded stream.

- new TfGraph payload (recording-internal, pickle-codec); TfGraphWriter appends
  snapshots via store.stream() instead of raw INSERTs.
- DbTf reads the graph through the stream API (count + order_by + time_range)
  instead of raw SQL; migration build_graph_stream appends to the stream.
- recorder records the graph stream for all store types, not just sqlite.
- drop the raw-table helpers (ensure_graph_table/_graph_table).
The loop-closure eval's raw baseline (tag agreement, voxel re-anchoring, drift
recovery) now composes the robot pose from the recording's tf (world->base_link,
sampled at each odom time) when the recording has tf; recordings without tf fall
back to the odom stream's stored poses. Adds --world-frame/--body-frame, records
pose_source in the summary, and bumps EVAL_VERSION to invalidate cached cells.
# Conflicts:
#	dimos/hardware/sensors/lidar/fastlio2/recorder.py
#	dimos/memory2/module.py
#	dimos/robot/all_blueprints.py
#	dimos/robot/unitree/go2/blueprints/smart/unitree_go2.py
…igned mid360_link

Keep the physical sensor pose (position + 44deg tilt) as mid360_mount, and add a
counter-rotated mid360_link at the same position whose orientation is un-tilted
(gravity-aligned) — the frame the gravity-anchored LIO scans live in. Preserves
the physical mount geometry while giving a level data frame.
- remove unused write_tf_tree; move transform_matrix into post_process (its only use)
- drop GRAPH_STREAM_SUFFIX indirection (one '<tf>_graph' stream, name derived inline)
- inline TfGraphWriter into the recorder and build_graph_stream into DbTf._ensure_built
- name the pose-equality rounding constant; namespace the child-frame index name
- fix: order graph snapshots by (ts, id) so same-timestamp topology changes resolve
  to the complete snapshot; raise the in-RAM cache threshold for fixed multi-frame rigs
Static mount frames are re-published onto the regular tf stream by the rig's
StaticTfPublisher (5Hz), so the recorder already captures them via the tf
subscription; the separate hardcoded Topic('/tf_static') had no publisher. Record
tf through the one tf stream; static-vs-dynamic is recovered offline by DbTf's
graph build (pose-constancy).
…ed LCM topic

Add an optional tf_static: In[TFMessage] port (a future system publishes latched
mount/extrinsic transforms there). Folded into the 'tf' stream + flagged static in
tf_graph; excluded from the generic per-port recording. Subscribed only when wired
(guards the unconnected transport), so it's a no-op until something publishes.
…rames)

Add DeformationNode msg (id, tf_id=FNV-1a-64(frame_from|frame_to), pose)
and a Recorder tf_deformation_nodes In-port recorded by default (no-op when
unconnected, like tf_static). Basis for loop-closure-corrected tf at query time.
Autoconnects to the Recorder's tf_deformation_nodes In. Binary publishes to it
once the pinned rev is bumped (gated on pushing the gsc_pgo C++ change).
For each edge in the resolved chain, hash (parent|child) -> tf_id; if the
deformation stream has matching keyframes, take the ones bracketing the query
time, compute each node's current ∘ inv(original) delta, linear-blend-skin
between them, and apply on the parent side of that edge. Edges with no match
(and recordings with no deformation stream) take the normal path at zero cost.
…rrection

Split db_tf into DbTfLive (in-RAM buffer, all stores) and DbTfSql (indexed
per-lookup queries + deformation correction, sqlite only). SemanticSearch now
resolves a match's world pose at query time via store.tf so it reflects loop
closure. PubSubTF gains a deformation_topic that corrects live tf.get from a
tf_deformation_nodes stream. Recorder only clears streams whose source is
connected (pcap_to_db no longer wipes an existing tf tree). Drop unused Marker msg.
Detect AprilTags over the camera stream when the raw tag stream is absent, so a
fresh recording only needs post_process (derive distance/view-angle from the tag
pose, defaulting unknown speeds to pass). Persist the PGO's keyframe deformation
(gt_tf_deformation_nodes) and optimized pose_graph as real typed streams so a
deformation-aware tf.get can replay the loop-closure correction offline.
eval.py's odom-pose baseline was replaced by tf_pose_samples, but
eval_ground_truth_tag.py still imported the deleted odometry_pose7_lookup.
Build the raw pose7 lookup from tf_pose_samples + pose7_lookup, mirroring
evaluate(). Fixes the broken import (full-repo mypy).
Comment on lines +336 to +343
tf_tag = "json_extract(tags, '$.tf_id')"
conn = self._connection()
lo = conn.execute(
f"SELECT {id_tag}, ts FROM {stream} WHERE {tf_tag} = ? AND ts <= ? "
"ORDER BY ts DESC, id DESC LIMIT 1",
(str(tf_id), query_time),
).fetchone()
hi = conn.execute(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _graph_at throws IndexError on an empty _graph_in_ram list

When _ensure_built() finds zero tf rows (empty tf table), it returns without building the graph stream. _load_graph_if_small() then sets _graph_in_ram = [] (because _graph_count() returns 0 and 0 < 64). The next _graph_at(t) call runs bisect.bisect_right([], t) - 1 = -1, falls into the index < 0 branch, and executes return self._graph_in_ram[0][1] — an IndexError on an empty list. The DbTf.get() protocol contract is to return None when no transform exists, not raise. The fix is to guard the "earliest" path: if not self._graph_in_ram: return None before the index access.

Comment on lines +252 to +271
(frame, kind) -> blob bytes."""
cf = "json_extract(tags, '$.child_frame')"
# One UNION of per-frame, index-served LIMIT-1 subqueries: each is a direct
# (child_frame, ts) range seek — far cheaper than a window-function scan, and
# still a single round-trip.
parts: list[str] = []
params: list[Any] = []

def pick(frame: str, kind: str, where_ts: str, order: str) -> None:
parts.append(
f"SELECT ? AS cf, ? AS kind, "
f'(SELECT id FROM "tf" WHERE {cf} = ?{where_ts} ORDER BY ts {order} LIMIT 1) AS id'
)
params.extend([frame, kind, frame])

for frame in dynamic:
pick(frame, "lo", " AND ts <= ?", "DESC")
params.append(query_time)
pick(frame, "hi", " AND ts >= ?", "ASC")
params.append(query_time)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 _ensure_built throws OperationalError when the tf table is absent

_ensure_built() immediately executes SELECT count(*) FROM "tf" without first checking whether the "tf" stream (and thus its table) actually exists in the recording. If a SqliteStore has no tf stream at all, this throws sqlite3.OperationalError: no such table: tf. By contrast, has_transforms() correctly guards with if "tf" not in set(self._store.list_streams()): return False. Any caller that invokes store.tf.get(...) on a tf-less recording — without first calling has_transforms() — gets an uncaught exception rather than the None the DbTf protocol promises.

test_db_tf.py instantiates SqliteStore directly (not via the sqlite_store
fixture), so it bypassed conftest's _SKIP_SQLITE_VEC guard and failed on the
ubuntu-24.04-arm runner (vec0.so wrong ELF class). Add a module-level skipif
mirroring memory2/conftest.py so these 6 tests skip instead of fail there.
if OPEN_RRD and WHAT in ("lidar", "both"):
import subprocess

from dimos.navigation.jnav.components.loop_closure.gsc_pgo import make_rrd

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 make_rrd lives at gsc_pgo/scripts/make_rrd.py, but this import resolves to gsc_pgo/make_rrd.py — a path that does not exist. Since OPEN_RRD is True and WHAT defaults to "both", the default invocation (python post_process.py --rec=...) will fail here with ModuleNotFoundError.

Suggested change
from dimos.navigation.jnav.components.loop_closure.gsc_pgo import make_rrd
from dimos.navigation.jnav.components.loop_closure.gsc_pgo.scripts import make_rrd

@jeff-hykin jeff-hykin added the backport:skip Skip creating a backport to any release branches label Jul 2, 2026
@github-actions github-actions Bot added the ready-to-merge Required CI checks have passed on this PR label Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport:skip Skip creating a backport to any release branches ready-to-merge Required CI checks have passed on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants