feat: hybrid embedding pipeline with native AI fallback and NaN self-healing#24
feat: hybrid embedding pipeline with native AI fallback and NaN self-healing#24Harsh16gupta wants to merge 3 commits into
Conversation
|
Some more work still needs to be done on this. Also, Laurent recently made changes to the API endpoints. I've tested them locally from the GitHub repository, but they haven't been released yet. I'll continue working on this once those changes are available in a release. |
65b4209 to
06b1f89
Compare
06b1f89 to
d5419db
Compare
|
The PR is ready for review |
There was a problem hiding this comment.
Pull request overview
This PR introduces a hybrid embedding pipeline that prefers Joplin’s native AI embeddings when available and falls back to a local ONNX worker otherwise, while also improving robustness against WebGPU numeric instability and invalid cached vectors.
Changes:
- Added native AI embedding integration (
getIndexStatus,getEmbeddings) with pagination safeguards and chunk aggregation. - Improved WebGPU stability by detecting NaNs during warmup/inference and dynamically falling back to WASM/q8.
- Centralized embedding dimension/config and added cache-vector validation to self-heal corrupted cache entries.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| src/worker/embedWorker.ts | Adds NaN detection and dynamic WebGPU→WASM fallback during warmup/inference. |
| src/pipeline/UmapProjector.ts | Adds optional distance-matrix projection support and validation for index-singleton vectors. |
| src/pipeline/runPipeline.ts | Adds native-embeddings fast path and cache-vector validation before reuse. |
| src/pipeline/pipelineConfig.ts | Introduces shared embedding dimension, vector validation, and tuned default clustering config. |
| src/pipeline/nativeEmbeddingPipeline.ts | Implements native AI readiness check and paged embedding fetch with cursor loop protection. |
| src/pipeline/clustering/benchmark.ts | Extends benchmark API to optionally accept a distance matrix and project it for clustering. |
| src/commands/testEmbed.ts | Aligns test command with shared config and cache-vector validation. |
| package.json | Renames the package to match Joplin naming convention. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| callbacks.onStatus('Clustering...'); | ||
| const results = benchmark(vectors, DEFAULT_CONFIG); | ||
|
|
||
| const panelNotes: PanelNote[] = validNotes.map((n) => ({ | ||
| noteId: n.id, | ||
| title: n.title, | ||
| })); | ||
|
|
||
| callbacks.onComplete(results, panelNotes); | ||
| return; |
There was a problem hiding this comment.
added enrichResultsWithTags to the native embeddings path with the same document-building pattern used in the local worker path. Both paths now produce TF-IDF tags for the dashboard.
| if (!page || !Array.isArray(page.chunks)) { | ||
| throw new Error('Invalid response from Joplin native getEmbeddings API'); | ||
| } | ||
|
|
||
| if (modelId && page.modelId !== modelId) { | ||
| throw new Error('Embedding model changed mid-fetch. Please restart.'); | ||
| } | ||
| modelId = page.modelId; | ||
| chunks.push(...page.chunks); | ||
| cursor = page.nextCursor; |
There was a problem hiding this comment.
added a guard that checks for noteId and vector before pushing chunks. Malformed ones get logged and skipped.
| export function isValidEmbeddingVector(vector: number[] | undefined | null): boolean { | ||
| if (!vector) return false; | ||
| if (vector.length !== EMBEDDING_DIM) return false; | ||
| return vector.every((v) => v !== null && !Number.isNaN(v)); | ||
| } |
There was a problem hiding this comment.
switched to Number.isFinite(v) which covers all the cases (null, NaN, Infinity, non-numbers). Didn't change the signature to unknown since callers already pass typed values - the any casts would make it worse.
| const t0 = performance.now(); | ||
| const output = await embedder(text, { pooling: POOLING, normalize: true }); | ||
| const inferenceTime = performance.now() - t0; | ||
| const dimensions = output.data.length; | ||
| const embedding = Array.from(output.data as Float32Array); | ||
|
|
||
| if (embedding.some((v) => isNaN(v))) { | ||
| throw new Error('Inference returned NaN values'); | ||
| } |
There was a problem hiding this comment.
changed to check the typed array (Float32Array) for NaNs first.
This PR adds support for using Joplin's native AI search embeddings if they are enabled. If native embeddings are not available or fail, the plugin seamlessly falls back to running the local model in a web worker. It also includes fixes for WebGPU stability issues and cache validation.
What changed
getIndexStatusandgetEmbeddingsAI APIs.joplin.aicalls directly to avoid Joplin IPC sandbox proxy errors.NaNvalues caused by WebGPU numeric instability.384dimension limit, and vector validation logic into a new sharedpipelineConfig.tsfile to eliminate duplicated code.Testing
Tested locally with 50 notes on Joplin desktop with native AI enabled:
multilingual-e5-smallmodel and retrieved all vectors.0.9352.