feat: hybrid embedding pipeline with native AI fallback and NaN self-healing by Harsh16gupta · Pull Request #24 · joplin/plugin-note-categorization

Harsh16gupta · 2026-06-27T14:05:58Z

This PR adds support for using Joplin's native AI search embeddings if they are enabled. If native embeddings are not available or fail, the plugin seamlessly falls back to running the local model in a web worker. It also includes fixes for WebGPU stability issues and cache validation.

What changed

Native AI Integration:
- Added support for Joplin"s new getIndexStatus and getEmbeddings AI APIs.
- Fetches embeddings in batches, handles page cursors safely to prevent infinite loops, and averages chunks together for longer notes.
- Chained all joplin.ai calls directly to avoid Joplin IPC sandbox proxy errors.
WebGPU Stability & WASM Fallback:
- Added checks to catch NaN values caused by WebGPU numeric instability.
- If WebGPU fails during warmup or inference, the worker dynamically reloads the model using the WASM/q8 runner and retries.
Cache Verification:
- When loading from the cache, we now check that the cached vectors are valid (correct length, not null/NaN). Invalid entries are logged and re-embedded.
Code Cleanup:
- Moved shared configurations, the 384 dimension limit, and vector validation logic into a new shared pipelineConfig.ts file to eliminate duplicated code.
- Tuned UMAP/clustering parameters and updated the package name to match Joplin's naming convention.

Testing

Tested locally with 50 notes on Joplin desktop with native AI enabled:

Successfully detected the native multilingual-e5-small model and retrieved all vectors.
The HDBSCAN clustering run completed successfully, producing a silhouette score of 0.9352.

Harsh16gupta · 2026-06-30T16:43:56Z

Some more work still needs to be done on this. Also, Laurent recently made changes to the API endpoints. I've tested them locally from the GitHub repository, but they haven't been released yet. I'll continue working on this once those changes are available in a release.

… fixes

Harsh16gupta · 2026-07-02T12:39:01Z

The PR is ready for review
the new native AI APIs (getIndexStatus and getEmbeddings) aren't released in the public pre-release version of Joplin yet. I have tested this against a local Joplin dev build and it works fine.

Copilot

Pull request overview

This PR introduces a hybrid embedding pipeline that prefers Joplin’s native AI embeddings when available and falls back to a local ONNX worker otherwise, while also improving robustness against WebGPU numeric instability and invalid cached vectors.

Changes:

Added native AI embedding integration (getIndexStatus, getEmbeddings) with pagination safeguards and chunk aggregation.
Improved WebGPU stability by detecting NaNs during warmup/inference and dynamically falling back to WASM/q8.
Centralized embedding dimension/config and added cache-vector validation to self-heal corrupted cache entries.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/worker/embedWorker.ts	Adds NaN detection and dynamic WebGPU→WASM fallback during warmup/inference.
src/pipeline/UmapProjector.ts	Adds optional distance-matrix projection support and validation for index-singleton vectors.
src/pipeline/runPipeline.ts	Adds native-embeddings fast path and cache-vector validation before reuse.
src/pipeline/pipelineConfig.ts	Introduces shared embedding dimension, vector validation, and tuned default clustering config.
src/pipeline/nativeEmbeddingPipeline.ts	Implements native AI readiness check and paged embedding fetch with cursor loop protection.
src/pipeline/clustering/benchmark.ts	Extends benchmark API to optionally accept a distance matrix and project it for clustering.
src/commands/testEmbed.ts	Aligns test command with shared config and cache-vector validation.
package.json	Renames the package to match Joplin naming convention.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Harsh16gupta · 2026-07-04T08:32:52Z

+					callbacks.onStatus('Clustering...');
+					const results = benchmark(vectors, DEFAULT_CONFIG);
+
+					const panelNotes: PanelNote[] = validNotes.map((n) => ({
+						noteId: n.id,
+						title: n.title,
+					}));
+
+					callbacks.onComplete(results, panelNotes);
+					return;


added enrichResultsWithTags to the native embeddings path with the same document-building pattern used in the local worker path. Both paths now produce TF-IDF tags for the dashboard.

Harsh16gupta · 2026-07-04T08:34:51Z

+			if (!page || !Array.isArray(page.chunks)) {
+				throw new Error('Invalid response from Joplin native getEmbeddings API');
+			}
+
+			if (modelId && page.modelId !== modelId) {
+				throw new Error('Embedding model changed mid-fetch. Please restart.');
+			}
+			modelId = page.modelId;
+			chunks.push(...page.chunks);
+			cursor = page.nextCursor;


added a guard that checks for noteId and vector before pushing chunks. Malformed ones get logged and skipped.

Harsh16gupta · 2026-07-04T08:34:50Z

+export function isValidEmbeddingVector(vector: number[] | undefined | null): boolean {
+	if (!vector) return false;
+	if (vector.length !== EMBEDDING_DIM) return false;
+	return vector.every((v) => v !== null && !Number.isNaN(v));
+}


switched to Number.isFinite(v) which covers all the cases (null, NaN, Infinity, non-numbers). Didn't change the signature to unknown since callers already pass typed values - the any casts would make it worse.

Harsh16gupta · 2026-07-04T08:41:58Z

+		const t0 = performance.now();
+		const output = await embedder(text, { pooling: POOLING, normalize: true });
+		const inferenceTime = performance.now() - t0;
+		const dimensions = output.data.length;
+		const embedding = Array.from(output.data as Float32Array);
+
+		if (embedding.some((v) => isNaN(v))) {
+			throw new Error('Inference returned NaN values');
+		}


changed to check the typed array (Float32Array) for NaNs first.

Harsh16gupta self-assigned this Jun 27, 2026

Harsh16gupta changed the title ~~feat: implementing the new joplin ai features in the plugin existing workflow~~ feat: hybrid embedding pipeline with native AI fallback and NaN self-healing Jul 2, 2026

Harsh16gupta force-pushed the feat/native-embeddings-hybrid branch from 65b4209 to 06b1f89 Compare July 2, 2026 12:28

feat: Support native Joplin embeddings with WebGPU fallback and cache…

d5419db

… fixes

Harsh16gupta force-pushed the feat/native-embeddings-hybrid branch from 06b1f89 to d5419db Compare July 2, 2026 12:32

Harsh16gupta marked this pull request as ready for review July 2, 2026 12:34

Harsh16gupta requested a review from HahaBill July 2, 2026 12:35

Merge branch 'master' into feat/native-embeddings-hybrid

11e5290

HahaBill requested a review from Copilot July 3, 2026 14:52

Copilot started reviewing on behalf of HahaBill July 3, 2026 14:52 View session

Copilot AI reviewed Jul 3, 2026

View reviewed changes

addresed review comment

ab35710

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: hybrid embedding pipeline with native AI fallback and NaN self-healing#24

feat: hybrid embedding pipeline with native AI fallback and NaN self-healing#24
Harsh16gupta wants to merge 3 commits into
masterfrom
feat/native-embeddings-hybrid

Harsh16gupta commented Jun 27, 2026 •

edited

Loading

Uh oh!

Harsh16gupta commented Jun 30, 2026

Uh oh!

Harsh16gupta commented Jul 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Harsh16gupta Jul 4, 2026

Uh oh!

Harsh16gupta Jul 4, 2026

Uh oh!

Harsh16gupta Jul 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

Harsh16gupta Jul 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Harsh16gupta commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Testing

Uh oh!

Harsh16gupta commented Jun 30, 2026

Uh oh!

Harsh16gupta commented Jul 2, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Harsh16gupta Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

Harsh16gupta Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

Harsh16gupta Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Harsh16gupta Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Harsh16gupta commented Jun 27, 2026 •

edited

Loading

Harsh16gupta Jul 4, 2026 •

edited

Loading