Skip to content

feat: hybrid embedding pipeline with native AI fallback and NaN self-healing#24

Open
Harsh16gupta wants to merge 3 commits into
masterfrom
feat/native-embeddings-hybrid
Open

feat: hybrid embedding pipeline with native AI fallback and NaN self-healing#24
Harsh16gupta wants to merge 3 commits into
masterfrom
feat/native-embeddings-hybrid

Conversation

@Harsh16gupta

@Harsh16gupta Harsh16gupta commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

This PR adds support for using Joplin's native AI search embeddings if they are enabled. If native embeddings are not available or fail, the plugin seamlessly falls back to running the local model in a web worker. It also includes fixes for WebGPU stability issues and cache validation.

What changed

  • Native AI Integration:
    • Added support for Joplin"s new getIndexStatus and getEmbeddings AI APIs.
    • Fetches embeddings in batches, handles page cursors safely to prevent infinite loops, and averages chunks together for longer notes.
    • Chained all joplin.ai calls directly to avoid Joplin IPC sandbox proxy errors.
  • WebGPU Stability & WASM Fallback:
    • Added checks to catch NaN values caused by WebGPU numeric instability.
    • If WebGPU fails during warmup or inference, the worker dynamically reloads the model using the WASM/q8 runner and retries.
  • Cache Verification:
    • When loading from the cache, we now check that the cached vectors are valid (correct length, not null/NaN). Invalid entries are logged and re-embedded.
  • Code Cleanup:
    • Moved shared configurations, the 384 dimension limit, and vector validation logic into a new shared pipelineConfig.ts file to eliminate duplicated code.
    • Tuned UMAP/clustering parameters and updated the package name to match Joplin's naming convention.

Testing

Tested locally with 50 notes on Joplin desktop with native AI enabled:

  • Successfully detected the native multilingual-e5-small model and retrieved all vectors.
  • The HDBSCAN clustering run completed successfully, producing a silhouette score of 0.9352.
image

@Harsh16gupta Harsh16gupta self-assigned this Jun 27, 2026
@Harsh16gupta

Copy link
Copy Markdown
Collaborator Author

Some more work still needs to be done on this. Also, Laurent recently made changes to the API endpoints. I've tested them locally from the GitHub repository, but they haven't been released yet. I'll continue working on this once those changes are available in a release.

@Harsh16gupta Harsh16gupta changed the title feat: implementing the new joplin ai features in the plugin existing workflow feat: hybrid embedding pipeline with native AI fallback and NaN self-healing Jul 2, 2026
@Harsh16gupta Harsh16gupta force-pushed the feat/native-embeddings-hybrid branch from 65b4209 to 06b1f89 Compare July 2, 2026 12:28
@Harsh16gupta Harsh16gupta force-pushed the feat/native-embeddings-hybrid branch from 06b1f89 to d5419db Compare July 2, 2026 12:32
@Harsh16gupta Harsh16gupta marked this pull request as ready for review July 2, 2026 12:34
@Harsh16gupta Harsh16gupta requested a review from HahaBill July 2, 2026 12:35
@Harsh16gupta

Copy link
Copy Markdown
Collaborator Author

The PR is ready for review
the new native AI APIs (getIndexStatus and getEmbeddings) aren't released in the public pre-release version of Joplin yet. I have tested this against a local Joplin dev build and it works fine.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a hybrid embedding pipeline that prefers Joplin’s native AI embeddings when available and falls back to a local ONNX worker otherwise, while also improving robustness against WebGPU numeric instability and invalid cached vectors.

Changes:

  • Added native AI embedding integration (getIndexStatus, getEmbeddings) with pagination safeguards and chunk aggregation.
  • Improved WebGPU stability by detecting NaNs during warmup/inference and dynamically falling back to WASM/q8.
  • Centralized embedding dimension/config and added cache-vector validation to self-heal corrupted cache entries.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/worker/embedWorker.ts Adds NaN detection and dynamic WebGPU→WASM fallback during warmup/inference.
src/pipeline/UmapProjector.ts Adds optional distance-matrix projection support and validation for index-singleton vectors.
src/pipeline/runPipeline.ts Adds native-embeddings fast path and cache-vector validation before reuse.
src/pipeline/pipelineConfig.ts Introduces shared embedding dimension, vector validation, and tuned default clustering config.
src/pipeline/nativeEmbeddingPipeline.ts Implements native AI readiness check and paged embedding fetch with cursor loop protection.
src/pipeline/clustering/benchmark.ts Extends benchmark API to optionally accept a distance matrix and project it for clustering.
src/commands/testEmbed.ts Aligns test command with shared config and cache-vector validation.
package.json Renames the package to match Joplin naming convention.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +87 to +96
callbacks.onStatus('Clustering...');
const results = benchmark(vectors, DEFAULT_CONFIG);

const panelNotes: PanelNote[] = validNotes.map((n) => ({
noteId: n.id,
title: n.title,
}));

callbacks.onComplete(results, panelNotes);
return;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added enrichResultsWithTags to the native embeddings path with the same document-building pattern used in the local worker path. Both paths now produce TF-IDF tags for the dashboard.

Comment on lines +49 to +58
if (!page || !Array.isArray(page.chunks)) {
throw new Error('Invalid response from Joplin native getEmbeddings API');
}

if (modelId && page.modelId !== modelId) {
throw new Error('Embedding model changed mid-fetch. Please restart.');
}
modelId = page.modelId;
chunks.push(...page.chunks);
cursor = page.nextCursor;

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a guard that checks for noteId and vector before pushing chunks. Malformed ones get logged and skipped.

Comment on lines +6 to +10
export function isValidEmbeddingVector(vector: number[] | undefined | null): boolean {
if (!vector) return false;
if (vector.length !== EMBEDDING_DIM) return false;
return vector.every((v) => v !== null && !Number.isNaN(v));
}

@Harsh16gupta Harsh16gupta Jul 4, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to Number.isFinite(v) which covers all the cases (null, NaN, Infinity, non-numbers). Didn't change the signature to unknown since callers already pass typed values - the any casts would make it worse.

Comment thread src/pipeline/UmapProjector.ts
Comment thread src/worker/embedWorker.ts
Comment on lines +86 to +94
const t0 = performance.now();
const output = await embedder(text, { pooling: POOLING, normalize: true });
const inferenceTime = performance.now() - t0;
const dimensions = output.data.length;
const embedding = Array.from(output.data as Float32Array);

if (embedding.some((v) => isNaN(v))) {
throw new Error('Inference returned NaN values');
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to check the typed array (Float32Array) for NaNs first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants