Skip to content

lloyal-ai/lloyal.node

Repository files navigation

lloyal.node

Build & Test GPU Tests npm License llama.cpp

Native backend for the lloyal inference platform.

Prebuilt llama.cpp binaries for 13 platform/GPU combinations, exposing a SessionContext that powers the @lloyal-labs/sdk inference primitives (Branch, BranchStore, Session, Rerank) and @lloyal-labs/lloyal-agents multi-agent framework. Built on liblloyal, a header-only C++20 inference kernel for llama.cpp.

All SDK and agent exports are re-exported from this package for convenience — import { Branch, useAgent, agentPool } from "@lloyal-labs/lloyal.node" works out of the box.

Install

npm install @lloyal-labs/lloyal.node

Prebuilt binaries for 13 platform/GPU combinations. GPU selection at runtime, not install time.

Platform Arch Acceleration
macOS arm64 Metal
macOS x64 CPU
Linux x64 CPU / CUDA / Vulkan
Linux arm64 CPU / CUDA / Vulkan
Windows x64 CPU / CUDA / Vulkan
Windows arm64 CPU / Vulkan

Quick Start

import { createContext } from "@lloyal-labs/lloyal.node";
import { Branch, BranchStore } from "@lloyal-labs/sdk";

const ctx = await createContext({ modelPath: "./model.gguf", nSeqMax: 4 });
const store = new BranchStore(ctx);

const root = Branch.create(ctx, 0, { temperature: 0.8 });
await root.prefill(await ctx.tokenize("Explain quantum entanglement"));

// Fork and generate — all branches in lockstep, 1 GPU call per step
const branches = await Promise.all([root.fork(), root.fork(), root.fork()]);
for (;;) {
  const live = branches.filter((b) => !b.disposed);
  if (!live.length) break;
  const produced = live.map((b) => ({ b, ...b.produce() }));
  for (const p of produced.filter((p) => p.isStop)) await p.b.prune();
  const items = produced
    .filter((p) => !p.isStop)
    .map((p) => {
      p.b.accept(p.token);
      return [p.b, p.token];
    });
  await store.commit(items);
}

Or for single-branch generation, Branch is an async iterable:

for await (const { token, text } of branch) {
  process.stdout.write(text);
}

See @lloyal-labs/sdk for the full Branch API, continuous tree batching, KV tenancy, and topology documentation.

Without the SDK

createContext returns a SessionContext — the native interface to llama.cpp. You can use it directly without the SDK's Branch/BranchStore layer:

import { createContext } from "@lloyal-labs/lloyal.node";

const ctx = await createContext({ modelPath: "./model.gguf", nSeqMax: 4 });

// Chat templates — model-agnostic formatting + tool calling
const { prompt, grammar, format } = await ctx.formatChat(messages, {
  addGenerationPrompt: true,
  tools: [{ type: "function", function: { name: "search", parameters: schema } }],
});
const { content, toolCalls } = await ctx.parseChatOutput(output, format);

// Branch primitives — what the SDK's Branch class wraps
const handle = ctx._branchCreate(0, samplerParams);
await ctx._branchPrefill(handle, tokens);
const token = ctx._branchSample(handle);
const text = ctx.tokenToText(token);
const isStop = ctx.isStopToken(token);
ctx._branchAccept(handle, token);
const logits = ctx._branchGetLogits(handle);     // Float32Array(vocabSize)
const entropy = ctx._branchModelEntropy(handle);
const child = ctx._branchFork(handle);

// Store primitives — what the SDK's BranchStore wraps
await ctx._storeCommit([handle1, handle2], [tok1, tok2]);  // N branches, 1 GPU call
await ctx._storePrefill([handle], [tokens]);
await ctx._storeRetainOnly(winner);
const available = ctx._storeAvailable();

// KV cache — snapshot, copy, persist
await ctx.kvSeqCopy(0, 1);                      // share prefix across sequences
await ctx.kvCacheSave();                         // snapshot for rollback
await ctx.kvCacheLoad();                         // restore checkpoint
await ctx.kvCacheWriteFile("cache.bin");         // persist to disk

// Embeddings
const embeddings = await ctx.encode("query text");
const dim = ctx.getEmbeddingDimension();

// Grammar + tokenizer
const grammar = await ctx.jsonSchemaToGrammar(schema);
const tokens = await ctx.tokenize("Hello world");
const sep = await ctx.getTurnSeparator();

What This Package Provides

Native-only (not in SDK):

  • createContext(options) — load a GGUF model, return a SessionContext
  • loadBinary(options?) — explicit GPU variant selection with automatic fallback
  • Prebuilt binaries for 13 platform/GPU combinations

Re-exported from @lloyal-labs/sdk:

  • Branch, BranchStore, Session, Rerank
  • Per-token metrics: modelEntropy(), modelSurprisal(), samplingPerplexity
  • Chat formatting: formatChat(), parseChatOutput()
  • Grammar: jsonSchemaToGrammar(), setGrammar()

Re-exported from @lloyal-labs/lloyal-agents:

  • useAgent, agentPool, useAgentPool, withSpine, diverge, reduce, createToolkit
  • Structured concurrency DAG via Effection generators
  • In-loop orchestration: agents as branches of a single running process
  • App protocol surfaces (AppRegistryCtx, AppConfigStoreCtx, App, AppManifest) when paired with @lloyal-labs/rig's defineApp / createAppRegistry

GPU Variant Selection

import { loadBinary, createContext } from "@lloyal-labs/lloyal.node";

// Automatic — uses Metal on macOS, CPU elsewhere
const ctx = await createContext({ modelPath: "./model.gguf" });

// Explicit CUDA
const binding = loadBinary({ gpuVariant: "cuda" });
const ctx = await binding.createContext({ modelPath: "./model.gguf" });
// Falls back to CPU with a warning if CUDA runtime not available

Examples

Example Pattern
entropy/ modelEntropy() mid-generation as control signal
chat/ Interactive streaming chat
embed/ Text embeddings extraction
npx tsx examples/best-of-n/best-of-n.ts
npx tsx examples/chat/chat.ts ./model.gguf

CI Testing

Integration tests run real inference across architectures:

Architecture Test Model Template
Llama Llama 3.2 1B llama3
Phi Phi 3.5 Mini phi3
Qwen Qwen 3 1.7B chatml
Gemma Gemma 3 1B gemma
SmolLM SmolLM2 1.7B chatml
Ministral Ministral 3B mistral

See distribution.md for details.

Ecosystem

Package Description
@lloyal-labs/sdk Backend-agnostic inference primitives (Branch, BranchStore, Session, Rerank)
@lloyal-labs/lloyal-agents Multi-agent runtime + App protocol primitives
@lloyal-labs/rig App protocol helpers, retrieval providers, framework tools (Plan/Delegate/Report)
harness.dev CLI — scaffold harnesses + Apps; publish/install signed Apps via the channel
liblloyal Header-only C++20 inference kernel for llama.cpp
lloyal.node This package — native backend + prebuilt binaries
nitro-llama React Native backend via Nitro Modules
tsampler Reference sampler implementation

Contributing

See CONTRIBUTING.md for development setup and release process.

License

You can build and sell commercial products using lloyal.node.

lloyal.node 3.0 is source-available under FSL-1.1-Apache-2.0 and converts to Apache 2.0 two years after each release. The restriction is narrow: you cannot offer a competing HDK runtime, managed HDK service, or alternative HDK App distribution channel.

See LICENSE-FAQ.md for concrete examples of what's permitted and what's restricted. See LICENSE for the legal text and NOTICE for attribution including the bundled llama.cpp MIT dependency.

About

Llama.cpp prefix-sharing and Git-like inference trees for Node.js

Topics

Resources

License

Unknown, Unknown licenses found

Licenses found

Unknown
LICENSE
Unknown
LICENSE-FAQ.md

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors