> For the complete documentation index, see [llms.txt](https://docs.xynq.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.xynq.ai/architecture/the-inference-mesh.md).

# The Inference Mesh

The inference mesh is the heart of XYNQ. It is a fabric of heterogeneous GPU nodes that, together, behave like one large model-serving endpoint.

### Heterogeneous by design

Nodes are ordinary consumer GPUs of varying capability — for example RTX 3080s, an RTX 4070 Ti, and RTX 4090s in the current reference fleet. The mesh treats VRAM and compute as fungible resources and assigns work according to each node's capacity.

| Class | Example GPU | VRAM  | Role in the mesh                                     |
| ----- | ----------- | ----- | ---------------------------------------------------- |
| Light | RTX 3080    | 10 GB | Holds a few layers; good for pipeline middle stages. |
| Mid   | RTX 4070 Ti | 12 GB | Balanced shard host.                                 |
| Heavy | RTX 4090    | 24 GB | Holds large shards or hosts smaller models whole.    |

### What a "node" provides

Each node advertises a capability descriptor to the coordinator:

* total and free VRAM,
* compute throughput,
* the model shards it currently holds (its "shard set"),
* network latency characteristics,
* current utilization and thermal headroom.

A node's advertisement to the coordinator looks roughly like this:

```python
from dataclasses import dataclass, field

@dataclass
class NodeCapability:
    node_id: str                  # "node-04"
    gpu: str                      # "RTX 4090"
    vram_total_mb: int            # 24576
    vram_free_mb: int             # live
    tflops_fp16: float            # measured, not advertised by vendor
    shard_set: list[str] = field(default_factory=list)  # ["jaguar:stage-3", "qwen35-27b:stage-1"]
    region: str = "unknown"       # "eu-central"
    rtt_ms_p50: float = 0.0       # to coordinator
    utilization: float = 0.0      # 0.0 - 1.0
    temp_c: int = 0
    reputation: float = 0.5       # 0.0 - 1.0, earned over time

    def can_host(self, stage_bytes: int) -> bool:
        # leave headroom for the KV cache and CUDA context
        return self.vram_free_mb * 1024 * 1024 > stage_bytes * 1.15
```

The coordinator keeps a live registry of these and updates them from each node's heartbeat:

```python
async def on_heartbeat(self, hb: Heartbeat) -> None:
    node = self.registry.get(hb.node_id)
    node.vram_free_mb = hb.vram_free_mb
    node.utilization  = hb.utilization
    node.temp_c       = hb.temp_c
    node.rtt_ms_p50   = hb.rtt_ms_p50
    node.last_seen    = now()
    # a node that misses N heartbeats is presumed gone and its shards re-covered
    self.health.mark_alive(hb.node_id)
```

### The mesh as a living system

Because contributors come and go, the mesh is never static. The coordinator continuously:

* **admits** new nodes and integrates them into shard groups,
* **rebalances** shards when capacity changes,
* **drains** nodes that are leaving (or whose owner reclaimed the GPU),
* **heals** around failures by promoting replicas.

The **Network** tab is a window into exactly this — the numbers you see drifting are real-time utilization across the fleet.

### Why a mesh instead of one big server

* **Cost:** idle consumer GPUs are effectively free capacity.
* **Resilience:** no single point of failure; the network routes around dead nodes.
* **Scalability:** capacity scales with community size, not capital expenditure.
* **Sovereignty:** no dependence on any single cloud provider.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.xynq.ai/architecture/the-inference-mesh.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
