> For the complete documentation index, see [llms.txt](https://docs.xynq.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.xynq.ai/architecture/fault-tolerance-and-verification.md).

# Fault Tolerance & Verification

A network of volunteer GPUs must assume nodes are unreliable and untrusted. XYNQ is engineered around that assumption.

### Churn is normal

Nodes disappear without warning — a contributor closes their laptop, a connection drops, a GPU gets reclaimed for a game. XYNQ treats this as the expected case:

* **Replicas everywhere:** every shard stage has multiple holders.
* **Mid-stream failover:** if a node dies during decode, the router re-pins remaining stages to healthy replicas and continues.
* **Draining:** nodes that leave *gracefully* signal first, finish in-flight work, and stop accepting new requests.

Failover during decode keeps the stream alive by replaying from the last good hop:

```python
async def decode_with_failover(plan, ctx, params):
    while not ctx.done:
        try:
            tok = await plan.step(ctx)           # one token through the pipeline
            yield tok
        except NodeUnavailable as e:
            # the dead node only held a slice; swap in a replica and resume
            replacement = registry.pick_replica(e.stage, exclude=e.node_id)
            if replacement is None:
                raise                            # no replica -> surface a soft error
            plan.repin(e.stage, replacement)
            ctx.rewind_to(e.last_committed_token)  # KV state replayed on new node
```

### Verification

Because contributors are untrusted, the network guards against faulty or dishonest results:

* **Redundant execution & spot-checking** — a fraction of work is recomputed on independent nodes and compared.
* **Reputation scoring** — nodes that consistently produce correct, timely results gain trust and receive more work and rewards; nodes that fail checks are downranked or ejected.
* **Deterministic checkpoints** — intermediate activations can be validated against expected shapes and bounds to catch corruption early.

```python
async def verify_sample(stage_output: Activation, p: float = 0.03) -> None:
    """Recompute a small fraction of stage outputs on an independent replica."""
    if random.random() > p:
        return                                   # most work is trusted, cheaply
    shadow = registry.pick_replica(stage_output.stage,
                                   exclude=stage_output.node_id)
    recomputed = await shadow.run(stage_output.inputs)
    # allow small numeric drift across different GPUs/kernels
    if not allclose(recomputed, stage_output.value, rtol=1e-2):
        reputation.penalize(stage_output.node_id)
        registry.quarantine(stage_output.node_id)
    else:
        reputation.reward(stage_output.node_id)
```

### Integrity & safety

* All transport is encrypted; nodes see only the shard-level activations they need to compute, not necessarily the full plaintext context.
* Misbehaving nodes are quarantined automatically.
* No node can observe or persist a complete conversation by virtue of holding only a slice of the pipeline.

### Result

From your perspective, the network feels like a single reliable endpoint — even though under the hood it is a shifting swarm of consumer hardware.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.xynq.ai/architecture/fault-tolerance-and-verification.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
