> For the complete documentation index, see [llms.txt](https://docs.xynq.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.xynq.ai/architecture/sharding-and-routing.md).

# Sharding & Routing

Large models don't fit on one consumer GPU. XYNQ solves this with sharding.

### Pipeline parallelism

A model is a stack of transformer layers. XYNQ splits that stack into contiguous **stages**, and assigns each stage to one or more nodes. A request flows through the stages in order:

```
[ node A: layers 0–11 ] -> [ node B: layers 12–23 ] -> [ node C: layers 24–35 ] -> output
```

Each node only needs enough VRAM for its stage, not the whole model. This is what lets a 27B-parameter model run across several 10–24 GB cards.

Partitioning a model into stages that fit each node's VRAM is a bin-packing pass run at placement time:

```python
def partition_layers(model: ModelSpec, nodes: list[NodeCapability]) -> list[Stage]:
    """Greedily pack contiguous layers into stages bounded by node VRAM."""
    stages, current, budget = [], [], 0
    # sort candidate hosts by free VRAM so heavy stages land on big cards
    capacity = max(n.vram_free_mb for n in nodes) * MB
    for layer in model.layers:
        if budget + layer.bytes > capacity * VRAM_SAFETY:
            stages.append(Stage(layers=current))
            current, budget = [], 0
        current.append(layer)
        budget += layer.bytes
    if current:
        stages.append(Stage(layers=current))
    return stages
```

### Tensor parallelism (within a stage)

For heavy stages, a single layer's matrices can be split *across* multiple nodes that compute partial results in parallel and combine them. This trades extra network communication for the ability to host very large layers on modest GPUs.

### Replication

Each stage is held by **multiple** nodes (replicas). The router can send a request to any replica of a stage, which provides:

* **load balancing** — spread traffic across replicas,
* **fault tolerance** — if one replica drops, others cover,
* **locality** — pick the replica closest to the previous stage.

### The router

The router is the component that, per request, picks one replica of each stage to form a concrete pipeline. It scores candidate pipelines on latency, load, and reliability, then commits the best one. If a node fails mid-stream, the router can re-pin the remaining stages to healthy replicas and resume.

```python
def build_pipeline(stages: list[Stage], registry: Registry, near: str) -> list[NodeCapability] | None:
    chosen, prev = [], None
    for stage in stages:
        replicas = registry.replicas_of(stage)            # all nodes holding this stage
        replicas = [n for n in replicas if n.is_alive and n.utilization < 0.97]
        if not replicas:
            return None                                   # missing coverage -> trigger rebalance
        # prefer low load, high reputation, and locality to the previous hop
        best = min(replicas, key=lambda n: score_hop(n, prev, near))
        chosen.append(best)
        prev = best
    return chosen


def score_hop(node, prev, near) -> float:
    locality = rtt(prev, node) if prev else rtt_region(near, node.region)
    return (0.55 * node.utilization
            + 0.30 * (locality / 100.0)
            + 0.15 * (1.0 - node.reputation))
```

### Shard placement

Placement decides *which* nodes hold *which* stages. The coordinator aims to:

* keep popular models broadly replicated,
* place consecutive stages on low-latency-adjacent nodes,
* match heavy stages to high-VRAM nodes,
* maintain enough replicas of every stage to tolerate churn.

Placement is recomputed continuously as nodes join and leave.

***

## PAGE: Architecture — Scheduling & Load Balancing

Scheduling is how XYNQ stays fast under unpredictable, churning capacity.

### Goals

1. **Low latency** for interactive chat.
2. **High utilization** of donated GPUs (don't waste idle cycles).
3. **Fairness** across concurrent users.
4. **Stability** as nodes constantly join and leave.

### Signals the scheduler uses

* Per-node utilization, queue depth, and thermal headroom.
* Per-node throughput (tokens/sec) measured continuously.
* Network RTT between candidate nodes and between node and user.
* Which shard sets each node holds.
* Current demand per model.

### Queueing model

Each node maintains a short local queue. The scheduler practices **least-loaded routing**: among valid pipelines, it prefers the one whose bottleneck node has the most headroom. Very short queues keep latency predictable; if the whole network is saturated, new requests wait briefly rather than degrading everyone's quality.

```python
def bottleneck_headroom(pipeline: list[NodeCapability]) -> float:
    # a pipeline is only as fast as its busiest node
    return min(1.0 - n.utilization for n in pipeline)

def choose_pipeline(candidates: list[list[NodeCapability]]) -> list[NodeCapability]:
    # maximize the headroom of the worst node across all candidate pipelines
    return max(candidates, key=bottleneck_headroom)

async def admit(self, req: ChatRequest) -> Lease | Wait:
    candidates = self.enumerate_pipelines(req.model, k=8)
    if not candidates:
        return Wait(reason="no_capacity", retry_after_ms=750)
    pipe = choose_pipeline(candidates)
    if bottleneck_headroom(pipe) < MIN_HEADROOM:   # whole net is hot
        return Wait(reason="saturated", retry_after_ms=400)
    return self.lease(pipe, ttl_ms=req.budget_ms)
```

### Backpressure

When demand exceeds capacity, the network applies backpressure gracefully:

1. First, requests queue for a few moments.
2. Then, the free-tier rate limits throttle new submissions.
3. Quality of responses is **never** silently reduced.

This is the "latency rises before quality does" principle in action.

### Autoscaling with the community

There is no fixed cluster size. As contributors' machines become idle (e.g., overnight in a given region), they auto-enroll and capacity rises. When they're reclaimed, capacity falls. The scheduler is built to operate smoothly across this constant ebb and flow.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.xynq.ai/architecture/sharding-and-routing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
