> For the complete documentation index, see [llms.txt](https://docs.xynq.ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.xynq.ai/architecture/request-lifecycle.md).

# Request Lifecycle

This page traces a single prompt from keypress to response.

### 1. Submission

Your browser sends the conversation (the running list of messages) plus the selected model to the nearest coordination endpoint over an encrypted channel.

### 2. Admission & rate checks

The coordinator verifies the request is well-formed and applies fair-use limits (see **Limits & Fair Use**). No identity is required or recorded; limits are enforced on coarse signals only.

### 3. Plan construction

The scheduler selects a **pipeline** of nodes that collectively hold every layer of the chosen model, optimizing for:

* lowest end-to-end latency to you,
* highest available throughput,
* thermal and utilization headroom,
* shard locality (minimize hops between consecutive layers).

In pseudocode, the coordinator's request entrypoint ties the whole lifecycle together:

```python
async def handle_chat(self, req: ChatRequest) -> AsyncIterator[Token]:
    # 1. admission + anonymous fair-use (no identity required)
    if not self.ratelimit.allow(req.coarse_key):
        raise RateLimited()

    model = self.catalog.resolve(req.model)        # e.g. "jaguar"

    # 2. build an execution plan: one replica per stage
    plan = self.scheduler.plan(model, near=req.client_region)
    if plan is None:
        raise NoCapacity(model.name)

    # 3. prefill the prompt through the pipeline (build KV cache)
    ctx = await self.pipeline.prefill(plan, req.messages)

    # 4. stream decode, token by token
    async for tok in self.pipeline.decode(plan, ctx, req.params):
        yield tok

    # 5. teardown — release KV cache, write nothing to disk
    await self.pipeline.release(ctx)
```

### 4. Prefill

The conversation is tokenized and run through the pipeline once to build the attention key/value cache. This is the "reading" phase — the model ingests your context.

### 5. Decode (token streaming)

The model then generates output one token at a time. Each token passes through the full pipeline; as soon as the final stage emits a token, it is streamed to your terminal. You see the answer appear progressively.

### 6. Completion & teardown

When generation stops (end-of-sequence or length limit), the pipeline's transient state — including the KV cache for your request — is released. Nothing is written to durable storage.

### 7. Delivery

Your terminal renders the final message. The only persistent copy of the exchange is the one in your browser tab, which you control.

#### Latency budget (typical)

| Phase        | Contribution                                     |
| ------------ | ------------------------------------------------ |
| Scheduling   | small, single-digit milliseconds in steady state |
| Network hops | depends on pipeline length and node proximity    |
| Prefill      | scales with prompt length                        |
| Decode       | scales with response length                      |


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.xynq.ai/architecture/request-lifecycle.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
