vllm.v1.kv_cache_interface ¶
ChunkedLocalAttentionSpec dataclass ¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
max_admission_blocks_per_request ¶
Per-request admission cap, in blocks.
Single source of truth for both startup pool sizing (max_memory_usage_bytes) and the runtime admission gate, so requests admitted by startup can also be admitted at runtime.
Source code in vllm/v1/kv_cache_interface.py
CrossAttentionSpec dataclass ¶
Bases: AttentionSpec
KV cache spec for cross-attention layers in encoder-decoder models.
Source code in vllm/v1/kv_cache_interface.py
FullAttentionSpec dataclass ¶
Bases: AttentionSpec
When hybrid allocator is disabled and the model contains both full attention layers and sliding window attention layers, sliding window attention are regarded as full attention in KV cache manager (blocks are allocated for all tokens), while computed as sliding window attention in model runner. In this case, we use FullAttentionSpec and record the sliding window size.
Source code in vllm/v1/kv_cache_interface.py
319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 | |
sliding_window class-attribute instance-attribute ¶
sliding_window: int | None = None
Default to None for not using sliding window attention.
merge classmethod ¶
Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
Source code in vllm/v1/kv_cache_interface.py
HiddenStateCacheSpec dataclass ¶
Bases: MLAAttentionSpec
Marker for hidden-state cache layers used by extract_hidden_states.
Source code in vllm/v1/kv_cache_interface.py
KVCacheConfig dataclass ¶
The KV cache configuration of a model.
Source code in vllm/v1/kv_cache_interface.py
kv_cache_groups instance-attribute ¶
kv_cache_groups: list[KVCacheGroupSpec]
The kv cache groups of the model. For models with only one type of attention, there is only one group that contains all layers. For models with multiple types of attention, there will be multiple groups, see _get_kv_cache_config_uniform_page_size for more details.
kv_cache_tensors instance-attribute ¶
kv_cache_tensors: list[KVCacheTensor]
How should model runner initialize the KV cache tensors for each layer
KVCacheGroupSpec dataclass ¶
Represents a group of model layers that share the same KV cache block table. These layers are regarded as one layer in the KV cache manager.
Source code in vllm/v1/kv_cache_interface.py
KVCacheLayout ¶
Bases: Enum
Physical layout descriptor for a KV cache group.
The logical shape is always [L, B, H, N,
Source code in vllm/v1/kv_cache_interface.py
KVCacheSpec dataclass ¶
A base class for specifying the KV cache format of one layer.
RFC #42082 standard vocabulary (properties, overridden by subclasses): num_heads: int — heads (1 if headless, e.g. MLA) tokens_per_state: int — -1 infinite (recurrent), 1 standard, N compressed state_content_size_bytes: int — bytes per state per head
Source code in vllm/v1/kv_cache_interface.py
copy_with_new_block_size ¶
Create a new KVCacheSpec from self but replacing the block size.
max_memory_usage_bytes ¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
The maximum possible memory usage of this KV cache in bytes.
Returns:
| Type | Description |
|---|---|
int | The KV cache size in bytes |
merge classmethod ¶
Merge a list of KVCacheSpec objects into a single KVCacheSpec object.
Source code in vllm/v1/kv_cache_interface.py
KVCacheTensor dataclass ¶
One contiguous GPU allocation backing one or more layer slots.
shared_by[slot_idx] lists the layer names aliasing slot slot_idx. Layers in the same inner list belong to different groups (independent block tables) so their block-id namespaces never collide.
Source code in vllm/v1/kv_cache_interface.py
KVQuantMode ¶
Bases: IntEnum
KV cache quantization mode.
Used by attention backends and kernels to dispatch quantization logic without string matching on kv_cache_dtype.
Source code in vllm/v1/kv_cache_interface.py
SinkFullAttentionSpec dataclass ¶
Bases: FullAttentionSpec
Source code in vllm/v1/kv_cache_interface.py
merge classmethod ¶
Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
Source code in vllm/v1/kv_cache_interface.py
SlidingWindowMLASpec dataclass ¶
Bases: SlidingWindowSpec
Sliding window attention with MLA cache format.
Source code in vllm/v1/kv_cache_interface.py
SlidingWindowSpec dataclass ¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
max_admission_blocks_per_request ¶
Per-request admission cap, in blocks.
Single source of truth for both startup pool sizing (max_memory_usage_bytes) and the runtime admission gate. Per-request real-held blocks plateau at this bound because SlidingWindowManager.remove_skipped_blocks runs from allocate_slots before each chunk's get_num_blocks_to_allocate.
Source code in vllm/v1/kv_cache_interface.py
TQFullAttentionSpec dataclass ¶
Bases: FullAttentionSpec
FullAttentionSpec with TQ-aware page size.
Python equivalent of the C++ TQ4FullAttentionSpec. Overrides real_page_size_bytes to use TQ slot bytes instead of the raw head_size * dtype formula.
Source code in vllm/v1/kv_cache_interface.py
UniformTypeKVCacheSpecs dataclass ¶
Bases: KVCacheSpec
A KV cache spec for multiple layers with the same type of attention. Here, same types means always need the same number of token slots. For example, sliding window attentions with different window sizes are not the same type and should not be merged into one UniformTypeKVCacheSpecs.
Source code in vllm/v1/kv_cache_interface.py
839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 | |
from_specs classmethod ¶
from_specs(
kv_cache_specs: dict[str, KVCacheSpec],
) -> Self | None
Return a SameTypeKVCacheSpecs object if all layers have the same type of KV cache spec. Return None if not.
Source code in vllm/v1/kv_cache_interface.py
is_uniform_type classmethod ¶
is_uniform_type(
kv_cache_specs: dict[str, KVCacheSpec],
) -> bool
Whether all layers have the same type of KV cache spec.
Source code in vllm/v1/kv_cache_interface.py
compute_layer_kv_cache_shape_bytes ¶
compute_layer_kv_cache_shape_bytes(
spec: KVCacheSpec,
num_blocks: int,
block_size: int | None = None,
) -> tuple[int, ...]
Return the 4D logical shape (B, H, N, C) where C is in bytes.
Source code in vllm/v1/kv_cache_interface.py
get_kv_quant_mode ¶
get_kv_quant_mode(kv_cache_dtype: str) -> KVQuantMode
Map a kv_cache_dtype string to a :class:KVQuantMode.
Source code in vllm/v1/kv_cache_interface.py
kv_cache_uses_per_token_head_scales ¶
Return True if kv_cache_dtype needs per-token-head scales.
num_states_for ¶
Derive num_states at allocation time (not part of the spec).
Source code in vllm/v1/kv_cache_interface.py
reshape_kv_cache ¶
reshape_kv_cache(
raw: Tensor,
spec: KVCacheSpec,
num_blocks: int,
num_layer_slots: int,
layout: KVCacheLayout,
block_size: int | None = None,
) -> list[Tensor]
View a flat int8 buffer as 4D [B, H, N, C] per-slot views.
Works for all KVCacheSpec subclasses. Shapes as int8 via compute_layer_kv_cache_shape_bytes, then reinterprets as spec.dtype.