Eval bug: Unusual high RAM usage on Windows when running DeepSeek V3 Q2_K_XL/IQ2_XXS, on Hybrid CPU+GPU

### Name and Version

llama-server --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
  Device 1: NVIDIA GeForce RTX 5080, compute capability 12.0, VMM: yes
version: 5572 (7675c555)
built with MSVC 19.44.35207.1 for x64

### Operating systems

Windows

### GGML backends

CUDA

### Hardware

AMD Ryzen 9 9950x3D CPU and 2 GPUs: Nvidia 5090 and 5080.

### Models

Unsloth models: IQ2_XXS and Q2_K_XL from Hugging Face from here: https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD

### Problem description & steps to reproduce

I am encountering the same issue reported here: https://github.com/ggml-org/llama.cpp/issues/12651 The prior issue was closed, but is there a fix for this issue? I am experiencing this exact same issue running unsloth's DeepSeek-V3-0324-UD-Q2_K_XL model with 2 GPUs (Nvidia RTX 5090 and 5080). I have tried setting LLAMA_CUDA_UNIFIED_MEMORY=1, but I still get out of memory exception. I tried running this: .\llama.cpp\build\bin\Release\llama-server ^
  --model F:/local_llm/models/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf ^
  --port 10000 ^
  --ctx-size 8192 ^
  --n-gpu-layers 10 ^  
  --tensor-split 0.6667,0.3333 ^
  --cache-type-k q8_0 ^
  --temp 0.3 ^
  --min-p 0.01

### First Bad Commit

_No response_

### Relevant log output

```shell
print_info: n_head           = 128
print_info: n_head_kv        = 128
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 192
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 24576
print_info: n_embd_v_gqa     = 16384
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18432
print_info: n_expert         = 256
print_info: n_expert_used    = 8
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 671B
print_info: model params     = 671.03 B
print_info: general.name     = DeepSeek V3 0324 BF16
print_info: n_layer_dense_lead   = 3
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_embd_head_k_mla    = 0
print_info: n_embd_head_v_mla    = 0
print_info: n_ff_exp             = 2048
print_info: n_expert_shared      = 1
print_info: expert_weights_scale = 2.5
print_info: expert_weights_norm  = 1
print_info: expert_gating_func   = sigmoid
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 129280
print_info: n_merges         = 127741
print_info: BOS token        = 0 '<∩╜£beginΓûüofΓûüsentence∩╜£>'
print_info: EOS token        = 1 '<∩╜£endΓûüofΓûüsentence∩╜£>'
print_info: EOT token        = 1 '<∩╜£endΓûüofΓûüsentence∩╜£>'
print_info: PAD token        = 1 '<∩╜£endΓûüofΓûüsentence∩╜£>'
print_info: LF token         = 201 '─è'
print_info: FIM PRE token    = 128801 '<∩╜£fimΓûübegin∩╜£>'
print_info: FIM SUF token    = 128800 '<∩╜£fimΓûühole∩╜£>'
print_info: FIM MID token    = 128802 '<∩╜£fimΓûüend∩╜£>'
print_info: EOG token        = 1 '<∩╜£endΓûüofΓûüsentence∩╜£>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 10 repeating layers to GPU
load_tensors: offloaded 10/62 layers to GPU
load_tensors:        CUDA0 model buffer size = 28222.66 MiB
load_tensors:        CUDA1 model buffer size = 12095.43 MiB
load_tensors:   CPU_Mapped model buffer size = 46729.58 MiB
load_tensors:   CPU_Mapped model buffer size = 47092.29 MiB
load_tensors:   CPU_Mapped model buffer size = 47190.81 MiB
load_tensors:   CPU_Mapped model buffer size = 46830.22 MiB
load_tensors:   CPU_Mapped model buffer size =  7958.07 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 10000.0
llama_context: freq_scale    = 0.025
llama_context: n_ctx_per_seq (8192) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.49 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4480.00 MiB on device 0: cudaMalloc failed: out of memory
alloc_tensor_range: failed to allocate CUDA0 buffer of size 4697620480
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
common_init_from_params: failed to create context with model 'F:/local_llm/models/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf'
srv    load_model: failed to load model, 'F:/local_llm/models/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf'
srv   operator (): operator (): cleaning up before exit...
main: exiting due to model loading error
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Unusual high RAM usage on Windows when running DeepSeek V3 Q2_K_XL/IQ2_XXS, on Hybrid CPU+GPU #13978

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Unusual high RAM usage on Windows when running DeepSeek V3 Q2_K_XL/IQ2_XXS, on Hybrid CPU+GPU #13978

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions