[18:41:12] llama_context: flash_attn = auto
[18:41:12] llama_context: kv_unified = true
[18:41:12] llama_context: freq_base = 500000.0
[18:41:12] llama_context: freq_scale = 1
[18:41:12] llama_context: n_ctx_seq (50688) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
[18:41:12] llama_context: CUDA_Host output buffer size = 2.31 MiB
[18:41:12] llama_kv_cache: CUDA0 KV buffer size = 1980.00 MiB
[18:41:12] llama_kv_cache: size = 1980.00 MiB ( 50688 cells, 40 layers, 4/1 seqs), K (f16): 990.00 MiB, V (f16): 990.00 MiB
[18:41:12] sched_reserve: reserving ...
[18:41:12] sched_reserve: Flash Attention was auto, set to enabled
[18:41:12] sched_reserve: CUDA0 compute buffer size = 392.25 MiB
[18:41:12] sched_reserve: CUDA_Host compute buffer size = 107.02 MiB
[18:41:12] sched_reserve: graph nodes = 1487
[18:41:12] sched_reserve: graph splits = 2
[18:41:12] sched_reserve: reserve took 21.77 ms, sched copies = 1
[18:41:12] common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
[18:41:12] clip_model_loader: model name: Glm-4.6V
[18:41:12] clip_model_loader: description:
[18:41:12] clip_model_loader: GGUF version: 3
[18:41:12] clip_model_loader: alignment: 32
[18:41:12] clip_model_loader: n_tensors: 182
[18:41:12] clip_model_loader: n_kv: 33
[18:41:12] clip_model_loader: has vision encoder
[18:41:12] clip_ctx: CLIP using CUDA0 backend
[18:41:12] load_hparams: projector: glm4v
[18:41:12] load_hparams: n_embd: 1536
[18:41:12] load_hparams: n_head: 12
[18:41:12] load_hparams: n_ff: 10944
[18:41:12] load_hparams: n_layer: 24
[18:41:12] load_hparams: ffn_op: silu
[18:41:12] load_hparams: projection_dim: 4096
[18:41:12] --- vision hparams ---
[18:41:12] load_hparams: image_size: 336
[18:41:12] load_hparams: patch_size: 14
[18:41:12] load_hparams: has_llava_proj: 0
[18:41:12] load_hparams: minicpmv_version: 0
[18:41:12] load_hparams: n_merge: 2
[18:41:12] load_hparams: n_wa_pattern: 0
[18:41:12] load_hparams: image_min_pixels: 6272
[18:41:12] load_hparams: image_max_pixels: 3211264
[18:41:12] load_hparams: model size: 1639.67 MiB
[18:41:12] load_hparams: metadata size: 0.06 MiB
[18:41:17] warmup: warmup with image size = 1288 x 1288
[18:41:17] alloc_compute_meta: CUDA0 compute buffer size = 515.05 MiB
[18:41:17] alloc_compute_meta: CPU compute buffer size = 19.11 MiB
[18:41:17] alloc_compute_meta: graph splits = 1, nodes = 632
[18:41:17] warmup: flash attention is enabled
[18:41:17] srv load_model: loaded multimodal model, 'C:\Users\marcv\.lmstudio\models\unsloth\GLM-4.6V-Flash-GGUF\mmproj-F16.gguf'
[18:41:17] srv load_model: initializing slots, n_slots = 4
[18:41:17] slot load_model: id 0 | task -1 | new slot, n_ctx = 50688
[18:41:17] slot load_model: id 1 | task -1 | new slot, n_ctx = 50688
[18:41:17] slot load_model: id 2 | task -1 | new slot, n_ctx = 50688
[18:41:17] slot load_model: id 3 | task -1 | new slot, n_ctx = 50688
[18:41:17] srv load_model: prompt cache is enabled, size limit: 8192 MiB
[18:41:17] srv load_model: use `--cache-ram 0` to disable the prompt cache
[18:41:17] srv load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
[18:41:17] init: chat template, example_format: '[gMASK]<|system|>
[18:41:17] You are a helpful assistant<|user|>
[18:41:17] Hello<|assistant|>
[18:41:17]
[18:41:17] Hi there<|user|>
[18:41:17] How are you?<|assistant|>
[18:41:17] '
[18:41:17] srv init: init: chat template, thinking = 0
[18:41:17] main: model loaded
[18:41:17] main: server is listening on http://0.0.0.0:8080
[18:41:17] main: starting the main loop...
[18:41:17] srv update_slots: all slots are idle
[18:41:28] srv params_from_: Chat format: GLM 4.5
[18:41:28] slot get_availabl: id 3 | task -1 | selected slot by LRU, t_last = -1
[18:41:28] slot launch_slot_: id 3 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
[18:41:28] slot launch_slot_: id 3 | task 0 | processing task, is_child = 0
[18:41:28] slot update_slots: id 3 | task 0 | new prompt, n_ctx_slot = 50688, n_keep = 0, task.n_tokens = 275
[18:41:28] slot update_slots: id 3 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
[18:41:28] slot update_slots: id 3 | task 0 | prompt processing progress, n_tokens = 275, batch.n_tokens = 275, progress = 1.000000
[18:41:28] slot update_slots: id 3 | task 0 | prompt done, n_tokens = 275, batch.n_tokens = 275
[18:41:28] slot init_sampler: id 3 | task 0 | init sampler, took 0.06 ms, tokens: text = 275, total = 275
[18:41:32] slot print_timing: id 3 | task 0 |
[18:41:32] prompt eval time = 189.57 ms / 275 tokens ( 0.69 ms per token, 1450.64 tokens per second)
[18:41:32] eval time = 3576.15 ms / 312 tokens ( 11.46 ms per token, 87.24 tokens per second)
[18:41:32] total time = 3765.72 ms / 587 tokens
[18:41:32] slot release: id 3 | task 0 | stop processing: n_tokens = 586, truncated = 0
[18:41:32] srv update_slots: all slots are idle
[18:41:32] srv log_server_r: request: POST /v1/chat/completions 100.x.x.x 200
[18:41:51] srv params_from_: Chat format: GLM 4.5
[18:41:51] slot get_availabl: id 2 | task -1 | selected slot by LRU, t_last = -1
[18:41:51] slot launch_slot_: id 2 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
[18:41:51] slot launch_slot_: id 2 | task 313 | processing task, is_child = 0
[18:41:51] slot update_slots: id 2 | task 313 | new prompt, n_ctx_slot = 50688, n_keep = 0, task.n_tokens = 53
[18:41:51] slot update_slots: id 2 | task 313 | n_tokens = 0, memory_seq_rm [0, end)
[18:41:51] slot update_slots: id 2 | task 313 | prompt processing progress, n_tokens = 53, batch.n_tokens = 53, progress = 1.000000
[18:41:51] slot update_slots: id 2 | task 313 | prompt done, n_tokens = 53, batch.n_tokens = 53
[18:41:51] slot init_sampler: id 2 | task 313 | init sampler, took 0.01 ms, tokens: text = 53, total = 53
[18:41:59] slot print_timing: id 2 | task 313 |
[18:41:59] prompt eval time = 160.95 ms / 53 tokens ( 3.04 ms per token, 329.29 tokens per second)
[18:41:59] eval time = 7080.81 ms / 612 tokens ( 11.57 ms per token, 86.43 tokens per second)
[18:41:59] total time = 7241.76 ms / 665 tokens
[18:41:59] slot release: id 2 | task 313 | stop processing: n_tokens = 664, truncated = 0
[18:41:59] srv update_slots: all slots are idle
[18:41:59] srv log_server_r: request: POST /v1/chat/completions 100.x.x.x 200