Releases · Hawthorne001/llama.cpp

09 Sep 06:30

85ca66a

b6408 Latest

Latest

CANN: Stream sync between devices for acl_graph (#15809)

* CANN: Switch to stream synchronization

Switch to stream synchronization because events are not effective.

Co-authored-by: hipudding <huafengchun@gmail.com>

* CANN: add Comments

---------

Co-authored-by: hipudding <huafengchun@gmail.com>

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-09-09T06:30:14Z
llama-b6408-bin-macos-arm64.zip

sha256:ed8ba465177d972763f22ff3412e73892fd53d2052c8c4450dc03b8ac76ac17c

11.1 MB 2025-09-09T06:30:23Z
llama-b6408-bin-macos-x64.zip

sha256:2b0cdd698a7b790c85b63bbc965700e1108e26d3e7715f33baecc52cfb6e5afa

28.6 MB 2025-09-09T06:30:24Z
llama-b6408-bin-ubuntu-vulkan-x64.zip

sha256:2b56ec7b027d76f511453d2905418e61f7eab6d5a0a10d802aa0c81b26b4f0bd

25.8 MB 2025-09-09T06:30:25Z
llama-b6408-bin-ubuntu-x64.zip

sha256:f22d0cc673767092c790e7e5172739eb56e8b6d419579d45b0b40488e17a1f62

13.1 MB 2025-09-09T06:30:26Z
llama-b6408-bin-win-cpu-arm64.zip

sha256:67db4ca28ab53ba5fbaae06ebc7247a7112112e6e2bb2e46db3ef70c56c3c483

11.3 MB 2025-09-09T06:30:27Z
llama-b6408-bin-win-cpu-x64.zip

sha256:b2de84d696cf963dcd9e0d29b8b0069f4a8da46693b9678a6539a4e0c8fd8f12

14.3 MB 2025-09-09T06:30:28Z
llama-b6408-bin-win-cuda-12.4-x64.zip

sha256:d38ed159bc6d8ed708ef4788acaa92e846b66f9a2907c9fec74c52d8286c1b31

138 MB 2025-09-09T06:30:29Z
llama-b6408-bin-win-hip-radeon-x64.zip

sha256:3b036543512b235ec28ff043a7bce9c565b623fefd197ce5f12d82f898c2b158

287 MB 2025-09-09T06:30:34Z
llama-b6408-bin-win-opencl-adreno-arm64.zip

sha256:a113fc2398c5645d919355b975752af803c64935abb3d1e3ed37af2c297ceb16

11.7 MB 2025-09-09T06:30:43Z
Source code (zip)

2025-09-08T02:03:29Z
Source code (tar.gz)

2025-09-08T02:03:29Z

09 Sep 03:44

github-actions

b6407

3976dfb

b6407

vulkan: support im2col_3d (#15795)

Assets 15

09 Sep 00:57

github-actions

b6403

3b15924

b6403

ggml WebGPU: remove userdata from request adapter callback (#15527)

* ggml WebGPU: remove userdata from request adapter callback

This commit removes the `userdata` parameter from the WebGPU request
adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function
captures the `webgpu_context` directly.

The motivation for this change is to simplify the code and improve
readability.

* inline the callback lambda into the RequestAdapter call

This commit removes the callback lambda variable and inlines it directly
into the RequestAdapter call.

Assets 15

09 Sep 00:00

github-actions

b6402

79bc429

b6402

CUDA: faster tile FA (Pascal/AMD), headsize 256 (#15769)

Assets 15

08 Sep 14:55

github-actions

b6401

c4df49a

b6401

kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16…

Assets 15

08 Sep 08:19

github-actions

b6397

186415d

b6397

ggml-cpu: drop support for nnpa intrinsics (#15821)

Assets 15

08 Sep 03:07

github-actions

b6396

fd62188

b6396

aLoRA Support (#15327)

* feat: Add python-side constants and conversion for adapter.lora.invocation_string

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add c++ side constants for adapter.lora.invocation_string

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse invocation string for adapters from GGUF

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(python): Update conversion to alora_invocation_tokens

This is the preferred method in PEFT which is the source of ground truth

https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(cpp): Update to alora_invocation_tokens on c++ side

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add C APIs to get alora invocation token array from lora

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Initial implementation of alora cache logic in server

This does not yet do the part to identify the invocation tokens and only
apply the lora adapter afterwards, but it does seem to produce correct
results if the invocation tokens are the beginning of the uncached input.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Identify alora invocation sequences

This currently limits to a single enabled alora per slot. Multiple aloras
with different invocation sequences would be possible, but it would require
a more complex integration of the adapter toggling and is not really a well
studied case for alora since it's unclear if one alora can reuse cache from
previous prefill computed with a different alora.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Only reuse cache for tokens before the alora invocation start

This is a bit of an edge case, but theoretically a user could try the same
query with the alora disabled (just using the base model), then retry with
the alora. The cached tokens from the first pass should be invalid.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Handle un-cached tokens that come before the alora activation

The solution is to only fill up to the token before the invocation start in
the batch if there are any tokens to be prefilled between those pulled from
cache and the invocation start. When this is detected, the alora is
temporarily disabled with a scale of 0.0, then immediately re-enabled after
it has been initialized for the internal graph. Since the batch does not
complete the prompt tokens, the remaining prompt tokens are handled in the
next task, pulling all of the non-alora tokens from cache and proceeding
with prefill for the alora tokens.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use || instead of 'or'

Too much python :facepalm:

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix off-by-one for limiting cached tokens to before alora start

This was the cause of the inconsistent results from the dummy test script
with and without the turn that runs the prompt without the adapter before
running it with the adapter.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Support backwards-compatibility for "invocation_string" in adapter_config.json

While this has been replaced in the PEFT PR in favor of
alora_invocation_tokens, the existing adapters in the ibm-granite org on HF
use "invocation_string," so this will enable backwards compatibility and
enable testing now (before PEFT PR changes have percolated everywhere).

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove duplicate logging

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Assets 15

05 Sep 23:41

github-actions

b6378

0a1b398

b6378

ggml: add ops for WAN video model (cuda && cpu) (#15669)

* add conv3d support

* add ggml_pad_ext for cpu & cuda backend

* cuda/cpu: add im2col_3d support

* cuda: make im2col a little faster

* fix cuda pad/scale/im2col3d

* make im2col_3d faster

* gguf: support loading tensors which n_dims > GGML_MAX_DIMS

* fix cuda get_rows

* avoid ggml_conv_3d conflict

* correct GGML_OP_COUNT assertion

* avoid build failure

* avoid build failure on MacOS

* cuda: remove unnecessary MIN define

* fix cpu im2col_3d

* adjust the code style

* cuda: use simpler loop in get_rows

* add test_im2col_3d to test-backend-ops

* test-backend-ops.cpp: remove trailing whitespace

* cpu: im2col_3d support non continuous src

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

* fix test_im2col_3d

* remove unused variables

* cuda: get_rows: dfloat2 -> float2

* add test_pad_ext to test-backend-ops.cpp

* add gguf_init_from_file_ext impl

* Revert "gguf: support loading tensors which n_dims > GGML_MAX_DIMS"

This reverts commit d8377a0a37f314bd3713fe043b4333ad661610c1.

* Revert "add gguf_init_from_file_ext impl"

This reverts commit d9f1d13208c68ef83b3538201ac7f31614fb1994.

* update ggml_backend_vk_device_supports_op

* fix ggml_backend_vk_device_supports_op

* update other backend supports op for ggml_pad_ext

* metal/opencl/sycl/vulkan: fix GGML_OP_PAD check in supports_op

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>

Assets 15

03 Sep 16:42

github-actions

b6357

0a2a384

b6357

vulkan: fix shaders gen when no integer dot is available (#15740)

Assets 15

03 Sep 11:00

github-actions

b6356

9961d24

b6356

CANN: Resolve soft_max precision issue (#15730)

Previously, the slope tensor was set to fp16 to improve efficiency.
While this worked correctly in FA, it caused precision issues in soft_max.
This change applies different data types for different operators
to balance both accuracy and performance.

Assets 15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: Hawthorne001/llama.cpp

b6408

Uh oh!

b6407

Uh oh!

b6403

Uh oh!

b6402

Uh oh!

b6401

Uh oh!

b6397

Uh oh!

b6396

Uh oh!

b6378

Uh oh!

b6357

Uh oh!

b6356

Uh oh!