-
Notifications
You must be signed in to change notification settings - Fork 13.2k
Description
Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.
There appears to be a small memory leak in ggml_metal_graph_compute
. After running a continual inference a few hundred times, I notice the amount of memory on my M1 constantly growing.
I've been tracking this for a while, and it appears to come from the decode function, and specifically in the ggml_metal_graph_compute
. I've removed the entire contents of the dispatch_apply
and the memory still seems to be leaking. There appears to be a few "known issues" around the MLTCommandBuffer leaking memory [1,2]
[1] https://developer.apple.com/forums/thread/662721
[2] https://forums.developer.apple.com/forums/thread/120931
There is a suggestion to use a @autoreleasepool
when using the MLTCommandBuffer. After adding this, I can confirm that the memory usage of Llama.cpp stays stable even after 1,000 inference requests