diff --git a/docs/code_documentation/documentation.adoc b/docs/code_documentation/documentation.adoc new file mode 100644 index 0000000000000..b52d944edbac5 --- /dev/null +++ b/docs/code_documentation/documentation.adoc @@ -0,0 +1,93 @@ += Code Documentation + +WARNING: This documentation is neither complete (i.e. it does not cover everything) nor exhaustive (i.e. it does not completely cover everything it touches on). The version used is a February 18th 2025 version, specifically commit 63ac128. Subsequent modifications have not yet been reviewed. + +[[docs:overview]] +== Overview + +[[docs:overview:main.cpp]] +=== main.cpp + +"`main.cpp`" is the primary source file from which the documentation process started. It compiles into the llama-cli executable which provides chatbot functionality inside the terminal and has the following high-level structure (note that this analysis is not exhaustive): + +* (lines) 1-86: include headers, global variables, helper functions +* 88-133: parameter parsing (call to [.codebit]#`common_params_parse(...)`# on line 91, edge case hadling afterwards), [.codebit]#`common_init()`#, console initialization +* 135: [.codebit]#`llama_backend_init()`# +* 136: [.codebit]#`llama_numa_init(...)`# +* 150: call to [.codebit]#`common_init_from_params(...)`# generates [.codebit]#`struct llama_model`# and [.codebit]#`struct llama_context`# +* 165-194: set up [.codebit]#`struct ggml_threadpool`# +* 203-226: conversation mode setup +* 235-432: session setup +* 434: [.codebit]#`common_sampler_init(...)`# +* 460-483: session setup +* 485-532: inference preparation +* 534-906: run loop + ** 535-630: input and context management + ** 632-652: token evaluation by [.codebit]#`llama_decode(...)`# call (line 640) + ** 704-728: display logic + ** 731-906: antiprompt/reverse prompt detection, console logic +* 908-923: cleanup (print final logs, dealocate memory) + + +[[docs:overview:call_paths]] +=== Call Paths + +Following is a description of the call paths followed in the documentation process. These are centered on the inference process and the setup necessary for it, and will provide a good picture of the program's general control flow. + +==== Model and context init + +* [.codebit]#`common_init_from_params(...)`# -> [.codebit]#`llama_model_load_from_file(...)`#, [.codebit]#`llama_init_from_model(...)`# + ** [.codebit]#`llama_model_load_from_file(...)`# -> [.codebit]#`llama_model_load_from_file_impl(...)`# -> [.codebit]#`ggml_backend_dev_get(...)`#, [.codebit]#`llama_model_load(...)`# + *** [.codebit]#`ggml_backend_dev_get(...)`# -> [.codebit]#`get_reg()`# -> [.codebit]#`struct ggml_backend_registry()`# -> [.codebit]#`struct ggml_backend_registry.register_backend(...)`#, [.codebit]#`ggml_backend_cuda_reg()`# and [.codebit]#`ggml_backend_cpu_reg()`# (among others, depending on the build) + ** [.codebit]#`llama_init_from_model(...)`# -> [.codebit]#`struct llama_context(...)`#, [.codebit]#`ggml_backend_dev_init(...)`#, [.codebit]#`ggml_backend_sched_new(...)`# + *** [.codebit]#`ggml_backend_dev_init(...)`# -> [.codebit]#`struct ggml_backend_device.iface.init_backend(...)`# + +Note that the calls to [.codebit]#`ggml_backend_cuda_reg()`# and [.codebit]#`ggml_backend_cpu_reg()`# go much deeper and are responsible for the proper setup of usable devices (among other functions for different backends that are not documented here). They are overall very similar and will be detailed in their own sections. + +==== Inference + +* [.codebit]#`llama_decode(...)`# -> [.codebit]#`llama_decode_impl(...)`# -> [.codebit]#`ggml_backend_sched_set_eval_callback(...)`#, [.codebit]#`llama_build_graph(...)`#, [.codebit]#`llama_set_inputs(...)`#, [.codebit]#`llama_graph_compute(...)`# + ** [.codebit]#`llama_build_graph(...)`# -> [.codebit]#`struct llm_build_context(...)`#, [.codebit]#`struct llm_build_context.init()`#, [.codebit]#`struct llm_build_context.build_llama()`# (one of many branches) + *** [.codebit]#`struct llm_build_context.init()`# -> [.codebit]#`ggml_init(...)`# + *** [.codebit]#`struct llm_build_context.build_llama()`# -> [.codebit]#`ggml_new_graph_custom(...)`#, [.codebit]#`llm_build_input_embd(...)`# + **** [.codebit]#`ggml_new_graph_custom(...)`# -> [.codebit]#`ggml_graph_nbytes(...)`#, [.codebit]#`ggml_new_object(...)`#, [.codebit]#`ggml_hash_set_reset(...)`# + *** [.codebit]#`llm_build_input_embd(...)`# -> [.codebit]#`ggml_new_tensor_1d(...)`#, [.codebit]#`ggml_new_tensor_2d(...)`# -> [.codebit]#`ggml_new_tensor_impl(...)`# + ** [.codebit]#`llama_graph_compute(...)`# -> [.codebit]#`ggml_backend_sched_graph_compute_async(...)`# -> [.codebit]#`ggml_backend_sched_alloc_graph(...)`#, [.codebit]#`ggml_backend_sched_compute_splits(...)`# + *** [.codebit]#`ggml_backend_sched_alloc_graph(...)`# -> [.codebit]#`ggml_backend_sched_split_graph(...)`#, [.codebit]#`ggml_backend_sched_alloc_splits(...)`# + *** [.codebit]#`ggml_backend_sched_compute_splits(...)`# -> [.codebit]#`struct ggml_backend_sched.callback_eval`#, [.codebit]#`ggml_backend_graph_compute_async(...)`# + **** [.codebit]#`ggml_backend_graph_compute_async(...)`# -> [.codebit]#`struct ggml_backend.iface.graph_compute`# + +Here note that the call path ends in [.codebit]#`struct ggml_backend.iface.graph_compute`#, which is a pointer to a function specific to each backend set in the initialization phase by a call to [.codebit]#`struct ggml_backend_device.iface.init_backend(...)`#, which is itself another pointer to a function set during backend initialization, specifically in the calls to [.codebit]#`ggml_backend_cuda_reg()`# and [.codebit]#`ggml_backend_cpu_reg()`# (and their counterparts for the other supported backends). Again, these will be detailed in their own sections. + +[[docs:funcstructs]] +== Functions and structures + +This section will elaborate on the functions and structures mentioned above, as well as other relevant ones, grouped by the files which contain them and ordered by their position in said files. + +NOTE: There are many types with the formats [.codebit]#`typename_t`# and [.codebit]#`typename_ptr`#. In most, if not all cases, [.codebit]#`typename_t`# is a [.codebit]#`typedef`# that stands for [.codebit]#`typename*`#, while [.codebit]#`typename_ptr`# stands for [.codebit]#`std::unique_ptr`#. + +include::documentation/common.h.adoc[] + +include::documentation/common.cpp.adoc[] + +include::documentation/llama-context.h.adoc[] + +include::documentation/llama.cpp.adoc[] + +include::documentation/ggml-impl.h.adoc[] + +include::documentation/ggml-backend-reg.cpp.adoc[] + +include::documentation/ggml-cuda.cu.adoc[] + +include::documentation/ggml-cpu.cpp.adoc[] + +include::documentation/ggml-cpu.c.adoc[] + +include::documentation/ggml-backend.cpp.adoc[] + +include::documentation/ggml-backend-impl.h.adoc[] + +include::documentation/ggml.h.adoc[] + +include::documentation/ggml.c.adoc[] diff --git a/docs/code_documentation/documentation/common.cpp.adoc b/docs/code_documentation/documentation/common.cpp.adoc new file mode 100644 index 0000000000000..55aec68f32fcf --- /dev/null +++ b/docs/code_documentation/documentation/common.cpp.adoc @@ -0,0 +1,17 @@ +[[docs:funcstructs:common.cpp]] +== common.cpp + + +[[docs:funcstructs:common.cpp:common_init_from_params]] +=== common_init_from_params + +Signature: +[.codebit]#`struct common_init_result common_init_from_params(common_params & params)`# + +Firstly, the function loads the model ([.codebit]#`struct llama_model`#). Depending on the parameters and the build, this can go through one of three branches, calling [.codebit]#`common_load_model_from_hf(...)`# to load from a HuggingFace repository, [.codebit]#`common_load_model_from_url(...)`# to load from an URL or [.codebit]#`llama_model_load_from_file(...)`# to load from a local file. The first two branches also end up indirectly calling [.codebit]#`llama_model_load_from_file(...)`#. + +Secondly, it passes the loaded model to [.codebit]#`llama_init_from_model(...)`# to generate the corresponding [.codebit]#`llama_context`#. + +Thirdly, it loads the control vectors, then the lora adapters ([.codebit]#`struct llama_adapter_lora`#) indicated by the parameters through calls to [.codebit]#`llama_adapter_lora_init(...)`#. It also performs a warmup run of the model if so indicated by [.codebit]#`params.warmup`#. + +Lastly, it bundles and returns the [.codebit]#`llama_model`#, [.codebit]#`llama_context`# and lora adapters in a [.codebit]#`struct common_init_result`#. diff --git a/docs/code_documentation/documentation/common.h.adoc b/docs/code_documentation/documentation/common.h.adoc new file mode 100644 index 0000000000000..a77d61adbfecc --- /dev/null +++ b/docs/code_documentation/documentation/common.h.adoc @@ -0,0 +1,19 @@ +[[docs:funcstructs:common.h]] +== common.h + + +[[docs:funcstructs:common.h:struct-common_init_result]] +=== struct common_init_result + +This structure is just a wrapper containing [.codebit]##`std::unique_ptr`##s to a [.codebit]#`llama_model`#, a [.codebit]#`llama_context`# and lora adapters: + +[source,C++] +---- +// note: defines object's lifetime +struct common_init_result { + llama_model_ptr model; + llama_context_ptr context; + + std::vector lora; +}; +---- diff --git a/docs/code_documentation/documentation/ggml-backend-impl.h.adoc b/docs/code_documentation/documentation/ggml-backend-impl.h.adoc new file mode 100644 index 0000000000000..8a04e1e5b6248 --- /dev/null +++ b/docs/code_documentation/documentation/ggml-backend-impl.h.adoc @@ -0,0 +1,67 @@ +[[docs:funcstructs:ggml-backend-impl.h]] +== ggml-backend-impl.h + + +[[docs:funcstructs:ggml-backend-impl.h:struct-ggml_backend_i]] +=== struct ggml_backend_i + +The interface for a [.codebit]#`ggml_backend`#. Has the following mandatory members: + + +* [.codebit]#`const char * (*get_name)(ggml_backend_t backend)`# +* [.codebit]#`void (*free)(ggml_backend_t backend)`# +* [.codebit]#`enum ggml_status (*graph_compute) (ggml_backend_t backend, struct ggml_cgraph * cgraph)`#: from comments: "compute graph (always async if supported by the backend)" + + + +[[docs:funcstructs:ggml-backend-impl.h:struct-ggml_backend]] +=== struct ggml_backend + +Describes a high-level backend that contains an interface for tensor operations (optional), graph computation and event synchronization (optional). Has the following members: + + +* [.codebit]#`ggml_guid_t guid`# +* [.codebit]#`struct ggml_backend_i iface`# +* [.codebit]#`ggml_backend_dev_t device`# +* [.codebit]#`void * context`# + + +[[docs:funcstructs:ggml-backend-impl.h:struct-ggml_backend_device_i]] +=== struct ggml_backend_device_i + +The interface of a [.codebit]#`ggml_backend_device`#. Here are some of its members: + +* [.codebit]#`const char * (*get_name)(ggml_backend_dev_t dev)`#: from comments +* [.codebit]#`ggml_backend_t (*init_backend)(ggml_backend_dev_t dev, const char * params)`#: initializes the [.codebit]#`ggml_backend`# corresponding to this device +* [.codebit]#`bool (*supports_op)(ggml_backend_dev_t dev, const struct ggml_tensor * op)`# + + +[[docs:funcstructs:ggml-backend-impl.h:struct-ggml_backend_device]] +=== struct ggml_backend_device + +Describes a usable device. Has the following members: + +* [.codebit]#`struct ggml_backend_device_i iface`# +* [.codebit]#`ggml_backend_reg_t reg`# +* [.codebit]#`void * context`# + + +[[docs:funcstructs:ggml-backend-impl.h:struct-ggml_backend_reg_i]] +=== struct ggml_backend_reg_i + +The interface for a [.codebit]#`ggml_backend_reg`#. Has the following members: + +* [.codebit]#`const char * (*get_name)(ggml_backend_reg_t reg)`# +* [.codebit]#`size_t (*get_device_count)(ggml_backend_reg_t reg)`# +* [.codebit]#`ggml_backend_dev_t (*get_device)(ggml_backend_reg_t reg, size_t index)`# +* [.codebit]#`void * (*get_proc_address)(ggml_backend_reg_t reg, const char * name)`#: from comments: "(optional) get a pointer to a function in the backend; backends can add custom functions that are not part of the standard ggml-backend interface" + + +[[docs:funcstructs:ggml-backend-impl.h:struct-ggml_backend_reg]] +=== struct ggml_backend_reg + +A registry managing the devices for a specific backend. Has the following members: + +* [.codebit]#`int api_version`#: must be initialized to [.codebit]#`GGML_BACKEND_API_VERSION`# +* [.codebit]#`struct ggml_backend_reg_i iface`# +* [.codebit]#`void * context`# diff --git a/docs/code_documentation/documentation/ggml-backend-reg.cpp.adoc b/docs/code_documentation/documentation/ggml-backend-reg.cpp.adoc new file mode 100644 index 0000000000000..91dedb5ec430f --- /dev/null +++ b/docs/code_documentation/documentation/ggml-backend-reg.cpp.adoc @@ -0,0 +1,70 @@ +[[docs:funcstructs:ggml-backend-reg.cpp]] +== ggml-backend-reg.cpp + + +[[docs:funcstructs:ggml-backend-reg.cpp:struct-ggml_backend_reg_entry]] +=== struct ggml_backend_reg_entry + +[source,C++] +---- +struct ggml_backend_reg_entry { + ggml_backend_reg_t reg; + dl_handle_ptr handle; +}; +---- + +Note that [.codebit]#`ggml_backend_reg_t`# is an alias for [.codebit]#`ggml_backend_reg*`#. + + +[[docs:funcstructs:ggml-backend-reg.cpp:struct-ggml_backend_registry]] +=== struct ggml_backend_registry + +It has two members: + +* [.codebit]#`std::vector backends`# +* [.codebit]#`std::vector devices`# + +Its default constructor calls its [.codebit]#`register_backend(...)`# method with the [.codebit]##`ggml_backend_reg`##s specific to each backend with which llama.cpp is compiled (see [.codebit]#`ggml_backend_cuda_reg()`# and [.codebit]#`ggml_backend_cpu_reg()`#). This constructor *_should not_* be called manually, as this structure is meant to be a singleton. See [.codebit]#`get_reg()`#. + + +[[docs:funcstructs:ggml-backend-reg.cpp:struct-ggml_backend_registry.register_backend]] +=== struct ggml_backend_registry.register_backend + +Signature: +[.codebit]#`void register_backend(ggml_backend_reg_t reg, dl_handle_ptr handle = nullptr)`# + +Pushes the given pair into the structure's [.codebit]#`backends`# member and calls its [.codebit]#`register_device(...)`# method for every device associated with the [.codebit]#`ggml_backend_reg`# (uses [.codebit]#`ggml_backend_reg_dev_count(...)`# and [.codebit]#`ggml_backend_reg_dev_get(...)`# to iterate through and retrieve them). + + +[[docs:funcstructs:ggml-backend-reg.cpp:struct-ggml_backend_registry.register_device]] +=== struct ggml_backend_registry.register_device + +Signature: +[.codebit]#`void register_device(ggml_backend_dev_t device)`# + +Simply pushes to the structure's [.codebit]#`devices`# member. + + +[[docs:funcstructs:ggml-backend-reg.cpp:get_reg]] +=== get_reg + +Signature: [.codebit]#`static ggml_backend_registry & get_reg()`# + +Helps implement a singleton-like design pattern for [.codebit]#`struct ggml_backend_registry`#: + +[source,C++] +---- +static ggml_backend_registry & get_reg() { + static ggml_backend_registry reg; + return reg; +} +---- + + +[[docs:funcstructs:ggml-backend-reg.cpp:ggml_backend_dev_get]] +=== ggml_backend_dev_get + +Signature: +[.codebit]#`ggml_backend_dev_t ggml_backend_dev_get(size_t index)`# + +Returns [.codebit]#`get_reg().devices[index]`#. diff --git a/docs/code_documentation/documentation/ggml-backend.cpp.adoc b/docs/code_documentation/documentation/ggml-backend.cpp.adoc new file mode 100644 index 0000000000000..68863ddbe11f5 --- /dev/null +++ b/docs/code_documentation/documentation/ggml-backend.cpp.adoc @@ -0,0 +1,205 @@ +[[docs:funcstructs:ggml-backend.cpp]] +== ggml-backend.cpp + + +[[docs:funcstructs:ggml-backend.cpp:ggml_backend_graph_compute_async]] +=== ggml_backend_graph_compute_async + +Signature: +[.codebit]#`enum ggml_status ggml_backend_graph_compute_async(ggml_backend_t backend, struct ggml_cgraph * cgraph)`# + +[source,C++] +---- +return backend->iface.graph_compute(backend, cgraph); +---- + + +[[docs:funcstructs:ggml-backend.cpp:ggml_backend_dev_init]] +=== ggml_backend_dev_init + +Signature: +[.codebit]#`ggml_backend_t ggml_backend_dev_init(ggml_backend_dev_t device, const char * params)`# + +[source,C++] +---- +return device->iface.init_backend(device, params); +---- + + +[[docs:funcstructs:ggml-backend.cpp:struct-ggml_backend_sched_split]] +=== struct ggml_backend_sched_split + +Holds the information necessary to describe and use (compute) a split. A split is a sequence of tensors that are to be computed on the same backend. It is composed of the following members: + +* [.codebit]#`int backend_id`# +* [.codebit]#`int i_start`#: index of the first tensor of the split in the full computation graph +* [.codebit]#`int i_end`#: index of the last tensor of the split in the full computation graph +* [.codebit]#`struct ggml_tensor * inputs[GGML_SCHED_MAX_SPLIT_INPUTS]`# +* [.codebit]#`int n_inputs`# +* [.codebit]#`struct ggml_cgraph graph`#: this split as a [.codebit]#`ggml_cgraph`# (for computation) + + +[[docs:funcstructs:ggml-backend.cpp:struct-ggml_backend_sched]] +=== struct ggml_backend_sched + +This structure is used to schedule the graph splits on backends. It has not been fully analyzed, but it holds the following members: + +* [.codebit]#`bool is_reset`# +* [.codebit]#`bool is_alloc`# +* [.codebit]#`int n_backends`# +* [.codebit]#`ggml_backend_t backends[GGML_SCHED_MAX_BACKENDS]`# +* [.codebit]#`ggml_backend_buffer_type_t bufts[GGML_SCHED_MAX_BACKENDS]`# +* [.codebit]#`ggml_gallocr_t galloc`# +* [.codebit]#`struct ggml_hash_set hash_set`# +* [.codebit]#`int * hv_tensor_backend_ids`#: dimension [.codebit]#`[hash_set.size]`# +* [.codebit]#`struct ggml_tensor ** hv_tensor_copies`#: dimension [.codebit]#`[hash_set.size][n_backends][n_copies]`# +* [.codebit]#`int * node_backend_ids`#: dimension [.codebit]#`[graph.size]`# +* [.codebit]#`int * leaf_backend_ids`#: dimension [.codebit]#`[graph.size]`# +* [.codebit]#`int * prev_node_backend_ids`#: the id of the assigned backend of each node tensor in [.codebit]#`graph`# at the previous splitting, used to determine if reallocation is necessary (dimension [.codebit]#`[graph.size]`#) +* [.codebit]#`int * prev_leaf_backend_ids`#: same as above, but for leaves (dimension [.codebit]#`[graph.size]`#) +* [.codebit]#`struct ggml_cgraph graph`#: a local copy of the computation graph with additional tensors that are used to pass data between consecutive splits on different backends ("consecutive" as in one uses as input parts of the output of the other) +* [.codebit]#`struct ggml_backend_sched_split * splits`#: splits array +* [.codebit]#`int n_splits`# +* [.codebit]#`int splits_capacity`# +* [.codebit]#`int n_copies`#: for "pipeline parallelism support" +* [.codebit]#`int cur_copy`#: for "pipeline parallelism support" +* [.codebit]#`ggml_backend_event_t events[GGML_SCHED_MAX_BACKENDS][GGML_SCHED_MAX_COPIES]`#: for "pipeline parallelism support" +* [.codebit]#`struct ggml_tensor * graph_inputs[GGML_SCHED_MAX_SPLIT_INPUTS]`#: for "pipeline parallelism support" +* [.codebit]#`int n_graph_inputs`#: for "pipeline parallelism support" +* [.codebit]#`struct ggml_context * ctx`# +* [.codebit]#`ggml_backend_sched_eval_callback callback_eval`# +* [.codebit]#`void * callback_eval_user_data`# +* [.codebit]#`char * context_buffer`#: buffer used by [.codebit]#`ctx`# +* [.codebit]#`size_t context_buffer_size`# +* [.codebit]#`inr debug`# + + +[[docs:funcstructs:ggml-backend.cpp:ggml_backend_sched_split_graph]] +=== ggml_backend_sched_split_graph + +Signature: +[.codebit]#`static void ggml_backend_sched_split_graph(ggml_backend_sched_t sched, struct ggml_cgraph * graph)`# + +Firstly, the scheduler's [.codebit]#`ggml_context`# is regenerated so that it uses the scheduler's [.codebit]#`context_buffer`# member as a buffer, then each tensor in the computation graph is assigned a backend so that data transfers between backends are minimized and the higher priority backends (gpus) are preferentially used. This assignation in done in 5 passes over the tensors: + +The first pass assigns some tensors a backend based on which device's the weights are currently stored in. See [.codebit]#`ggml_backend_sched_backend_id_from_cur(...)`#. + +The second pass "`expands`" the initial assignments, i.e. it sets unassigned tensors to the backend of one of the closest assigned neighbours in each direction, with priority for the gpu backends, if it supports their operation. For example: + +[cols=15*] +|=== +| After pass 1 +| cpu +| unassigned +| unassigned +| gpu0 +| unassigned +| cpu +| gpu1 +| unassigned +| unassigned +| gpu0 +| unassigned +| cpu +| unassigned +| cpu + +| After pass 2 +| cpu +| gpu0 +| gpu0 +| gpu0 +| gpu0 +| cpu +| gpu1 +| gpu1 +| gpu1 +| gpu0 +| gpu0 +| cpu +| cpu +| cpu +|=== + +The other passes were not explicitly analyzed, but helpful comments were left about them: + +[source,C++] +---- +// pass 3: upgrade nodes to higher prio backends with compatible buffer types +// if the tensor is already in the same buffer type (*) as another higher priority backend, we should move it there +// however, we also need to verify that the sources are in compatible buffer types +// (*) the actual requirement is more relaxed, the buffer type of the backend should be supported by all the users of this tensor further down the graph +// however, this is slow to verify, so we have a more strict requirement that the buffer type is the same +// this is not uncommon since multiple backends can use host memory, with the same buffer type (eg. BLAS and CPU) +// additionally, set remaining unassigned nodes to the backend with the most supported inputs +// only nodes that could not be assigned during expansion due to the backend not supporting the op should be unassigned at this point + +// pass 4: assign backends to remaining src from dst and view_src + +// pass 5: split graph, find tensors that need to be copied +---- + +After these passes, the final section sets up the scheduler's [.codebit]#`graph`# field. + + +[[docs:funcstructs:ggml-backend.cpp:ggml_backend_sched_alloc_splits]] +=== ggml_backend_sched_alloc_splits + +Signature: +[.codebit]#`static bool ggml_backend_sched_alloc_splits(ggml_backend_sched_t sched)`# + +Not well documented. Deffers to [.codebit]#`ggml_gallocr_alloc_graph(...)`# for the actual allocation. + +[[docs:funcstructs:ggml-backend.cpp:ggml_backend_sched_compute_splits]] +=== ggml_backend_sched_compute_splits + +Signature: +[.codebit]#`static enum ggml_status ggml_backend_sched_compute_splits(ggml_backend_sched_t sched)`# + +For each split: + +* for each tensor in the split: + ** copies input tensors to the split backend, if there are any +* if no [.codebit]#`callback_eval`# is set in the scheduler: + ** computes the split by calling [.codebit]#`ggml_backend_graph_compute_async`# +* otherwise: + ** succesively calls the scheduler's [.codebit]#`callback_eval`# for each tensor in the split with the [.codebit]#`ask`# argument [.codebit]#`true`# until a [.codebit]#`true`# is returned (this is the first tensor whose data is needed) + ** computes the subgraph composed of the unneeded tensors and the needed tensor + ** calls the [.codebit]#`callback_eval`# on the needed tensor with [.codebit]#`ask=false`# + ** repeats this process until the whole split has been computed or halts the computation entirely if the [.codebit]#`callback_eval`# signals so + + +[[docs:funcstructs:ggml-backend.cpp:ggml_backend_sched_new]] +=== ggml_backend_sched_new + +Signature: +[.codebit]#`ggml_backend_sched_t ggml_backend_sched_new(ggml_backend_t * backends, ggml_backend_buffer_type_t * bufts, int n_backends, size_t graph_size, bool parallel)`# + +Creates a new [.codebit]#`ggml_backend_sched`#. + + +[[docs:funcstructs:ggml-backend.cpp:ggml_backend_sched_alloc_graph]] +=== ggml_backend_sched_alloc_graph + +Signature: +[.codebit]#`bool ggml_backend_sched_alloc_graph(ggml_backend_sched_t sched, struct ggml_cgraph * graph)`# + +First splits the graph by calling [.codebit]#`ggml_backend_sched_split_graph(...)`#, then allocates the resulting splits with [.codebit]#`ggml_backend_sched_alloc_splits(...)`# and marks the scheduler as allocated (by setting its [.codebit]#`is_alloc`# member to [.codebit]#`true`#). + + +[[docs:funcstructs:ggml-backend.cpp:ggml_backend_sched_graph_compute_async]] +=== ggml_backend_sched_graph_compute_async + +Signature: +[.codebit]#`enum ggml_status ggml_backend_sched_graph_compute_async(ggml_backend_sched_t sched, struct ggml_cgraph * graph)`# + +Resets and allocates the scheduler if needed by calls to [.codebit]#`ggml_backend_sched_reset(...)`# and [.codebit]#`ggml_backend_sched_alloc_graph(...)`#, and finally deffers to [.codebit]#`ggml_backend_sched_compute_splits(...)`# for the computation. + + +[[docs:funcstructs:ggml-backend.cpp:ggml_backend_sched_set_eval_callback]] +=== ggml_backend_sched_set_eval_callback + +Signature: +[.codebit]#`void ggml_backend_sched_set_eval_callback(ggml_backend_sched_t sched, ggml_backend_sched_eval_callback callback, void * user_data)`# + +Sets the scheduler's [.codebit]#`callback_eval`# and [.codebit]#`callback_eval_user_data`# members. diff --git a/docs/code_documentation/documentation/ggml-cpu.c.adoc b/docs/code_documentation/documentation/ggml-cpu.c.adoc new file mode 100644 index 0000000000000..5e712327976d1 --- /dev/null +++ b/docs/code_documentation/documentation/ggml-cpu.c.adoc @@ -0,0 +1,29 @@ +[[docs:funcstructs:ggml-cpu.c]] +== ggml-cpu.c + + +[[docs:funcstructs:ggml-cpu.c:ggml_compute_forward]] +=== ggml_compute_forward + +Signature: +[.codebit]#`static void ggml_compute_forward(struct ggml_compute_params * params, struct ggml_tensor * tensor)`# + +Calls a specific computation function depending on the tensor's operation. (ex: [.codebit]#`ggml_compute_forward_add(...)`#, [.codebit]#`ggml_compute_forward_mul(...)`#) + + +[[docs:funcstructs:ggml-cpu.c:ggml_graph_compute_thread]] +=== ggml_graph_compute_thread + +Signature: +[.codebit]#`static thread_ret_t ggml_graph_compute_thread(void * data)`# + +Calls [.codebit]#`ggml_compute_forward(...)`# on each node in the [.codebit]#`+((struct ggml_compute_state *)data)->threadpool->cgraph+`#. Also handles thread abortion. + + +[[docs:funcstructs:ggml-cpu.c:ggml_graph_compute]] +=== ggml_graph_compute + +Signature: +[.codebit]#`enum ggml_status ggml_graph_compute(struct ggml_cgraph * cgraph, struct ggml_cplan * cplan)`# + +Calls [.codebit]#`ggml_cpu_init()`#, then adjusts [.codebit]#`+cplan->threadpool+`# (of type [.codebit]#`struct ggml_threadpool`#) if needed and uses it to call [.codebit]#`ggml_graph_compute_thread(...)`# for each worker thread. diff --git a/docs/code_documentation/documentation/ggml-cpu.cpp.adoc b/docs/code_documentation/documentation/ggml-cpu.cpp.adoc new file mode 100644 index 0000000000000..dc27e3fabc96b --- /dev/null +++ b/docs/code_documentation/documentation/ggml-cpu.cpp.adoc @@ -0,0 +1,74 @@ +[[docs:funcstructs:ggml-cpu.cpp]] +== ggml-cpu.cpp + + +[[docs:funcstructs:ggml-cpu.cpp:ggml_backend_cpu_graph_compute]] +=== ggml_backend_cpu_graph_compute + +Signature: +[.codebit]#`static enum ggml_status ggml_backend_cpu_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph)`# + +Creates a [.codebit]#`struct ggml_cplan`# through a call to [.codebit]#`ggml_graph_plan(...)`#, updates the [.codebit]#`ggml_backend`#'s context if needed based on the result, sets the [.codebit]#`ggml_cplan`#'s [.codebit]#`work_data`#, [.codebit]#`abort_callback`# and [.codebit]#`abort_callback_data`# members and calls [.codebit]#`ggml_graph_compute(...)`#. + + +[[docs:funcstructs:ggml-cpu.cpp:variable-ggml_backend_cpu_i]] +=== variable ggml_backend_cpu_i + +Full declaration: +[.codebit]#`static const struct ggml_backend_i ggml_backend_cpu_i`# + +The interface of a cpu backend. Its [.codebit]#`graph_compute`# member is set to [.codebit]#`ggml_backend_cpu_graph_compute(...)`#. + + +[[docs:funcstructs:ggml-cpu.cpp:ggml_backend_cpu_init]] +=== ggml_backend_cpu_init + +Signature: +[.codebit]#`ggml_backend_t ggml_backend_cpu_init(void)`# + +Calls [.codebit]#`ggml_cpu_init()`# and creates a [.codebit]#`ggml_backend`# object with its interface set to [.codebit]#`ggml_backend_cpu_i`#. + + +[[docs:funcstructs:ggml-cpu.cpp:ggml_backend_cpu_device_init_backend]] +=== ggml_backend_cpu_device_init_backend + +Signature: +[.codebit]#`static ggml_backend_t ggml_backend_cpu_device_init_backend(ggml_backend_dev_t dev, const char * params)`# + +Simply calls and returns the output of [.codebit]#`ggml_backend_cpu_init()`#. + + +[[docs:funcstructs:ggml-cpu.cpp:variable-ggml_backend_cpu_device_i]] +=== variable ggml_backend_cpu_device_i + +Full declaration: +[.codebit]#`static const struct ggml_backend_device_i ggml_backend_cpu_device_i`# + +The interface of a cpu device. The [.codebit]#`init_backend`# member points to [.codebit]#`ggml_backend_cpu_device_init_backend(...)`#. + + +[[docs:funcstructs:ggml-cpu.cpp:ggml_backend_cpu_reg_get_device]] +=== ggml_backend_cpu_reg_get_device + +Signature: +[.codebit]#`static ggml_backend_dev_t ggml_backend_cpu_reg_get_device(ggml_backend_reg_t reg, size_t index)`# + +Instantiates and returns a pointer to a static [.codebit]#`ggml_backend_device`# for the cpu. Its [.codebit]#`iface`# member is set to [.codebit]#`ggml_backend_cpu_device_i`#. + + +[[docs:funcstructs:ggml-cpu.cpp:variable-ggml_backend_cpu_reg_i]] +=== variable ggml_backend_cpu_reg_i + +Full declaration: +[.codebit]#`static const struct ggml_backend_reg_i ggml_backend_cpu_reg_i`# + +The [.codebit]#`ggml_backend_reg_i`# for cpu. Its [.codebit]#`get_device`# member is set to [.codebit]#`ggml_backend_cpu_reg_get_device(...)`#. + + +[[docs:funcstructs:ggml-cpu.cpp:ggml_backend_cpu_reg]] +=== ggml_backend_cpu_reg + +Signature: +[.codebit]#`ggml_backend_reg_t ggml_backend_cpu_reg(void)`# + +Calls [.codebit]#`ggml_cpu_init()`# and instantiates a static [.codebit]#`ggml_backend_reg`# for the cpu. Its [.codebit]#`context`# member is set to [.codebit]#`NULL`# and its [.codebit]#`iface`# member is set to [.codebit]#`ggml_backend_cpu_reg_i`#. diff --git a/docs/code_documentation/documentation/ggml-cuda.cu.adoc b/docs/code_documentation/documentation/ggml-cuda.cu.adoc new file mode 100644 index 0000000000000..ba6ea15dc645c --- /dev/null +++ b/docs/code_documentation/documentation/ggml-cuda.cu.adoc @@ -0,0 +1,47 @@ +[[docs:funcstructs:ggml-cuda.cu]] +== ggml-cuda.cu + + +[[docs:funcstructs:ggml-cuda.cu:variable-ggml_backend_cuda_interface]] +=== variable ggml_backend_cuda_interface + +Full declaration: +[.codebit]#`static const ggml_backend_i ggml_backend_cuda_interface`# + +Sets the members of [.codebit]#`ggml_backend_i`# to cuda-specific functions, with the exception of [.codebit]#`graph_plan_create`#, [.codebit]#`graph_plan_free`#, [.codebit]#`graph_plan_update`# and [.codebit]#`graph_plan_compute`#. Notably, [.codebit]#`graph_compute`# is set to point to [.codebit]#`ggml_backend_cuda_graph_compute(...)`# + + +[[docs:funcstructs:ggml-cuda.cu:ggml_backend_cuda_device_init_backend]] +=== ggml_backend_cuda_device_init_backend + +Signature: +[.codebit]#`static ggml_backend_t ggml_backend_cuda_device_init_backend(ggml_backend_dev_t dev, const char * params)`# + +Deffers to [.codebit]#`ggml_backend_cuda_init(...)`#, calling it with the device's index (taken from [.codebit]#`+dev->context+`#). + + +[[docs:funcstructs:ggml-cuda.cu:variable-ggml_backend_cuda_device_interface]] +=== variable ggml_backend_cuda_device_interface + +Full declaration: +[.codebit]#`static const ggml_backend_device_i ggml_backend_cuda_device_interface`# + +This sets all the members of [.codebit]#`ggml_backend_device_i`#, with the exception of [.codebit]#`buffer_from_host_ptr`#, to cuda specific functions. The [.codebit]#`init_backend`# member points to [.codebit]#`ggml_backend_cuda_device_init_backend(...)`#. + + +[[docs:funcstructs:ggml-cuda.cu:ggml_backend_cuda_reg]] +=== ggml_backend_cuda_reg + +Signature: +[.codebit]#`ggml_backend_reg_t ggml_backend_cuda_reg()`# + +Thread safely initializes and returns the cuda backend registry in a singleton-like manner. This is were the [.codebit]##`ggml_backend_device`##s for cuda are created. Their interface is set to [.codebit]#`ggml_backend_cuda_device_interface`#. + + +[[docs:funcstructs:ggml-cuda.cu:ggml_backend_cuda_init]] +=== ggml_backend_cuda_init + +Signature: +[.codebit]#`ggml_backend_t ggml_backend_cuda_init(int device)`# + +Initializes a cuda-specific [.codebit]#`ggml_backend`#. Its [.codebit]#`interface`# member is set to [.codebit]#`ggml_backend_cuda_interface`#. diff --git a/docs/code_documentation/documentation/ggml-impl.h.adoc b/docs/code_documentation/documentation/ggml-impl.h.adoc new file mode 100644 index 0000000000000..9643c5b8b61f1 --- /dev/null +++ b/docs/code_documentation/documentation/ggml-impl.h.adoc @@ -0,0 +1,75 @@ +[[docs:funcstructs:ggml-impl.h]] +== ggml-impl.h + + +[[docs:funcstructs:ggml-impl.h:struct-ggml_hash_set]] +=== struct ggml_hash_set + +[source,C++] +---- +struct ggml_hash_set { + size_t size; + ggml_bitset_t * used; // whether or not the keys are in use i.e. set + struct ggml_tensor ** keys; // actual tensors in the set, keys[i] is only defined if ggml_bitset_get(used, i) +}; +---- + +Hash table with linear probing. Used with the following functions (note that there are no functions for deleting individual keys): + +* [.codebit]#`struct ggml_hash_set ggml_hash_set_new(size_t size)`#: (declared in ggml-impl.h, defined in ggml.c) +* [.codebit]#`void ggml_hash_set_free(struct ggml_hash_set * hash_set)`#: frees allocated memory (declared in ggml-impl.h, defined in ggml.c) +* [.codebit]#`size_t ggml_hash_size(size_t min_sz)`#: "returns the minimum size for a hash set that can hold min_sz elements", i.e. the smallest prime number greater than min_sz (declared in ggml-impl.h, defined in ggml.c) +* [.codebit]#`void ggml_hash_set_reset(struct ggml_hash_set * hash_set)`#: marks all keys as unused (declared in ggml-impl.h, defined in ggml.c) +* [.codebit]#`static bool ggml_hash_contains(const struct ggml_hash_set * hash_set, struct ggml_tensor * key)`#: (declared and defined in ggml-impl.h) +* [.codebit]#`static size_t ggml_hash_find(const struct ggml_hash_set * hash_set, const struct ggml_tensor * key)`#: "returns GGML_HASHSET_FULL if table is full, otherwise the current index of the key or where it should be inserted" (declared and defined in ggml-impl.h) +* [.codebit]#`static size_t ggml_hash_insert(struct ggml_hash_set * hash_set, struct ggml_tensor * key)`#: "returns GGML_HASHSET_ALREADY_EXISTS if key already exists, index otherwise, asserts if table is full" (declared and defined in ggml-impl.h) +* [.codebit]#`static size_t ggml_hash_find_or_insert(struct ggml_hash_set * hash_set, struct ggml_tensor * key)`#: (declared and defined in ggml-impl.h) + +[[docs:funcstructs:ggml-impl.h:ggml_hash]] +=== ggml_hash + +Signature: +[.codebit]#`static inline size_t ggml_hash(const struct ggml_tensor * p)`# + +[source,C++] +---- +// the last 4 bits are always zero due to alignment +return (size_t)(uintptr_t)p >> 4; +---- + + +[[docs:funcstructs:ggml-impl.h:enum-ggml_cgraph_eval_order]] +=== enum ggml_cgraph_eval_order + +[source,C++] +---- +enum ggml_cgraph_eval_order { + GGML_CGRAPH_EVAL_ORDER_LEFT_TO_RIGHT = 0, + GGML_CGRAPH_EVAL_ORDER_RIGHT_TO_LEFT, + GGML_CGRAPH_EVAL_ORDER_COUNT +}; +---- + +Computation graph evaluation order. Default is [.codebit]#`GGML_CGRAPH_EVAL_ORDER_LEFT_TO_RIGHT`# (see [.codebit]#`ggml_new_graph_custom(...)`#). + + +[[docs:funcstructs:ggml-impl.h:struct-ggml_cgraph]] +=== struct ggml_cgraph + +[source,C++] +---- +struct ggml_cgraph { + int size; // maximum number of nodes/leafs/grads/grad_accs + int n_nodes; // number of nodes currently in use + int n_leafs; // number of leafs currently in use + + struct ggml_tensor ** nodes; // tensors with data that can change if the graph is evaluated + struct ggml_tensor ** grads; // the outputs of these tensors are the gradients of the nodes + struct ggml_tensor ** grad_accs; // accumulators for node gradients + struct ggml_tensor ** leafs; // tensors with constant data + + struct ggml_hash_set visited_hash_set; + + enum ggml_cgraph_eval_order order; +}; +---- diff --git a/docs/code_documentation/documentation/ggml.c.adoc b/docs/code_documentation/documentation/ggml.c.adoc new file mode 100644 index 0000000000000..26c0dbcdd89a9 --- /dev/null +++ b/docs/code_documentation/documentation/ggml.c.adoc @@ -0,0 +1,147 @@ +[[docs:funcstructs:ggml.c]] +== ggml.c + + +[[docs:funcstructs:ggml.c:variable-type_traits]] +=== variable type_traits + +Full declaration: [.codebit]#`static const struct ggml_type_traits type_traits[GGML_TYPE_COUNT]`# + +Holds the [.codebit]#`ggml_type_traits`# for every supported tensor data type. + + +[[docs:funcstructs:ggml.c:struct-ggml_object]] +=== struct ggml_object + +Acts as a handler for objects rather than "`being`" one. Holds the following information about them: + +* [.codebit]#`size_t offs`#: offset to handled object/data (relative to [.codebit]#`ggml_context`#'s memory buffer) +* [.codebit]#`size_t size`#: size of memory chunk handled (not necessarily object size, see [.codebit]#`ggml_context`#) +* [.codebit]#`struct ggml_object * next`#: pointer to the next [.codebit]#`ggml_object`# in the buffer +* [.codebit]#`enum ggml_object_type type`# +* [.codebit]#`char padding[4]`#: padding to 32 bytes (must multiple of [.codebit]#`GGML_MEM_ALIGN`#, which is 4 or 16, see [.codebit]#`ggml_context`# for details) + + +[[docs:funcstructs:ggml.c:struct-ggml_context]] +=== struct ggml_context + +Contains the following members: + +* [.codebit]#`size_t mem_size`# +* [.codebit]#`void * mem_buffer`# +* [.codebit]#`bool mem_buffer_owned`# +* [.codebit]#`bool no_alloc`# +* [.codebit]#`int n_objects`# +* [.codebit]#`struct ggml_object * objects_begin`# +* [.codebit]#`struct ggml_object * objects_end`# + +The memory buffer is structured like this: + +[.codebit]#`GGML_OBJECT_1`#, [.codebit]#`DATA_1`#, (optional empty space for alignment), [.codebit]#`GGML_OBJECT_2`#, [.codebit]#`DATA_2`#, (optional empty space for alignment),... + +[.codebit]##`GGML_OBJECT`##s and [.codebit]##`DATA`##s are always [.codebit]#`GGML_MEM_ALIGN`# aligned ([.codebit]#`GGML_MEM_ALIGN`# is either 16 or 4). Note that currently the alignment of [.codebit]#`GGML_OBJECT`# is based solely on its size being a multiple of [.codebit]#`GGML_MEM_ALIGN`# and the correct alignment of the preceeding [.codebit]#`GGML_OBJECT`# and [.codebit]#`DATA`#. Moreover, when space is allocated, the optional space for alignment is calculated based solely on the size of [.codebit]#`DATA`# (this is done by [.codebit]#`ggml_new_object(...)`#). + +[.codebit]#`GGML_OBJECT.size`# = [.codebit]#`DATA_size`# + optional_padding_size + +[.codebit]#`GGML_OBJECT.offs`# = [.codebit]#`&DATA`# - [.codebit]#`ggml_context.mem_buffer`# (i.e. offset of [.codebit]#`DATA`# in the buffer) + + +[[docs:funcstructs:ggml.c:ggml_type_size]] +=== ggml_type_size + +Signature: [.codebit]#`size_t ggml_type_size(enum ggml_type type)`# + +Looks up the [.codebit]#`type_size`# in the [.codebit]#`type_traits`# array. + + +[[docs:funcstructs:ggml.c:ggml_init]] +=== ggml_init + +Signature: +[.codebit]#`struct ggml_context * ggml_init(struct ggml_init_params params)`# + +Generates a [.codebit]#`ggml_context`# based on the [.codebit]#`params`# argument. On the first call it also thread-safely initializes the time system through [.codebit]#`ggml_time_init()`# (required only for Windows, this function is empty when compiled for Linux or macOS) and the [.codebit]#`ggml_table_f32_f16`# array. + + +[[docs:funcstructs:ggml.c:ggml_free]] +=== ggml_free + +Signature: [.codebit]#`void ggml_free(struct ggml_context * ctx)`# + +Frees the [.codebit]#`ggml_context`#'s memory buffer. + + +[[docs:funcstructs:ggml.c:ggml_new_object]] +=== ggml_new_object + +Signature: +[.codebit]#`static struct ggml_object * ggml_new_object(struct ggml_context * ctx, enum ggml_object_type type, size_t size)`# + +Generates a [.codebit]#`ggml_object`# in [.codebit]#`ctx`#'s memory buffer while reserving memory for its associated data and taking care of alignment. See [.codebit]#`ggml_context`# for how that works. + + +[[docs:funcstructs:ggml.c:ggml_new_tensor_impl]] +=== ggml_new_tensor_impl + +Signature: +[.codebit]#`static struct ggml_tensor * ggml_new_tensor_impl(struct ggml_context * ctx, enum ggml_type type, int n_dims, const int64_t * ne, struct ggml_tensor * view_src, size_t view_offs)`# + +Generates a [.codebit]#`ggml_tensor`#, along with its handler [.codebit]#`ggml_object`#, inside [.codebit]#`ctx`#'s buffer. The [.codebit]#`ggml_object`#'s [.codebit]#`DATA`# (see [.codebit]#`ggml_context`#) is composed of the [.codebit]#`ggml_tensor`# object and [.codebit]#`ggml_tensor->data`#. + + +[[docs:funcstructs:ggml.c:ggml_new_tensor]] +=== ggml_new_tensor + +Signature: +[.codebit]#`struct ggml_tensor * ggml_new_tensor(struct ggml_context * ctx, enum ggml_type type, int n_dims, const int64_t * ne)`# + +Wrapper for [.codebit]#`ggml_new_tensor_impl(...)`#. + + +[[docs:funcstructs:ggml.c:ggml_set_name]] +=== ggml_set_name + +Signature: +[.codebit]#`struct ggml_tensor * ggml_set_name(struct ggml_tensor * tensor, const char * name)`# + +Sets a tensor's name to [.codebit]#`name`#. If [.codebit]#`name`# is longer than the [.codebit]#`ggml_tensor.name`# array, it is truncated and always ended in [.codebit]#`NULL`#. + + +[[docs:funcstructs:ggml.c:ggml_format_name]] +=== ggml_format_name + +Signature: +[.codebit]#`struct ggml_tensor * ggml_format_name(struct ggml_tensor * tensor, const char * fmt, ...)`# + +Uses [.codebit]#`vsnprintf(...)`# to set a tensor's name according to the given format and arguments. + + +[[docs:funcstructs:ggml.c:incr_ptr_aligned]] +=== incr_ptr_aligned + +Signature: +[.codebit]#`static void * incr_ptr_aligned(void ** p, size_t size, size_t align)`# + +Returns the next [.codebit]#`align`#-aligned pointer while setting the [.codebit]#`void*`# pointed to by [.codebit]#`p`# to [.codebit]#`return_value + size`#, i.e. the end of the memory region in which an object of size [.codebit]#`size`# that must be [.codebit]#`align`#-aligned would be allocated. Note that for it to work [.codebit]#`align`# *_must_* be a power of 2. + + +[[docs:funcstructs:ggml.c:ggml_graph_nbytes]] +=== ggml_graph_nbytes + +Signature: +[.codebit]#`static size_t ggml_graph_nbytes(size_t size, bool grads)`# + +Returns the number of bytes needed to store a [.codebit]#`ggml_cgraph`# with [.codebit]#`size`# nodes and [.codebit]#`size`# leaves, followed by its corresponding pointer-size-aligned arrays ([.codebit]#`nodes`#, [.codebit]#`leaves`#, [.codebit]#`visited_hash_set.keys`#, [.codebit]#`grads`# (optional), [.codebit]#`grad_accs`# (optional) and [.codebit]#`visited_hash_set.used`#, in that order). + + +[[docs:funcstructs:ggml.c:ggml_new_graph_custom]] +=== ggml_new_graph_custom + +Signature: +[.codebit]#`struct ggml_cgraph * ggml_new_graph_custom(struct ggml_context * ctx, size_t size, bool grads)`# + +Generates a [.codebit]#`ggml_cgraph`#, along with its handler [.codebit]#`ggml_object`#, inside [.codebit]#`ctx`#'s buffer. The [.codebit]#`ggml_object`#'s [.codebit]#`DATA`# (see [.codebit]#`ggml_context`#) is arranged as such: + +[.codebit]#`nodes`# array, [.codebit]#`leaves`# array, [.codebit]#`visited_hash_set.keys`# array, [.codebit]#`grads`# array (optional), [.codebit]#`grad_accs`# array (optional) and [.codebit]#`visited_hash_set.used`# array + +Everything is pointer-size-aligned, as described in the section on [.codebit]#`ggml_graph_nbytes(...)`#. diff --git a/docs/code_documentation/documentation/ggml.h.adoc b/docs/code_documentation/documentation/ggml.h.adoc new file mode 100644 index 0000000000000..3989281089415 --- /dev/null +++ b/docs/code_documentation/documentation/ggml.h.adoc @@ -0,0 +1,73 @@ +[[docs:funcstructs:ggml.h]] +== ggml.h + +[[docs:funcstructs:ggml.h:enum-ggml_object_type]] +=== enum ggml_object_type + +Enumerates all possible types of [.codebit]#`struct ggml_object`#. These are [.codebit]#`GGML_OBJECT_TYPE_TENSOR`#, [.codebit]#`GGML_OBJECT_TYPE_GRAPH`# and [.codebit]#`GGML_OBJECT_TYPE_WORK_BUFFER`#. + + +[[docs:funcstructs:ggml.h:struct-ggml_init_params]] +=== struct ggml_init_params + +Ties together the parameters needed by [.codebit]#`ggml_init(...)`#. These are: + +[source,C++] +---- +// memory pool +size_t mem_size; // bytes +void * mem_buffer; // if NULL, memory will be allocated internally +bool no_alloc; // don't allocate memory for the tensor data +---- + + +[[docs:funcstructs:ggml.h:struct-ggml_tensor]] +=== struct ggml_tensor + +Represents a unit of computation. Has the following members: + +* [.codebit]#`ggml_type type`#: enum that indicates the data type the tensor works with +* [.codebit]#`ggml_backend_buffer buffer`# +* [.codebit]#`int64_t ne[GGML_MAX_DIMS]`#: dimensions in terms of logical elements (i.e. for quantized data types that use batches, ne[0] holds total number of values the batches on a single "row" hold) +* [.codebit]#`size_t nb[GGML_MAX_DIMS]`#: from comments: + +[source,C++] +---- +// stride in bytes: +// nb[0] = ggml_type_size(type) +// nb[1] = nb[0] * (ne[0] / ggml_blck_size(type)) + padding +// nb[i] = nb[i-1] * ne[i-1] +---- + +* [.codebit]#`ggml_op op`#: enum indicating the operation the tensor does +* [.codebit]#`int32_t op_params[GGML_MAX_OP_PARAMS / sizeof(int32_t)]`# +* [.codebit]#`int32_t flags`# +* [.codebit]#`ggml_tensor* src[GGML_MAX_SRC]`#: source tensors, a.k.a. tensors used as inputs for the operation +* [.codebit]#`ggml_tensor * view_src`# +* [.codebit]#`size_t view_offs`# +* [.codebit]#`void* data`#: the result of the operation, with structure defined by ne and nb +* [.codebit]#`char name[GGML_MAX_NAME]`#: name +* [.codebit]#`void* extra`#: "extra things e.g. for ggml-cuda.cu" +* [.codebit]#`char padding[8]`#: padding to 336 bytes, which is divisible by 16 ([.codebit]#`GGML_MEM_ALIGN`# is either 16 or 4), see struct ggml_context for more details + + +[[docs:funcstructs:ggml.h:struct-ggml_type_traits]] +=== struct ggml_type_traits + +This structure describes a data type and thus dictates how [.codebit]#`ggml_tensor.data`# is interpreted. It has the following members: + +* [.codebit]#`const char * type_name`# +* [.codebit]#`int64_t blck_size`#: the number of logical values held in a single block (this is 0 for deprecated and removed types, 1 for non-quantized types, and something else for quantized ones) +* [.codebit]#`int64_t blck_size_interleave`#: currently not set for any type +* [.codebit]#`size_t type_size`#: the size of the data type for unquantized types, and the size of the structure representing a block of said type for quantized ones +* [.codebit]#`bool is_quantized`# +* [.codebit]#`ggml_to_float_t to_float`#: pointer to a function for conversion to float (see exact definition below) +* [.codebit]#`ggml_from_float_t from_float_ref`#: pointer to a function for conversion from float (see exact definition below) + +The conversion function pointer types mentioned are defined as such: + +[source,C++] +---- +typedef void (*ggml_to_float_t) (const void * GGML_RESTRICT x, float * GGML_RESTRICT y, int64_t k); +typedef void (*ggml_from_float_t)(const float * GGML_RESTRICT x, void * GGML_RESTRICT y, int64_t k); +---- diff --git a/docs/code_documentation/documentation/llama-context.h.adoc b/docs/code_documentation/documentation/llama-context.h.adoc new file mode 100644 index 0000000000000..be792d12dee6c --- /dev/null +++ b/docs/code_documentation/documentation/llama-context.h.adoc @@ -0,0 +1,53 @@ +[[docs:funcstructs:llama-context.h]] +== llama-context.h + + +[[docs:funcstructs:llama-context.h:struct-llama_context]] +=== struct llama_context + +This structure contains most, if not all of the information crucial for a run. Here are some of its members: + +* [.codebit]#`const struct llama_model & model`#: a reference to the model to be used +* [.codebit]#`struct llama_cparams cparams`#: this contains the eval_callback and eval_callback_user_data (see the [.codebit]#`ggml_backend_sched_compute_splits(...)`# section for more details) +* [.codebit]#`std::vector backends`#: these contain interfaces with functions specialized for each available backend, see [.codebit]#`struct ggml_backend`# for more details +* [.codebit]#`ggml_backend_t backend_cpu`#: same as above, but for the cpu backend +* [.codebit]#`std::vector buf_compute_meta`#: serves as the buffer for the [.codebit]#`ggml_context`# used to build the [.codebit]#`ggml_cgraph`# in [.codebit]#`struct llm_build_context`# +* [.codebit]#`ggml_backend_sched_ptr sched`#: helps with splitting the computation graph between multiple backends when needed, see [.codebit]#`struct ggml_backend_sched`# +* input tensors of type [.codebit]#`struct ggml_tensor*`#, see below +* [.codebit]#`struct llama_sbatch sbatch`#: helps with input handling +* [.codebit]#`size_t logits_size`#: size of [.codebit]#`logits`# buffer +* [.codebit]#`float * logits`#: 2-dimensional array of size [.codebit]#`[n_outputs][n_vocab]`# holding decode output +* [.codebit]#`size_t embd_size`#: size of [.codebit]#`embd`# buffer +* [.codebit]#`float * embd`#: 2-dimensional array of size [.codebit]#`[n_outputs][n_embd]`# holding embeddings output +* [.codebit]#`int32_t n_outputs`#: from comments, "number of actually-used outputs in the current ubatch or last logical batch" + +Input tensors: + +[source,C++] +---- +struct ggml_tensor * inp_tokens; // I32 [n_batch] +struct ggml_tensor * inp_embd; // F32 [n_embd, n_batch] +struct ggml_tensor * inp_pos; // I32 [n_batch] +struct ggml_tensor * inp_out_ids; // I32 [n_outputs] +struct ggml_tensor * inp_KQ_mask; // F32 [kv_size, n_batch] +struct ggml_tensor * inp_KQ_mask_swa; // F32 [kv_size, n_batch] +struct ggml_tensor * inp_K_shift; // I32 [kv_size] +struct ggml_tensor * inp_mean; // F32 [n_batch, n_batch] +struct ggml_tensor * inp_cls; // I32 [n_batch] +struct ggml_tensor * inp_s_copy; // I32 [kv_size] +struct ggml_tensor * inp_s_mask; // F32 [1, n_kv] +struct ggml_tensor * inp_s_seq; // I32 [n_kv, n_batch] +struct ggml_tensor * inp_pos_bucket; // I32 [n_batch|n_kv, n_batch] +struct ggml_tensor * inp_embd_enc; // F32 [n_embd, n_outputs_enc] +struct ggml_tensor * inp_KQ_mask_cross; // F32 [n_outputs_enc, n_batch] +---- + +It has a single constructor that does minimal setup: + +[source,C++] +---- +llama_context(const llama_model & model) + : model(model) + , t_start_us(model.t_start_us) + , t_load_us(model.t_load_us) {} +---- diff --git a/docs/code_documentation/documentation/llama.cpp.adoc b/docs/code_documentation/documentation/llama.cpp.adoc new file mode 100644 index 0000000000000..8c820b29935d3 --- /dev/null +++ b/docs/code_documentation/documentation/llama.cpp.adoc @@ -0,0 +1,160 @@ +[[docs:funcstructs:llama.cpp]] +== llama.cpp + + +[[docs:funcstructs:llama.cpp:llama_model_load]] +=== llama_model_load + +Signature: +[.codebit]#`static int llama_model_load(const std::string & fname, std::vector & splits, llama_model & model, llama_model_params & params)`# + +Loads the model data from the given file using a [.codebit]#`llama_model_loader`#. Called by [.codebit]#`llama_model_load_from_file_impl(...)`#. + + +[[docs:funcstructs:llama.cpp:struct-llm_build_context]] +=== struct llm_build_context + +This structure's purpose is to help build the computation graphs ([.codebit]#`struct ggml_cgraph`#) for various model architectures through its special builder methods: [.codebit]#`build_llama()`#, [.codebit]#`build_deci()`#, [.codebit]#`build_baichuan()`#, [.codebit]#`build_bert()`#, etc. Its constructor has the following signature: + +[.codebit]#`llm_build_context(llama_context & lctx, const llama_ubatch & ubatch, const llm_build_cb & cb, bool worst_case)`#. + +Note that its [.codebit]#`init()`# method must be called before using any of the builder methods. + + +[[docs:funcstructs:llama.cpp:struct-llm_build_context.init]] +=== struct llm_build_context.init + +Signature: +[.codebit]#`void init()`# + +Through a call to [.codebit]#`ggml_init(...)`#, it generates a [.codebit]#`ggml_context`# that uses the [.codebit]#`buf_compute_meta`# member of the [.codebit]#`llama_context`# the object was constructed with as a buffer. + + +[[docs:funcstructs:llama.cpp:struct-llm_build_context.build_llama]] +=== struct llm_build_context.build_llama + +Signature: +[.codebit]#`struct ggml_cgraph * build_llama()`# + +One of [.codebit]#`llm_build_context`#'s graph builder methods. Like all the others, it begins with a call to [.codebit]#`ggml_new_graph_custom(...)`#, follows with a section that creates and ties the tensor operations and finishes with a call to [.codebit]#`ggml_build_forward_expand(...)`#, which links the tensors to the graph. + +NOTE: Builder methods [.codebit]#`build_bert(...)`#, [.codebit]#`build_t5_dec(...)`#, [.codebit]#`build_rwkv6(...)`# and [.codebit]#`build_rwkv6qwen2(...)`# have additional calls to [.codebit]#`ggml_build_forward_expand(...)`# in the tensor building section. + + +[[docs:funcstructs:llama.cpp:llama_build_graph]] +=== llama_build_graph + +Signature: +[.codebit]#`static struct ggml_cgraph * llama_build_graph(llama_context & lctx, const llama_ubatch & ubatch, bool worst_case)`# + +Builds the computation graph ([.codebit]#`struct ggml_cgraph`#). + +First, it creates a lambda function with the following signature: [.codebit]#`(struct ggml_tensor * cur, const char * name, int il)->void`#, where [.codebit]#`il`# is the index of the tensor's layer. This function will be passed to [.codebit]#`llm_build_context`#'s constructor and used as a callback in the builder functions. It first sets the tensor's name to \{name}-\{il} (as long as its length doesn't exceed [.codebit]#`GGML_MAX_NAME`# (currently 64), in which case it is truncated, see [.codebit]#`ggml_tensor_format_name(...)`#) if [.codebit]#`il>=0`# and to \{name} otherwise, then it attempts to offload as many normalization tensors as possible from the cpu backend to the backends of the devices indicated by [.codebit]#`struct llama_model.dev_layer(il)`#, if certain parameters require this: + +[source,C++] +---- +const bool full_offload = lctx.model.params.n_gpu_layers > (int) lctx.model.hparams.n_layer; +if (ubatch.n_tokens < 32 || full_offload) { + if (il != -1 && strcmp(name, "norm") == 0) { + const auto & dev_layer = lctx.model.dev_layer(il); + for (auto & backend : lctx.backends) { + if (ggml_backend_get_device(backend.get()) == dev_layer) { + if (ggml_backend_supports_op(backend.get(), cur)) { + ggml_backend_sched_set_tensor_backend(lctx.sched.get(), cur, backend.get()); + } + } + } + } +} +---- + +NOTE: Normalization tensors are created by calls to [.codebit]#`llm_build_norm(...)`# from [.codebit]#`llm_build_context`#'s builder functions. Through a call to the callback described earlier, [.codebit]#`llm_build_norm(...)`# sets the tensor's name to "`norm`" (or "`norm_w`", which won't have the same effect), then this tensor is potentially moved to the desired backend in the callback. However, most of the time, after [.codebit]#`llm_build_norm(...)`# creates a normalization tensor, the caller builder function invokes the callback again to change its name to something more specific, like "`attn_norm`" or "`ffn_norm`". This results in most normalization tensors remaining on the specified backends while having names other than "`norm`". + +Secondly, [.codebit]#`llm_build_context`# is instantiated and initialized: + +[source,C++] +---- +struct llm_build_context llm(lctx, ubatch, cb, worst_case); + +llm.init(); +---- + +Lastly, the proper builder function is called based on [.codebit]#`llama_model`#'s [.codebit]#`arch`# member and the result is returned. + + +[[docs:funcstructs:llama.cpp:llama_graph_compute]] +=== llama_graph_compute + +Signature: +[.codebit]#`static enum ggml_status llama_graph_compute(llama_context & lctx, ggml_cgraph * gf, int n_threads, ggml_threadpool * threadpool)`# + +As its name implies, this function computes a [.codebit]#`ggml_cgraph`# in a given [.codebit]#`llama_context`#. First it performs some threadpool management which was not well analyzed, then it calls [.codebit]#`ggml_backend_sched_graph_compute_async(...)`# for the actual graph computation, after which it logs any failures and returns a [.codebit]#`ggml_status`#. + + +[[docs:funcstructs:llama.cpp:llama_decode_impl]] +=== llama_decode_impl + +Signature: +[.codebit]#`static int llama_decode_impl(llama_context & lctx, llama_batch inp_batch)`# + +This function handles the inference process. It has the following structure: + +* input batch processing +* inference loop (until the input batch is emptied): + ** batch preparation + ** [.codebit]#`ggml_backend_sched_reset(...)`# + ** [.codebit]#`ggml_backend_sched_set_eval_callback(...)`#: this sets the scheduler's callback function to the user-provided one (if any) + ** [.codebit]#`llama_build_graph(...)`# + ** setting pointers to the output tensors. There are 2 types of outputs: logits, which are always extracted from the last tensor in the computation graph, and embeddings, which are extracted from the first tensor named "`result_embd_pooled`" (if at all) + ** [.codebit]#`ggml_backend_sched_alloc_graph`#: this will also called indirectly by [.codebit]#`llama_graph_compute(...)`#, so I believe this call is redundant + ** [.codebit]#`llama_set_inputs(...)`# + ** [.codebit]#`llama_graph_compute(...)`# + ** output extraction +* output processing +* kv cache defragmentation (if needed) +* [.codebit]#`ggml_backend_sched_reset(...)`# + + +[[docs:funcstructs:llama.cpp:llama_backend_init]] +=== llama_backend_init + +Signature: +[.codebit]#`void llama_backend_init(void)`# + +Calls [.codebit]#`ggml_time_init()`#, then [.codebit]#`ggml_init(...)`# and [.codebit]#`ggml_free(...)`# to initialize the f16 tables. + + +[[docs:funcstructs:llama.cpp:llama_model_load_from_file_impl]] +=== llama_model_load_from_file_impl + +Signature: +[.codebit]#`static struct llama_model * llama_model_load_from_file_impl(const std::string & path_model std::vector & splits, struct llama_model_params params)`# + +Constructs a [.codebit]#`struct llama_model`# and sets its devices (using calls to [.codebit]#`ggml_backend_dev_count()`# and [.codebit]#`ggml_backend_dev_get(...)`#), logs information on their memory, calls [.codebit]#`llama_model_load(...)`# and logs possible errors before returning. + + +[[docs:funcstructs:llama.cpp:llama_model_load_from_file]] +=== llama_model_load_from_file + +Signature: +[.codebit]#`struct llama_model * llama_model_load_from_file(const char * path_model, struct llama_model_params params)`# + +Wrapper for [.codebit]#`llama_model_load_from_file_impl`# (calls it with and empty [.codebit]#`splits`# parameter). + + +[[docs:funcstructs:llama.cpp:llama_init_from_model]] +=== llama_init_from_model + +Signature: +[.codebit]#`struct llama_context * llama_init_from_model(struct llama_model * model, struct llama_context_params params)`# + +Constructs a [.codebit]#`llama_context`# object, sets up its members according to the [.codebit]#`params`# argument, then initializes (by calls to [.codebit]#`ggml_backend_dev_init(...)`#) the backends of the devices set in [.codebit]#`model`# and adds them to [.codebit]#`llama_context.backends`#. The rest is undocumented. + + +[[docs:funcstructs:llama.cpp:llama_decode]] +=== llama_decode + +Signature: +[.codebit]#`int32_t llama_decode(struct llama_context * ctx, struct llama_batch batch)`# + +Wrapper for [.codebit]#`llama_decode_impl(...)`#.