-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GGUF file format specification #302
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general question, is this format only for LLMs ? what about vision stuff and multiple models in one file? eg. https://github.com/monatis/clip.cpp does that.
Nope! LLMs are just the use-case I'm familiar with. We should describe |
I'm afraid defining a closed set of metadata vocabulary might be a restricting design that hinders the speed of innovations in the GGML community. My suggestion would be define a certain format to encode freeform key-value pairs: One possible way might be
Almost anything can be reduced to this type of key-value pairs. If needed, we can extended to a nested structure as well, but I believe that the metadata keys should be open and no model-specific metadata should be defined. The GGML manifesto states that "The AI models are improving at a very high rate and it is important to stay on top of it." and I think we must not define such keys in order to stay on top of improvements in AI. |
Yes, that's what's described in the spec. It's not a closed set; the keys that are specified are standardized and guaranteed to always share the same meaning, but users can extend it with their own as required to serve their needs. Ideally, the more popular KVs would end up being standardized as well. |
check out the README :) https://github.com/ggerganov/ggml#updates |
I've addressed the review comments 👍 Just asking the people behind each implementation: can you suggest metadata values that should be standardized, if any?
|
How is this spec relating to LoRa ( |
Good spot, I actually noticed that this morning and hadn't updated it. What should it look like? I imagine that you want it to
Maybe a subset of the fields of the original LLM, with a |
The LoRA files are very simple currently, it's just a tiny header with a few parameters and a bunch of tensors. I think it should work fine with the way this is designed currently. The only parameters stored in the header currently are the rank and alpha values of the LoRA. This is not enough to support every type of LoRA, so I wouldn't bother with defining this in a very detailed way for now, we can look into it later. |
What is the difference between |
I suggest use of special key-values to identify special tokens:
|
There is no difference, I suppose it's just came into existence because the Falcon implementation was derived from MPT/Replit, which also has this naming. |
Updated with latest round of feedback. |
Note that @saharNooby and myself are maintainer and contributor (respectively) to a popular RWKV inference library RWKV.cpp so the parameters we proposed are indeed the ones that are needed to properly inference with the model. You could add them without much trouble |
Oh, no, I know this; I was just giving you two an opportunity to agree on what the names of those fields should be before I wrote anything up. |
Some models have special tokens for separating two (or more) sides of a chat conversation—OpenAI is one example of a company that trains models like this, in an attempt to disallow the "user" from performing prompt injections, by giving the "system" higher authority. How would this be represented? |
Big endian support is proposed in ggerganov/llama.cpp#3552. I think it needs the community's attention because the spec is explicit in little endianness. We need to update it when merging that PR. |
So Huggingface just introduced prompt templates embedded into the model files. https://huggingface.co/docs/transformers/main/chat_templating This means that when a model creator has set the prompt format in it, inference programs and UIs can detect the right prompt template based on the model files and set it automatically. IMO, this is absolutely huge and a big step forward into making LLMs more accessible to everyone. @ggerganov Does GGUF support this? |
You can do this already, just look in the vocab for common tokens and you know which "template" was used. What huggingface added is jinja specific/dependent. I don't think ggml/llama.cpp is going to re-implement jinja templating engine. EDIT: I don't think it's even possible, those templates depend on python expressions. Also, it's only for chat, you can use models for other things, and then the chat template is useless, whereas the vocab introspection would still help. |
I believe this should be responsibility of downstream executors if they want to support this. It's technically possible to add any template info in GGUF files --you can easily add the system prompt, roll names or a full templated string in it and then read from the model file to act accordingly. |
So what you are saying is, GGUF files already contain the "chat_template" information in the tokenizer_config.json and it's up to the inference program to use it? If that's the case, then I agree with you. |
It's up to the converter to add those pieces of information to the GGUF file and it's up to the inference program to make use of those. GGML / llama.cpp does not distribute preconverted GGUF files officially --community members do. Anyone can add arbitrary key-value pairs to GGUF files that they want to use after the conversion. The GGUF spec defines only the structure, not the content. |
Hi all! Apologies for the late follow-up on this, I've been tremendously busy and I'm catching back up on everything now. I've updated the PR to address the comments. @ggerganov I think we should merge this in now, and then let the community make follow-up PRs to fill in any specifics or correct things.
Fixed, thanks!
Well-spotted. I've reworded the relevant section to make it more obvious what the intended goal is there; I was going to go with your fix, but figured that a different approach might be clearer.
I'm sorry, but I didn't understand what you were suggesting; would you be able to produce a diff to the spec? I can edit it as required, I'm just not sure what the exact changes you'd like me to make are 😞
Done. That PR's not merged yet, so I've left the old key in there in the meantime so that implementers aren't confused.
Thanks for mentioning this. As far as I can tell, the only functional difference is the version number has changed, which is... not ideal? How do you tell apart a little-endian and big-endian file? I've updated to v3 nonetheless and written a little about it, but this seems like something we should rectify ASAP. Regarding the discussion of prompt templates: I think this is a great idea (improves the single-file usability of GGUF more still), which is why there's a section reserved for it in the spec. Nobody's defined what this should look like for GGUF yet as the design space was too wide when we were initially looking at it. It looks like Hugging Face are using Jinja templates for their prompt templates, which is a simple and straightforward solution to the problem, but may not be appropriate for us: Jinja supports a lot of things we don't care about, and its canonical implementation is Python. I'd love to hear suggestions for how we could embed prompt templates that cover the majority of what people would want to do with prompts while still remaining relatively simple to implement. |
Speaking from experience with The models are quite adaptable and the use of I appreciate @ggerganov's original recommendation on handling special tokens, and the parameterized approach has its merits as it automates a previously tedious process. In Python, I've experimented with a flexible design that aligns well with @ggerganov's original idea and translates smoothly to JSON. # Default chat formatting templates for reusability.
# These templates can be reused or modified on a model-by-model basis.
# Template for HuggingFace-based models.
huggingface_template = {
"model": "meta-llama/Llama-2-7b-chat-hf",
"jinja": None,
"tokenize": False,
}
# Common formatting settings applicable to all roles in chat models.
common_template: llama_types.CommonTemplate = {
"separators": {
"after_system": "\n",
"between_messages": "\n",
"end_of_response": "",
},
"default_termination": {
"role": "assistant", # Default role for termination
"message": None, # Default termination message (None for assistant)
},
"include_prompt": False, # Whether to include user prefix/postfix in prompts
}
# Template for Llama-2 model.
llama2_template: llama_types.ChatMLTemplate = {
"roles": {
"system": {
"prefix": "<<SYS>>", # System message prefix
"postfix": "<</SYS>>", # System message postfix
"format": None, # Optionally specify a custom format
},
"user": {
"prefix": "[INST] ",
"postfix": " [/INST]", # Model generates from here
"format": None,
},
"assistant": {
"prefix": "", # No prefix for assistant role by default
"postfix": "", # No postfix for assistant role by default
"format": None, # Custom format for assistant (if needed)
},
}
}
# Merge common settings into the llama2_template to reduce code duplication.
llama2_template |= common_template Regarding JSON, it might not be the most convenient in C++, but it's manageable. You could consider using What sets this library apart is its commitment to user control and flexibility, a feature I highly respect. Imposing rigid templates could compromise this strength. Translating this to C++ should be relatively straightforward, as shown in the provided example. #include <iostream>
#include <map>
struct CommonTemplate {
std::string after_system;
std::string between_messages;
std::string end_of_response;
// Add other common fields
};
struct RoleTemplate {
std::string prefix;
std::string postfix;
// Add other role-specific fields
};
int main() {
CommonTemplate common_template = { "\n", "\n", "" };
std::map<std::string, RoleTemplate> roles;
roles["system"] = { "<<SYS>>", "<</SYS>>" };
roles["user"] = { "[INST] ", " [/INST]" };
roles["assistant"] = { "", "" };
std::map<std::string, std::map<std::string, RoleTemplate>> llama2_template;
llama2_template["roles"] = roles;
// Adding common settings to llama2_template can be done by another structure
// or by manual assignment if the key names differ between the common and role-specific settings.
std::cout << "System Prefix: " << llama2_template["roles"]["system"].prefix << std::endl;
std::cout << "User Prefix: " << llama2_template["roles"]["user"].prefix << std::endl;
return 0;
} However, I'd strongly caution against imposing fixed templates. User-centric design has been the strong suit of this library, and I would appreciate keeping it that way. |
Thank you once again for initiating GGUF, writing and maintaining the spec and coordinating the community around it! Will proceed with merging the PR and let follow-up PRs update the spec as needed. My idea about including chat templates in GGUF only extends to the point where it is a container of oblique templates associated with the given model. A project using GGUF files can decide what to do with these templates, but specifically for Without knowing the specifics of these templates and assuming they are just a parsable string, I would suggest that we store them in GGUF just as strings. Probably an array of strings since maybe there could be more than one template associated with the model. |
Woohoo! Glad to see it in; it's been great to see widespread adoption 🎉 I also agree on the subject of prompt templates - we wouldn't require anyone to use them, but they're there to ease use with supported executors. I think we might include hints as to how they're used (i.e. indicate that it's a Jinja template or whatever), but it should otherwise be pretty freeform. A discussion for another issue, perhaps? |
The Python package has monthly ~35k downloads: https://pypistats.org/packages/gguf |
There are more than 1300 GGUF models hosted on HuggingFace: https://huggingface.co/models?sort=trending&search=gguf 🎉 |
we should be able to embed the jinja2 templates from huggingface in the gguf metadata, would be really helpful. jinja2 is easy to parse, lightweight https://github.com/jinja2cpp/Jinja2Cpp !! |
I'd suggest opening another issue or a PR to the spec for that. I don't think llama.cpp will support actually using the templates, but it would be good to standardise the metadata keys for prompt templates regardless. |
There is already one here ggerganov/llama.cpp#3810 (comment) Support for it directly in the llama.cpp UI would be cool, but it's not a big deal if it won't be implemented. What a big deal would be however, is that the converter.py should implement the chat template metadata into the GGUF file so programs can read and adjust the prompt template automatically. |
There should be an option in the GGUF API in |
Depending on which conversion script you're running, you may be running into ggerganov/llama.cpp#3433. |
Closes #220.
Rendered: https://github.com/philpax/ggml/blob/gguf-spec/docs/gguf.md
Defines a complete specification for the proposed GGUF file format, which should generically describe models to be loaded by any compatible executor.
This is a first draft, so there's still some work that needs to be done - I need to fill in the TODOs and clarify a few things. If you have any suggestions for what should go in the TODOs, please let me know!
Changes from the version in the issue include: