Colossal-LLaMA-2: Low Cost and High-quality Domain-specific LLM Solution Using LLaMA and Colossal-AI
The most prominent distinction between LLaMA-1 and LLaMA-2 lies in the incorporation of higher-quality corpora, a pivotal factor contributing to significant performance enhancements in LLaMA-2. This, coupled with its commercial availability, extends the potential for creative applications of large models within the open-source community.
Nevertheless, it’s widely recognized that the cost of pre-training large models from scratch is exorbitant, often humorously referred to as a domain accessible only to those with “50 million dollars” to spare. This deters many companies and developers, so how can we build our own large models at a lower cost?
Being at the forefront of cost reduction and efficiency enhancement for large models, the Colossal-AI team maximizes the core capabilities of LLaMA-2. Through innovative training techniques, Colossal-AI has achieved remarkable results by utilizing only approximately 0.0085 trillion tokens of data, investing 15 hours, and incurring training costs in the range of a few hundred dollars. This strategy has yielded a high-performance Chinese LLaMA-2 model that consistently outperforms competitors across multiple evaluation benchmarks.
Building upon the initial framework, the Colossal-AI team began the next iteration of the model. They constructed a more refined and comprehensive data architecture by utilizing 25 billion token data, and ultimately engineered a refined 13B model with just $5000 USD.
In contrast to the original LLaMA-2, Colossal-AI’s model not only enhances Chinese language capabilities but also further refines its proficiency in English. Remarkably, it exhibits performance levels that rival state-of-the-art (SOTA) models of similar scale within the open-source community.
Underpinning Colossal-AI’s approach are steadfast open-source principles. As a result, this model is made accessible without any commercial restrictions, with complete transparency extended to the entire training process, code, and model weights. In conjunction with this, Colossal-AI offers the comprehensive evaluation framework, ColossalEval, facilitating cost-effective reproducibility.
Moreover, the methodologies developed by Colossal-AI can be readily applied across various domains, facilitating the economical construction of large models that are pre-trained from scratch.
Open-source code and weights are available at :
https://github.com/hpcaitech/ColossalAI
Performance
Colossal-LLaMA-2 7B
Note: Based on ColossalEval scores, the scores in parentheses are from the official leaderboard scores of the corresponding models, and C-Eval scores are from the official Leaderboard.
In commonly observed English evaluation rankings, it can be observed that in the MMLU ranking, Colossal-LLaMA-2–7B-base, with the support of low-cost continual pre-training, has overcome the problem of catastrophic forgetting. Its capabilities have steadily improved (44.47 -> 53.06), showcasing outstanding performance among all 7B-scale models.
In the Chinese rankings, the primary comparisons were made against CMMLU, AGIEVAL, GAOKAO, and C-Eval. The performance of Colossal-LLaMA-2 significantly outshines other Chinese localization models based on LLaMA-2. Even when compared to other renowned models that employ Chinese language corpora and may cost millions of USD for training from scratch, Colossal-LLaMA-2 still stands out at the same scale. Notably, when compared to the original LLaMA-2, it has made a remarkable leap in Chinese language proficiency (CMMLU: 32.97 -> 49.89).
In addition, fine-tuning through methods like SFT and LoRA has limitations in effectively infusing knowledge and capabilities from the base model. It doesn’t satisfactorily meet the requirements for constructing high-quality domain-specific knowledge or specialized model applications.
To better evaluate the performance of the models, the Colossal-AI team relies not only on quantitative indicators but also conducts manual evaluations on different model aspects. Below are some examples:
Looking at the entire training loss record, it’s evident that while harnessing the cost-effective capabilities of the Colossal-AI system, the model’s convergence is also well-preserved. With a training dataset of only about 8.5 billion tokens and computational costs in the range of hundreds of dollars, the model achieves such remarkable results. In contrast, many large-scale models available in the market require training with several trillion tokens to ensure effectiveness, incurring significantly higher costs.
Colossal-LLaMA-2 13B
Note: Based on ColossalEval scores, the scores in parentheses are from the official leaderboard scores of the corresponding models, and C-Eval scores are from the official Leaderboard.
In English MMLU rankings, the Colossal-LLaMA-2–13B-base exhibits a steady improvement in English performance, attributed to low-cost incremental pre-training. Notably, in the GSM8k evaluation, there is a significant improvement in English mathematical and reasoning capabilities (31.31 -> 58.83), outperforming all other 13B models.
In Chinese rankings, we primarily compared CMMLU, AGIEVAL, GAOKAO, and C-Eval. The effectiveness of Colossal-LLaMA-2 surpasses other Chinese models based on LLaMA-2 by a wide margin. Even when compared to well-known Chinese corporation models that are pre-trained from scratch with potentially millions of dollars, Colossal-LLaMA-2’s performance is outstanding. Particularly noteworthy is the leap in Chinese proficiency compared to the original LLaMA-2 (CMMLU: 38.14 -> 61.8).
After analyzing loss records throughout the entire training process, it is clear that through leveraging the cost reduction and enhanced efficiency of the Colossal-AI system the model’s convergence is ensured. Remarkably, achieving such impressive results requires approximately 25 billion tokens, coupled with computational costs of $5000 USD. This is in contrast to prevalent large-scale market models that demand training using several trillion tokens, incurring substantial and expensive computational expenses to achieve comparable effectiveness.
So, how did the Colossal-AI team manage to reduce training costs and achieve such impressive results?
Vocabulary Expansion and Model Initialization
LLaMA-2’s original vocabulary was not specifically optimized for Chinese and had a limited set of Chinese words, resulting in insufficient comprehension of Chinese language data. Consequently, the first step involved expanding the vocabulary of LLaMA-2.
The Colossal-AI team discovered that:
- Vocabulary expansion not only significantly improved the efficiency of encoding string sequences but also enriched the encoded sequences with more meaningful information. This, in turn, proved highly beneficial for document-level encoding and understanding.
- However, due to the limited volume of continual pre-training data, an extensive expansion of the vocabulary could result in certain words or combinations lacking practical meaning, making it challenging to learn them effectively from the continual pre-training dataset and impacting the final performance.
- An excessively large vocabulary would increase the number of embedding-related parameters, affecting training efficiency.
Therefore, after conducting numerous experiments while considering both training quality and efficiency, the Colossal-AI team decided to expand the vocabulary from the original 32,000 words in LLaMA-2 to 69,104.
With the expanded vocabulary in place, the next step involved initializing the embeddings based on the original LLaMA-2 for the new vocabulary. To facilitate a seamless transition of capabilities from the original LLaMA-2 to the Chinese LLaMA-2 while ensuring that the English proficiency remains unaffected in the initial state, the Colossal-AI team employed mean initialization of the new embeddings using the weights from the original LLaMA-2. This approach not only preserved the English language capabilities but also facilitated the smooth transfer of these capabilities to the Chinese language model.
Data Construction
To further reduce the cost of training, high-quality data plays a key role, especially for continual pre-training which has strict requirements for the quality and distribution of data. To better filter high-quality data, the Colossal-AI team has constructed a complete data cleaning system and toolkit for selecting higher-quality data for continual pre-training.
The following image shows the complete data governance process of the Colossal-AI team:
In addition to conducting heuristic selection and deduplication of data, scoring and classification filtering were also applied to key data. Suitable data plays a crucial role in stimulating the Chinese capabilities of LLaMA-2, while simultaneously overcoming the catastrophic forgetting problem in English.
Finally, to improve training efficiency for data on the same topic, the Colossal-AI team sorted the data by length and concatenated it according to the maximum length of 4096.
In contrast to the 7B version, the 13B model’s training uses a more refined data architecture, categorizing data into knowledge-based, functional, and memory replay data.
Knowledge-based data is subdivided into over a dozen major categories, including finance, law, education, etc., with each major category further divided into subcategories to enable precise control over different data. Additionally, the scale of data from various verticals was increased to ensure a strong grasp of the model on data from diverse domains.
To address the community’s demand for functional capabilities in large models, targeted enhancements were made for different natural language processing tasks. This ensures that the model meets a certain level of understanding and proficiency in common NLP tasks during pre-training regarding text summarization, information extraction, and comprehension of complex problem-solving chains.
Furthermore, memory replay data serves as a crucial component to achieve the model’s mastery of acquired knowledge, effectively enhancing the overall performance and generalization ability of the model.
To address the growing security concerns, the Colossal-AI team implemented multidimensional enhancements (political sensitivity, religious sensitivity, abusive language, hatred, bias, illegal activities, physical harm, mental health, property privacy, moral and ethical considerations, etc.) to ensure the foundational model possesses robust security and adheres to correct values.
Training Strategy
Multi-stage Training
In terms of training, given the characteristics of continual pre-training, the Colossal-AI team designed a multi-stage, hierarchical continual pre-training scheme, dividing the training process into three stages:
- Large-Scale Pre-training Stage: The goal at this stage is to enable the model to produce relatively smooth text through training with a large amount of corpus. This stage is completed by LLaMA-2. After this stage, the model has already mastered a large amount of English knowledge and can produce smooth results based on Next Token Prediction.
- Chinese Knowledge Injection Stage: This stage relies on high-quality Chinese knowledge. On one hand, it enhances the model’s grasp of Chinese knowledge, and on the other hand, it improves the model’s understanding of the newly added words in the Chinese vocabulary.
- Relevant Knowledge Replay Stage: This stage is dedicated to enhancing the model’s understanding and generalization ability of knowledge and alleviating the catastrophic forgetting problem.
The multi-stage approach complements each other, ultimately ensuring that the model progresses equally in both Chinese and English abilities.
Bucket Training
Continual pre-training is extremely sensitive to data distribution, so balance is particularly important. To ensure a balanced distribution of data, the Colossal-AI team designed a data bucketing strategy, dividing the same type of data into 10 different bins. During the training process, each data bucket contains one bin of each type of data, thereby ensuring that each type of data can be utilized evenly by the model.
Evaluation System
To better assess the performance of the model, the Colossal-AI team has built a complete evaluation system — ColossalEval, which evaluates large language models from multiple dimensions. The framework and code of the process are fully open-source, supporting the reproduction of results and also allowing users to customize datasets and evaluation methods according to the application scenario. The characteristics of the evaluation framework are summarized as follows:
- It includes common datasets for evaluating the knowledge reserve capability of large language models, such as MMLU, CMMLU, etc. For formats like multiple-choice questions and comparing probabilities of ABCD, more comprehensive calculation methods are added, such as absolute matching, single-choice perplexity, etc. This aims to measure the model’s grasp of knowledge more thoroughly.
- Support evaluation for multiple-choice questions and long-text assessments.
- Support evaluation methods for different application scenarios including multi-turn dialogues, role-playing, information extraction, content generation, etc. Users can selectively assess different aspects of the model’s abilities based on their specific needs. Additionally, the system supports the extension of custom prompts and evaluation methods to cater to individual preferences and requirements.
Bridging from General Large Models to Domain-specific Large Models
From the experience of the Colossal-AI team, constructing a Chinese version of LLaMA-2 can be summarized into the following process:
So, can this process be reused?
The answer is affirmative, and it holds great significance in real-world implementation scenarios.
As the wave of artificial intelligence-driven by ChatGPT surges, major internet giants, AI companies, startups, universities, research institutions, and others are all actively participating in the race for large general-purpose models. However, behind the generality of these large models often lies a lack of domain-specific knowledge. Consequently, the issue of practical applicability becomes particularly serious. While fine-tuning for specific applications can yield some benefits, the absence of domain-specific large models creates performance bottlenecks in application deployment.
If a domain-specific large model can be rapidly and cost-effectively constructed, followed by fine-tuning for specific business needs, it would undoubtedly advance the deployment of applications, providing a competitive advantage.
Applying the above process to perform knowledge transfer in any field allows for the cost-effective construction of lightweight domain-specific foundational large models:
For constructing foundational large models from scratch, one can also draw inspiration from the aforementioned experiences and Colossal-AI’s cost-reduction and efficiency-enhancing capabilities to efficiently achieve this goal at minimal cost.
System Optimization
The impressive performance and cost advantages of Colossal-LLaMA-2 are built upon the foundation of the low-cost AI large model development system, Colossal-AI.
Colossal-AI, based on PyTorch, leverages efficient multi-dimensional parallelism, heterogeneous memory, and other techniques to reduce the development and deployment costs of AI large model training, fine-tuning, and inference. It enhances model task performance, reduces GPU requirements, and more. In just over a year, it has garnered over 30,000 GitHub stars within the open-source community. It ranks first in the world in the field of large model development tools and communities and has collaborated with numerous Fortune 500 companies and other well-known enterprises to develop/optimize models with hundreds of billions or tens of billions of parameters and create domain-specific models.
Low Cost and High-quality Model Construction
After the construction of multidimensional data and the enhancement of the foundational model’s natural language capabilities, the Colossal-AI team has developed a 7B version model and a more powerful 13B version model. Based on these models, community users can benefit from reduced amounts of high-quality fine-tuning data while fine-tuning, resulting in cheaper costs and the creation of a personalized model.
Colossal-LLaMA-2 Download: https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-7b-base
https://huggingface.co/hpcai-tech/Colossal-LLaMA-2-13b-base
Colossal-AI Open Source Address: https://github.com/hpcaitech/ColossalAI