Overview
Stability AI has recently launched two new AI models called Free Willy 1 and Free Willy 2. These models are based on the llama Foundation models created by meta llama. Llama stands for "large language model with adaptive learning" and it uses a unique training method that adjusts itself based on the data it receives and the task it needs to perform. The Free Willy models, equipped with billions of parameters, are capable of handling a wide range of natural language tasks such as text generation, summarization, question answering, sentiment analysis, and more.
Model Specifications
Free Willy 1 is built on the Llama 65b model, which consists of 65 billion parameters. On the other hand, Free Willy 2 is built on the Llama 270b model, which consists of 70 billion parameters. Free Willy 2, being an updated version of the 65b model, offers better performance and efficiency. Both models have been improved using a method called supervised fine-tuning (SFT), which involves providing detailed instructions to help the models learn and perform complex tasks that require reasoning and understanding of natural language nuances.
The training methodology for the Free Willy models was inspired by a groundbreaking approach from Microsoft Research, described in their paper titled "Orca: Progressive learning from complex explanation traces of gpt4". Microsoft researchers used a large Foundation model, gpt4, with 175 billion parameters, to generate synthetic data for a small model called Orca. They used gpt4's outputs and explanation traces to train Orca in imitating the reasoning process of the advanced language model.
Training Methodology
Stability AI followed a similar approach for training the Free Willy models, but instead of using gpt4, they used chat GPT as the teacher model. They utilized different data sets created by Enrico Shippel, a renowned researcher in creating high-quality instructions for language models. These data sets included tasks such as text classification, generation, summarization, translation, analysis, and paraphrasing.
Stability AI generated 600,000 data points for the Free Willy models by prompting chat GPT with high-quality instructions and collecting its outputs and explanations. This method proved to be efficient and effective, as it required only 10% of the data size used by Microsoft for training Orca. The data was carefully filtered to ensure fair comparisons and avoid overfitting by removing examples from evaluation benchmarks.
Evaluation Results
Stability AI evaluated the Free Willy models using various benchmarks that measure their natural language understanding and reasoning abilities. These benchmarks include open llm leaderboard, GPT for all, AGI eval, and professional and academic exams such as SAT, LSAT, GRE, and GMAT.
The results of the evaluations showed that the Free Willy models outperformed many state-of-the-art instruction-tuned models such as vicuna 13B Bard and text DaVinci 003. They even reached parity or surpassed chat GPT and came close to gpt4 on certain tasks.
Examples:
- The open llm leaderboard test measures how well a model adapts to different language tasks. Free Willy 2 scored 103 points, surpassing chat GPT's 100 points and vicuna 13B's 76 points.
- The GPT for all test evaluates how well a model performs on 20 different language tasks without any special training. Free Willy 2 scored 47 points, outperforming vicuna 13B's 30 points and text DaVinci 003's 42 points.
- The AGI eval test utilizes big academic and professional tests to assess a model's problem-solving skills. Free Willy 2 obtained 45 points, slightly lower than chat GPT's 49 points but significantly better than vicuna 13B's 20 points.
The Free Willy models also demonstrated impressive performance on specific tasks. For example, on the heliswag task, where the model predicts the best ending for a story, Free Willy 2 achieved an 86.4% accuracy, surpassing chat GPT's 85.5% accuracy. Similarly, on the winogranda task, where the model determines the referent of a pronoun in a sentence, Free Willy 2 achieved a 79.8% accuracy compared to chat GPT's 78.9% accuracy. In the SAT Math task, Free Willy 2 correctly solved 63.6% of the problems, slightly lower than chat GPT's 65.5% accuracy.
Evaluation Methods
Stability AI employed two different tools to evaluate the Free Willy models: Ellie Uther ai's LM eval harness and hugging face's open llm leaderboard. Ellie Uther ai's LM eval harness is a tool that allows researchers to evaluate language models on various natural language tasks using standardized metrics and protocols. Stability AI included AGI eval, which incorporates more challenging tasks to assess the models' reasoning and problem-solving skills.
Hugging face's open llm leaderboard is a platform that enables researchers to submit their language models and compare their performance with other models on various natural language tasks using standardized metrics and protocols. Stability AI submitted the Free Willy models to both tools and verified the consistency and reproducibility of their results. Hugging face also independently reproduced Stability AI's results and confirmed their accuracy.
Future Potential
Stability AI spokesperson and researcher Anal Islamovitch expressed pride in the Free Willy models, believing that they will have a significant impact on the open-source llm community. Islamovitch emphasized the power, affordability, and accessibility of these models, highlighting their potential in advancing learning from explanations by humans or AI.
Islamovitch is enthusiastic about the models' potential in natural language understanding and reasoning. They can potentially tackle challenges in natural language processing, such as common sense reasoning, and open doors for novel applications like interactive storytelling and educational content creation.
Challenges and Ethical Considerations
During the development of the Free Willy models, the team faced several hurdles. Ensuring the quality and diversity of the data was paramount, while avoiding biases and duplications. The team also grappled with balancing model size and operational speed, considering that larger models, while more powerful, can be costly and slower to run.
Islamovitch acknowledged that the models are not flawless. Their reliance on chat GPT, which is not as advanced as gpt4, can be a drawback. Inaccuracies in chat GPT can influence the learning process of the Free Willy models, especially when faced with unfamiliar or ambiguous inputs.
However, the team at Stability AI prioritized safety and ethics throughout the development process. They adhered to responsible AI practices, including transparency and fairness. Thorough tests were conducted to ensure that the models were free from issues such as bias and misinformation. Islamovitch welcomed input and collaboration from anyone in the AI community to further improve the models and advance AI for the benefit of all.
Conclusion
The launch of Stability AI's Free Willy models marks a significant advancement in the field of natural language processing. These AI models, built on the llama Foundation models, offer impressive performance in various language tasks. With their large parameter sizes and the unique supervised fine-tuning methodology, the Free Willy models outperform many existing models and demonstrate promising potential for natural language understanding and reasoning.
Stability AI's commitment to responsible AI practices and their rigorous evaluation methods ensure the validity and reliability of the Free Willy models' performance. The models have been tested against various benchmarks and verified by independent tools and researchers.
Moving forward, Stability AI aims to continue refining the Free Willy models and welcomes collaboration within the AI community to further enhance their capabilities. The Free Willy models hold promise for advancing AI applications and contributing to the open-source llm community.
0 Comments