MaLA-LM focuses on adapting large language models to support hundreds of languages, including many underrepresented ones. Our models are multilingual, scalable, and optimized for diverse linguistic tasks. Explore our models on Hugging Face.

Check out our multilingual LLM collections, featuring models trained to handle 500+ languages, ideal for global, multilingual applications.

EMMA-500

EMMA-500 is a state-of-the-art multilingual language model designed to improve language representation, especially in low-resource languages, through continual pre-training on the Llama 2 7B architecture. Leveraging the MaLA Corpus, which spans over 500 languages and 74 billion tokens, EMMA-500 excels in multilingual tasks like commonsense reasoning, machine translation, open-ended generation, and text classification.

EMMA-500 outperforms other Llama 2-based models in diverse multilingual settings while maintaining robustness in specialized tasks.

MaLA corpus

The MaLA Corpus (Massive Language Adaptation) is a comprehensive, multilingual dataset designed to support the continual pre-training of large language models. It covers 939 languages and consists of over 74 billion tokens, making it one of the largest datasets of its kind. With a focus on improving the representation of low-resource languages, the MaLA Corpus is a critical resource for advancing multilingual models, particularly those aimed at serving underrepresented languages.

  • Language Coverage: Includes data for 939 languages, with 546 languages having over 100,000 tokens.
  • Pre-processing: The corpus is cleaned and deduplicated to ensure high-quality training data.

MaLA-500

MaLA-500 is a novel large language model designed to cover an extensive range of 534 languages. This model builds upon LLaMA 2 7B and integrates continued pretraining with vocabulary extension, with an expanded vocabulary size of 260,164, and LoRA low-rank adaptation.

  • Continued Pretraining: Enhances the model’s ability to adapt to a wide range of languages.
  • LoRA Low-Rank Adaptation: LoRA low-rank adaptation refines the model’s adaptation capabilities.
  • Vocabulary Extension: MaLA-500 boasts an extended vocabulary size of 260,164.
  • Multilingual Proficiency: Trained on Glot500-c, covering 534 languages.

With vocabulary extension and LoRA modules, the MaLA-500 introduces additional 2.1B trainable parameters, making the total parameters to be 10.7B.

Connect with Us on Discord

For those interested in joining the conversation or learning more about the project, you can connect with the community on Discord: MaLA-LM Discord.