Introduction Large language models (LLMs) are increasingly becoming powerful tools for understanding and generating human language. These models have achieved state-of-the-art results on different natural language processing tasks, including text summarization, machine translation, question answering, and dialogue generation. LLMs have even shown promise in more specialized domains, like healthcare, finance, and law.Google has been at the forefront of LLM research and development, releasing a series of open models that have pushed the boundaries of what is possible with this technology. These models include BERT, T5, and T5X, which have been widely adopted by researchers and practitioners alike. In this Guide, we introduce Gemma, a new family of open LLMs developed by Google. Learning Objectives Understand Gemma’s architecture and key features.Explore Gemma’s training process and techniques.Evaluate Gemma’s performance across NLP benchmarks.Learn to use Gemma for inference tasks.Recognize the importance of responsible deployment for Gemma.This article was published as a part of the Data Science Blogathon.What is Gemma? Gemma is a family of open language models based on Google’s Gemini models, trained on up to 6T tokens of text. These are considered to be the lighter versions of Gemini models. The Gemma family consists of two sizes: a 7 billion parameter model for efficient deployment on GPU and TPU, and a 2 billion parameter model for CPU and on-device applications. Gemma exhibits strong generalist capabilities in text domains and state-of-the-art understanding and reasoning skills at scale. It achieves better performance compared to other open models of similar or larger scales across different domains, including question answering, commonsense reasoning, mathematics and science, and coding. For both the models, the pre-trained, finetune checkpoints and open-source codebase for inference and serving are released by the Google Team.Gemma builds upon recent advancements in sequence models, transformers, deep learning, and large-scale training in a distributed manner. It continues Google’s history of releasing open models and ecosystems, following Word2Vec, Transformer, BERT, T5, and T5X. The responsible release of Gemma aims to improve the safety of frontier models, provide equitable access to this technology, give the path to rigorous evaluation and analysis of current techniques, and foster the development of future innovations. However, thorough safety testing specific to each Use Case is crucial before deploying or using Gemma. Gemma – Model Architecture Gemma follows the architecture of a decoder-only transformer that was introduced way back in 2017. Both the Gamma 2B and the 7B models have a vocabulary size of 256k. Both models even have a context length of 8192 tokens. The Gemma even includes the recent advancements made in the transformers’ architecture including: Multi-Query Attention: The 7B model uses multi-head attention, while the 2B model implements multi-query attention (with num_kv_heads=1). This choice is based on performance improvements that were shown at each scale through ablation studies. RoPE Embeddings: Instead of absolute positional embeddings, both models employ rotary positional embeddings in each layer. Additionally, embedding sharing across inputs and outputs minimizes model size. GeGLU Activations: The regular ReLU activation function is replaced by the GeGLU activation function, giving good performance. Normalizer Location: Gemma deviates from the goto practice by normalizing both the input and output of each transformer sub-layer, using RMSNorm for the normalization method. How was Gemma Trained?Gemma 2B and 7B models were trained on 2T and 6T tokens, respectively, of primarily-English data sourced from Web Docs, mathematics, and code. Unlike Gemini models, which include multimodal elements and are optimized for multilingual tasks, Gemma models focus is on processing English text. The training data underwent a careful filtering process to remove Unwanted or Unsafe Content, including personal information and sensitive data. This filtering involved both heuristic methods and model-based classifiers to ensure the quality and safety of the dataset.Gemma 2B and 7B models underwent supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to further refine their performance. The supervised fine-tuning involved a mix of text-only, English-only synthetic, and human-generated prompt-response pairs. Data mixtures for fine-tuning were carefully selected based on LM-based side-by-side evaluations, with different Prompt sets designed to highlight specific capabilities like the instruction following, factuality, creativity, and safety. Even, synthetic data underwent several stages of filtering to remove examples containing personal information or toxic outputs, following the approach established by Gemini for improving model performance without compromising safety. Finally, reinforcement learning from human feedback involved collecting pairs of preferences from human raters and training a reward function under the Bradley-Terry model. This function was then optimized using a type of REINFORCE to further refine the models’ performance and mitigate potential issues like reward hacking.Benchmarks and Performance Metrics Looking at the results, Gemma outperforms Mistral on five out of six benchmarks, with the sole exception being HellaSwag, where they get similar accuracy. This dominance is clearly evident in tasks like ARC-c and TruthfulQA, where Gemma surpasses Mistral by nearly 2% and 2.5% in accuracy and F1 score, respectively. Even on MMLU, where Perplexity scores are lower is better, Gemma achieves a prominently lower Perplexity, indicating a better grip of language patterns. These results solidify Gemma’s position in being a powerful language model, capable of handling complex NLP tasks with good accuracy and efficiency.Getting Started with Gemma In this section, we will get started with Gemma. We will be working with Google Colab because it comes with a free GPU. Before we get started, we need to accept Google’s Terms and Conditions to download the model.Step 1: Opening GemmaClick on this link to go to Gemma on HuggingFace. You will be presented with something like the below:Step 2: Click on Acknowledge LicenseIf you click on Acknowledge License , then you will see a page as below.Click on Authorize. Done we are now ready to download the model. Before, let’s generate a new HuggingFace Token. For this, you can go to the HuggingFace Settings and Generate a new Token, this token will be useful because we need it to authorize inside Google Colab to download the Google Gemma Large Language Model.Step 3: Installing LibrariesTo get started, we first need to install the following libraries.!pip install -U accelerate bitsandbytes transformers huggingface_hub accelerate: Allows distributed training and mixed-precision training for faster and more efficient model training. The accelerate library even helps for faster inference of the Large Language Models.bitsandbytes: Allows quantization of model weights to 4-bit or 8-bit precision, reducing memory footprint and computation requirements. Because we are dealing with a 7Billion Parameter model, which requires around 30-40GB of GPU VRAM, we need to quantize it to fit in the Colab GPU.transformers: Provide pre-trained language models, tokenizers, and training tools for natural language processing tasks. We work with this library to download the Gemma model and start inferring it.huggingface_hub: Facilitates access to the Hugging Face Hub, a platform for sharing and seeing language models and datasets. We need this library to login to huggingface so that we can verify that we are authorized to download the Google Gemma Large Language ModelThe -U option after the install indicates that we are fetching the latest updated versions of all the libraries.Step 4: Typing Important CommandNow, type the below command!huggingface-cli loginThe above command will ask you to provide the HuggingFace Token, which we can get from the HuggingFace website. Give this token and press the enter button and you will receive a Login Successful message. Now let’s move on to coding # Import necessary classes for model loading and quantization from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig # Configure model quantization to 4-bit for memory and computation efficiency quantization_config = BitsAndBytesConfig(load_in_4bit=True) # Load the tokenizer for the Gemma 7B Italian model tokenizer = AutoTokenizer.from_pretrained(“google/gemma-7b-it”) # Load the Gemma 7B Italian model itself, with 4-bit quantization model = AutoModelForCausalLM.from_pretrained(“google/gemma-7b-it”, quantization_config=quantization_config) AutoTokenizer: This class dynamically loads the pre-trained tokenizer associated with the given model, ensuring compatibility and avoiding manual config. AutoModelForCausalLM: Similar to the tokenizer, this class automatically loads the pre-trained Causal Language Model architecture based on the provided model identifier. quantization_config = BitsAndBytesConfig(load_in_4bit=True): This line creates a config object for quantization, telling that the model’s weights should be pushed in 4-bit precision instead of the original 32-bit. This to a great extent reduces memory consumption and potentially speeds up computations, making the model more efficient for resource-constrained environments.tokenizer = AutoTokenizer.from_pretrained(“google/gemma-7b-it”): This line loads the pre-trained tokenizer specifically designed for the “google/gemma-7b-it” model. This tokenizer knows how to break down text into separate Tokens that the model can understand and process.model = AutoModelForCausalLM.from_pretrained(“google/gemma-7b-it”, quantization_config=quantization_config): This line loads the actual “google/gemma-7b-it” model, but with the crucial addition of the quantization_config object. This ensures that the model weights are created in the 4-bit…
Sign Up for Our Newsletters
Get notified of the best deals on our WordPress themes.