LARGE LANGUAGE MODELs - Frequently Asked Questions
Preface
1
Basics of Large Language Models
1.1
Generative AI and Large Language Models (LLMs)
1.1.1
Generative AI
1.1.2
Early Beginnings
1.1.3
Advancements in Neural Networks
1.1.4
The Rise of Generative Models
1.1.5
Modern Era
1.1.6
Large Language Models (LLMs)
1.1.7
Key Concepts
1.1.8
Practical Use Cases
1.2
Advantages of Transformer Models over LSTM
1.3
Predictive/Discriminative AI vs. Generative AI
1.3.1
Predictive/Discriminative AI
1.4
Generative AI
1.4.1
Key Differences
1.5
Large Language Models (LLMs)
1.5.1
What is an LLM?
1.5.2
How are LLMs Trained?
1.6
What is a Token in the Language Model?
1.6.1
Key Points:
1.6.2
Example:
1.7
How to Estimate the Cost of Running SaaS-based and Open Source LLM Models
1.7.1
SaaS-based LLM Models
1.7.2
Open Source LLM Models
1.7.3
Factors Influencing Costs
1.8
Explain the Temperature Parameter and How to Set It
1.8.1
What is the Temperature Parameter?
1.8.2
How Does It Work?
1.8.3
How to Set the Temperature?
1.8.4
Example Settings:
1.9
Different Decoding Strategies for Picking Output Tokens
1.10
Different Ways to Define Stopping Criteria in Large Language Models
1.11
How to Use Stop Sequences in LLMs
1.11.1
What are Stop Sequences?
1.11.2
How to Implement Stop Sequences
1.11.3
Example Implementation
1.11.4
Benefits of Using Stop Sequences
1.12
Explain the Basic Structure of Prompt Engineering
1.12.1
What is Prompt Engineering?
1.12.2
Basic Structure of Prompt Engineering
1.12.3
Example Prompt
1.13
Explain In-Context Learning
1.13.1
What is In-Context Learning?
1.13.2
How Does In-Context Learning Work?
1.13.3
Example of In-Context Learning
1.13.4
Benefits of In-Context Learning
1.13.5
Applications of In-Context Learning
1.14
Types of Prompt Engineering
1.14.1
Zero-Shot Prompting
1.14.2
One-Shot Prompting
2
Retrieval Augmented Generation (RAG)
2.1
How to Increase Accuracy, Reliability, and Make Answers Verifiable in LLMs
2.2
How Does RAG Work?
2.2.1
What is Retrieval-Augmented Generation (RAG)?
2.2.2
How RAG Works
2.2.3
Benefits of RAG
2.2.4
Example Use Case
2.3
Benefits of Using the RAG System
2.4
When Should I Use Fine-Tuning Instead of RAG?
2.4.1
Fine-Tuning
2.4.2
Retrieval-Augmented Generation (RAG)
2.4.3
Choosing Between Fine-Tuning and RAG
2.5
Architecture Patterns for Customizing LLM with Proprietary Data
3
Chunking
3.1
What is Chunking, and Why Do We Chunk Our Data?
3.1.1
What is Chunking?
3.1.2
Why Do We Chunk Our Data?
3.1.3
Example of Chunking in Practice
3.2
What Factors Influence Chunk Size?
3.3
How to Find the Ideal Chunk Size
3.4
Different Types of Chunking Methods
3.5
What is Semantic Chunking?
3.5.1
Definition
3.5.2
How Semantic Chunking Works
3.5.3
Benefits of Semantic Chunking
3.5.4
Best Use Cases for Semantic Chunking
3.5.5
Tools for Semantic Chunking
3.5.6
Example of Semantic Chunking
3.6
Pros and Cons of Fixed-Size Chunking
3.6.1
Pros
3.6.2
Cons
3.7
Alternatives to Fixed-Size Chunking
3.7.1
Semantic Chunking
3.7.2
Sentence-Based Chunking
3.7.3
Paragraph-Based Chunking
3.7.4
Token-Based Chunking
3.7.5
Overlapping Chunking
3.7.6
Dynamic Chunking
4
Embedding Models
4.1
Vector Embeddings and Embedding Models
4.2
Embedding Models in LLM Applications
4.2.1
What is an Embedding?
4.2.2
How Embeddings Work
4.2.3
Types of Embedding Models
4.2.4
Importance in LLM Applications
4.2.5
Conclusion
4.3
Difference Between Embedding Short and Long Content
4.3.1
Embedding Short Content
4.3.2
Embedding Long Content
4.3.3
Summary
4.4
Best Practices for Embedding
4.4.1
Preprocessing and Cleaning
4.4.2
Choosing the Right Model and Parameters
4.4.3
Utilizing Pre-Trained Embeddings
4.4.4
Handling Biases and Ethical Considerations
4.4.5
Continuous Monitoring and Periodic Updating
4.4.6
Integration with Downstream Models
4.5
Common Pitfalls in Using Embeddings
4.5.1
Insufficient Data Understanding
4.5.2
Overfitting
4.5.3
Poor Preprocessing
4.5.4
Misapplication of Embeddings
4.5.5
Neglecting Updates
4.5.6
Ignoring Context
4.6
Evaluating the Effectiveness of Embeddings
4.6.1
Intrinsic Evaluation
4.6.2
Extrinsic Evaluation
4.6.3
Visualization
4.6.4
Performance Metrics
4.7
Improving the Accuracy of Embedding-Based Search Models
4.7.1
Data Quality and Preprocessing
4.7.2
Model Selection and Fine-Tuning
4.7.3
Hyperparameter Optimization
4.7.4
Regular Updates and Retraining
4.7.5
Advanced Techniques
4.7.6
Evaluation and Feedback
4.7.7
Conclusion
4.8
Hyperparameter Optimization Methods
4.8.1
Grid Search
4.8.2
Random Search
4.8.3
Bayesian Optimization
4.8.4
Hyperband
4.8.5
Genetic Algorithms
4.9
Bayesian Optimization
4.9.1
Key Components
4.9.2
How It Works
4.9.3
Advantages
4.9.4
Applications
4.9.5
Implementing Bayesian Optimization in Python
4.10
Steps to Improve a Sentence Transformer Model
5
Internal Working of Vector Databases
5.1
What is a Vector Database?
5.1.1
Key Features
5.1.2
Use Cases
5.1.3
Popular Vector Databases
5.2
Differences Between Vector Databases and Traditional Databases
5.3
How a Vector Database Works
5.3.1
Key Components
5.3.2
Example Workflow
5.4
Differences Between Vector Index, Vector Database, and Vector Plugins
5.4.1
Vector Index
5.4.2
Vector Database
5.4.3
Vector Plugins
5.4.4
Summary
5.5
Choosing the Best Search Strategy for Finding Similar Reviews
5.5.1
Why Choose Exhaustive Search?
5.5.2
How to Implement Exhaustive Search
5.6
Steps to Find Similar Customer Reviews
5.6.1
Example Code in Python
5.7
Vector Search Strategies: Clustering and Locality-Sensitive Hashing
5.7.1
Clustering
5.7.2
Locality-Sensitive Hashing (LSH)
5.8
How Clustering Reduces Search Space
5.8.1
Reducing Search Space with Clustering
5.8.2
When Clustering Fails
5.8.3
Mitigating Clustering Failures
5.9
How DBSCAN Works
5.9.1
Key Concepts
5.9.2
Parameters
5.9.3
Algorithm Steps
5.9.4
Example
5.10
Choosing ε and MinPts Values for DBSCAN
5.10.1
Choosing MinPts
5.10.2
Example
5.10.3
Effects of ε in DBSCAN
5.11
Random Projection Index
5.11.1
Introduction
5.11.2
Key Concepts
5.11.3
How It Works
5.11.4
Advantages
5.11.5
Applications
5.11.6
Example
5.12
Johnson-Lindenstrauss Lemma
5.12.1
Statement of the Lemma
5.12.2
Key Concepts
5.12.3
Applications
5.12.4
Example
5.13
Product Quantization (PQ) Indexing Method
5.13.1
Introduction
5.13.2
How Product Quantization Works
5.13.3
Advantages
5.13.4
Example
5.14
Comparison of Different Vector Indexing Methods
5.14.1
1. Brute-Force Search
5.14.2
2. KD-Tree
5.14.3
3. Locality-Sensitive Hashing (LSH)
5.14.4
4. Product Quantization (PQ)
5.14.5
5. HNSW (Hierarchical Navigable Small World)
5.14.6
Scenario-Based Recommendation
5.15
Deciding Ideal Search Similarity Metrics for the Use Case
6
Filtering in Vector Databases
6.1
Types of Filtering
6.2
Challenges of Filtering
6.3
Types and Challenges Associated with Filtering in Vector Databases
6.3.1
Types of Filtering
6.3.2
Challenges of Filtering
6.4
How to Decide the Best Vector Database for Your Needs
7
Advanced Search Algorithms
7.1
Architecture Patterns for Information Retrieval & Semantic Search
7.1.1
Traditional Information Retrieval (IR) Systems
7.1.2
Semantic Search Systems
7.1.3
Hybrid Search Systems
7.1.4
Retrieval-Augmented Generation (RAG)
7.1.5
Neural IR Systems
7.1.6
Challenges
7.2
Why It’s Important to Have Very Good Search
7.3
Examples of Good Search Systems
7.3.1
4. Elasticsearch
7.4
Comparison of Google and Bing’s Algorithms
7.4.1
Google Search Algorithm
7.4.2
Bing Search Algorithm
7.4.3
Key Differences
7.5
Improving the Accuracy of a RAG-Based System
7.6
Keyword-Based Retrieval Method
7.6.1
Indexing
7.6.2
Query Processing
7.6.3
Retrieval
7.6.4
Ranking
7.6.5
Advantages
7.6.6
Disadvantages
7.7
How to Fine-Tune Re-Ranking Models
7.7.1
Data Preparation
7.7.2
Model Selection
7.7.3
Fine-Tuning Process
7.7.4
Evaluation and Validation
7.7.5
Deployment and Monitoring
7.8
Common Loss Functions Used in Machine Learning
7.8.1
Mean Squared Error (MSE)
7.8.2
Cross-Entropy Loss
7.8.3
Hinge Loss
7.8.4
Kullback-Leibler Divergence (KL Divergence)
7.8.5
Mean Absolute Error (MAE)
7.8.6
Huber Loss
7.8.7
Triplet Loss
7.9
Most Common Metric Used in Information Retrieval and When It Fails
7.9.1
Precision@k
7.9.2
Definition
7.9.3
When It Fails
7.9.4
Alternatives and Complements
7.10
Evaluation Metric for a Quora-like Question-Answering System
7.10.1
Why F1 Score?
7.10.2
When F1 Score Fails
7.10.3
Complementary Metrics
7.11
Evaluation Metrics for a Recommendation System
7.12
Mean Average Precision (MAP) in Detail
7.12.1
Key Concepts
7.12.2
Steps to Calculate MAP
7.12.3
Example
7.12.4
When MAP Fails
7.13
How Hybrid Search Works
7.13.1
Key Components
7.13.2
How Hybrid Search Operates
7.13.3
Benefits of Hybrid Search
7.14
Merging and Homogenizing Search Results from Multiple Methods
7.15
Handling Multi-Hop/Multifaceted Queries
7.16
Techniques to Improve Retrieval
8
Language Models Internal Working
8.1
Self-attention
8.1.1
Key Concepts
8.1.2
Mathematical Formulation
8.1.3
Benefits of Self-Attention
8.1.4
Applications
8.2
Disadvantages of the Self-Attention Mechanism and How to Overcome Them
8.2.1
Disadvantages
8.2.2
Overcoming the Disadvantages
8.3
Positional Encoding
8.3.1
Key Concepts
8.3.2
Mathematical Formulation
8.3.3
Benefits
8.3.4
Applications
8.4
Transformer Architecture
8.4.1
Key Components
8.4.2
Detailed Workflow
8.4.3
Advantages
8.4.4
Applications
8.5
Advantages of Using Transformers Instead of LSTMs
8.6
Difference Between Local Attention and Global Attention
8.6.1
Local Attention
8.6.2
Global Attention
8.6.3
Comparative Analysis
8.7
What Makes Transformers Heavy on Computation and Memory, and How Can We Address This?
8.7.1
Reasons for High Computation and Memory Usage
8.7.2
Solutions to Address High Computation and Memory Usage
8.8
Increasing the Context Length of an LLM
8.9
Optimizing Transformer Architecture for a Large Vocabulary
8.9.1
1.
Embedding Layer Optimization
8.9.2
2.
Efficient Attention Mechanisms
8.9.3
3.
Model Pruning and Quantization
8.9.4
4.
Parameter Sharing
8.9.5
5.
Memory-Efficient Training Techniques
8.9.6
6.
Efficient Architectures
8.9.7
7.
Regularization Techniques
8.10
Balancing Vocabulary Size in NLP
8.11
Different Types of LLM Architectures and Their Best Use Cases
8.11.1
Choosing the Right Architecture
8.12
Limitations of Large Language Models (LLMs)
8.13
real-world applications of LLMs
9
Supervised Fine-Tuning of LLM
9.1
What is Fine-Tuning, and Why is it Needed?
9.1.1
Why is Fine-Tuning Needed?
9.2
Fine-tuning a Large Language Model (LLM) is particularly useful in several scenarios:
9.3
Deciding whether to fine-tune a Large Language Model (LLM) involves evaluating several key factors:
9.4
Improving a model to answer only when there is sufficient context involves several strategies:
9.4.1
Steps to Create Fine-Tuning Datasets for Q&A
9.4.2
Tools and Resources
9.4.3
Key Hyperparameters
9.4.4
Best Practices for Setting Hyperparameters
9.5
Estimating infrastructure requirements for fine-tuning a Large Language Model (LLM)
9.5.1
Key Factors to Consider
9.5.2
Practical Steps to Estimate Requirements
9.6
Fine-tuning a Large Language Model (LLM) on consumer hardware
9.6.1
Steps to Fine-Tune LLM on Consumer Hardware
9.6.2
Example Code Snippet
9.6.3
Tools and Resources
9.7
Parameter-Efficient Fine-Tuning
9.8
Catastrophic forgetting in LLMs
9.8.1
Key Points about Catastrophic Forgetting in LLMs
9.9
Different re-parameterized methods for fine-tuning
10
Preference Alignment
10.1
Reinforcement Learning from Human Feedback
10.1.1
What is RLHF?
10.1.2
How is RLHF Used?
10.1.3
Implement of Reinforcement Learning from Human Feedback in Natural Language Processing models
10.2
The choice of Preference Alignment methods rather than Supervised Fine-Tuning (SFT)
10.2.1
When to Choose Preference Alignment
10.2.2
When to Choose Supervised Fine-Tuning (SFT)
11
Evaluation of LLM System
11.1
Evaluating the best Large Language Model (LLM) for your use case
11.1.1
Key Evaluation Criteria
11.1.2
Practical Steps for Evaluation
11.2
Evaluating Retrieval-Augmented Generation (RAG) systems
11.2.1
Key Evaluation Criteria
11.2.2
Evaluation Methods
11.2.3
Practical Steps for Evaluation
11.2.4
Common challenges in RAG evaluation
11.2.5
What are Context-Aware Models?
11.2.6
Key Components of Context-Aware Models
11.2.7
Applications of Context-Aware Models
11.3
Chain of Verification
11.3.1
Key Steps in the Chain of Verification
11.3.2
Benefits of Chain of Verification
11.3.3
Example of Chain of Verification
11.4
LLM Vs Traditional NLP models
11.4.1
Model Architecture and Training
11.4.2
Data Requirements
11.4.3
Flexibility and Adaptability
11.4.4
Performance and Use Cases
11.4.5
Generative Capabilities
11.5
Evaluating language models in Natural Language Processing (NLP)
12
Hallucination Control Techniques
12.1
Hallucination in LLMs
12.2
What are different forms of hallucinations?
12.3
Mitigating hallucinations in large language models (LLMs)
13
Deployment of LLM
13.1
Why does quantization not decrease the accuracy of LLM?
13.2
What are the techniques by which you can optimize the inference of LLM for higher throughput?
13.3
Techniques to Accelerate Model Response Time Without Attention Approximation
14
Agent-Based System
14.1
Explain the basic concepts of an agent and the types of strategies available to implement agents
14.1.1
Basic Concepts of an Agent
14.1.2
Why We Need Agents
14.1.3
Types of Strategies to Implement Agents
14.2
ReAct Prompting
14.2.1
How ReAct Prompting Works
14.2.2
Code Example
14.2.3
Advantages of ReAct Prompting
14.3
Plan and Execute Prompting Strategy
14.3.1
How It Works
14.3.2
Advantages
14.3.3
Example
14.4
OpenAI Functions Strategy
14.4.1
How It Works
14.4.2
Code Example
14.4.3
Advantages
14.4.4
real-world applications
14.5
OpenAI Functions vs. LangChain Agents
14.5.1
OpenAI Functions
14.5.2
LangChain Agents
14.5.3
Comparison
15
Miscellaneous
15.1
Prompt hacking
15.1.1
Why Should We Care About Prompt Hacking?
15.1.2
different types of prompt hacking
15.1.3
Different defense tactics from prompt hacking
15.2
How to optimize cost of overall LLM System?
15.3
Caching in Large Language Model (LLM) systems
15.3.1
Key Concepts of Caching in LLM Systems
15.3.2
Benefits of Caching in LLM Systems
15.3.3
Practical Applications
15.4
Mixture of Expert (MoE) models
15.4.1
Key Components of MoE Models
15.4.2
Benefits of MoE Models
15.4.3
Applications
15.5
How to build production grade RAG system
15.5.1
Retriever Component
15.5.2
Generator Component
15.5.3
API Endpoint
15.5.4
Caching Layer
15.5.5
Monitoring and Logging
15.5.6
Security and Compliance
15.5.7
Scalability
15.6
FP8 variable
15.6.1
Key Aspects of FP8
15.6.2
Advantages of FP8
15.7
FP8 and FP16, two popular floating-point formats used in AI and high-performance computing
15.7.1
FP8 (8-bit Floating Point)
15.7.2
FP16 (16-bit Floating Point)
15.7.3
Practical Comparison
15.7.4
Example Use Case
15.8
How to train LLM with low precision training without compromising on accuracy ?
15.9
Calculating the size of the Key-Value (KV) cache
15.9.1
Key Factors
15.9.2
Calculation Formula
15.9.3
Example Calculation
15.10
Dimension of each layer in multi headed transformation attention block
15.11
Ensuring that the attention layer focuses on the right part of the input
References
License: CC BY-SA
LARGE LANGUAGE MODELs - Frequently Asked Questions
References