Chunking strategies — My notes
We use chunking daily in our minds. Chunking is a cognitive process where the brain groups information into meaningful units or “chunks.” This helps us manage and remember large amounts of information more efficiently.
Examples in Daily Life
Phone Numbers: Instead of remembering a long string of numbers like 1234567890, we break it into chunks like 123–456–7890.
Words and Sentences: We group letters into words and words into sentences to understand and remember them better.
Shopping Lists: We categorize items (fruits, vegetables, dairy) instead of remembering each item separately.
Benefits of Chunking
Grouping related information helps with easier recall, better understanding by simplifying complex information, and more efficient learning by connecting new information with existing knowledge. The brain uses pattern recognition to identify themes, association to link new information with existing knowledge, and hierarchical organization to structure information from broad concepts to specific details.
Machine Chunking
Just like the human mind, machines can use similar ways for processing information. They group related data to make it easier to recall, understand, and learn. Machines identify patterns within the data, associate new information with existing data, and organize it hierarchically from broad categories to specific details, ensuring efficient processing and retrieval.
How to do this with machines?
Chunking or Text Splitting: While building LLM-related applications (RAG — Retrieval augmented generation), chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance of the content we get from a vector index/database once we use the LLM to embed content.
Example: In semantic search, we index a collection of documents, each containing valuable information on a specific topic. By using an effective chunking strategy, we can ensure our search results accurately reflect the user’s query. If our chunks are too small or too large, it may result in imprecise search results or missed opportunities to find relevant content.
Chunking strategies
1. Fixed-size chunking:
Fixed-size chunking divides text into equal token chunks with optional overlap to maintain context. It’s simple, efficient, and doesn’t require NLP libraries, making it suitable for most cases.
- Human Analogy: Imagine you’re organizing a stack of index cards for studying. You decide each stack should have exactly 50 cards. To make sure you don’t lose context, you ensure the last 5 cards of one stack are also the first 5 cards of the next stack.
Code: This code splits a text into overlapping chunks of 50 characters, using LangChain’s CharacterTextSplitter
, and prints them.
from langchain.text_splitter import CharacterTextSplitter
def fixed_size_chunking():
text = "LangChain is a powerful tool for processing and understanding large amounts of text. It uses various chunking strategies to break down documents into manageable pieces, enabling efficient analysis and retrieval of information. By employing different chunking methods, LangChain can handle diverse types of content and tasks."
text_splitter = CharacterTextSplitter(
separator = ".",
chunk_size = 50,
chunk_overlap = 20
)
docs = text_splitter.create_documents([text])
for doc in docs:
print(doc.page_content)
Output:
Chunk-1: LangChain is a powerful tool for processing and understanding large amounts of text
Chunk-2: It uses various chunking strategies to break down documents into manageable pieces, enabling efficient analysis and retrieval of information
Chunk-3: By employing different chunking methods, LangChain can handle diverse types of content and tasks
2. Sentence Splitting:
Many models are optimized for embedding sentence-level content, making sentence chunking a natural choice.
SpaCy: SpaCy is a Python library for NLP tasks. It offers a sophisticated sentence segmentation feature that efficiently divides text into separate sentences, preserving context in the resulting chunks.
Human analogy: Imagine you have a long article and you want to break it down into smaller, understandable parts. You read through the article and, at the end of each sentence, you make a small cut. This way, each sentence stands alone as a complete thought, making it easier to process and analyze the information. Similarly, spaCy splits text into individual sentences, preserving the context and meaning of each one.
Code: This code uses SpacyTextSplitter
to split a text into sentences and prints each sentence.
from langchain.text_splitter import SpacyTextSplitter
def sentence_splitting():
text = "LangChain is a powerful tool for processing and understanding large amounts of text. It uses various chunking strategies to break down documents into manageable pieces, enabling efficient analysis and retrieval of information. By employing different chunking methods, LangChain can handle diverse types of content and tasks."
text_splitter = SpacyTextSplitter()
docs = text_splitter.split_text(text)
for doc in docs:
print(doc)
Output:
LangChain is a powerful tool for processing and understanding large amounts of text.
It uses various chunking strategies to break down documents into manageable pieces, enabling efficient analysis and retrieval of information.
By employing different chunking methods, LangChain can handle diverse types of content and tasks.
3. Recursive Chunking
Recursive chunking divides input text into smaller chunks hierarchically and iteratively using separators. If the initial split doesn’t yield the desired chunk size or structure, the method recursively splits the resulting chunks with different separators or criteria until the target size or structure is achieved. Although the chunks won’t be exactly the same size, they will generally be similar.
Human anology: Imagine you have a long story and you want to divide it into smaller, manageable sections for easier reading. First, you read through the story and make cuts at major points, like the end of each chapter. This is your initial attempt to create chunks. If some chapters are still too long, you go back and divide those chapters into smaller sections, like scenes or paragraphs, until each section is a manageable size.
Code: In the code, the RecursiveCharacterTextSplitter works similarly. It starts by trying to split the text into chunks of 256 characters. If a chunk is still too large or not the desired structure, it further splits those chunks with a bit of overlap (20 characters) to ensure context isn’t lost.
from langchain.text_splitter import RecursiveCharacterTextSplitter
def recursive_chunking():
text = "LangChain is a powerful tool for processing and understanding large amounts of text. It uses various chunking strategies to break down documents into manageable pieces, enabling efficient analysis and retrieval of information. By employing different chunking methods, LangChain can handle diverse types of content and tasks."
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 256,
chunk_overlap = 20
)
docs = text_splitter.create_documents([text])
for doc in docs:
print("chunk:"+ doc.page_content)
Output:
chunk:LangChain is a powerful tool for processing and understanding large amounts of text. It uses various chunking strategies to break down documents into manageable pieces, enabling efficient analysis and retrieval of information. By employing different
chunk:employing different chunking methods, LangChain can handle diverse types of content and tasks.
4. Specialized chunking: (ex: Markdown)
Markdown is a lightweight markup language commonly used for formatting text. By recognizing the Markdown syntax (e.g., headings, lists, and code blocks), you can intelligently divide the content based on its structure and hierarchy, resulting in more semantically meaningful chunks.
Code: This code splits a Markdown text into 100-character chunks without overlap and prints each chunk.
from langchain.text_splitter import MarkdownTextSplitter
def markdown_special_chunking():
markdown_text = """
# Introduction
Markdown is a lightweight markup language for creating formatted text using a plain-text editor.
## Features
- Easy to read and write
- Supports headings, lists, and code blocks
### Headings
Headings are created using the `#` symbol.
### Lists
Lists can be ordered or unordered.
## Usage
Markdown is commonly used in readme files, for writing messages in online discussion forums, and to create rich text using a plain text editor.
"""
markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])
for doc in docs:
print("chunk:\n" + doc.page_content +"\n")
Output:
chunk:
# Introduction
chunk:
Markdown is a lightweight markup language for creating formatted text using a plain-text
chunk:
editor.
chunk:
## Features
- Easy to read and write
chunk:
- Supports headings, lists, and code blocks
chunk:
### Headings
Headings are created using the `#` symbol.
chunk:
### Lists
Lists can be ordered or unordered.
chunk:
## Usage
chunk:
Markdown is commonly used in readme files, for writing messages in online discussion
chunk:
forums, and to create rich text using a plain text editor.
5.Semantic Chunking:
Fixed chunk sizes often overlook text meaning, Semantic chunking groups text based on meaning rather than structural elements that are crucial for understanding the context of data. By using embeddings, sentences are grouped by themes. The process involves breaking the document into sentences, creating sentence groups anchored by a central sentence, generating embeddings for each group, and comparing embedding distances to identify topic changes and define coherent chunks.
Human analogy: Imagine you’re organizing a collection of news articles. Instead of sorting them by length or date, you group them by topic. You read each article, identify its main theme, and then place it with others discussing the same subject. This ensures each group of articles covers a specific topic comprehensively. Similarly, semantic chunking groups text by meaning, ensuring related sentences form meaningful chunks.
Code: This code sets an API key, then uses SemanticChunker
with OpenAIEmbeddings
to semantically split Markdown text into chunks and prints them.
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
import os
def semantic_chunking():
os.environ["OPENAI_API_KEY"] = "REPLACE_ME"
text = """
# Introduction
Markdown is a lightweight markup language for creating formatted text using a plain-text editor.
## Features
- Easy to read and write
- Supports headings, lists, and code blocks
### Headings
Headings are created using the `#` symbol.
### Lists
Lists can be ordered or unordered.
## Usage
Markdown is commonly used in readme files, for writing messages in online discussion forums, and to create rich text using a plain text editor.
"""
# Create a SemanticChunker instance
text_splitter = SemanticChunker(OpenAIEmbeddings())
# Split the text into semantic chunks
docs = text_splitter.create_documents([text])
# Print the chunks
for doc in docs:
print("chunk:\n" + doc.page_content +"\n")
Output:
chunk:
# Introduction
Markdown is a lightweight markup language for creating formatted text using a plain-text editor. ## Features
- Easy to read and write
- Supports headings, lists, and code blocks
### Headings
Headings are created using the `#` symbol. ### Lists
Lists can be ordered or unordered.
chunk:
## Usage
Markdown is commonly used in readme files, for writing messages in online discussion forums, and to create rich text using a plain text editor.
The first chunk introduces Markdown and describes its features, while the second chunk focuses on how Markdown is used. This approach ensures that each chunk is meaningful on its own, making it easier for subsequent processing and analysis.
Effective chunking strategies are important for optimizing text processing in LLM-related applications, Semantic chunking groups contextually similar information into independent and meaningful segments. This method improves the efficiency and effectiveness of large language models by providing focused inputs, enhancing their ability to understand and process natural language data.
PS — Scratchpad:
from langchain.text_splitter import CharacterTextSplitter
def fixed_size_chunking():
text = "LangChain is a powerful tool for processing and understanding large amounts of text. It uses various chunking strategies to break down documents into manageable pieces, enabling efficient analysis and retrieval of information. By employing different chunking methods, LangChain can handle diverse types of content and tasks."
text_splitter = CharacterTextSplitter(
separator = ".",
chunk_size = 50,
chunk_overlap = 20
)
docs = text_splitter.create_documents([text])
for doc in docs:
print(doc.page_content)
from langchain.text_splitter import SpacyTextSplitter
def sentence_splitting():
text = "LangChain is a powerful tool for processing and understanding large amounts of text. It uses various chunking strategies to break down documents into manageable pieces, enabling efficient analysis and retrieval of information. By employing different chunking methods, LangChain can handle diverse types of content and tasks."
text_splitter = SpacyTextSplitter()
docs = text_splitter.split_text(text)
for doc in docs:
print(doc)
from langchain.text_splitter import RecursiveCharacterTextSplitter
def recursive_chunking():
text = "LangChain is a powerful tool for processing and understanding large amounts of text. It uses various chunking strategies to break down documents into manageable pieces, enabling efficient analysis and retrieval of information. By employing different chunking methods, LangChain can handle diverse types of content and tasks."
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 256,
chunk_overlap = 20
)
docs = text_splitter.create_documents([text])
for doc in docs:
print("chunk:"+ doc.page_content)
from langchain.text_splitter import MarkdownTextSplitter
def markdown_special_chunking():
markdown_text = """
# Introduction
Markdown is a lightweight markup language for creating formatted text using a plain-text editor.
## Features
- Easy to read and write
- Supports headings, lists, and code blocks
### Headings
Headings are created using the `#` symbol.
### Lists
Lists can be ordered or unordered.
## Usage
Markdown is commonly used in readme files, for writing messages in online discussion forums, and to create rich text using a plain text editor.
"""
markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
docs = markdown_splitter.create_documents([markdown_text])
for doc in docs:
print("chunk:\n" + doc.page_content +"\n")
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
import os
def semantic_chunking():
os.environ["OPENAI_API_KEY"] = "REPLACE_ME"
text = """
# Introduction
Markdown is a lightweight markup language for creating formatted text using a plain-text editor.
## Features
- Easy to read and write
- Supports headings, lists, and code blocks
### Headings
Headings are created using the `#` symbol.
### Lists
Lists can be ordered or unordered.
## Usage
Markdown is commonly used in readme files, for writing messages in online discussion forums, and to create rich text using a plain text editor.
"""
# Create a SemanticChunker instance
text_splitter = SemanticChunker(OpenAIEmbeddings())
# Split the text into semantic chunks
docs = text_splitter.create_documents([text])
# Print the chunks
for doc in docs:
print("chunk:\n" + doc.page_content +"\n")
def main():
#fixed_size_chunking()
#sentence_splitting()
#recursive_chunking()
#markdown_special_chunking()
semantic_chunking()
main()
Posted on: Mon Jun 24 2024