Configure the Chunk Settings

Learn how to split documents into appropriate chunks to optimize information retrieval.

Table of Contents

Β· [What is Chunking?](#what-is-chunking)

Β· [Choose a Chunk Mode](#choose-a-chunk-mode)

Β· [Pre-process Text Before Chunking](#pre-process-text-before-chunking)

Β· [Enable Summary Auto-Gen](#enable-summary-auto-gen)

Β· [Preview Chunks](#preview-chunks)

What is Chunking?

Chunking is the process of splitting long documents into shorter text segments (called "chunks"). This is a critical step in building a Knowledge Base because:

Β· Chunks that are too long may contain irrelevant information, causing noise during retrieval

Β· Chunks that are too short may lack context, leading to incomplete answers

Β· Properly sized chunks result in more accurate retrieval

Two key concepts:

Β· Delimiter: The character or sequence where text is split. For example, \n\n splits at paragraph breaks, \n at line breaks.

πŸ“ NOTE: Delimiters are removed during chunking. For example, using `A` as the delimiter splits `CBACD` into `CB` and `CD`. To avoid information loss, use non-content characters that don't naturally appear in your documents.

Β· Maximum Chunk Length: The maximum size of each chunk in characters. Text exceeding this limit is force-split regardless of delimiter settings.

Choose a Chunk Mode

ClickAI provides 4 chunking modes:

Mode Overview

Mode

Description

When to use

General

Splits by delimiter and max size. Flexible and fits most cases.

General documents, FAQs, guides

Parent-Child

Creates large chunks (parent) containing smaller chunks (child). Retrieval targets child but returns fuller parent context.

Technical docs, docs needing broader context

Paragraph

Splits by natural paragraphs.

Documents with clear paragraph structure

Full Doc

Keeps the entire document as a single chunk.

Short documents, policy documents

Quick Comparison

Criteria

General

Parent-Child

Paragraph

Full Doc

Flexibility

High

Medium

Low

Low

Broad context

Medium

High

Medium

Very High

Retrieval accuracy

High

Very High

High

Low

Suits long docs

βœ…

βœ…

βœ…

❌

Suits short docs

βœ…

❌

βœ…

βœ…

Notes on Parent-Child Mode

Β· Only the first 10,000 tokens are processed. Content beyond this limit will be truncated.

Β· The parent chunk cannot be edited once created. To modify it, you must upload a new document.

⚠️ IMPORTANT: Choosing the right chunk mode is a critical step that directly affects retrieval quality. Experiment with different modes and use the Test Retrieval feature to evaluate results.

Pre-process Text Before Chunking

ClickAI provides pre-processing options to clean text before chunking:

Replace consecutive spaces, newlines, and tabs

Automatically normalizes whitespace:

Β· Three or more consecutive newlines β†’ two newlines

Β· Multiple spaces β†’ single space

Β· Tabs, form feeds, and special Unicode spaces β†’ regular space

Remove all URLs and email addresses

Strips all URLs and email addresses from text content.

πŸ“ NOTE: This setting is ignored in Full Doc mode.

Enable Summary Auto-Gen

When Summary Auto-Gen is enabled, ClickAI automatically generates summaries for each chunk using an LLM. Summaries help:

Β· Improve retrieval when user queries differ from document language

Β· Add high-level information for chunks containing technical content (code, tables, logs)

Β· Create "semantic glue" β€” apply identical summaries to related chunks for grouped retrieval

πŸ’‘ TIP: Summary Auto-Gen is especially useful when source documents use specialized jargon but users ask questions in everyday natural language.

Preview Chunks

After configuring chunk settings, click Preview to review results:

Β· See how documents are split into chunks

Β· Inspect content of each chunk

Β· Adjust configuration if results are unsatisfactory

Check chunk quality:

Β· Chunks too short β€” May lack sufficient context, leading to semantic loss and inaccurate answers

Β· Chunks too long β€” May include irrelevant information, introducing semantic noise and lowering retrieval precision

Β· Semantically incomplete chunks β€” Caused by forced chunking that cuts through sentences or paragraphs, resulting in missing or misleading content

⚠️ IMPORTANT: Always preview and check chunk quality before proceeding with indexing. Re-indexing later costs additional time and resources.

πŸ“– Previous: [Quick Create Overview] Β· Next: [Index Method & Retrieval Settings]

Last updated