Feature suggestion: to each Chunk add some positional metadata which also connects the chunk to the original text
- the start offset (character index) where the extracted text can be found in the input text
- perhaps: length or end offset (character index) of the end of the chunk text in the input text (could be added afterwards)
- perhaps: the index of the chunk in the sequence of chunks (could be added afterwards): just 0, 1, 2, ...
Example
Input text: This is an AI demo. Waffles. My dog runs fast.
Output chunks, if a minimum chunk length was defined:
[
{ "Text": "This is an AI demo.", "StartOffset": 0, "Length": 19, "Index": 0, "Id": ..., "Vector": [...] },
{ "Text": "My dog runs fast.", "StartOffset": 29, "Length": 17, "Index": 1, "Id": ..., "Vector": [...] }
]
Can't I just calculate StartOffset afterwards?
While splitting the original text into sentences and chunks, several destructive operations are applied to intermediate results
.Trim() which might remove certain whitespace of unknown length (spaces, tabs, line breaks)
- Skipping chunks which are too short
- Cutting off too long chunk texts
Therefore we can not reliably calculate the original position afterwards, when chunks have already been built.
You might try, but I guess some messy iterated IndexOf-work will emerge to handle all edge-cases.
Why would I need StartOffset?
It allows
- Highlight a found chunk in a snippet of the original text
- Identify unchanged chunks when re-indexing the document later (might even allow skipping re-embedding)
- Add some overlap text around the extracted chunk at indexing time (as you can easily determine what to add)
- Remove overlapping text of adjacent chunks after retrieval
Why would I need Index, Length / EndOffset?
Just for convenience. They can be added after calling your chunking library by anyone who need them.
As far as I know it's common for some indexing pipelines to add them.
- Index: to retrieve some adjacent chunks, independent of their vectors
- Length / EndOffset: if your retrieval fetches the full original document text anyway, you could skip transferring / storing the chunk text and just use the metadata; in this case
Text.Length is not available and you need Length or EndOffset
It depends on your strategy as a library author whether you would like to provide (some of) those values out of the box.
Feature suggestion: to each
Chunkadd some positional metadata which also connects the chunk to the original textExample
Input text:
This is an AI demo. Waffles. My dog runs fast.Output chunks, if a minimum chunk length was defined:
[ { "Text": "This is an AI demo.", "StartOffset": 0, "Length": 19, "Index": 0, "Id": ..., "Vector": [...] }, { "Text": "My dog runs fast.", "StartOffset": 29, "Length": 17, "Index": 1, "Id": ..., "Vector": [...] } ]Can't I just calculate
StartOffsetafterwards?While splitting the original text into sentences and chunks, several destructive operations are applied to intermediate results
.Trim()which might remove certain whitespace of unknown length (spaces, tabs, line breaks)Therefore we can not reliably calculate the original position afterwards, when chunks have already been built.
You might try, but I guess some messy iterated
IndexOf-work will emerge to handle all edge-cases.Why would I need
StartOffset?It allows
Why would I need Index, Length / EndOffset?
Just for convenience. They can be added after calling your chunking library by anyone who need them.
As far as I know it's common for some indexing pipelines to add them.
Text.Lengthis not available and you need Length or EndOffsetIt depends on your strategy as a library author whether you would like to provide (some of) those values out of the box.