Skip to content

Add positional metadata to Chunk #13

@hwanders

Description

@hwanders

Feature suggestion: to each Chunk add some positional metadata which also connects the chunk to the original text

  • the start offset (character index) where the extracted text can be found in the input text
  • perhaps: length or end offset (character index) of the end of the chunk text in the input text (could be added afterwards)
  • perhaps: the index of the chunk in the sequence of chunks (could be added afterwards): just 0, 1, 2, ...

Example

Input text: This is an AI demo. Waffles. My dog runs fast.
Output chunks, if a minimum chunk length was defined:

[
  { "Text": "This is an AI demo.", "StartOffset": 0, "Length": 19, "Index": 0, "Id": ..., "Vector": [...] },
  { "Text": "My dog runs fast.", "StartOffset": 29, "Length": 17, "Index": 1, "Id": ..., "Vector": [...] }
]

Can't I just calculate StartOffset afterwards?

While splitting the original text into sentences and chunks, several destructive operations are applied to intermediate results

  • .Trim() which might remove certain whitespace of unknown length (spaces, tabs, line breaks)
  • Skipping chunks which are too short
  • Cutting off too long chunk texts

Therefore we can not reliably calculate the original position afterwards, when chunks have already been built.
You might try, but I guess some messy iterated IndexOf-work will emerge to handle all edge-cases.

Why would I need StartOffset?

It allows

  • Highlight a found chunk in a snippet of the original text
  • Identify unchanged chunks when re-indexing the document later (might even allow skipping re-embedding)
  • Add some overlap text around the extracted chunk at indexing time (as you can easily determine what to add)
  • Remove overlapping text of adjacent chunks after retrieval

Why would I need Index, Length / EndOffset?

Just for convenience. They can be added after calling your chunking library by anyone who need them.

As far as I know it's common for some indexing pipelines to add them.

  • Index: to retrieve some adjacent chunks, independent of their vectors
  • Length / EndOffset: if your retrieval fetches the full original document text anyway, you could skip transferring / storing the chunk text and just use the metadata; in this case Text.Length is not available and you need Length or EndOffset

It depends on your strategy as a library author whether you would like to provide (some of) those values out of the box.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions