Add positional metadata to Chunk

Feature suggestion: to each `Chunk` add some positional metadata which also connects the chunk to the original text
- the start offset (character index) where the extracted text can be found in the input text
- perhaps: length or end offset (character index) of the end of the chunk text in the input text (could be added afterwards)
- perhaps: the index of the chunk in the sequence of chunks (could be added afterwards): just 0, 1, 2, ...

## Example
Input text: `This is an AI demo.   Waffles. My dog runs fast.`
Output chunks, if a minimum chunk length was defined:
```json
[
  { "Text": "This is an AI demo.", "StartOffset": 0, "Length": 19, "Index": 0, "Id": ..., "Vector": [...] },
  { "Text": "My dog runs fast.", "StartOffset": 29, "Length": 17, "Index": 1, "Id": ..., "Vector": [...] }
]
```
 
## Can't I just calculate `StartOffset` afterwards?

While splitting the original text into sentences and chunks, several destructive operations are applied to intermediate results
- `.Trim()` which might remove certain whitespace of unknown length (spaces, tabs, line breaks)
- Skipping chunks which are too short
- Cutting off too long chunk texts

Therefore *we can not* reliably calculate the original position afterwards, when chunks have already been built.
You might try, but I guess some messy iterated `IndexOf`-work will emerge to handle all edge-cases.

## Why would I need `StartOffset`?

It allows
- Highlight a found chunk in a snippet of the original text
- Identify unchanged chunks when re-indexing the document later (might even allow skipping re-embedding)
- Add some overlap text around the extracted chunk at indexing time (as you can easily determine what to add)
- Remove overlapping text of adjacent chunks after retrieval

## Why would I need Index, Length / EndOffset?
Just for convenience. They can be added after calling your chunking library by anyone who need them.

As far as I know it's common for some indexing pipelines to add them.
- Index: to retrieve some adjacent chunks, independent of their vectors
- Length / EndOffset: if your retrieval fetches the full original document text anyway, you could skip transferring / storing the chunk text and just use the metadata; in this case `Text.Length` is not available and you need Length or EndOffset

It depends on your strategy as a library author whether you would like to provide (some of) those values out of the box.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add positional metadata to Chunk #13

Example

Can't I just calculate `StartOffset` afterwards?

Why would I need `StartOffset`?

Why would I need Index, Length / EndOffset?

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add positional metadata to Chunk #13

Description

Example

Can't I just calculate StartOffset afterwards?

Why would I need StartOffset?

Why would I need Index, Length / EndOffset?

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Can't I just calculate `StartOffset` afterwards?

Why would I need `StartOffset`?