Better support for token count and tokenization depending on the model in use by dluc · Pull Request #189 · microsoft/kernel-memory

dluc · 2023-12-07T01:22:14Z

Motivation and Context (Why the change? What's the scenario?)

Allow to use different models with different tokenization characteristics.

High level description (Approach, Design)

New interfaces required for text and embedding generation.
Handlers, TextChunker and SearchClient use the current model to check token count, to check partition size, to check summarization settings, etc.
Text generation backends must provide a MaxToken (configurable) in order to check for correct prompt sizes
Embedding generation backends must provide a MaxToken (configurable) in order to check for correct partitioning/chunking
All LLMs must provide a tokenizer, in order to count the token flowing through. The system falls back to the default GPT3 tokenizer if none is provided
Embedding and Text model details can be configured in appsettings.json or passed with the usual "With*" builder methods

This was referenced Dec 7, 2023

Kernel Memory is broken with latest nugets SciSharp/LLamaSharp#305

Closed

Using Semantic Kernel ITextCompletion As ITextGeneration #183

Merged

dluc commented Dec 7, 2023

View reviewed changes

Comment thread service/Core/DataFormats/Text/TextChunker.cs Outdated

dluc and others added 3 commits December 7, 2023 11:34

Change AI model interfaces to provide the correct tokenization

b48147e

Update service/Core/DataFormats/Text/TextChunker.cs

105d8b2

Update examples

c13ffcb

dluc force-pushed the llmInterfaces branch from 8d5126c to c13ffcb Compare December 7, 2023 21:34

dluc added 3 commits December 7, 2023 14:51

Fix chunker unit tests

20edef4

Improve logging

e936560

config validation and XML docs

5e82a05

dluc merged commit 5c7012c into microsoft:main Dec 8, 2023

dluc deleted the llmInterfaces branch December 8, 2023 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better support for token count and tokenization depending on the model in use#189

Better support for token count and tokenization depending on the model in use#189
dluc merged 6 commits intomicrosoft:mainfrom
dluc:llmInterfaces

dluc commented Dec 7, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dluc commented Dec 7, 2023

Motivation and Context (Why the change? What's the scenario?)

High level description (Approach, Design)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant