Skip to content

Better support for token count and tokenization depending on the model in use#189

Merged
dluc merged 6 commits intomicrosoft:mainfrom
dluc:llmInterfaces
Dec 8, 2023
Merged

Better support for token count and tokenization depending on the model in use#189
dluc merged 6 commits intomicrosoft:mainfrom
dluc:llmInterfaces

Conversation

@dluc
Copy link
Copy Markdown
Collaborator

@dluc dluc commented Dec 7, 2023

Motivation and Context (Why the change? What's the scenario?)

Allow to use different models with different tokenization characteristics.

High level description (Approach, Design)

  • New interfaces required for text and embedding generation.
  • Handlers, TextChunker and SearchClient use the current model to check token count, to check partition size, to check summarization settings, etc.
  • Text generation backends must provide a MaxToken (configurable) in order to check for correct prompt sizes
  • Embedding generation backends must provide a MaxToken (configurable) in order to check for correct partitioning/chunking
  • All LLMs must provide a tokenizer, in order to count the token flowing through. The system falls back to the default GPT3 tokenizer if none is provided
  • Embedding and Text model details can be configured in appsettings.json or passed with the usual "With*" builder methods

Comment thread service/Core/DataFormats/Text/TextChunker.cs Outdated
@dluc dluc merged commit 5c7012c into microsoft:main Dec 8, 2023
@dluc dluc deleted the llmInterfaces branch December 8, 2023 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant