Fix LlamaIndexEmbeddingOperator returning vector=None for every chunk#68491
Merged
Conversation
VectorStoreIndex attaches embeddings to model_copy() copies of the nodes it is given, never the originals, so reading node.embedding after index construction always returned None. Embed the original nodes explicitly with the same content VectorStoreIndex embeds (MetadataMode.EMBED) and only build the index when persist_dir is set; embed_nodes() inside the index skips pre-embedded nodes, so persisting does not re-call the embedding API.
Member
Author
|
@vojay-dev can you try with the fix in this PR, please? |
LlamaIndexEmbeddingOperator returning vector=None for every chunk
gopidesupavan
approved these changes
Jun 14, 2026
Tested against the deps where I hit the bug (provider 0.4.0, llama-index-core 0.14.22, Python 3.13, Runtime 3.2-5) with LGTM, thanks for the fast turnaround 👍! |
imrichardwu
pushed a commit
to imrichardwu/airflow
that referenced
this pull request
Jun 16, 2026
…hunk (apache#68491) VectorStoreIndex attaches embeddings to model_copy() copies of the nodes it is given, never the originals, so reading node.embedding after index construction always returned None. Embed the original nodes explicitly with the same content VectorStoreIndex embeds (MetadataMode.EMBED) and only build the index when persist_dir is set; embed_nodes() inside the index skips pre-embedded nodes, so persisting does not re-call the embedding API.
dingo4dev
pushed a commit
to dingo4dev/airflow
that referenced
this pull request
Jun 16, 2026
…hunk (apache#68491) VectorStoreIndex attaches embeddings to model_copy() copies of the nodes it is given, never the originals, so reading node.embedding after index construction always returned None. Embed the original nodes explicitly with the same content VectorStoreIndex embeds (MetadataMode.EMBED) and only build the index when persist_dir is set; embed_nodes() inside the index skips pre-embedded nodes, so persisting does not re-call the embedding API.
RulerChen
pushed a commit
to RulerChen/airflow
that referenced
this pull request
Jun 16, 2026
…hunk (apache#68491) VectorStoreIndex attaches embeddings to model_copy() copies of the nodes it is given, never the originals, so reading node.embedding after index construction always returned None. Embed the original nodes explicitly with the same content VectorStoreIndex embeds (MetadataMode.EMBED) and only build the index when persist_dir is set; embed_nodes() inside the index skips pre-embedded nodes, so persisting does not re-call the embedding API.
75 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #68416.
LlamaIndexEmbeddingOperator.execute()documentschunks[*].vectoras the embedding for each chunk, but every chunk came back withvector: None. The operator relied onVectorStoreIndex(nodes, ...)setting.embeddingon the nodes it was given as a side effect -- but_get_node_with_embedding()attaches embeddings tomodel_copy()copies, never the originals (same behavior from v0.10 through v0.14.22, so no version pin avoids it). The vectors only ever existed inside the index's vector store.The fix embeds the original nodes explicitly before any index work:
embed_model.get_text_embedding_batch()overnode.get_content(metadata_mode=MetadataMode.EMBED)-- the exact contentembed_nodes()embeds (includes node metadata, respectsexcluded_embed_metadata_keys), so the vectors are identical to whatVectorStoreIndexproduced internally before.VectorStoreIndexis now built only whenpersist_diris set.embed_nodes()skips nodes whose.embeddingis already set, so persisting reuses the vectors instead of re-calling the embedding API (pinned by a test that counts embed calls through the real llama-index persist path). Withoutpersist_dir, the index was built and immediately discarded.Design rationale
Pre-embedding beats the alternative of reading vectors back from
index.vector_store.data.embedding_dict: that couples the operator toSimpleVectorStoreinternals (.datadoesn't exist on other stores) and still builds a throwaway index when not persisting. Pre-embedding only uses the publicBaseEmbeddingAPI.The duck-type check in
_resolve_embed_modelnow requiresget_text_embedding_batchinstead ofget_text_embedding, since that's whatexecute()calls. It's a concrete method onBaseEmbedding(subclasses override_get_text_embeddings, not the public method), so every real embedding class passes the check across the supported llama-index range.closes #68488
closes #68434