Skip to content

Fix LlamaIndexEmbeddingOperator returning vector=None for every chunk#68491

Merged
kaxil merged 1 commit into
apache:mainfrom
astronomer:fix-llamaindex-embedding-vectors
Jun 15, 2026
Merged

Fix LlamaIndexEmbeddingOperator returning vector=None for every chunk#68491
kaxil merged 1 commit into
apache:mainfrom
astronomer:fix-llamaindex-embedding-vectors

Conversation

@kaxil

@kaxil kaxil commented Jun 13, 2026

Copy link
Copy Markdown
Member

Fixes #68416.

LlamaIndexEmbeddingOperator.execute() documents chunks[*].vector as the embedding for each chunk, but every chunk came back with vector: None. The operator relied on VectorStoreIndex(nodes, ...) setting .embedding on the nodes it was given as a side effect -- but _get_node_with_embedding() attaches embeddings to model_copy() copies, never the originals (same behavior from v0.10 through v0.14.22, so no version pin avoids it). The vectors only ever existed inside the index's vector store.

The fix embeds the original nodes explicitly before any index work:

  • embed_model.get_text_embedding_batch() over node.get_content(metadata_mode=MetadataMode.EMBED) -- the exact content embed_nodes() embeds (includes node metadata, respects excluded_embed_metadata_keys), so the vectors are identical to what VectorStoreIndex produced internally before.
  • VectorStoreIndex is now built only when persist_dir is set. embed_nodes() skips nodes whose .embedding is already set, so persisting reuses the vectors instead of re-calling the embedding API (pinned by a test that counts embed calls through the real llama-index persist path). Without persist_dir, the index was built and immediately discarded.

Design rationale

Pre-embedding beats the alternative of reading vectors back from index.vector_store.data.embedding_dict: that couples the operator to SimpleVectorStore internals (.data doesn't exist on other stores) and still builds a throwaway index when not persisting. Pre-embedding only uses the public BaseEmbedding API.

The duck-type check in _resolve_embed_model now requires get_text_embedding_batch instead of get_text_embedding, since that's what execute() calls. It's a concrete method on BaseEmbedding (subclasses override _get_text_embeddings, not the public method), so every real embedding class passes the check across the supported llama-index range.

image image

closes #68488
closes #68434

VectorStoreIndex attaches embeddings to model_copy() copies of the
nodes it is given, never the originals, so reading node.embedding
after index construction always returned None. Embed the original
nodes explicitly with the same content VectorStoreIndex embeds
(MetadataMode.EMBED) and only build the index when persist_dir is
set; embed_nodes() inside the index skips pre-embedded nodes, so
persisting does not re-call the embedding API.
@kaxil

kaxil commented Jun 13, 2026

Copy link
Copy Markdown
Member Author

@vojay-dev can you try with the fix in this PR, please?

@kaxil kaxil changed the title Fix LlamaIndexEmbeddingOperator returning vector=None for every chunk Fix LlamaIndexEmbeddingOperator returning vector=None for every chunk Jun 13, 2026
@vojay-dev

Copy link
Copy Markdown

@vojay-dev can you try with the fix in this PR, please?

@kaxil

Tested against the deps where I hit the bug (provider 0.4.0, llama-index-core 0.14.22, Python 3.13, Runtime 3.2-5) with MockEmbedding. On 0.4.0 chunks[].vector comes back None (with and without persist_dir); with this PR the vectors are populated directly. Also confirmed no double-embedding: 3 docs with persist_dir embed 3 texts total, not 6.

LGTM, thanks for the fast turnaround 👍!

@kaxil kaxil merged commit eee0b99 into apache:main Jun 15, 2026
81 checks passed
@kaxil kaxil deleted the fix-llamaindex-embedding-vectors branch June 15, 2026 22:15
imrichardwu pushed a commit to imrichardwu/airflow that referenced this pull request Jun 16, 2026
…hunk (apache#68491)

VectorStoreIndex attaches embeddings to model_copy() copies of the
nodes it is given, never the originals, so reading node.embedding
after index construction always returned None. Embed the original
nodes explicitly with the same content VectorStoreIndex embeds
(MetadataMode.EMBED) and only build the index when persist_dir is
set; embed_nodes() inside the index skips pre-embedded nodes, so
persisting does not re-call the embedding API.
dingo4dev pushed a commit to dingo4dev/airflow that referenced this pull request Jun 16, 2026
…hunk (apache#68491)

VectorStoreIndex attaches embeddings to model_copy() copies of the
nodes it is given, never the originals, so reading node.embedding
after index construction always returned None. Embed the original
nodes explicitly with the same content VectorStoreIndex embeds
(MetadataMode.EMBED) and only build the index when persist_dir is
set; embed_nodes() inside the index skips pre-embedded nodes, so
persisting does not re-call the embedding API.
RulerChen pushed a commit to RulerChen/airflow that referenced this pull request Jun 16, 2026
…hunk (apache#68491)

VectorStoreIndex attaches embeddings to model_copy() copies of the
nodes it is given, never the originals, so reading node.embedding
after index construction always returned None. Embed the original
nodes explicitly with the same content VectorStoreIndex embeds
(MetadataMode.EMBED) and only build the index when persist_dir is
set; embed_nodes() inside the index skips pre-embedded nodes, so
persisting does not re-call the embedding API.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LlamaIndexEmbeddingOperator returns vector=None for every chunk

3 participants