Fix `LlamaIndexEmbeddingOperator` returning `vector=None` for every chunk by kaxil · Pull Request #68491 · apache/airflow

kaxil · 2026-06-13T00:46:46Z

LlamaIndexEmbeddingOperator.execute() documents chunks[*].vector as the embedding for each chunk, but every chunk came back with vector: None. The operator relied on VectorStoreIndex(nodes, ...) setting .embedding on the nodes it was given as a side effect -- but _get_node_with_embedding() attaches embeddings to model_copy() copies, never the originals (same behavior from v0.10 through v0.14.22, so no version pin avoids it). The vectors only ever existed inside the index's vector store.

The fix embeds the original nodes explicitly before any index work:

embed_model.get_text_embedding_batch() over node.get_content(metadata_mode=MetadataMode.EMBED) -- the exact content embed_nodes() embeds (includes node metadata, respects excluded_embed_metadata_keys), so the vectors are identical to what VectorStoreIndex produced internally before.
VectorStoreIndex is now built only when persist_dir is set. embed_nodes() skips nodes whose .embedding is already set, so persisting reuses the vectors instead of re-calling the embedding API (pinned by a test that counts embed calls through the real llama-index persist path). Without persist_dir, the index was built and immediately discarded.

Design rationale

Pre-embedding beats the alternative of reading vectors back from index.vector_store.data.embedding_dict: that couples the operator to SimpleVectorStore internals (.data doesn't exist on other stores) and still builds a throwaway index when not persisting. Pre-embedding only uses the public BaseEmbedding API.

The duck-type check in _resolve_embed_model now requires get_text_embedding_batch instead of get_text_embedding, since that's what execute() calls. It's a concrete method on BaseEmbedding (subclasses override _get_text_embeddings, not the public method), so every real embedding class passes the check across the supported llama-index range.

closes #68488
closes #68434

VectorStoreIndex attaches embeddings to model_copy() copies of the nodes it is given, never the originals, so reading node.embedding after index construction always returned None. Embed the original nodes explicitly with the same content VectorStoreIndex embeds (MetadataMode.EMBED) and only build the index when persist_dir is set; embed_nodes() inside the index skips pre-embedded nodes, so persisting does not re-call the embedding API.

kaxil · 2026-06-13T00:47:29Z

@vojay-dev can you try with the fix in this PR, please?

vojay-dev · 2026-06-15T19:34:54Z

@vojay-dev can you try with the fix in this PR, please?

@kaxil

Tested against the deps where I hit the bug (provider 0.4.0, llama-index-core 0.14.22, Python 3.13, Runtime 3.2-5) with MockEmbedding. On 0.4.0 chunks[].vector comes back None (with and without persist_dir); with this PR the vectors are populated directly. Also confirmed no double-embedding: 3 docs with persist_dir embed 3 texts total, not 6.

LGTM, thanks for the fast turnaround 👍!

…hunk (apache#68491) VectorStoreIndex attaches embeddings to model_copy() copies of the nodes it is given, never the originals, so reading node.embedding after index construction always returned None. Embed the original nodes explicitly with the same content VectorStoreIndex embeds (MetadataMode.EMBED) and only build the index when persist_dir is set; embed_nodes() inside the index skips pre-embedded nodes, so persisting does not re-call the embedding API.

kaxil requested a review from gopidesupavan as a code owner June 13, 2026 00:46

boring-cyborg Bot added area:providers kind:documentation provider:common-ai labels Jun 13, 2026

kaxil changed the title ~~Fix LlamaIndexEmbeddingOperator returning vector=None for every chunk~~ Fix LlamaIndexEmbeddingOperator returning vector=None for every chunk Jun 13, 2026

gopidesupavan approved these changes Jun 14, 2026

View reviewed changes

kaxil merged commit eee0b99 into apache:main Jun 15, 2026
81 checks passed

kaxil deleted the fix-llamaindex-embedding-vectors branch June 15, 2026 22:15

shahar1 mentioned this pull request Jun 19, 2026

Status of testing Providers that were prepared on June 19, 2026 #68751

Open

75 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `LlamaIndexEmbeddingOperator` returning `vector=None` for every chunk#68491

Fix `LlamaIndexEmbeddingOperator` returning `vector=None` for every chunk#68491
kaxil merged 1 commit into
apache:mainfrom
astronomer:fix-llamaindex-embedding-vectors

kaxil commented Jun 13, 2026

Uh oh!

kaxil commented Jun 13, 2026

Uh oh!

vojay-dev commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kaxil commented Jun 13, 2026

Design rationale

Uh oh!

kaxil commented Jun 13, 2026

Uh oh!

vojay-dev commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants