-
-
Notifications
You must be signed in to change notification settings - Fork 235
feat: multimodal prompt for generateImage/generateVideo (image-to-image, image-to-video) #624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
tombeckenham
wants to merge
11
commits into
main
Choose a base branch
from
618-image-to-image-and-image-to-video-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
8a89c16
feat(ai): add imageInputs / videoInputs / audioInputs for image-condi…
tombeckenham 48d0f62
ci: apply automated fixes
autofix-ci[bot] dfbc5e1
feat(ai-fal): resolve image-input fields per endpoint from generated …
tombeckenham 389a1a0
feat(ai-grok,ai-openrouter): support imageInputs for image-conditione…
tombeckenham 34347f7
chore: adapt #618 branch to the packages/ restructure and post-rebase…
tombeckenham 26f10a9
feat(ai): make prompt multimodal for generateImage/generateVideo, pas…
tombeckenham ff3bb47
fix: address PR review findings for image/video input support
tombeckenham 32f2175
fix(ai-openai): throw on empty generateImages responses too
tombeckenham 4b60a05
feat: client-side multimodal prompts, e2e coverage, media example, fa…
tombeckenham acd7319
fix: address CodeRabbit review findings
tombeckenham 1b52533
feat(ai,ai-gemini): add Google Veo video adapter on the typed-duratio…
tombeckenham File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,42 @@ | ||
| --- | ||
| '@tanstack/ai': minor | ||
| '@tanstack/ai-gemini': minor | ||
| --- | ||
|
|
||
| Add a Google Veo video adapter (`geminiVideo` / `createGeminiVideo`) and the | ||
| per-model typed-duration video contract it is built on (#534, #634). | ||
|
|
||
| **`@tanstack/ai`** (additive, non-breaking): `VideoAdapter` / | ||
| `BaseVideoAdapter` gain a `TModelDurationByName` generic (defaulting to | ||
| `Record<string, number>`, preserving today's `duration?: number` typing for | ||
| adapters without a map) plus two introspection methods with safe defaults: | ||
|
|
||
| - `availableDurations()` — a `DurationOptions` tagged union | ||
| (`discrete | range | mixed | none`) describing the durations the current | ||
| model accepts. Default: `{ kind: 'none' }`. | ||
| - `snapDuration(seconds)` — coerce raw seconds to the closest valid duration | ||
| (`snapToDurationOption` is exported for adapter authors). Default: | ||
| `undefined`. | ||
|
|
||
| `generateVideo({ duration })` is now typed per model via | ||
| `VideoDurationForAdapter<TAdapter>`. | ||
|
|
||
| **`@tanstack/ai-gemini`**: new Veo adapter over the long-running | ||
| `:predictLongRunning` operation, supporting `veo-3.1-generate-preview`, | ||
| `veo-3.1-fast-generate-preview`, `veo-3.0-generate-001`, | ||
| `veo-3.0-fast-generate-001`, and `veo-2.0-generate-001`: | ||
|
|
||
| - `geminiVideo('veo-3.0-generate-001')` → `duration?: 4 | 6 | 8` | ||
| (Veo 2: `5 | 6 | 8`); `adapter.snapDuration(7)` → `6`. | ||
| - Multimodal prompts: the first un-roled / `'start_frame'` image part | ||
| becomes the input image, `'end_frame'` → `lastFrame`, `'reference'` / | ||
| `'character'` → `referenceImages`. | ||
| - `size` takes Veo aspect ratios (`'16:9' | '9:16'`); everything else from | ||
| the SDK's `GenerateVideosConfig` (e.g. `resolution`, `generateAudio`, | ||
| `negativePrompt`) is available through `modelOptions`. | ||
| - Responsible-AI filtering is surfaced as a failed job with the filter | ||
| reasons. | ||
|
|
||
| Note: Veo result URLs are served by the Gemini Files API and require the | ||
| Google API key to download (`x-goog-api-key` header or `key` query | ||
| parameter). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| --- | ||
| '@tanstack/ai': minor | ||
| '@tanstack/ai-openai': minor | ||
| '@tanstack/ai-gemini': minor | ||
| '@tanstack/ai-fal': minor | ||
| '@tanstack/ai-grok': minor | ||
| '@tanstack/ai-openrouter': minor | ||
| '@tanstack/ai-client': minor | ||
| '@tanstack/ai-event-client': patch | ||
| --- | ||
|
|
||
| `generateImage()` and `generateVideo()` now accept a multimodal `prompt`: a plain string, or an ordered array of content parts (`TextPart` / `ImagePart` / `VideoPart` / `AudioPart`) for image-conditioned generation, image-to-image, multi-reference, image-to-video, and edit / inpaint flows. Part order is meaningful — "not like this _(image)_, more like this _(image)_" — and each media part may carry a `metadata.role` hint (`'reference' | 'mask' | 'control' | 'start_frame' | 'end_frame' | 'character'`) that adapters use to route to the provider-specific field, plus an informational `metadata.tag` label for your own bookkeeping. The accepted part types are narrowed per model at compile time via each adapter's input-modality map, so passing an image part to a text-only model is a type error (with a clear runtime throw as backstop). | ||
|
|
||
| Prompt text is always sent **verbatim** — the SDK never injects or rewrites in-prompt referencing markers. To reference inputs from your prompt, write the provider's own convention (fal Kling / Seedance `@Image1`, OpenAI / FLUX.2 `"image 1"` prose, Gemini content descriptions); see the image-generation docs for the per-provider table. | ||
|
|
||
| Provider behavior in this release: | ||
|
|
||
| - **OpenAI image** — Prompts with image parts route `gpt-image-2` / `gpt-image-1` / `gpt-image-1-mini` to `images.edit()` (up to 16 source images plus optional mask); `dall-e-2` routes to `images.edit()` with one source image; `dall-e-3` rejects image parts at compile time and at runtime. | ||
| - **OpenAI video** — Sora-2 / Sora-2-Pro accept a single image part as `input_reference`; passing more than one throws. | ||
| - **Gemini image** — Native models (`gemini-*-flash-image`, "nano-banana") map prompt parts 1:1 onto multimodal `contents`, preserving interleaved order. Imagen is text-only (compile-time + runtime rejection). | ||
| - **fal.ai** — Field names resolve per endpoint from a map generated from the fal SDK's endpoint types (362 endpoints with nonstandard fields, e.g. nano-banana edit → `image_urls`, Kling i2v start frame → `image_url`, Veo first-last-frame → `first_frame_url` / `last_frame_url`). Defaults for endpoints not in the map: single → `image_url`, multiple → `image_urls`; `role: 'mask'` → `mask_url`; `role: 'control'` → `control_image_url`; `role: 'reference'` / `'character'` → `reference_image_urls`; video `role: 'start_frame'` / `'end_frame'` → `start_image_url` / `end_image_url`. Per-model prompt modalities are derived at the type level from the SDK's endpoint input types. Regenerate the map after a fal SDK bump with `pnpm generate:fal-image-fields` (a unit test fails when it goes stale). In `FalImageProviderOptions` / `FalVideoProviderOptions`, media-conditioning fields the mappers can populate (`image_url`, `start_image_url`, `video_url`, `audio_url`, …) are demoted from required to optional — supply them as prompt parts, or keep passing them explicitly via `modelOptions`. | ||
| - **Grok** — New `grok-imagine-image` / `grok-imagine-image-quality` models. Prompts with image parts route to xAI's JSON `/v1/images/edits` endpoint (up to 3 source images, addressed by xAI in request order; the prompt is sent verbatim). `role: 'mask'` / `'control'` throw. Their `size` uses an `aspectRatio_resolution` template (`'16:9_2k'`, suffix optional) mirroring Gemini's native image models. `grok-2-image-1212` remains text-to-image only. | ||
| - **OpenRouter** — Prompt parts map 1:1 onto multimodal `text` / `image_url` chat content parts, preserving interleaved order, and are forwarded to the underlying image model. URL sources pass through verbatim (no fetching or re-encoding in your process); `data` sources become data URIs. | ||
| - **Anthropic** — Unchanged (no image generation API). | ||
|
|
||
| A new `resolveMediaPrompt()` utility (exported from `@tanstack/ai`) is the single downrev point from the canonical interleaved prompt shape to flattened text + per-modality part buckets, for adapter authors. | ||
|
|
||
| On the client side, `ImageGenerateInput.prompt` and `VideoGenerateInput.prompt` (`@tanstack/ai-client`, and the `useGenerateImage` / `useGenerateVideo` hooks built on them) are widened from `string` to the same `MediaPrompt` shape, so prompt parts can be sent from the browser through your server route to `generateImage()` / `generateVideo()`. | ||
|
|
||
| Closes #618. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -5,6 +5,7 @@ | |
| **/coverage | ||
| **/dist | ||
| **/docs | ||
| packages/ai-fal/src/image/generated/ | ||
| pnpm-lock.yaml | ||
|
|
||
| .angular | ||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.