Skip to content

feat: add client side transformers to parallel loader#710

Open
ad-claw000 wants to merge 23 commits into
developfrom
fix/issue-326-client-side-transformers
Open

feat: add client side transformers to parallel loader#710
ad-claw000 wants to merge 23 commits into
developfrom
fix/issue-326-client-side-transformers

Conversation

@ad-claw000
Copy link
Copy Markdown
Contributor

Summary

Added a transformers parameter to ParallelLoader.ingest and ParallelQuery.query to allow the application of client-side transformers directly when using these APIs, resolving the feature request to support usage like:

qr.ingest(csv, transformers=[ImageProps, Resizer])

Verification

  • Added the parameter to ParallelLoader.ingest and ParallelQuery.query signatures.
  • Used the loop over the provided transformers to wrap the dataset generator before proceeding with the normal query execution.

Fixes #326

This allows applying transformers like ImageProps, Resizer etc. by passing
a list of transformer classes via the `transformers` argument to
`ParallelLoader.ingest` and `ParallelQuery.query`.

Fixes #326
Copilot AI review requested due to automatic review settings May 20, 2026 08:42
@ad-claw000 ad-claw000 self-assigned this May 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds first-class support for applying client-side dataset transformers directly via the parallel ingestion/query APIs (ParallelLoader.ingest and ParallelQuery.query), targeting the feature request in #326.

Changes:

  • Added a transformers parameter to ParallelLoader.ingest(...).
  • Added a transformers parameter to ParallelQuery.query(...).
  • Implemented transformer application by wrapping the provided generator in each transformer, in sequence.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
aperturedb/ParallelQuery.py Extends query() with a transformers argument and wraps the generator before query execution.
aperturedb/ParallelLoader.py Extends ingest() with a transformers argument and wraps the generator before delegating to query().
Comments suppressed due to low confidence (2)

aperturedb/ParallelQuery.py:288

  • In use_dask mode this wraps the generator with Transformer instances, but DaskManager.run() assumes generator has .df, .filename, and .blobs_relative_to_csv (and later this method does len(generator.df)). Since Transformer doesn’t expose those attributes, query(..., transformers=...) will break for dask-backed generators. Consider applying transformers inside DaskManager.run() (wrap the per-partition data before calling loader.query) or explicitly rejecting transformers when use_dask is True with a clear error.

This issue also appears on line 279 of the same file.

        use_dask = hasattr(generator, "use_dask") and generator.use_dask
        if use_dask:
            self._reset(batchsize=batchsize, numthreads=numthreads)
            self.daskmanager = DaskManager(num_workers=numthreads)

        if transformers:
            for transformer in transformers:
                generator = transformer(generator)

        if hasattr(self, "query_setup"):
            self.query_setup(generator)

        if use_dask:
            results, self.total_actions_time = self.daskmanager.run(
                self.__class__, self.client, generator, batchsize, stats=stats)

aperturedb/ParallelQuery.py:285

  • query_setup() is invoked after wrapping the generator in transformers. This hides generator-specific methods like get_indices() used by ParallelLoader.query_setup() to create indexes, so index creation can silently stop working when transformers are provided. Consider calling query_setup(generator) before applying transformers, or make Transformer forward get_indices/other loader hooks to the wrapped dataset.
        if transformers:
            for transformer in transformers:
                generator = transformer(generator)

        if hasattr(self, "query_setup"):
            self.query_setup(generator)


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread aperturedb/ParallelQuery.py Outdated
Comment thread aperturedb/ParallelLoader.py Outdated
Comment thread aperturedb/ParallelLoader.py Outdated
@ad-claw000
Copy link
Copy Markdown
Contributor Author

Addressed all review comments (including the low confidence ones) in commit 163c99b:

  1. Passed client=self.client when instantiating transformers.
  2. Removed wrapping logic from ingest() and passed it directly to query().
  3. Added test_transformers and test_transformers_rejects_dask tests.
  4. Added an explicit check to reject transformers when use_dask is True, throwing a ValueError as suggested.
  5. Invoked query_setup() before wrapping transformers to make sure get_indices() is not hidden.

Copilot AI review requested due to automatic review settings May 20, 2026 10:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread aperturedb/ParallelQuery.py Outdated
Copilot AI review requested due to automatic review settings May 20, 2026 12:48
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 23, 2026 00:54
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

test/test_Parallel.py:138

  • This equivalence test also relies on GeneratorWithErrors(..., error_pct=0) being fully deterministic, but the generator’s current random error condition can still emit a BadCommand at error_pct=0 (randint==0), which can cause intermittent failures and mismatched succeeded counts. Making GeneratorWithErrors deterministic for error_pct=0 (or seeding random here) will stabilize the comparison.
        elements = 10
        
        # Manual wrapping
        generator1 = GeneratorWithErrors(elements=elements, error_pct=0)
        transformer1 = DummyTransformer(generator1, client=db)
        loader1 = ParallelLoader(db)
        loader1.ingest(transformer1, batchsize=2, numthreads=2, stats=False)
        
        # transformers parameter
        generator2 = GeneratorWithErrors(elements=elements, error_pct=0)
        loader2 = ParallelLoader(db)
        loader2.ingest(generator2, batchsize=2, numthreads=2, stats=False, transformers=[DummyTransformer])

Comment thread test/test_Parallel.py
Comment thread aperturedb/ParallelQuery.py Outdated
Copilot AI review requested due to automatic review settings May 23, 2026 05:21
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread aperturedb/ParallelQuery.py Outdated
Copilot AI review requested due to automatic review settings May 24, 2026 01:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings May 24, 2026 13:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread aperturedb/ParallelQuery.py Outdated
Comment thread test/test_Parallel.py Outdated
Copilot AI review requested due to automatic review settings May 24, 2026 22:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread aperturedb/ParallelQuery.py
Copilot AI review requested due to automatic review settings May 25, 2026 02:13
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread aperturedb/ParallelQuery.py
Copilot AI review requested due to automatic review settings May 25, 2026 20:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

Comment thread test/test_Parallel.py Outdated
Comment thread aperturedb/ParallelQuery.py
Comment thread test/test_Parallel.py Outdated
Copilot AI review requested due to automatic review settings May 26, 2026 20:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comment thread aperturedb/ParallelQuery.py
Comment thread aperturedb/ParallelQuery.py
Comment thread aperturedb/transformers/transformer.py
Comment thread test/test_Parallel.py
- Add thread-local Connector clone in Transformer to avoid sharing across threads
- Restrict attribute delegation in Transformer to an allowlist
- Add validation for transformers parameter in ParallelQuery
- Add test coverage for single transformer and invalid transformer inputs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Gautam's client side transformers to parallel loader

3 participants