Align extract_partitioning_index logic with upstream shuffling#60
Conversation
|
cc @VibhuJawa |
Signed-off-by: rjzamora <rzamora217@gmail.com>
This PR adds a new tutorial to demonstrate data curation for PEFT use-cases. Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Signed-off-by: rjzamora <rzamora217@gmail.com>
Signed-off-by: rjzamora <rzamora217@gmail.com>
d453b6e to
1f28a35
Compare
|
@ayushdg - thanks for the review. Still don't have an evironment to test this myself. I will try to do that later today, but if it's easy for you to test it is very welcome on my end :) |
|
Can confirm I'm seeing expected results on the larger scale dataset with this fix. I'll run a few more tests on my end but it's generally looking good. |
* Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add comment around import, move constant import to global scope Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com>
@VibhuJawa - Still trying to fix my environment so I can confirm locally, but won't the |
In theory the shuffle gives incorrect results, but the dataset/num_partitions here is small enough that it doesn't impact the correctness of final results (duplicate documents detected) |
VibhuJawa
left a comment
There was a problem hiding this comment.
Approving, lets merge it in when we add the requested unit test.
Signed-off-by: rjzamora <rzamora217@gmail.com>
| npartitions_right, | ||
| npartitions_right, |
There was a problem hiding this comment.
I'm having trouble wrapping my head around the "correct" way to test the batched case. However, my sense is that this test already covers the critical requirement that we are extracting a partitioning index that is consistent with ddf.shuffle.
There was a problem hiding this comment.
Yeah. I don't think this specific bug impacts the batched case any differently than the reproducer here so I think it should be good to go.
Signed-off-by: rjzamora <rzamora217@gmail.com>
Signed-off-by: rjzamora <rzamora217@gmail.com>
b0e41ab to
647406f
Compare
Signed-off-by: rjzamora <rzamora217@gmail.com>
| from nemo_curator.utils.fuzzy_dedup_utils.shuffle_utils import ( | ||
| rearange_by_column_direct, | ||
| ) |
There was a problem hiding this comment.
Note that I moved this import into merge_left_to_shuffled_right, because shuffle_utils currently requires cudf/dask_cuda, while other logic in this module does not.
| npartitions_right, | ||
| npartitions_right, |
There was a problem hiding this comment.
Yeah. I don't think this specific bug impacts the batched case any differently than the reproducer here so I think it should be good to go.
…DIA-NeMo#60) * update extract_partitioning_index with compat code Signed-off-by: rjzamora <rzamora217@gmail.com> * [Tutorials] Add a tutorial for PEFT data curation (NVIDIA-NeMo#45) This PR adds a new tutorial to demonstrate data curation for PEFT use-cases. Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Signed-off-by: rjzamora <rzamora217@gmail.com> * move compat code to _compat file Signed-off-by: rjzamora <rzamora217@gmail.com> * Only import PII constants during Curator import (NVIDIA-NeMo#61) * Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add comment around import, move constant import to global scope Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * add unit test Signed-off-by: rjzamora <rzamora217@gmail.com> * add pytest.mark.gpu Signed-off-by: rjzamora <rzamora217@gmail.com> * move extract_partitioning_index import for now Signed-off-by: rjzamora <rzamora217@gmail.com> * test both cudf and pandas Signed-off-by: rjzamora <rzamora217@gmail.com> * spelling Signed-off-by: rjzamora <rzamora217@gmail.com> --------- Signed-off-by: rjzamora <rzamora217@gmail.com> Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com> Signed-off-by: Vibhu Jawa <vibhujawa@gmail.com>
…DIA-NeMo#60) * update extract_partitioning_index with compat code Signed-off-by: rjzamora <rzamora217@gmail.com> * [Tutorials] Add a tutorial for PEFT data curation (NVIDIA-NeMo#45) This PR adds a new tutorial to demonstrate data curation for PEFT use-cases. Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Signed-off-by: rjzamora <rzamora217@gmail.com> * move compat code to _compat file Signed-off-by: rjzamora <rzamora217@gmail.com> * Only import PII constants during Curator import (NVIDIA-NeMo#61) * Move PII constants to a seperate file that does not import presidio/spacy and other GPU dependencies Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * Add comment around import, move constant import to global scope Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> --------- Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> * add unit test Signed-off-by: rjzamora <rzamora217@gmail.com> * add pytest.mark.gpu Signed-off-by: rjzamora <rzamora217@gmail.com> * move extract_partitioning_index import for now Signed-off-by: rjzamora <rzamora217@gmail.com> * test both cudf and pandas Signed-off-by: rjzamora <rzamora217@gmail.com> * spelling Signed-off-by: rjzamora <rzamora217@gmail.com> --------- Signed-off-by: rjzamora <rzamora217@gmail.com> Signed-off-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Signed-off-by: Ayush Dattagupta <ayushdg95@gmail.com> Co-authored-by: Mehran Maghoumi <Maghoumi@users.noreply.github.com> Co-authored-by: Ayush Dattagupta <ayushdg95@gmail.com>
Dask modified how
partitioning_indexis used for shuffling in dask/dask#10705 (included indask>=2023.12.1). This PR modifiesextract_partitioning_indexto use the same logic.TODO: