Skip to content

Join task#53

Merged
klwhaley merged 7 commits into
mainfrom
join_task
Oct 5, 2022
Merged

Join task#53
klwhaley merged 7 commits into
mainfrom
join_task

Conversation

@hammadk373
Copy link
Copy Markdown
Contributor

uses the new parse_uri_scheme function from PR #51. so that should be merged first

fixes: #49
fixes: #52

hammadk373 and others added 5 commits October 3, 2022 15:58
undefined

Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
Signed-off-by: Dixon Whitmire <dixonwh@gmail.com>
Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
Signed-off-by: Dixon Whitmire <dixonwh@gmail.com>
Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
Copy link
Copy Markdown
Member

@dixonwhitmire dixonwhitmire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hammadk373 thank you very much for these updates! The documentation, including docstrings, is really helpful!

My comments are limited to documentation and test case suggestions. I will go ahead and approve since the core implementation looks sound.

Thanks!

Comment thread docs/datacontract.md Outdated
<td>
<pre>
{
"name": "add_constant",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be helpful to change the task name here to "validate_email" or something similar

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my bad.. that task is called validate_value because it not just limited to emails.

Comment thread docs/datacontract.md Outdated
</td>
<td>
<b>secondary_data_source:</b> path to the secondary data file. Can be relative to the data-contract dictionary or absolute.<br>
<b>join_type:</b> {'left', 'right', 'outer', 'inner', 'cross'} which correspond roughly to the RDB join types of the same name.<br>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the RDB acronym isn't defined it may be helpful to instead use the term "relational database".

Comment thread src/linuxforhealth/csvtofhir/pipeline/tasks.py Outdated

:param data_frame: The input DataFrame
:param secondary_data_source: path to the secondary data file. Can be relative to the data-contract dictionary or absolute.
Also supports http, ftp, s3 and gs paths
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming that pandas is supporting http, ftp, s3, etc paths and that the new "optimized-streaming" extra is not required.

:param data_frame: The input DataFrame
:param secondary_data_source: path to the secondary data file. Can be relative to the data-contract dictionary or absolute.
Also supports http, ftp, s3 and gs paths
:param join_type: {'left', 'right', 'outer', 'inner', 'cross'} which correspond roughly to the RDB join types of the same name.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relational database join types may be more descriptive than RDB

Comment thread src/linuxforhealth/csvtofhir/pipeline/tasks.py
Comment thread tests/pipeline/test_tasks.py Outdated
assert actual["input1"][1] == "FailedToMatch"


def test_join_data():
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be helpful to include additional test "cases" for the inner and right join conditions. For the outer join conditions we should also have a separate record or two which do not have a matched value, just so we can verify it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a simple test each for inner, left, right and outer. I will trust that more thorough testing of these "joins" is being done in pandas :).

Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
@hammadk373
Copy link
Copy Markdown
Contributor Author

I have address all the review comments. @klwhaley can we cut a new beta release when you get a chance (after this PR is merged.)

@klwhaley klwhaley merged commit 9279c13 into main Oct 5, 2022
@klwhaley klwhaley deleted the join_task branch October 5, 2022 16:23
@klwhaley klwhaley added this to the 1.2b1 milestone Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add a new task for validating a column value using regex Add a new task for joining original dataframe with auxiliary data

3 participants