Skip to content

Distributed datacontract and regex support#48

Merged
klwhaley merged 7 commits into
mainfrom
distributed_datacontract
Oct 3, 2022
Merged

Distributed datacontract and regex support#48
klwhaley merged 7 commits into
mainfrom
distributed_datacontract

Conversation

@hammadk373
Copy link
Copy Markdown
Contributor

Changes:

  1. Datacontracts can be reference external json files for fileDefinitions
  2. Support regex based matching with input file names. This is controlled by a configuration in the general section of the data contractregexFilenames. It defaults to false to ensure that legacy behavior is the default.
  3. Moved from using python's internal open(...) function for reading data contract to using the smart_open library. This has the exact same behavior (in fact it internally uses python's open function) when dealing with files on the local file system. However, it also add support for cloud storage solutions such as S3, and Azure Object Storage. See doc updates in the PR for more information

Fixes: #47

Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
@hammadk373 hammadk373 self-assigned this Sep 29, 2022
@hammadk373 hammadk373 removed their assignment Sep 29, 2022
Comment thread tests/model/test_contract.py Fixed
Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
Comment thread docs/datacontract.md Outdated

### FileDefinition
The top-level key within a FileDefinition serves as the FileDefinition name. This name is matched against the input CSV file (case-insensitive)
The top-level key within a FileDefinition serves as the FileDefinition name. This name is matched against the input CSV file using either string match (case-insensitive) or regex (case-sensitive) [see general.regexFilenames setting]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can you add periods to complete the sentences in the next two lines

Comment thread docs/datacontract.md


## Alternative Datacontract locations
CsvToFHIR uses the [smart_open](https://github.com/RaRe-Technologies/smart_open) library to read the Datacontract and any referenced file definitions within it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this now used for all reading of the data contract? even when not in external cloud storage...is that what we want?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the library provides an "open" function which effectively replaces the "open" function from the base library. We can update our setup here to make cloud storage an "opt-in" feature and then check for it's existence at runtime.

I know that this complicates things a bit, but in the event that we have issues with the library we will still be able to function.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the other issues, if any, are addressed, I will be happy to submit some changes for this PR to support the "optional" open.

assert d is not None


def test_datacontract_with_invalid_external_config(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we missing a test for data contract with external config that is successful?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test for a successful external config is here: https://github.com/LinuxForHealth/CsvToFHIR/blob/distributed_datacontract/tests/test_converter.py#L388. Its a more thorough test that checks both the successful loading and successful conversion of the csv all-in-one.

Having said that I think a more directed test at the functionality is a good idea. so I added one.

Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
@hammadk373 hammadk373 requested a review from klwhaley September 30, 2022 14:24
Comment thread tests/model/test_contract.py Fixed
Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
@hammadk373
Copy link
Copy Markdown
Contributor Author

@klwhaley are there any further changes you want me to look at?

@dixonwhitmire will you be able to tackle the optional use of the library or do you want me to try and implement that?

@dixonwhitmire
Copy link
Copy Markdown
Member

@klwhaley are there any further changes you want me to look at?

@dixonwhitmire will you be able to tackle the optional use of the library or do you want me to try and implement that?

Hi @hammadk373 - if @klwhaley feels the issues are addressed I can take a look at the optional use of the library later today. At a high level I was going to update the "setup" to make it an optional component like we've done with jupyter and then use try/except to ensure our "open" call works a-ok with or without the dependency. If you would prefer to get started on that, that's fine with me. Just let me know.
Thanks!

@klwhaley
Copy link
Copy Markdown
Contributor

klwhaley commented Oct 3, 2022

I'm going to approve and merge. And Dixon can do his 'optional' change for the open in another PR. Then we can cut a release. Thanks!

Copy link
Copy Markdown
Contributor

@klwhaley klwhaley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving

@klwhaley klwhaley merged commit 6a8c2ce into main Oct 3, 2022
@klwhaley klwhaley deleted the distributed_datacontract branch October 3, 2022 16:08
@dixonwhitmire
Copy link
Copy Markdown
Member

Tracking smart_open as an "optional" component in #50

hammadk373 added a commit that referenced this pull request Oct 3, 2022
undefined

Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
klwhaley pushed a commit that referenced this pull request Oct 5, 2022
* Distributed datacontract and regex support (#48)

undefined

Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>

* Refactoring smart_open usage as an optional depdendency

Signed-off-by: Dixon Whitmire <dixonwh@gmail.com>
Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>

* removing unused imports

Signed-off-by: Dixon Whitmire <dixonwh@gmail.com>
Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>

* new tasks

Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>

* Update tasks.py

Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>

* address review comments

Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>

Signed-off-by: Hammad Khan <480812+hammadk373@users.noreply.github.com>
Signed-off-by: Dixon Whitmire <dixonwh@gmail.com>
Co-authored-by: Dixon Whitmire <dixonwh@gmail.com>
@klwhaley klwhaley added this to the 1.2b1 milestone Oct 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance data contract model to simplify large data contracts

4 participants