Skip to content

update pattern for dataflow job id extraction#41794

Merged
potiuk merged 3 commits into
apache:mainfrom
lukas-mi:dataflow-job-id-pattern
Sep 1, 2024
Merged

update pattern for dataflow job id extraction#41794
potiuk merged 3 commits into
apache:mainfrom
lukas-mi:dataflow-job-id-pattern

Conversation

@lukas-mi

@lukas-mi lukas-mi commented Aug 27, 2024

Copy link
Copy Markdown
Contributor

Dataflow job id is extracted from the logged output of java process that starts the Dataflow job, for example, in case of BeamRunJavaPipelineOperator.

Currently job id pattern matches characters until first " or \n is encountered, which is fine for a following case:

  • logged line: [2024-08-27 11:20:22,094] INFO Submitted job: 2024-08-27_04_20_21-7947372725816706151
  • extracted job id: 2024-08-27_04_20_21-7947372725816706151

However, if the logger is configured differently, for example, has a whitespace and a suffix at the end with additional information, the pattern extracts the id together with the suffix:

  • logged line: [2024-08-27 11:20:22,094] INFO Submitted job: 2024-08-27_04_20_21-7947372725816706151 (org.apache.beam.runners.dataflow.DataflowRunner) (main)
  • extracted job id: 2024-08-27_04_20_21-7947372725816706151 (org.apache.beam.runners.dataflow.DataflowRunner) (main)

In the previous example suffix (org.apache.beam.runners.dataflow.DataflowRunner) (main) should not be extracted as part of the job id.

I updated the pattern by adding the whitespace character \s (along side existing " and \n), indicating the end of job id.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg Bot added area:providers provider:google Google (including GCP) related issues labels Aug 27, 2024
Comment thread airflow/providers/google/cloud/hooks/dataflow.py Outdated
@lukas-mi lukas-mi force-pushed the dataflow-job-id-pattern branch from 676c264 to 8019d86 Compare August 28, 2024 08:25
@lukas-mi

Copy link
Copy Markdown
Contributor Author

@VladaZakharova when will this be merged? :)

@VladaZakharova

Copy link
Copy Markdown
Contributor

Hi @potiuk ! Can you please merge it?

@potiuk

potiuk commented Aug 30, 2024

Copy link
Copy Markdown
Member

@VladaZakharova when will this be merged? :)

When the test pass and someone will merge it.

Since you are the first time contributor - we have to manually approve workflows to see if tests pass, then you have to fix them if they don't. but when you submit new version you will have to wait for someone to see it and approve it (you can ask in general without mentioning anyone to approve your workflows) to signal that you think you fixed all the tests.

Also see the contribution docs that explain the process https://github.com/apache/airflow/tree/main/contributing-docs

@potiuk potiuk merged commit 9a66882 into apache:main Sep 1, 2024
@boring-cyborg

boring-cyborg Bot commented Sep 1, 2024

Copy link
Copy Markdown

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants