Skip to content

Task 8 + Bonus#11

Merged
lfoppiano merged 35 commits intomainfrom
luca/feature/part4
Feb 11, 2026
Merged

Task 8 + Bonus#11
lfoppiano merged 35 commits intomainfrom
luca/feature/part4

Conversation

@lfoppiano
Copy link
Collaborator

@lfoppiano lfoppiano commented Jan 5, 2026

Description

This PR add Task 8 and the Bonus. I wrote a corresponding Java class Duck.java which mimick 1:1 the Python duck.py. I did not include the non-supported algorithms / tasks.

There are some part of the Task 8 that are working only in the development server, we might want to add additional information about it. What's your opinion?

@lfoppiano lfoppiano marked this pull request as ready for review January 9, 2026 17:25
@lfoppiano lfoppiano requested review from sebastian-nagel and removed request for sebastian-nagel January 9, 2026 17:31
Copy link

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comments.

Same as for other PRs: TEST-000000.* and whirlwind.parquet are written into the repository home folder. Could keep them elsewhere.

public static List<String> getFiles(Algorithm algo, String crawl) throws IOException {
switch (algo) {
case CCF_LOCAL_FILES: {
Path indexPath = Path.of("/home/cc-pds/commoncrawl/cc-index/table/cc-main/warc", "crawl=" + crawl,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not portable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use data/downloads/cc-index/table/... and add a target to download the Parquet files showing a warning about the amount of data going to be downloaded?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sebastian-nagel I've moved your suggestion to #14 for validation since it might require to sync also the python whirlwind tour.

Copy link

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 looks good!

@lfoppiano
Copy link
Collaborator Author

👍🙏 thanks

@lfoppiano lfoppiano merged commit 352b140 into main Feb 11, 2026
1 check passed
@lfoppiano lfoppiano deleted the luca/feature/part4 branch February 11, 2026 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants