Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 20 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -795,46 +795,36 @@ In case you want to run many of these queries, and you have a lot of disk space,
> [!IMPORTANT]
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```

To download the crawl index, there are two options: if you have access to the CCF AWS buckets, run:
To download the crawl index, please use [cc-downloader](https://github.com/commoncrawl/cc-downloader), which is a polite downloader for Common Crawl data:

```shell
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
aws s3 sync s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ 'crawl=CC-MAIN-2024-22/subset=warc'
cargo install cc-downloader
```

If, by any other chance, you don't have access through the AWS CLI:
`cc-downloader` will not be set up on your path by default, but you can run it by prepending the right path.
If cargo is not available or does not install, please check on [the cc-downloader official repository](https://github.com/commoncrawl/cc-downloader).

```shell
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
cd 'crawl=CC-MAIN-2024-22/subset=warc'

wget https://data.commoncrawl.org/crawl-data/CC-MAIN-2024-22/cc-index-table.paths.gz
gunzip cc-index-table.paths.gz

grep 'subset=warc' cc-index-table.paths | \
awk '{print "https://data.commoncrawl.org/" $1, $1}' | \
xargs -n 2 -P 10 sh -c '
echo "Downloading: $2"
mkdir -p "$(dirname "$2")" &&
wget -O "$2" "$1"
' _

rm cc-index-table.paths
cd -
mkdir crawl
~/.cargo/bin/cc-downloader download-paths CC-MAIN-2024-22 cc-index-table crawl
~/.cargo/bin/cc-downloader download crawl/cc-index-table.paths.gz --progress crawl
```

In both ways, the file structure should be something like this:
```shell
tree my_data
my_data
└── crawl=CC-MAIN-2024-22
└── subset=warc
├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
├── part-00001-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
├── part-00002-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
```

Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.
tree crawl/
crawl/
├── cc-index
│ └── table
│ └── cc-main
│ └── warc
│ └── crawl=CC-MAIN-2024-22
│ └── subset=warc
│ ├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c000.gz.parquet
│ ├── part-00000-4dd72944-e9c0-41a1-9026-dfd2d0615bf2.c001.gz.parquet
```

Then, you can run `make duck_local_files LOCAL_DIR=crawl` to run the same query as above, but this time using your local copy of the index files.

Both `make duck_ccf_local_files` and `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` run the same SQL query and should return the same record (written as a parquet file).

Expand Down