add the web scraping process for nlp paper and current version#100
Open
add the web scraping process for nlp paper and current version#100
Conversation
2 tasks
5e9e1bf to
3c44256
Compare
i-be-snek
requested changes
Sep 8, 2024
| @@ -0,0 +1,20 @@ | |||
| *** This is the web scraping process *** | |||
|
|
|||
| [] Wikipedia_articles/Web_scraping_wiki.py is the script for web scraping, and process the whole text as a header-content pair, which is the web scraping process of EN Wiki articles used for prompt V_3 | |||
Collaborator
There was a problem hiding this comment.
Use - instead of [] for bullet points to show properly in markdown.
| type=str, | ||
| ) | ||
| parser.add_argument( | ||
| "-h", |
Collaborator
There was a problem hiding this comment.
You should not name any flag -h because that's the shorthand flag for --help. If you keep -h here, this is what a user would see when trying to get the docs of this script:
$ poetry run python3 Database/Wikipedia_articles/Web_scraping_wiki.py --help
Traceback (most recent call last):
File "/Users/shorouqza/Code/Wikimpacts/Database/Wikipedia_articles/Web_scraping_wiki.py", line 38, in <module>
parser.add_argument(
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1468, in add_argument
return self._add_action(action)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1850, in _add_action
self._optionals._add_action(action)
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1670, in _add_action
action = super(_ArgumentGroup, self)._add_action(action)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1482, in _add_action
self._check_conflict(action)
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1619, in _check_conflict
conflict_handler(action, confl_optionals)
File "/opt/homebrew/Cellar/python@3.11/3.11.8/Frameworks/Python.framework/Versions/3.11/lib/python3.11/argparse.py", line 1628, in _handle_conflict_error
raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument -h/--header: conflicting option string: -h| "-h", | ||
| "--header", | ||
| dest="header", | ||
| help="The header for web scraping", |
Collaborator
There was a problem hiding this comment.
Can you give an example of what he header could look like? Maybe in the README.md
| @@ -0,0 +1,143 @@ | |||
| import argparse | |||
Collaborator
There was a problem hiding this comment.
I tried running both this file and Web_scraping_wiki.py with this command and in both cases I get a similar error:
$ poetry run python3 Database/Wikipedia_articles/Web_scraping_wiki.py --raw_dir Database/Wiki_dev_test_articles --filename wiki_test_whole_infobox_20240729_159single_events.json --output_dir Database --header "123"
web_scraping: 2024-09-08 21:17:40 INFO Passed args: Namespace(filename='wiki_test_whole_infobox_20240729_159single_events.json', raw_dir='Database/Wiki_dev_test_articles', output_dir='Database', header='123')
web_scraping: 2024-09-08 21:17:40 INFO Creating Database if it does not exist!
Traceback (most recent call last):
File "/Users/shorouqza/Code/Wikimpacts/Database/Wikipedia_articles/Web_scraping_wiki.py", line 64, in <module>
raw = pd.read_csv(f"{args.raw_dir}/{args.filename}", encoding="ISO-8859-1")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 583, in _read
return parser.read(nrows)
^^^^^^^^^^^^^^^^^^
File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1704, in read
) = self._engine.read( # type: ignore[attr-defined]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/shorouqza/Library/Caches/pypoetry/virtualenvs/wikimpacts-1uvlbl-K-py3.11/lib/python3.11/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
chunks = self._reader.read_low_memory(nrows)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pandas/_libs/parsers.pyx", line 814, in pandas._libs.parsers.TextReader.read_low_memory
File "pandas/_libs/parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 850, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._check_tokenize_status
File "pandas/_libs/parsers.pyx", line 2029, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2| @@ -0,0 +1,148 @@ | |||
| import argparse | |||
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

@i-be-snek , this is also in low priority, and just need to check if the files are not damaging the main branch, thanks!