Applying Ensembling Methods for Sentiment Classification of Tweets

Datasets creation

To download data:

wget http://www.da.inf.ethz.ch/teaching/2018/CIL/material/exercise/twitter-datasets.zip

To prepare datasets:

python3 infrastructure/convert.py prepares the train and test datasets merging positive and negative tweets with the appropriate labels (1: positive tweets, 0: negative tweets).

To preprocess data:

python3 preprocessing/preprocess_punctuation.py 'path-to-csv-data' 0/1 (0: training data preprocessing, 1: test data preprocessing

Baseline Implementations

The baseline models can be found in the implementations/baselines folder.

To download the essential data for this project:

mkdir data
cd data
wget http://www.da.inf.ethz.ch/teaching/2018/CIL/material/exercise/twitter-datasets.zip (to download dataset)
wget Twitter pretrained embeddings from https://nlp.stanford.edu/projects/glove/
mkdir interm_data
mkdir final_data

To build a co-occurence matrix, run the following commands:

Note that the cooc.py script takes a few minutes to run, and displays the number of tweets processed.

./infrastructure/build_vocab.sh
./infrastructure/cut_vocab.sh
python3 ./infrastructure/pickle_vocab.py
python3 ./infrastructure/cooc.py

(Not mandatory) For preprocessing:

python3 ./preprocessing/preprocess_baselines.py 'path-to-txt-data' 0/1 (for each data file seperately, 0: training data preprocessing, 1: test data preprocessing)

(Not mandatory) For removing duplicates:

python3 ./infrastructure/deduplication.py

For manually computing GloVe embeddings: (change number of dimensions)

python3 ./infrastructure/glove_compute.py

(Not mandatory) For pretrained GloVe embeddings: (needs previous manual computation of embeddings)

python3 ./infrastructure/glove_pretrained.py

For computing tweet embeddings:

python3 ./infrastructure/infrastructure.py (for manual embeddings)
python3 ./infrastructure/infrastructure_pretrained.py (for pretrained embeddings)

For classification task:

python3 ./implementations/baselines/svm.py
python3 ./implementations/baselines/xgboost_impl.py
python3 ./implementations/baselines/logistic.py

LSTM-based models

Our LSTM-based model can be found in the implementations/lstm folder.

For generating word indexes and word embeddings:

python3 implementations/lstm/process_data.py

For training/fine-tuning/testing the model:

python3 implementations/lstm/sentiment.py

BERT-based models

The implementations regarding to BERT are contained in the implementations/bert folder. There can be found the code scripts for:

fine-tuning pretrained BERT models in bert.py.
fine-tuning pretrained BERT models using huggingface library in bert_huggging_face.py.
further pretraining the BERT model in retrain/retrain.sh and fine-tuning the generated model with retrain/fine_tune.sh.

RoBERTa-based models

The implementations regarding to RoBERTa are contained in the implementations/robert folder. There can be found code scripts for:

fine-tuning RoBERTa base in roberta_base.py
fine-tuning RoBERTa large in roberta_large.py
fine-tuning RoBERTa large with an additional BiLSTM layer in roberta_large_lstm.py

Ensembling methods

We used two ensembling methods in order to optimize our final model:

Linear regression
XGBoost

The respective implementations can be found in implementations/ensembling folder.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
csv_data		csv_data
implementations		implementations
infrastructure		infrastructure
preprocessing		preprocessing
report		report
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Applying Ensembling Methods for Sentiment Classification of Tweets

Datasets creation

To download data:

To prepare datasets:

To preprocess data:

Baseline Implementations

LSTM-based models

BERT-based models

RoBERTa-based models

Ensembling methods

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Applying Ensembling Methods for Sentiment Classification of Tweets

Datasets creation

To download data:

To prepare datasets:

To preprocess data:

Baseline Implementations

LSTM-based models

BERT-based models

RoBERTa-based models

Ensembling methods

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages