Skip to content

Marcofo21/Sentiment_Analysis

Repository files navigation

Sentiment Analysis of Climate Change Tweets

In this project, I use a dataset containing tweets about climate change to train four classification models and evaluate their performance on the data. The objective is to find the best performing model out of sample, and more generally explore the performance of the models.

The data employed is hand-coded by humans so to classify tweets as:

  • 2(News): the tweet links to factual news about climate change
  • 1(Pro): the tweet supports the belief of man-made climate change
  • 0(Neutral): the tweet neither supports nor refutes the belief of man-made climate change
  • -1(Anti): the tweet does not believe in man-made climate change

I perform scraping and preprocessing of the tweets so to store the features and categories to predict in an optimal fashion. The code is made to allow for any n-gram to be used for the analysis.

I also resample the data so to make categories overall balanced. This step here is not necessary, but I added this feature for potential use in the future.

Then, I train four classification models using Bayesian Optimization for hyperparameter tuning. The four models are logistic regression, K-Nearest-Neighbors (KNN), Random Forests (RF), and Support Vector Machine (SVM).

Finally, I compute diagnostics in the form of confusion matrices, accuracy rates, precision rates, recall rates, and F1-scores for each model and each category. I also plot ROC curves for each model and category.

Structure

The project is organised as follows:

  • src/sentiment_analysis_climate_change_tweets
    • data: contains the tweets dataframe.
    • generate_random_number: contains the files and tasks used to generate the seed.
      • generate_random_number.py: contains the functions for this module.
      • task_generate_random_number.py: contains the task performed in this module.
    • model_evaluation: contains the files and tasks used for producing ROC curves and diagnostics.
      • model_evaluation.py : contains the functions for this module.
      • task_model_evaluation.py : contains the tasks performed in this module.
    • models_training: contains the files and tasks used to train and hypertune the models.
      • training_functions : contains the functions to train and hypertune the models
        • knn.py: contains functions for k-nearest-neighbors.
        • logreg.py: contains functions for logistic regression.
        • rf.py: contains functions for random forests.
        • svm.py: contains functions for support vector machines.
      • task_models_training.py: contains the task to train and hypertune models.
    • preprocessing: contains the files and tasks used to perform the preprocessing of the tweets.
      • preprocessing.py: contains the functions for this module.
      • task_preprocessing.py: contains the tasks performed for preprocessing.
    • resampling: contains the files and tasks used to resample the data.
      • resampling.py: contains the functions for this module.
      • task_resampling.py: contains the tasks performed for resampling.
    • utilities: contains a python file with functions used across the project.
      • utilities.py: the python file.
  • tests: contains the tests for the project.
  • bld: contains the outputs of the project.
    • utilities: contains useful outputs used across the project.
      • random_number.txt: contains the random number used as seed.
    • python: contains the outputs of the python code.
      • data: contains the preprocessed and resampled data, and summary statistics for the features.
        • x.mtx: contains the features of the tweets in a sparse matrix format.
        • y.csv: contains the categories of the tweets.
        • summary_categories.csv: contains the summary statistics of the categories.
        • twitter_sentiment_data_resampled.csv: contains the resampled data.
      • model_evaluation: contains the outputs of the model evaluation.
        • confusion_matrices: contains the confusion matrices.
          • cm_knn.csv: contains the confusion matrix for the k-nearest-neighbors model.
          • cm_logreg.csv: contains the confusion matrix for the logistic regression model.
          • cm_rf.csv: contains the confusion matrix for the random forests model.
          • cm_svm.csv: contains the confusion matrix for the support vector machines model.
        • diagnostics: contains the diagnostics.
          • classification_report_knn.csv: contains the diagnostics for the k-nearest-neighbors model.
          • classification_report_logreg.csv: contains the diagnostics for the logistic regression model.
          • classification_report_rf.csv: contains the diagnostics for the random forests model.
          • classification_report_svm.csv: contains the diagnostics for the support vector machines model.
        • roc_curves: contains the ROC curves.
          • roc_knn.png: contains the plot of the ROC curve for the k-nearest-neighbors model.
          • roc_logreg.png: contains the plot of the ROC curve for the logistic regression model.
          • roc_rf.png: contains the plot of the ROC curve for the random forests model.
          • roc_svm.png: contains the plot of the ROC curve for the support vector machines model.
      • trained_models: contains the outputs of the trained models.
        • knn: contains the outputs of the k-nearest-neighbors model.
          • knn_model.pkl: contains the trained model.
          • knn_score_parameters.txt: contains preliminary score parameters and time for hypertuning and training.
        • logreg: contains the outputs of the logistic regression model.
          • logreg_model.pkl: contains the trained model.
          • logreg_score_parameters.txt: contains preliminary score parameters and time for hypertuning and training.
        • rf: contains the outputs of the random forests model.
          • rf_model.pkl: contains the trained model.
          • rf_score_parameters.txt: contains preliminary score parameters and time for hypertuning and training.
        • svm: contains the outputs of the support vector machines model.
          • svm_model.pkl: contains the trained model.
          • svm_score_parameters.txt: contains preliminary score parameters and time for hypertuning and training.
  • paper: contains the latex file to produce a paper reporting the graphs.
    • graphs_paper.tex: contains the latex code to produce the paper.
    • task_final.py: contains the task to produce the paper.
  • pyproject.toml: contains the project configuration.
  • setup.cfg: contains the project metadata.
  • environment.yml: contains the environment configuration.

Usage

To get started, create and activate the environment with

$ conda activate sentiment_analysis_climate_change_tweets

To build the project, type

$ pytask

I recommend parallelization to run the project. To parallelize in pytask, you can use pytask-parallel. To install pytask-parallel, type in the terminal

$ pip install pytask-parallel

You can then run the project in parallel by typing

$ pytask -n 4

where 4 is the number of cores you want to use (you can use as many as you like, but the project is unlikely to need more than 4).

Credits

This project was created with cookiecutter and the econ-project-templates.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors