This repository contains a collection of Python scripts designed for a comprehensive pipeline to analyze political tweets, specifically focusing on detecting stances related to the Awami League in Bangladesh. The project leverages Natural Language Processing (NLP), Large Language Models (LLMs), and Knowledge Graphs to preprocess data, classify stances, and build a rich, queryable knowledge base.
The core of this project is the GRASP-ChoQ (Graph-based Relational and Semantic Prompting with Chain-of-Question) method, a novel prompting technique for enhanced political stance detection.
This project provides a complete workflow for political text analysis:
- Preprocessing: Cleaning and filtering raw tweet datasets to retain relevant, high-quality data.
- Translation: Translating non-English tweets into English for consistent processing.
- Entity Analysis: Identifying key entities (people, organizations) from the text, analyzing their frequency, and visualizing them as a word cloud.
- Knowledge Graph: Building a Neo4j knowledge graph from the extracted entities using Wikipedia data to create structured relational information.
- Stance Detection: Classifying the political stance of tweets (in favor of or against a target entity) using various LLM-based prompting strategies, including Zero-Shot, Few-Shot, and the advanced GRASP-ChoQ method.
- Data Cleaning: Robust preprocessing script to filter tweets by content length, user bio, and media attachments.
- LLM-Powered Translation: Translates text while preserving proper nouns.
- Entity Recognition (NER): Identifies and tags named entities in the text corpus.
- Knowledge Graph Builder: Automatically constructs a Neo4j graph from entities using Wikipedia as a data source.
- Vector-based Retrieval: Implements hybrid search (semantic and keyword-based) on the knowledge graph.
- Multiple Stance Detection Models:
- Zero-Shot: Classify stance without examples.
- Few-Shot: Improve accuracy with in-prompt examples.
- Few-Shot with RAG: Enhance classification by providing external context.
- GRASP-ChoQ: A sophisticated model using chain-of-question prompts and multiple information sources for nuanced stance detection.
The project is structured as a modular pipeline. Here is the typical workflow:
preprocess.py: Start with a raw Twitter dataset (.xlsxor.csv) and generate a cleanedpreprocessed_dataset.xlsx.translation.py: Take the preprocessed data and translate the tweets into English, saving the result totranslated_dataset.xlsx.entities.py: Analyze the translated text to generate awordcloud.pngand anentities.csvfile containing top words and their NER tags.knowledge_graph_builder.py: Use theentities.csvto populate a Neo4j database with a knowledge graph.- Stance Detection Scripts: Use one of the classification scripts (
zero_shot.py,few_shot.py,grasp_choq.py) on the translated data. These scripts can leverage the knowledge graph viaretrieve_from_graph.pyto get contextual information.
-
Clone the repository:
git clone https://github.com/Programming-Dude/GRASP-ChoQ.git cd GRASP-ChoQ -
Create a virtual environment and install dependencies:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate` pip install -r requirements.txt
Note: You will need to create a
requirements.txtfile.pandas wordcloud matplotlib nltk spacy langchain-openai langchain-community langchain-experimental langchain python-dotenv openai neo4j langchain-huggingface torch sentence-transformers openpyxl -
Download NLP models: Run the following in a Python interpreter to download necessary models for
nltkandspacy:import nltk import spacy nltk.download('stopwords') nltk.download('punkt') spacy.cli.download("en_core_web_sm")
-
Set up environment variables: Create a
.envfile in the root directory and add your API keys and database credentials. This project uses OpenRouter.ai for LLM access and Neo4j for the knowledge graph.# Get your key from https://openrouter.ai/keys OPENROUTER_API_KEY="your-openrouter-api-key" # Neo4j database credentials NEO4J_URI="bolt://localhost:7687" NEO4J_USERNAME="neo4j" NEO4J_PASSWORD="your-neo4j-password"
-
Set up Neo4j: Ensure you have a running Neo4j instance. You can use Neo4j Desktop or a Docker container.
Each script can be run individually. Follow the pipeline steps for a full workflow.
Place your raw dataset (e.g., BPDisC_with_stance.xlsx) in the root folder and run:
python preprocess.pyThis script filters tweets based on criteria like word count and user bio, saving the output to preprocessed_BPDisC_dataset.xlsx.
Translate the preprocessed tweets into English:
python translation.pyThis will create a BPDisC_translated.xlsx file with a new translation column.
Analyze the translated text to find key entities:
python entities.pyThis script reads the translated Excel file, generates a wordcloud.png image, and saves the top 100 words and their NER tags to stop_words.csv.
Build the knowledge graph from the extracted entities. Make sure your Neo4j database is running.
python knowledge_graph_builder.pyThis will populate the graph with nodes and relationships based on Wikipedia articles for the top entities.
You can classify tweet stances using different methods. Each script includes an example in its if __name__ == "__main__": block.
For simple, direct classification:
python zero_shot.pyFor improved accuracy using examples:
python few_shot.pyFor the most advanced classification using retrieval-augmented generation and chain-of-question prompting:
- Use
retrieve_from_graph.pyto fetch context for a given tweet. - Feed the tweet and the retrieved context into the
detect_stance_grasp_choqfunction ingrasp_choq.py.
Example:
python grasp_choq.pyThis project is licensed under the MIT License. See the LICENSE file for details.