Skip to content

InternScience/IdeaMiner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IdeaMiner

A two-stage LLM pipeline for generating and evaluating novel scientific research questions.

Official Site  GitHub 

IdeaMiner is a two-stage pipeline for generating and evaluating novel scientific research questions using LLM agents. It covers a broad taxonomy of academic disciplines and produces ranked, deduplicated research questions scored on novelty, feasibility, and significance.


🌐 Try It Online

Visit our official platform to explore AI-generated research ideas across disciplines — no setup required.

IdeaMiner – My Library

Browse and save ideas from your personal library. Each card shows the research question along with its key topic tags.


Idea Detail View
Idea Detail View — Each ranked question comes with Background, Significance, Methodology, and Rationale, alongside novelty, feasibility, and significance scores.
User Profile
Personalized Profile — Set your research domain and experience level so the platform surfaces the most relevant ideas for you.

Interaction Buttons

Quick-action buttons let you skip, dislike, like, copy, or navigate between ideas with a single click.


⚙️ How It Works

flowchart TD
    A["📄 Config File<br>field · keywords · research_type · granularity"]
    A --> B["🤖 Step 1 · Generator<br>agents/step_1_generator.py"]
    B --> C["📝 30 Raw Research Questions<br>data/raw_questions/*.json"]
    C --> D["🔍 Step 2 · Evaluator<br>agents/step_2_evaluator.py"]
    D --> E["🧹 Deduplication<br>Embedding-based Cosine Similarity"]
    E --> F["⭐ Group-Based Scoring<br>novelty · feasibility · significance"]
    F --> G["🏆 Ranked Questions<br>data/evaluated_questions/"]
Loading

Step 1 – Generation (agents/step_1_generator.py): Each config file specifies a scientific field, a set of keywords, a research type, and a granularity level. The generator prompts an LLM to produce 30 diverse and novel research questions.

Step 2 – Evaluation (agents/step_2_evaluator.py): The evaluator first deduplicates questions using embedding-based cosine similarity, then scores the remaining questions across multiple rounds using a group-based approach. Each group is assessed by one or more LLM models that can invoke a web_search tool to ground their evaluations in current literature.


📂 Project Structure

IdeaMiner/
├── agents/
│   ├── step_1_generator.py   # Question generation agent
│   └── step_2_evaluator.py   # Question evaluation and ranking agent
├── utils/
│   ├── langchain_agent.py    # Async LangChain agent with tool support
│   ├── langchain_tools.py    # web_search and paper_search tools
│   ├── langchain_utils.py    # Custom embeddings with HuggingFace tokenizer support
│   └── tools.py              # Standalone Semantic Scholar search function
├── configs/
│   └── subject.py            # Academic discipline taxonomy and config generator
├── sh/
│   ├── 1_gen.sh              # Batch generation script
│   └── 2_eval.sh             # Batch evaluation script
├── assets/                   # Images for README and documentation
├── data/
│   ├── raw_questions/        # Output of Step 1 (git-ignored)
│   └── evaluated_questions/  # Output of Step 2 (git-ignored)
├── logs/                     # Runtime logs (git-ignored)
├── .env.example              # Environment variable template
├── requirements.txt          # Python dependencies
└── LICENSE                   # MIT License

📦 Dependencies

This project uses StructAI as its core utility library, which provides the LLMAgent, load_file, save_file, and other helpers used throughout the codebase.


🚀 Setup

1. Install dependencies

pip install -r requirements.txt

2. Configure environment variables

cp .env.example .env
# Edit .env and fill in your API keys

Required variables:

Variable Description
LLM_API_KEY API key for your OpenAI-compatible LLM provider
LLM_BASE_URL Base URL of the API (default: https://api.openai.com/v1)
TAVILY_API_KEYS Comma-separated Tavily search API keys (or use TAVILY_API_KEY)

Optional variables:

Variable Description
SEMANTIC_SCHOLAR_API_KEY Increases the Semantic Scholar API rate limit

3. Generate config files

The configs/subject.py script generates random experiment configs and writes them to configs/:

python configs/subject.py

Or write your own JSON config:

{
    "field": "Life Sciences",
    "keywords": ["Genomics", "CRISPR", "Epigenetics"],
    "research_type": "Experiment",
    "granularity_level": "Microscopic"
}

💻 Usage

Run the full pipeline

# Step 1: Generate questions for all configs
./sh/1_gen.sh

# Step 2: Evaluate and rank the generated questions
./sh/2_eval.sh

Run individual steps

# Generate questions for a single config
python agents/step_1_generator.py --config_path configs/my_config.json

# Evaluate a single raw question file
python agents/step_2_evaluator.py \
    --input_file data/raw_questions/my_config.json \
    --output_dir data/evaluated_questions/my_config/ \
    --field "Life Sciences" \
    --models gpt-4o-mini \
    --comparison_rounds 3 \
    --group_size 5

Key parameters for evaluation

Parameter Default Description
--similarity_threshold 0.85 Cosine similarity threshold for duplicate removal
--filter_batch_size 50 Questions per filtering batch
--comparison_rounds 3 Number of scoring rounds per question
--group_size 5 Questions per scoring group
--models gpt-4o-mini Space-separated list of scorer models
--max_concurrent_tasks 32 Maximum parallel async scoring tasks

📊 Output Format

After evaluation, each output directory contains:

File Description
filtered_questions.json Questions after deduplication
evaluation_results.json Full results including per-model scores
ranked_questions.json Questions sorted by consensus score (best first)
summary.json Statistics and top-10 questions

Each ranked question includes:

{
    "question": "...",
    "background": "...",
    "average_scores": {
        "novelty": 8.2,
        "feasibility": 7.5,
        "significance": 8.8,
        "total": 8.17
    },
    "rank": 1
}

📬 Contact

  • GitHub Issues: Please open an issue for bug reports or feature requests
  • Wechat Mini Program:

WeChat Mini Program


🌟 Star History

If you find this work helpful, please consider to star⭐ this repo. Thanks for your support! 🤩

InternScience/IdeaMiner Stargazers

Star History Chart


📜 License

MIT License. See LICENSE for details.

🔝 Back to top

About

Your dedicated research inspiration engine. An agent framework for generating high-quality, structured research ideas.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors