Refactor: Modularize utils.py into core package by denis-samatov · Pull Request #180 · VectifyAI/PageIndex

denis-samatov · 2026-03-26T12:08:18Z

This PR addresses the feedback from previous reviews by splitting the monolithic utils.py into smaller, focused modules to improve maintainability.

Changes:

Extracted logic from utils.py into a new pageindex/core/ package.
Created dedicated modules: llm.py, pdf.py, tree.py, and logging.py.
Kept the original utils.py as a backward-compatible facade so excellent integrations won't break.
Added a pytest suite for the new core modules.

This change focuses strictly on the restructuring without altering the underlying logic, making it safe and easy to review.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copilot

Pull request overview

This PR refactors the former monolithic pageindex/utils.py into a new pageindex/core/ package, and adds initial unit tests for the extracted modules.

Changes:

Introduces pageindex/core/{llm,pdf,tree,logging}.py with the previously centralized utilities split by domain.
Replaces pageindex/utils.py with a facade that re-exports symbols from pageindex/core/*.
Adds basic pytest coverage for tree and LLM JSON/token helpers, plus a test conftest.py.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`pageindex/utils.py`	Replaced the old monolith with a re-export facade to the new `core` modules.
`pageindex/core/llm.py`	New OpenAI/tiktoken-based LLM utilities and JSON extraction.
`pageindex/core/pdf.py`	New PDF text extraction + per-page token counting helpers.
`pageindex/core/tree.py`	New tree manipulation, traversal, and summarization helpers.
`pageindex/core/logging.py`	New JSON logger extracted into its own module.
`tests/test_llm.py`	Adds tests for `extract_json()` and `count_tokens()`.
`tests/test_tree.py`	Adds tests for `list_to_tree()`, `structure_to_list()`, and `write_node_id()`.
`tests/conftest.py`	Adds a test-time path tweak for imports.
`pageindex.egg-info/*`	Adds generated distribution metadata files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T12:14:18Z

pageindex/core/logging.py

+        os.makedirs("./logs", exist_ok=True)
+        # Initialize empty list to store all messages
+        self.log_data: List[Dict[str, Any]] = []
+


List is used in the type annotation for self.log_data but is not imported from typing, which will raise NameError at class initialization time (since annotations are evaluated at runtime without from __future__ import annotations). Import List (or switch to built-in list[...] generics) to fix.

Copilot · 2026-03-26T12:14:19Z

pageindex/core/pdf.py

+    enc = tiktoken.encoding_for_model(model)
+    if pdf_parser == "PyPDF2":
+        pdf_reader = PyPDF2.PdfReader(pdf_path)
+        page_list = []
+        for page_num in range(len(pdf_reader.pages)):
+            page = pdf_reader.pages[page_num]
+            page_text = page.extract_text()
+            token_length = len(enc.encode(page_text))
+            page_list.append((page_text, token_length))


get_page_tokens() calls tiktoken.encoding_for_model(model) without handling KeyError, unlike pageindex.core.llm.count_tokens() which falls back to cl100k_base. This can crash token counting for unknown/new model names (including the current default). Consider reusing count_tokens() or adding the same fallback logic here.

Copilot · 2026-03-26T12:14:19Z

tests/conftest.py

+
+# Add src to python path for testing
+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../src")))


This conftest.py inserts ../src into sys.path, but this repository does not have a src/ directory. This can mask import/path issues and makes the tests dependent on a non-existent layout. Either remove this path manipulation or point it at the actual project root/package location.

Suggested change

# Add src to python path for testing

sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../src")))

Copilot · 2026-03-26T12:14:19Z

pageindex/core/llm.py

+import tiktoken
+import openai
+import logging
+import os
+import time
+import json
+import asyncio
+from typing import Optional, List, Dict, Any, Union, Tuple
+from dotenv import load_dotenv
+
+load_dotenv()
+
+OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") or os.getenv("CHATGPT_API_KEY")
+
+def count_tokens(text: Optional[str], model: str = "gpt-4o") -> int:
+    """
+    Count the number of tokens in a text string using the specified model's encoding.
+
+    Args:
+        text (Optional[str]): The text to encode. If None, returns 0.
+        model (str): The model name to use for encoding. Defaults to "gpt-4o".
+
+    Returns:
+        int: The number of tokens.
+    """
+    if not text:
+        return 0
+    try:
+        enc = tiktoken.encoding_for_model(model)
+    except KeyError:
+         # Fallback for newer or unknown models
+        enc = tiktoken.get_encoding("cl100k_base")
+    tokens = enc.encode(text)
+    return len(tokens)
+


The PR description says this is a refactor without logic changes, but pageindex.core.llm changes both the LLM provider layer (from litellm to the openai SDK) and the token counting implementation (from litellm.token_counter to tiktoken). This is a functional behavior change (model support, tokenization differences, retry/error semantics). Either update the PR description/scope or add compatibility wrappers to keep behavior consistent with the previous utils.py implementation.

Copilot · 2026-03-26T12:14:20Z

pageindex.egg-info/top_level.txt

+data
+docs
+notebooks
+pageindex
+scripts
+tests


These *.egg-info files look like generated packaging artifacts from a local build/install (and top_level.txt even lists non-importable directories like data/, docs/, tests/). Committing them is brittle and often incorrect across environments. Prefer removing pageindex.egg-info/ from version control and adding it to .gitignore, and manage dependencies/metadata via a real packaging config (e.g., pyproject.toml / setup.cfg).

Suggested change

data

docs

notebooks

pageindex

scripts

tests

pageindex

Copilot · 2026-03-26T12:14:20Z

pageindex/utils.py

+from .core.llm import *
+from .core.pdf import *
+from .core.tree import *
+from .core.logging import *


pageindex/utils.py is no longer a backward-compatible facade: it only star-imports from pageindex.core.*, but legacy symbols like llm_completion / llm_acompletion are not defined anywhere in pageindex.core.llm. This breaks existing internal callers (e.g. pageindex/page_index.py imports from .utils import * and calls those functions) and will raise NameError at runtime. Reintroduce these legacy functions (as wrappers/aliases) or explicitly re-export them from the new core modules to preserve the old API surface.

Suggested change

from .core.logging import *

from .core.logging import *

# Explicit import of the core llm module so we can provide

# backward-compatible wrappers for legacy utility functions.

from .core import llm as _core_llm

import asyncio

from typing import Any

def llm_completion(*args: Any, **kwargs: Any) -> Any:

"""

Backward-compatible wrapper for the legacy `llm_completion` symbol.

Delegates to the underlying implementation in `pageindex.core.llm`,

preferring a same-named function if it exists, and otherwise falling

back to a possible new-style name (`completion`).

"""

if hasattr(_core_llm, "llm_completion"):

return _core_llm.llm_completion(*args, **kwargs)

if hasattr(_core_llm, "completion"):

return _core_llm.completion(*args, **kwargs)

raise RuntimeError(

"llm_completion is not available in pageindex.core.llm. "

"Please update your code to use the currently supported API."

)

async def llm_acompletion(*args: Any, **kwargs: Any) -> Any:

"""

Backward-compatible async wrapper for the legacy `llm_acompletion` symbol.

Delegates to an async implementation in `pageindex.core.llm` if present,

or falls back to running the synchronous `llm_completion` in a thread

executor to preserve behavior for async callers.

"""

if hasattr(_core_llm, "llm_acompletion"):

return await _core_llm.llm_acompletion(*args, **kwargs)

if hasattr(_core_llm, "acompletion"):

return await _core_llm.acompletion(*args, **kwargs)

# Fallback: run the sync implementation in a thread, if available.

loop = asyncio.get_running_loop()

return await loop.run_in_executor(None, llm_completion, *args, **kwargs)

denis-samatov · 2026-03-26T12:59:14Z

@copilot open a new pull request to apply changes based on the comments in this thread

refactor: modularize utils.py into core package

a10ad51

Copilot AI review requested due to automatic review settings March 26, 2026 12:08

claude bot reviewed Mar 26, 2026

View reviewed changes

Copilot started reviewing on behalf of denis-samatov March 26, 2026 12:08 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: Modularize utils.py into core package#180

Refactor: Modularize utils.py into core package#180
denis-samatov wants to merge 1 commit intoVectifyAI:mainfrom
denis-samatov:feature/utils-split

denis-samatov commented Mar 26, 2026

Uh oh!

claude bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

denis-samatov commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		# Add src to python path for testing
		sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../src")))

-from .core.logging import *
+from .core.logging import *
+# Explicit import of the core llm module so we can provide
+# backward-compatible wrappers for legacy utility functions.
+from .core import llm as _core_llm
+import asyncio
+from typing import Any
+def llm_completion(*args: Any, **kwargs: Any) -> Any:
+    """
+    Backward-compatible wrapper for the legacy `llm_completion` symbol.
+    Delegates to the underlying implementation in `pageindex.core.llm`,
+    preferring a same-named function if it exists, and otherwise falling
+    back to a possible new-style name (`completion`).
+    """
+    if hasattr(_core_llm, "llm_completion"):
+        return _core_llm.llm_completion(*args, **kwargs)
+    if hasattr(_core_llm, "completion"):
+        return _core_llm.completion(*args, **kwargs)
+    raise RuntimeError(
+        "llm_completion is not available in pageindex.core.llm. "
+        "Please update your code to use the currently supported API."
+    )
+async def llm_acompletion(*args: Any, **kwargs: Any) -> Any:
+    """
+    Backward-compatible async wrapper for the legacy `llm_acompletion` symbol.
+    Delegates to an async implementation in `pageindex.core.llm` if present,
+    or falls back to running the synchronous `llm_completion` in a thread
+    executor to preserve behavior for async callers.
+    """
+    if hasattr(_core_llm, "llm_acompletion"):
+        return await _core_llm.llm_acompletion(*args, **kwargs)
+    if hasattr(_core_llm, "acompletion"):
+        return await _core_llm.acompletion(*args, **kwargs)
+    # Fallback: run the sync implementation in a thread, if available.
+    loop = asyncio.get_running_loop()
+    return await loop.run_in_executor(None, llm_completion, *args, **kwargs)

Conversation

denis-samatov commented Mar 26, 2026

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

denis-samatov commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants