Skip to content

Refactor: Modularize utils.py into core package#180

Open
denis-samatov wants to merge 1 commit intoVectifyAI:mainfrom
denis-samatov:feature/utils-split
Open

Refactor: Modularize utils.py into core package#180
denis-samatov wants to merge 1 commit intoVectifyAI:mainfrom
denis-samatov:feature/utils-split

Conversation

@denis-samatov
Copy link
Copy Markdown

This PR addresses the feedback from previous reviews by splitting the monolithic utils.py into smaller, focused modules to improve maintainability.

Changes:

  • Extracted logic from utils.py into a new pageindex/core/ package.
  • Created dedicated modules: llm.py, pdf.py, tree.py, and logging.py.
  • Kept the original utils.py as a backward-compatible facade so excellent integrations won't break.
  • Added a pytest suite for the new core modules.

This change focuses strictly on the restructuring without altering the underlying logic, making it safe and easy to review.

Copilot AI review requested due to automatic review settings March 26, 2026 12:08
Copy link
Copy Markdown

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the former monolithic pageindex/utils.py into a new pageindex/core/ package, and adds initial unit tests for the extracted modules.

Changes:

  • Introduces pageindex/core/{llm,pdf,tree,logging}.py with the previously centralized utilities split by domain.
  • Replaces pageindex/utils.py with a facade that re-exports symbols from pageindex/core/*.
  • Adds basic pytest coverage for tree and LLM JSON/token helpers, plus a test conftest.py.

Reviewed changes

Copilot reviewed 14 out of 15 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pageindex/utils.py Replaced the old monolith with a re-export facade to the new core modules.
pageindex/core/llm.py New OpenAI/tiktoken-based LLM utilities and JSON extraction.
pageindex/core/pdf.py New PDF text extraction + per-page token counting helpers.
pageindex/core/tree.py New tree manipulation, traversal, and summarization helpers.
pageindex/core/logging.py New JSON logger extracted into its own module.
tests/test_llm.py Adds tests for extract_json() and count_tokens().
tests/test_tree.py Adds tests for list_to_tree(), structure_to_list(), and write_node_id().
tests/conftest.py Adds a test-time path tweak for imports.
pageindex.egg-info/* Adds generated distribution metadata files.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +23 to +26
os.makedirs("./logs", exist_ok=True)
# Initialize empty list to store all messages
self.log_data: List[Dict[str, Any]] = []

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List is used in the type annotation for self.log_data but is not imported from typing, which will raise NameError at class initialization time (since annotations are evaluated at runtime without from __future__ import annotations). Import List (or switch to built-in list[...] generics) to fix.

Copilot uses AI. Check for mistakes.
Comment on lines +143 to +151
enc = tiktoken.encoding_for_model(model)
if pdf_parser == "PyPDF2":
pdf_reader = PyPDF2.PdfReader(pdf_path)
page_list = []
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
page_text = page.extract_text()
token_length = len(enc.encode(page_text))
page_list.append((page_text, token_length))
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_page_tokens() calls tiktoken.encoding_for_model(model) without handling KeyError, unlike pageindex.core.llm.count_tokens() which falls back to cl100k_base. This can crash token counting for unknown/new model names (including the current default). Consider reusing count_tokens() or adding the same fallback logic here.

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +6

# Add src to python path for testing
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../src")))
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conftest.py inserts ../src into sys.path, but this repository does not have a src/ directory. This can mask import/path issues and makes the tests dependent on a non-existent layout. Either remove this path manipulation or point it at the actual project root/package location.

Suggested change
# Add src to python path for testing
sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../src")))

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +35
import tiktoken
import openai
import logging
import os
import time
import json
import asyncio
from typing import Optional, List, Dict, Any, Union, Tuple
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") or os.getenv("CHATGPT_API_KEY")

def count_tokens(text: Optional[str], model: str = "gpt-4o") -> int:
"""
Count the number of tokens in a text string using the specified model's encoding.

Args:
text (Optional[str]): The text to encode. If None, returns 0.
model (str): The model name to use for encoding. Defaults to "gpt-4o".

Returns:
int: The number of tokens.
"""
if not text:
return 0
try:
enc = tiktoken.encoding_for_model(model)
except KeyError:
# Fallback for newer or unknown models
enc = tiktoken.get_encoding("cl100k_base")
tokens = enc.encode(text)
return len(tokens)

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says this is a refactor without logic changes, but pageindex.core.llm changes both the LLM provider layer (from litellm to the openai SDK) and the token counting implementation (from litellm.token_counter to tiktoken). This is a functional behavior change (model support, tokenization differences, retry/error semantics). Either update the PR description/scope or add compatibility wrappers to keep behavior consistent with the previous utils.py implementation.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +6
data
docs
notebooks
pageindex
scripts
tests
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These *.egg-info files look like generated packaging artifacts from a local build/install (and top_level.txt even lists non-importable directories like data/, docs/, tests/). Committing them is brittle and often incorrect across environments. Prefer removing pageindex.egg-info/ from version control and adding it to .gitignore, and manage dependencies/metadata via a real packaging config (e.g., pyproject.toml / setup.cfg).

Suggested change
data
docs
notebooks
pageindex
scripts
tests
pageindex

Copilot uses AI. Check for mistakes.
from .core.llm import *
from .core.pdf import *
from .core.tree import *
from .core.logging import *
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pageindex/utils.py is no longer a backward-compatible facade: it only star-imports from pageindex.core.*, but legacy symbols like llm_completion / llm_acompletion are not defined anywhere in pageindex.core.llm. This breaks existing internal callers (e.g. pageindex/page_index.py imports from .utils import * and calls those functions) and will raise NameError at runtime. Reintroduce these legacy functions (as wrappers/aliases) or explicitly re-export them from the new core modules to preserve the old API surface.

Suggested change
from .core.logging import *
from .core.logging import *
# Explicit import of the core llm module so we can provide
# backward-compatible wrappers for legacy utility functions.
from .core import llm as _core_llm
import asyncio
from typing import Any
def llm_completion(*args: Any, **kwargs: Any) -> Any:
"""
Backward-compatible wrapper for the legacy `llm_completion` symbol.
Delegates to the underlying implementation in `pageindex.core.llm`,
preferring a same-named function if it exists, and otherwise falling
back to a possible new-style name (`completion`).
"""
if hasattr(_core_llm, "llm_completion"):
return _core_llm.llm_completion(*args, **kwargs)
if hasattr(_core_llm, "completion"):
return _core_llm.completion(*args, **kwargs)
raise RuntimeError(
"llm_completion is not available in pageindex.core.llm. "
"Please update your code to use the currently supported API."
)
async def llm_acompletion(*args: Any, **kwargs: Any) -> Any:
"""
Backward-compatible async wrapper for the legacy `llm_acompletion` symbol.
Delegates to an async implementation in `pageindex.core.llm` if present,
or falls back to running the synchronous `llm_completion` in a thread
executor to preserve behavior for async callers.
"""
if hasattr(_core_llm, "llm_acompletion"):
return await _core_llm.llm_acompletion(*args, **kwargs)
if hasattr(_core_llm, "acompletion"):
return await _core_llm.acompletion(*args, **kwargs)
# Fallback: run the sync implementation in a thread, if available.
loop = asyncio.get_running_loop()
return await loop.run_in_executor(None, llm_completion, *args, **kwargs)

Copilot uses AI. Check for mistakes.
@denis-samatov
Copy link
Copy Markdown
Author

@copilot open a new pull request to apply changes based on the comments in this thread

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants