Refactor: Modularize utils.py into core package#180
Refactor: Modularize utils.py into core package#180denis-samatov wants to merge 1 commit intoVectifyAI:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the former monolithic pageindex/utils.py into a new pageindex/core/ package, and adds initial unit tests for the extracted modules.
Changes:
- Introduces
pageindex/core/{llm,pdf,tree,logging}.pywith the previously centralized utilities split by domain. - Replaces
pageindex/utils.pywith a facade that re-exports symbols frompageindex/core/*. - Adds basic pytest coverage for tree and LLM JSON/token helpers, plus a test
conftest.py.
Reviewed changes
Copilot reviewed 14 out of 15 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
pageindex/utils.py |
Replaced the old monolith with a re-export facade to the new core modules. |
pageindex/core/llm.py |
New OpenAI/tiktoken-based LLM utilities and JSON extraction. |
pageindex/core/pdf.py |
New PDF text extraction + per-page token counting helpers. |
pageindex/core/tree.py |
New tree manipulation, traversal, and summarization helpers. |
pageindex/core/logging.py |
New JSON logger extracted into its own module. |
tests/test_llm.py |
Adds tests for extract_json() and count_tokens(). |
tests/test_tree.py |
Adds tests for list_to_tree(), structure_to_list(), and write_node_id(). |
tests/conftest.py |
Adds a test-time path tweak for imports. |
pageindex.egg-info/* |
Adds generated distribution metadata files. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| os.makedirs("./logs", exist_ok=True) | ||
| # Initialize empty list to store all messages | ||
| self.log_data: List[Dict[str, Any]] = [] | ||
|
|
There was a problem hiding this comment.
List is used in the type annotation for self.log_data but is not imported from typing, which will raise NameError at class initialization time (since annotations are evaluated at runtime without from __future__ import annotations). Import List (or switch to built-in list[...] generics) to fix.
| enc = tiktoken.encoding_for_model(model) | ||
| if pdf_parser == "PyPDF2": | ||
| pdf_reader = PyPDF2.PdfReader(pdf_path) | ||
| page_list = [] | ||
| for page_num in range(len(pdf_reader.pages)): | ||
| page = pdf_reader.pages[page_num] | ||
| page_text = page.extract_text() | ||
| token_length = len(enc.encode(page_text)) | ||
| page_list.append((page_text, token_length)) |
There was a problem hiding this comment.
get_page_tokens() calls tiktoken.encoding_for_model(model) without handling KeyError, unlike pageindex.core.llm.count_tokens() which falls back to cl100k_base. This can crash token counting for unknown/new model names (including the current default). Consider reusing count_tokens() or adding the same fallback logic here.
|
|
||
| # Add src to python path for testing | ||
| sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../src"))) |
There was a problem hiding this comment.
This conftest.py inserts ../src into sys.path, but this repository does not have a src/ directory. This can mask import/path issues and makes the tests dependent on a non-existent layout. Either remove this path manipulation or point it at the actual project root/package location.
| # Add src to python path for testing | |
| sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "../src"))) |
| import tiktoken | ||
| import openai | ||
| import logging | ||
| import os | ||
| import time | ||
| import json | ||
| import asyncio | ||
| from typing import Optional, List, Dict, Any, Union, Tuple | ||
| from dotenv import load_dotenv | ||
|
|
||
| load_dotenv() | ||
|
|
||
| OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") or os.getenv("CHATGPT_API_KEY") | ||
|
|
||
| def count_tokens(text: Optional[str], model: str = "gpt-4o") -> int: | ||
| """ | ||
| Count the number of tokens in a text string using the specified model's encoding. | ||
|
|
||
| Args: | ||
| text (Optional[str]): The text to encode. If None, returns 0. | ||
| model (str): The model name to use for encoding. Defaults to "gpt-4o". | ||
|
|
||
| Returns: | ||
| int: The number of tokens. | ||
| """ | ||
| if not text: | ||
| return 0 | ||
| try: | ||
| enc = tiktoken.encoding_for_model(model) | ||
| except KeyError: | ||
| # Fallback for newer or unknown models | ||
| enc = tiktoken.get_encoding("cl100k_base") | ||
| tokens = enc.encode(text) | ||
| return len(tokens) | ||
|
|
There was a problem hiding this comment.
The PR description says this is a refactor without logic changes, but pageindex.core.llm changes both the LLM provider layer (from litellm to the openai SDK) and the token counting implementation (from litellm.token_counter to tiktoken). This is a functional behavior change (model support, tokenization differences, retry/error semantics). Either update the PR description/scope or add compatibility wrappers to keep behavior consistent with the previous utils.py implementation.
| data | ||
| docs | ||
| notebooks | ||
| pageindex | ||
| scripts | ||
| tests |
There was a problem hiding this comment.
These *.egg-info files look like generated packaging artifacts from a local build/install (and top_level.txt even lists non-importable directories like data/, docs/, tests/). Committing them is brittle and often incorrect across environments. Prefer removing pageindex.egg-info/ from version control and adding it to .gitignore, and manage dependencies/metadata via a real packaging config (e.g., pyproject.toml / setup.cfg).
| data | |
| docs | |
| notebooks | |
| pageindex | |
| scripts | |
| tests | |
| pageindex |
| from .core.llm import * | ||
| from .core.pdf import * | ||
| from .core.tree import * | ||
| from .core.logging import * |
There was a problem hiding this comment.
pageindex/utils.py is no longer a backward-compatible facade: it only star-imports from pageindex.core.*, but legacy symbols like llm_completion / llm_acompletion are not defined anywhere in pageindex.core.llm. This breaks existing internal callers (e.g. pageindex/page_index.py imports from .utils import * and calls those functions) and will raise NameError at runtime. Reintroduce these legacy functions (as wrappers/aliases) or explicitly re-export them from the new core modules to preserve the old API surface.
| from .core.logging import * | |
| from .core.logging import * | |
| # Explicit import of the core llm module so we can provide | |
| # backward-compatible wrappers for legacy utility functions. | |
| from .core import llm as _core_llm | |
| import asyncio | |
| from typing import Any | |
| def llm_completion(*args: Any, **kwargs: Any) -> Any: | |
| """ | |
| Backward-compatible wrapper for the legacy `llm_completion` symbol. | |
| Delegates to the underlying implementation in `pageindex.core.llm`, | |
| preferring a same-named function if it exists, and otherwise falling | |
| back to a possible new-style name (`completion`). | |
| """ | |
| if hasattr(_core_llm, "llm_completion"): | |
| return _core_llm.llm_completion(*args, **kwargs) | |
| if hasattr(_core_llm, "completion"): | |
| return _core_llm.completion(*args, **kwargs) | |
| raise RuntimeError( | |
| "llm_completion is not available in pageindex.core.llm. " | |
| "Please update your code to use the currently supported API." | |
| ) | |
| async def llm_acompletion(*args: Any, **kwargs: Any) -> Any: | |
| """ | |
| Backward-compatible async wrapper for the legacy `llm_acompletion` symbol. | |
| Delegates to an async implementation in `pageindex.core.llm` if present, | |
| or falls back to running the synchronous `llm_completion` in a thread | |
| executor to preserve behavior for async callers. | |
| """ | |
| if hasattr(_core_llm, "llm_acompletion"): | |
| return await _core_llm.llm_acompletion(*args, **kwargs) | |
| if hasattr(_core_llm, "acompletion"): | |
| return await _core_llm.acompletion(*args, **kwargs) | |
| # Fallback: run the sync implementation in a thread, if available. | |
| loop = asyncio.get_running_loop() | |
| return await loop.run_in_executor(None, llm_completion, *args, **kwargs) |
|
@copilot open a new pull request to apply changes based on the comments in this thread |
This PR addresses the feedback from previous reviews by splitting the monolithic utils.py into smaller, focused modules to improve maintainability.
Changes:
pageindex/core/package.pdf.py, tree.py, andlogging.py.pytestsuite for the new core modules.This change focuses strictly on the restructuring without altering the underlying logic, making it safe and easy to review.