Skip to content

ModelBuilder with source_code + DJL LMI: /opt/ml/model becomes read-only, breaking HF Hub model downloads #5698

@dgallitelli

Description

@dgallitelli

Description

When using ModelBuilder (SDK v3) with a pre-built DJL LMI container image and source_code (via SourceCode) to provide custom requirements.txt, the model directory /opt/ml/model/ becomes read-only at runtime. This prevents the DJL container from downloading models from HuggingFace Hub, which tries to write cache files to /opt/ml/model/.

Additionally, ModelBuilder overrides user-provided HF_MODEL_ID environment variable with the value from the model= parameter, making it impossible to point the container to the local model path (/opt/ml/model) when S3 model artifacts are also provided via s3_model_data_url.

How to Reproduce

from sagemaker.serve import ModelBuilder, ModelServer
from sagemaker.serve.builder.schema_builder import SchemaBuilder
from sagemaker.serve.mode.function_pointers import Mode
from sagemaker.serve.model_builder import SourceCode

source_code = SourceCode(
    source_dir="./model_code",
    requirements="requirements.txt",  # e.g. transformers>=4.55.0
)

mb = ModelBuilder(
    model="chromadb/context-1",  # HF Hub model ID
    role_arn=ROLE,
    image_uri="763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.36.0-lmi22.0.0-cu129",
    model_server=ModelServer.DJL_SERVING,
    schema_builder=SchemaBuilder(
        {"inputs": "Hello", "parameters": {"max_new_tokens": 64}},
        [{"generated_text": "Hi"}],
    ),
    source_code=source_code,
    env_vars={"OPTION_TENSOR_PARALLEL_DEGREE": "4", ...},
    instance_type="ml.g6e.12xlarge",
    mode=Mode.SAGEMAKER_ENDPOINT,
)

model = mb.build()
endpoint = mb.deploy(endpoint_name="test", wait=True)
# FAILS: OSError: [Errno 30] Read-only file system: /opt/ml/model/models--chromadb--context-1

Observed Behavior

  1. ModelBuilder.build() packages the source_code directory into a model.tar.gz and uploads to S3
  2. At deploy time, SageMaker mounts this tar.gz at /opt/ml/model/ — which becomes read-only
  3. ModelBuilder sets HF_MODEL_ID=chromadb/context-1 (from model=), overriding any user-provided value
  4. DJL LMI container sees HF_MODEL_ID=chromadb/context-1 and tries to download from HF Hub
  5. HF Hub download tries to write cache to /opt/ml/model/models--chromadb--context-1/
  6. Fails with OSError: [Errno 30] Read-only file system

CloudWatch logs confirm:

OSError: [Errno 30] Read-only file system: /opt/ml/model/models--chromadb--context-1

Expected Behavior

Users should be able to use ModelBuilder with:

  • A pre-built container image (e.g. DJL LMI)
  • source_code with a custom requirements.txt to install additional dependencies at container startup
  • A HuggingFace Hub model ID that the container downloads at runtime

The requirements.txt installation should not make /opt/ml/model/ read-only, or the HF Hub cache should be redirected to a writable location (e.g. /tmp).

Workaround Attempted

Setting HF_HOME=/tmp/hf_home and HUGGINGFACE_HUB_CACHE=/tmp/hf_home/hub in env_vars — these appear in the container environment but the DJL container still writes to /opt/ml/model/.

Use Case

This is a common pattern for deploying newer models (e.g. OpenAI GPT-OSS based models like chromadb/context-1) that require a newer transformers version than what is bundled in the DJL LMI container. The source_code with requirements.txt is the natural SDK v3 mechanism for this, but it is incompatible with HF Hub model downloads.

Environment

  • SageMaker Python SDK: 3.6.0
  • Container: djl-inference:0.36.0-lmi22.0.0-cu129
  • Instance: ml.g6e.12xlarge
  • Region: us-east-1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions