Skip to content

load_pdf

Load a PDF file and return an iterator that yields page images in BGR format.

Pages are rendered lazily one at a time to avoid loading all pages into memory at once, preventing OOM errors for large PDFs with hundreds of pages.

引数:

名前 タイプ デスクリプション デフォルト
pdf_path str

The path to the PDF file to be loaded.

必須
dpi int

The resolution (dots per inch) for rendering the PDF pages as images. Higher values result in higher resolution images. Defaults to 200.

200

戻り値:

名前 タイプ デスクリプション
PdfPageIterator PdfPageIterator

An iterator yielding NumPy arrays (BGR format) for each page. Has a total_pages attribute and supports len().

発生:

タイプ デスクリプション
FileNotFoundError

If the specified PDF file does not exist.

ValueError
  • If the file format is not supported.
  • If the file is not a valid PDF.
RuntimeError

If there is an error while processing the PDF file.

Example
from yomitoku.data.functions import load_pdf

pages = load_pdf("example.pdf", dpi=200)
print(len(pages))  # Output: Number of pages in the PDF file
for page_img in pages:
    print(page_img.shape)  # Output: (height, width, channels)
ソースコード位置: src/yomitoku/data/functions.py
def load_pdf(pdf_path: str, dpi=200) -> PdfPageIterator:
    """
    Load a PDF file and return an iterator that yields page images in BGR format.

    Pages are rendered lazily one at a time to avoid loading all pages into
    memory at once, preventing OOM errors for large PDFs with hundreds of pages.

    Args:
        pdf_path (str): The path to the PDF file to be loaded.
        dpi (int, optional): The resolution (dots per inch) for rendering the PDF pages
            as images. Higher values result in higher resolution images. Defaults to 200.

    Returns:
        PdfPageIterator: An iterator yielding NumPy arrays (BGR format) for each page.
            Has a `total_pages` attribute and supports `len()`.

    Raises:
        FileNotFoundError: If the specified PDF file does not exist.
        ValueError:
            - If the file format is not supported.
            - If the file is not a valid PDF.
        RuntimeError: If there is an error while processing the PDF file.

    Example:
        ```python
        from yomitoku.data.functions import load_pdf

        pages = load_pdf("example.pdf", dpi=200)
        print(len(pages))  # Output: Number of pages in the PDF file
        for page_img in pages:
            print(page_img.shape)  # Output: (height, width, channels)
        ```
    """
    pdf_path = Path(pdf_path)
    if not pdf_path.exists():
        raise make_error(ErrorCode.IMAGE_FILE_NOT_FOUND)

    ext = pdf_path.suffix[1:].lower()
    if ext not in SUPPORT_INPUT_FORMAT:
        raise make_error(ErrorCode.UNSUPPORTED_IMAGE_FORMAT)

    if ext != "pdf":
        raise make_error(ErrorCode.NOT_PDF_FILE)

    return PdfPageIterator(pdf_path, dpi=dpi)