Module Usage¶

This page explains how to use YomiToku as a Python library.

Using Document Analyzer¶

Document Analyzer performs OCR and layout analysis, returning an integrated analysis result. It can be used for various use cases, such as paragraph and table structure analysis, extraction, and figure/table detection.

When loading a PDF file, use load_pdf, and when loading an image file, use load_image. Internally, load_image uses OpenCV. Note that the channel order is BGR.

Following 4 models are utilized in the module:

Text Recognizer
Text Detector
Layout Parser
Table Structure Recognizer

demo/simple_document_analysis.py

import cv2

from yomitoku import DocumentAnalyzer
from yomitoku.data.functions import load_pdf

PATH_IMGE = "demo/samples/sample.pdf"
analyzer = DocumentAnalyzer(visualize=True, device="cuda")
# PDFファイルを読み込み
imgs = load_pdf(PATH_IMGE)

# 画像を読み込む場合はload_imageを使用
# imgs = load_image(PATH_IMGE)

for i, img in enumerate(imgs):
    results, ocr_vis, layout_vis = analyzer(img)
    # HTML形式で解析結果をエクスポート
    results.to_html(f"output_{i}.html", img=img)
    # 可視化画像を保存
    cv2.imwrite(f"output_ocr_{i}.jpg", ocr_vis)
    cv2.imwrite(f"output_layout_{i}.jpg", layout_vis)

analyzer.close()

Option Name	Type	Description	Notes
`visualize`	`bool`	Specifies whether to visualize the processing results.	We recommend `False` if not for debugging. If `True`, the OCR results are returned as the 2nd return value and the layout analysis results as the 3rd return value. If `False`, `None` is returned.
`device`	`str`	Specifies the device to be used for processing.	The default is `"cuda"`. If a GPU is unavailable, it automatically switches to `"cpu"`.
`configs`	`dict`	Used to set more detailed parameters for module processing.	Refer to Model Detailed Config for details.
`license_key`	`str`	Stores the license key, which can be used for authentication.	This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
`secret_key`	`str`	Stores the secret key, which can be used for authentication.	This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
`device_token`	`str`	Stores the device token, which can be used for with external services.	This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
`ignore_ruby`	`bool`	Specifies whether to exclude ruby (furigana) text from the output.	Default is `False`.
`ruby_threshold`	`float`	Specifies the threshold for ruby detection as a ratio to the median line height. Text consisting solely of hiragana or katakana below this threshold is identified as ruby.	Default is `2.0`. Only effective when `ignore_ruby=True`.

The results of DocumentAnalyzer can be exported in the following formats:

Method	Output Format
`to_json()`	JSON format (*.json)
`to_html()`	HTML format (*.html)
`to_csv()`	Comma-separated CSV format (*.csv)
`to_markdown()`	Markdown format (*.md)

Using AI-OCR Only¶

AI-OCR performs text detection and recognition on the detected text, returning the positions of the text within the image along with the recognition results.

Following 2 models are utilized in the module:

Text Recognizer
Text Detector

demo/simple_ocr.py

import cv2

from yomitoku import OCR
from yomitoku.data.functions import load_pdf

ocr = OCR(visualize=True, device="cuda")
# PDFファイルを読み込み
imgs = load_pdf("demo/samples/sample.pdf")
for i, img in enumerate(imgs):
    results, ocr_vis = ocr(img)

    # JSON形式で解析結果をエクスポート
    results.to_json(f"output_{i}.json")
    cv2.imwrite(f"output_ocr_{i}.jpg", ocr_vis)

ocr.close()

Option Name	Type	Description	Notes
`visualize`	`bool`	Specifies whether to visualize the processing results.	We recommend `False` if not for debugging. If `True`, the OCR results are returned as the 2nd return value. If `False`, `None` is returned.
`device`	`str`	Specifies the device to be used for processing.	The default is `"cuda"`. If a GPU is unavailable, it automatically switches to `"cpu"`.
`configs`	`dict`	Used to set more detailed parameters for module processing.	Refer to Model Detailed Config for details.
`license_key`	`str`	Stores the license key, which can be used for authentication.	This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
`secret_key`	`str`	Stores the secret key, which can be used for authentication.	This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
`device_token`	`str`	Stores the device token, which can be used for with external services.	This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.

The results of OCR processing support export in JSON format (to_json()) only.

Using Layout Analyzer only¶

The LayoutAnalyzer performs text detection, followed by AI-based paragraph, figure/table detection, and table structure analysis. It analyzes the layout structure within the document.

Following 2 models are utilized in the module:

Layout Parser
Table Structure Recognizer

demo/simple_layout.py

import cv2

from yomitoku import LayoutAnalyzer
from yomitoku.data.functions import load_pdf

analyzer = LayoutAnalyzer(visualize=True, device="cuda")
# PDFファイルを読み込み
imgs = load_pdf("demo/sample.pdf")
for i, img in enumerate(imgs):
    results, layout_vis = analyzer(img)
    # JSON形式で解析結果をエクスポート
    results.to_json(f"output_{i}.json")
    cv2.imwrite(f"output_layout_{i}.jpg", layout_vis)

analyzer.close()

Option Name	Type	Description	Notes
`visualize`	`bool`	Specifies whether to visualize the processing results.	We recommend `False` if not for debugging. If `True`, the layout analysis results as the 2nd return value. If `False`, `None` is returned.
`device`	`str`	Specifies the device to be used for processing.	The default is `"cuda"`. If a GPU is unavailable, it automatically switches to `"cpu"`.
`configs`	`dict`	Used to set more detailed parameters for module processing.	Refer to Model Detailed Config for details.
`license_key`	`str`	Stores the license key, which can be used for authentication.	This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
`secret_key`	`str`	Stores the secret key, which can be used for authentication.	This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
`device_token`	`str`	Stores the device token, which can be used for with external services.	This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.

The results of LayoutAnalyzer processing support export only in JSON format (to_json()).

Model Detailed Config¶

By providing a config, you can adjust the behavior in greater detail. The following parameters can be set for the model:

Option Name	Type	Description
`model_name`	`str`	Specifies the model name to be used.
`path_cfg`	`str`	Inputs the path to the config file containing the hyperparameters.
`device`	`str`	Specifies the device to be used for inference. (Allowed Values: `cuda` \| `cpu` \| `mps`)
`visualize`	`bool`	Specifies whether to perform visualization.
`from_pretrained`	`bool`	Specifies whether to use a Pretrained Model (a previously trained model).
`infer_onnx`	`bool`	Specifies whether to use ONNX Runtime instead of PyTorch for inference.
`dynamic_width`	`bool`	(Text recognition only) Runs inference keeping each crop at its native content width instead of padding to a fixed width. Use together with a dynamic-width model (`parseqv4-tiny-dynw`). Defaults to `False`. Automatically disabled during ONNX inference (`infer_onnx=True`).
`batch_bucketing`	`bool`	(Text recognition only) Groups crops with similar widths into the same batch so the AR decoding loop is not gated by one long line per batch. Effective when combined with `dynamic_width=True`. Defaults to `False`.

How to Write Config¶

The config is provided in dictionary format. By using a config, you can execute processing on different devices for each model and set detailed parameters.

For example, the following config allows the OCR processing to run on a GPU, while the layout analysis is performed on a CPU:

from yomitoku import DocumentAnalyzer

if __name__ == "__main__":
    configs = {
        "ocr": {
            "text_detector": {
                "device": "cuda",
            },
            "text_recognizer": {
                "device": "cuda",
            },
        },
        "layout_analyzer": {
            "layout_parser": {
                "device": "cpu",
            },
            "table_structure_recognizer": {
                "device": "cpu",
            },
        },
    }

    DocumentAnalyzer(configs=configs)

Using Lite Mode (Fast Processing) from the Python API¶

The lite mode that corresponds to the CLI --lite option can also be used from the Python API by specifying parseqv4-tiny-dynw as the text recognition model and enabling dynamic_width and batch_bucketing in configs. On CPU, switching text detection to ONNX inference speeds things up further.

import cv2
from yomitoku import DocumentAnalyzer

if __name__ == "__main__":
    configs = {
        "ocr": {
            "text_recognizer": {
                "model_name": "parseqv4-tiny-dynw",  # dynamic-width lite model
                "dynamic_width": True,                # keep each crop at its native width
                "batch_bucketing": True,              # batch similar-width crops together
                "device": "cpu",
            },
            "text_detector": {
                "device": "cpu",
                "infer_onnx": True,                   # ONNX text detection is faster on CPU
            },
        },
    }

    analyzer = DocumentAnalyzer(configs=configs, device="cpu")

    img = cv2.imread("sample.jpg")
    results, ocr_vis, layout_vis = analyzer(img)
    results.to_json("output.json")

When using the OCR module on its own, pass the same options to text_recognizer.

import cv2
from yomitoku import OCR

if __name__ == "__main__":
    configs = {
        "text_recognizer": {
            "model_name": "parseqv4-tiny-dynw",
            "dynamic_width": True,
            "batch_bucketing": True,
        },
    }

    ocr = OCR(configs=configs, device="cpu")

    img = cv2.imread("sample.jpg")
    results, ocr_vis = ocr(img)

Note

parseqv4-tiny-dynw is trained for dynamic-width batched inference, so use it together with dynamic_width=True (and batch_bucketing=True). ONNX inference (infer_onnx=True) uses a fixed input size, so dynamic_width is automatically disabled in that case.

Using as a REST API Server¶

Document Analyzer can also be started as a REST API server and used over HTTP.

uv pip install -e ".[server]"
yomitoku_server document_analyzer [--host 0.0.0.0] [--port 8000] [--device cuda]

After starting the server, you can send binary image or PDF data to the POST /invocations endpoint to obtain analysis results.

curl -X POST http://localhost:8000/invocations \
     -H "Content-Type: image/jpeg" \
     --data-binary @sample.jpg

Deployment via Docker is also supported. See the Server page for details.

Defining Parameters in an YAML File¶

By providing the path to a YAML file in the config, you can adjust detailed parameters for inference. Examples of YAML files can be found in the configs directory within the repository. While the model's network parameters cannot be modified, certain aspects like post-processing parameters and input image size can be adjusted. Refer to Model Config for configurable parameters.

For instance, you can define post-processing thresholds for the Text Detector in a YAML file and set its path in the config. The config file does not need to include all parameters; you only need to specify the parameters that require changes.

post_process:
  thresh: 0.1
  unclip_ratio: 2.5

The path to the YAML file can be stored in the Config, as follows:

demo/setting_document_anaysis.py

from yomitoku import DocumentAnalyzer
from yomitoku.data.functions import load_pdf

# 設定ファイルを指定してDocumentAnalyzerを初期化
configs = {"ocr": {"text_detector": {"path_cfg": "demo/text_detector.yaml"}}}

analyzer = DocumentAnalyzer(configs=configs, visualize=True, device="cuda")

PATH_IMGE = "demo/samples/sample.pdf"

imgs = load_pdf(PATH_IMGE)

for i, img in enumerate(imgs):
    analyzer(img)