Skip to content

Module Usage

This page explains how to use YomiToku as a Python library.

Using Document Analyzer

Document Analyzer performs OCR and layout analysis, returning an integrated analysis result. It can be used for various use cases, such as paragraph and table structure analysis, extraction, and figure/table detection.

When loading a PDF file, use load_pdf, and when loading an image file, use load_image. Internally, load_image uses OpenCV. Note that the channel order is BGR.

Following 4 models are utilized in the module:

  • Text Recognizer
  • Text Detector
  • Layout Parser
  • Table Structure Recognizer
import cv2

from yomitoku import DocumentAnalyzer
from yomitoku.data.functions import load_pdf

PATH_IMGE = "demo/samples/sample.pdf"
analyzer = DocumentAnalyzer(visualize=True, device="cuda")
# PDFファイルを読み込み
imgs = load_pdf(PATH_IMGE)

# 画像を読み込む場合はload_imageを使用
# imgs = load_image(PATH_IMGE)

for i, img in enumerate(imgs):
    results, ocr_vis, layout_vis = analyzer(img)
    # HTML形式で解析結果をエクスポート
    results.to_html(f"output_{i}.html", img=img)
    # 可視化画像を保存
    cv2.imwrite(f"output_ocr_{i}.jpg", ocr_vis)
    cv2.imwrite(f"output_layout_{i}.jpg", layout_vis)

analyzer.close()
Option Name Type Description Notes
visualize bool Specifies whether to visualize the processing results. We recommend False if not for debugging. If True, the OCR results are returned as the 2nd return value and the layout analysis results as the 3rd return value. If False, None is returned.
device str Specifies the device to be used for processing. The default is "cuda". If a GPU is unavailable, it automatically switches to "cpu".
configs dict Used to set more detailed parameters for module processing. Refer to Model Detailed Config for details.
license_key str Stores the license key, which can be used for authentication. This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
secret_key str Stores the secret key, which can be used for authentication. This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
device_token str Stores the device token, which can be used for with external services. This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
ignore_ruby bool Specifies whether to exclude ruby (furigana) text from the output. Default is False.
ruby_threshold float Specifies the threshold for ruby detection as a ratio to the median line height. Text consisting solely of hiragana or katakana below this threshold is identified as ruby. Default is 2.0. Only effective when ignore_ruby=True.

The results of DocumentAnalyzer can be exported in the following formats:

Method Output Format
to_json() JSON format (*.json)
to_html() HTML format (*.html)
to_csv() Comma-separated CSV format (*.csv)
to_markdown() Markdown format (*.md)

Using AI-OCR Only

AI-OCR performs text detection and recognition on the detected text, returning the positions of the text within the image along with the recognition results.

Following 2 models are utilized in the module:

  • Text Recognizer
  • Text Detector
import cv2

from yomitoku import OCR
from yomitoku.data.functions import load_pdf

ocr = OCR(visualize=True, device="cuda")
# PDFファイルを読み込み
imgs = load_pdf("demo/samples/sample.pdf")
for i, img in enumerate(imgs):
    results, ocr_vis = ocr(img)

    # JSON形式で解析結果をエクスポート
    results.to_json(f"output_{i}.json")
    cv2.imwrite(f"output_ocr_{i}.jpg", ocr_vis)

ocr.close()
Option Name Type Description Notes
visualize bool Specifies whether to visualize the processing results. We recommend False if not for debugging. If True, the OCR results are returned as the 2nd return value. If False, None is returned.
device str Specifies the device to be used for processing. The default is "cuda". If a GPU is unavailable, it automatically switches to "cpu".
configs dict Used to set more detailed parameters for module processing. Refer to Model Detailed Config for details.
license_key str Stores the license key, which can be used for authentication. This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
secret_key str Stores the secret key, which can be used for authentication.  This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
device_token str Stores the device token, which can be used for with external services. This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.

The results of OCR processing support export in JSON format (to_json()) only.

Using Layout Analyzer only

The LayoutAnalyzer performs text detection, followed by AI-based paragraph, figure/table detection, and table structure analysis. It analyzes the layout structure within the document.

Following 2 models are utilized in the module:

  • Layout Parser
  • Table Structure Recognizer
import cv2

from yomitoku import LayoutAnalyzer
from yomitoku.data.functions import load_pdf

analyzer = LayoutAnalyzer(visualize=True, device="cuda")
# PDFファイルを読み込み
imgs = load_pdf("demo/sample.pdf")
for i, img in enumerate(imgs):
    results, layout_vis = analyzer(img)
    # JSON形式で解析結果をエクスポート
    results.to_json(f"output_{i}.json")
    cv2.imwrite(f"output_layout_{i}.jpg", layout_vis)

analyzer.close()
Option Name Type Description Notes
visualize bool Specifies whether to visualize the processing results. We recommend False if not for debugging. If True, the layout analysis results as the 2nd return value. If False, None is returned.
device str Specifies the device to be used for processing. The default is "cuda". If a GPU is unavailable, it automatically switches to "cpu".
configs dict Used to set more detailed parameters for module processing. Refer to Model Detailed Config for details.
license_key str Stores the license key, which can be used for authentication. This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
secret_key str Stores the secret key, which can be used for authentication. This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.
device_token str Stores the device token, which can be used for with external services. This argument's value is prioritized over the environment variable. If not stored, the environment variable key will be referenced.

The results of LayoutAnalyzer processing support export only in JSON format (to_json()).

Model Detailed Config

By providing a config, you can adjust the behavior in greater detail. The following parameters can be set for the model:

Option Name Type Description
model_name str Specifies the model name to be used.
path_cfg str Inputs the path to the config file containing the hyperparameters.
device str Specifies the device to be used for inference. (Allowed Values: cuda | cpu | mps)
visualize bool Specifies whether to perform visualization.
from_pretrained bool Specifies whether to use a Pretrained Model (a previously trained model).
infer_onnx bool Specifies whether to use ONNX Runtime instead of PyTorch for inference.

How to Write Config

The config is provided in dictionary format. By using a config, you can execute processing on different devices for each model and set detailed parameters.

For example, the following config allows the OCR processing to run on a GPU, while the layout analysis is performed on a CPU:

from yomitoku import DocumentAnalyzer

if __name__ == "__main__":
    configs = {
        "ocr": {
            "text_detector": {
                "device": "cuda",
            },
            "text_recognizer": {
                "device": "cuda",
            },
        },
        "layout_analyzer": {
            "layout_parser": {
                "device": "cpu",
            },
            "table_structure_recognizer": {
                "device": "cpu",
            },
        },
    }

    DocumentAnalyzer(configs=configs)

Using as a REST API Server

Document Analyzer can also be started as a REST API server and used over HTTP.

uv pip install -e ".[server]"
yomitoku_server document_analyzer [--host 0.0.0.0] [--port 8000] [--device cuda]

After starting the server, you can send binary image or PDF data to the POST /invocations endpoint to obtain analysis results.

curl -X POST http://localhost:8000/invocations \
     -H "Content-Type: image/jpeg" \
     --data-binary @sample.jpg

Deployment via Docker is also supported. See the Server page for details.

Defining Parameters in an YAML File

By providing the path to a YAML file in the config, you can adjust detailed parameters for inference. Examples of YAML files can be found in the configs directory within the repository. While the model's network parameters cannot be modified, certain aspects like post-processing parameters and input image size can be adjusted. Refer to Model Config for configurable parameters.

For instance, you can define post-processing thresholds for the Text Detector in a YAML file and set its path in the config. The config file does not need to include all parameters; you only need to specify the parameters that require changes.

post_process:
  thresh: 0.1
  unclip_ratio: 2.5

The path to the YAML file can be stored in the Config, as follows:

from yomitoku import DocumentAnalyzer
from yomitoku.data.functions import load_pdf

# 設定ファイルを指定してDocumentAnalyzerを初期化
configs = {"ocr": {"text_detector": {"path_cfg": "demo/text_detector.yaml"}}}

analyzer = DocumentAnalyzer(configs=configs, visualize=True, device="cuda")

PATH_IMGE = "demo/samples/sample.pdf"

imgs = load_pdf(PATH_IMGE)

for i, img in enumerate(imgs):
    analyzer(img)