Skip to content

CLI Usage

This page explains how to use YomiToku as a command-line interface (CLI).

When you run the command for the first time, the model weight files will be automatically downloaded from HuggingFace Hub. After that, you can analyze document images using the following command:

yomitoku ${path_data} -v -o results
Option Name Description
${path_data} Specifies the path to the directory containing images or the path to an image file.
-o, --outdir Specifies the output directory (will be created if it doesn't exist).
-v, --vis Outputs a visualization image of the analysis results.

Supplement: About ${path_data}

  • An image file or a directory can be specified.
  • If a directory is specified, it will be processed recursively, including subdirectories.
  • The supported file formats are pdf, jpeg, png, bmp, and tiff.

Note

  • OCR is generally divided into Document OCR and Scene OCR (e.g., text on signs or surfaces other than paper). YomiToku is optimized for Document OCR.
  • The accuracy of AI-OCR depends heavily on the resolution of the input image. For best results, we recommend using images with a minimum short edge of 1000px.

Displaying Help

To display the list of available options:

yomitoku --help
# or
yomitoku -h

License Key Authentication

You can also specify the license key and secret key directly when running the command:

yomitoku ${path_data} -k ${your_license_key} -s ${your_secret_key}
  • -k, --license_key: Specify your license key.
  • -s, --secret_key: Specify your secret key.

Lightweight Mode (Faster Processing)

Use the --lite option to run inference with a lightweight model. This allows faster analysis compared to normal mode, though text recognition accuracy may decrease.

yomitoku ${path_data} --lite -v

Specifying Visualization Output Directory

To specify the folder for saving visualized images:

yomitoku ${path_data} -f md -v --vis_dir ${folder_name}

Specifying Output Format

Use -f or --format to specify the output format of analysis results. Supported formats are: json, csv, html, md, pdf (searchable-pdf).

yomitoku ${path_data} -f md

If pdf is specified, the system will recognize the text within the image using OCR and embed the text information as an invisible layer to convert it into a searchable PDF.

You can specify multiple formats at once, separated by commas:

yomitoku ${path_data} -f md,html,json,csv

Specifying Inference Device

Use the -d or --device option to specify the device for model execution. Supported values: cuda, cpu, mps. Default is cuda. If GPU is not available, it will fall back to cpu.

yomitoku ${path_data} -d cpu

Ignoring Line Breaks

By default, line breaks follow the layout in the image. With the --ignore_line_break option, line breaks are ignored and sentences in the same paragraph are merged.

yomitoku ${path_data} --ignore_line_break

Extracting and Saving Figures/Graphs

Normally, figures and images in documents are not extracted. With the --figure option, they will be cropped and saved as separate image files, and links to them will be included in the output file.

yomitoku ${path_data} --figure

To specify the folder for saving images:

yomitoku ${path_data} --figure_dir ${folder_name}

If you set --figure_dir to an empty string, the images will be saved directly under the output folder:

yomitoku ${path_data} --figure_dir ""

Extracting Text in Figures or Images

By default, text contained within figures or images is not extracted. With the --figure_letter option, text in figures/images will also be included in the output.

yomitoku ${path_data} --figure --figure_letter

Specifying Output File Encoding

You can set the character encoding for the output file using --encoding. Supported encodings: utf-8, utf-8-sig, shift-jis, enc-jp, cp932. Unsupported characters will be ignored.

yomitoku ${path_data} --encoding utf-8-sig

Specifying Config File Paths

You can specify the YAML config file paths for each module:

Option Name Target Model
--td_cfg Text Detector (TD)
--tr_cfg Text Recognizer (TR)
--lp_cfg Layout Parser (LP)
--tsr_cfg Table Structure Recognizer (TSR)

Example:

yomitoku ${path_data} --td_cfg ${path_yaml}

Excluding Metadata

Exclude metadata such as headers or footers from the output file:

yomitoku ${path_data} --ignore_meta

Combining Multiple PDF Pages into One File

If the input is a multi-page PDF, you can export all pages into a single output file:

yomitoku ${path_data} -f md --combine

Automatic Document Orientation Correction

If images are rotated (e.g., sideways), YomiToku can detect and automatically correct their orientation:

yomitoku ${path_data} --rotate_detection

Enabling Recognition Orientation Fallback

By default, orientation fallback is disabled. When enabled with --enable-rec-orientation-fallback, if the confidence score of text recognition is low, the system retries recognition with the ROI image rotated 180 degrees and adopts the result with the higher confidence.

yomitoku ${path_data} --enable-rec-orientation-fallback

You can specify the confidence threshold for triggering the fallback using --rec-orientation-fallback-thresh. (Default: 0.75)

yomitoku ${path_data} --enable-rec-orientation-fallback --rec-orientation-fallback-thresh 0.6

Checking Request Count

You can check the usage count linked to your YomiToku license key:

query_count --license_key ${YOMITOKU_LICENSE_KEY} --secret_key ${YOMITOKU_SECRET_KEY}

Each option is optional. If omitted, values will be read from environment variables.


Specifying Reading Order

By default, the reading order option is set to auto.

When auto is specified, the system identifies the document's orientation (horizontal or vertical) and automatically estimates the reading order. Specifically, the order is estimated as top2left for horizontal documents and top2bottom for vertical documents.

Setting Name Preferred Reading Order Valid Document Types
top2bottom Top to Bottom Column-formatted Word documents, etc.
left2right Left to Right Layouts where keys and values are in columns (e.g., receipts, insurance cards)
right2left Right to Left Vertically written documents

You can also explicitly set it:

yomitoku ${path_data} --reading_order left2right

PDF Output Image Quality

You can specify the image quality preset for searchable PDF output using --pdf_quality. The default is high.

Preset Max Long Side JPEG Quality Description
high No limit 85 High quality (default). Preserves the original image resolution.
middle 2000px 80 Medium quality. Balances file size and image quality.
low 1500px 60 Low quality. Minimizes file size.
yomitoku ${path_data} -f pdf --pdf_quality middle

Setting the PDF Reading Resolution

Specifies the resolution (DPI) when reading a PDF (default DPI = 200). Increasing the DPI value may improve recognition accuracy when dealing with fine text or small details within the PDF.

yomitoku ${path_data} --dpi 250

Excluding Ruby (Furigana) Text

You can exclude ruby (furigana) text from the output. When the --ignore_ruby option is set, text whose line height is below a certain threshold relative to the median line height within each paragraph or cell, and consists solely of hiragana or katakana characters, is identified as ruby and excluded.

yomitoku ${path_data} --ignore_ruby

You can adjust the ruby detection threshold using the --ruby_threshold option (default: 2.0). Increasing the value widens the range of text identified as ruby.

yomitoku ${path_data} --ignore_ruby --ruby_threshold 3.0

Specifying Pages to Process

You can choose to process only specific pages. Pages can be specified either as a comma-separated list or as a range using a hyphen.

yomitoku ${path_data} --pages 1,3-5,10

Specify and Execute a Model

You can run AI-OCR by specifying particular models. Use tr_name to define the text recognition model and td_name to define the text detection model.

yomitoku ${path_data} --tr_name parseqv4-short --td_name dbnet

Model List and Key Features

Category Model Name Version Max Sequence Length Supported Text Types Description
Text Recognition parseqv3 v1.3.0 100 characters Printed / Handwritten Accuracy-optimized model providing high OCR performance for general documents.
Text Recognition parseqv4 v1.4.0 100 characters Printed / Handwritten / Old-style / Variant Characters High-accuracy model supporting a wide range of Japanese characters, including historical and variant forms. (★ Default)
Text Recognition parseqv4-short v1.4.0 75 characters Printed / Handwritten / Old-style / Variant Characters Balanced model optimized for both processing speed and accuracy.
Text Recognition parseqv4-tiny v1.4.0 50 characters Printed / Handwritten / Old-style / Variant Characters High-speed lightweight model optimized for CPU inference with broad versatility.
Text Recognition parseqv4-large v1.6 100 characters Printed / Handwritten / Old-style / Variant Characters Large-scale model with stronger language model correction. Improved recognition of fine characters, vertical text, and symbols.
Text Detection dbnet v1.0.0 Printed Detection model optimized for printed text.
Text Detection dbnetv2 v1.2.0 Printed / Handwritten Detection model optimized for both printed and handwritten text.
Text Detection dbnetv2_1 v1.6 Printed / Handwritten Improved version of dbnetv2 with enhanced detection of fine characters, vertical text, and symbols. (★ Default)
  • For maximum accuracy: parseqv4-large + dbnetv2_1
  • For balanced speed and accuracy on printed documents: parseqv4-short + dbnetv2_1
  • For efficient CPU-based recognition of mixed printed and handwritten documents: parseqv4-tiny + dbnetv2_1

  • For maximum accuracy: parseqv4-large + dbnetv2_1
  • For balanced speed and accuracy on printed documents: parseqv4-short + dbnetv2_1
  • For efficient CPU-based recognition of mixed printed and handwritten documents: parseqv4-tiny + dbnetv2_1