CLI Usage¶
This page explains how to use YomiToku as a command-line interface (CLI).
When you run the command for the first time, the model weight files will be automatically downloaded from HuggingFace Hub. After that, you can analyze document images using the following command:
| Option Name | Description |
|---|---|
${path_data} |
Specifies the path to the directory containing images or the path to an image file. |
-o, --outdir |
Specifies the output directory (will be created if it doesn't exist). |
-v, --vis |
Outputs a visualization image of the analysis results. |
Supplement: About ${path_data}
- An image file or a directory can be specified.
- If a directory is specified, it will be processed recursively, including subdirectories.
- The supported file formats are
pdf,jpeg,png,bmp, andtiff.
Note
- OCR is generally divided into Document OCR and Scene OCR (e.g., text on signs or surfaces other than paper). YomiToku is optimized for Document OCR.
- The accuracy of AI-OCR depends heavily on the resolution of the input image. For best results, we recommend using images with a minimum short edge of 1000px.
Displaying Help¶
To display the list of available options:
License Key Authentication¶
You can also specify the license key and secret key directly when running the command:
-k,--license_key: Specify your license key.-s,--secret_key: Specify your secret key.
Lightweight Mode (Faster Processing)¶
Use the --lite option to run inference with a lightweight model. This allows faster analysis compared to normal mode, though text recognition accuracy may decrease.
Specifying Visualization Output Directory¶
To specify the folder for saving visualized images:
Specifying Output Format¶
Use -f or --format to specify the output format of analysis results. Supported formats are: json, csv, html, md, pdf (searchable-pdf).
If pdf is specified, the system will recognize the text within the image using OCR and embed the text information as an invisible layer to convert it into a searchable PDF.
You can specify multiple formats at once, separated by commas:
Specifying Inference Device¶
Use the -d or --device option to specify the device for model execution. Supported values: cuda, cpu, mps. Default is cuda. If GPU is not available, it will fall back to cpu.
Ignoring Line Breaks¶
By default, line breaks follow the layout in the image. With the --ignore_line_break option, line breaks are ignored and sentences in the same paragraph are merged.
Extracting and Saving Figures/Graphs¶
Normally, figures and images in documents are not extracted. With the --figure option, they will be cropped and saved as separate image files, and links to them will be included in the output file.
To specify the folder for saving images:
If you set --figure_dir to an empty string, the images will be saved directly under the output folder:
Extracting Text in Figures or Images¶
By default, text contained within figures or images is not extracted. With the --figure_letter option, text in figures/images will also be included in the output.
Specifying Output File Encoding¶
You can set the character encoding for the output file using --encoding. Supported encodings: utf-8, utf-8-sig, shift-jis, enc-jp, cp932. Unsupported characters will be ignored.
Specifying Config File Paths¶
You can specify the YAML config file paths for each module:
| Option Name | Target Model |
|---|---|
--td_cfg |
Text Detector (TD) |
--tr_cfg |
Text Recognizer (TR) |
--lp_cfg |
Layout Parser (LP) |
--tsr_cfg |
Table Structure Recognizer (TSR) |
Example:
Excluding Metadata¶
Exclude metadata such as headers or footers from the output file:
Combining Multiple PDF Pages into One File¶
If the input is a multi-page PDF, you can export all pages into a single output file:
Automatic Document Orientation Correction¶
If images are rotated (e.g., sideways), YomiToku can detect and automatically correct their orientation:
Enabling Recognition Orientation Fallback¶
By default, orientation fallback is disabled. When enabled with --enable-rec-orientation-fallback, if the confidence score of text recognition is low, the system retries recognition with the ROI image rotated 180 degrees and adopts the result with the higher confidence.
You can specify the confidence threshold for triggering the fallback using --rec-orientation-fallback-thresh. (Default: 0.75)
Checking Request Count¶
You can check the usage count linked to your YomiToku license key:
Each option is optional. If omitted, values will be read from environment variables.
Specifying Reading Order¶
By default, the reading order option is set to auto.
When auto is specified, the system identifies the document's orientation (horizontal or vertical) and automatically estimates the reading order. Specifically, the order is estimated as top2left for horizontal documents and top2bottom for vertical documents.
| Setting Name | Preferred Reading Order | Valid Document Types |
|---|---|---|
top2bottom |
Top to Bottom | Column-formatted Word documents, etc. |
left2right |
Left to Right | Layouts where keys and values are in columns (e.g., receipts, insurance cards) |
right2left |
Right to Left | Vertically written documents |
You can also explicitly set it:
PDF Output Image Quality¶
You can specify the image quality preset for searchable PDF output using --pdf_quality. The default is high.
| Preset | Max Long Side | JPEG Quality | Description |
|---|---|---|---|
high |
No limit | 85 | High quality (default). Preserves the original image resolution. |
middle |
2000px | 80 | Medium quality. Balances file size and image quality. |
low |
1500px | 60 | Low quality. Minimizes file size. |
Setting the PDF Reading Resolution¶
Specifies the resolution (DPI) when reading a PDF (default DPI = 200). Increasing the DPI value may improve recognition accuracy when dealing with fine text or small details within the PDF.
Excluding Ruby (Furigana) Text¶
You can exclude ruby (furigana) text from the output. When the --ignore_ruby option is set, text whose line height is below a certain threshold relative to the median line height within each paragraph or cell, and consists solely of hiragana or katakana characters, is identified as ruby and excluded.
You can adjust the ruby detection threshold using the --ruby_threshold option (default: 2.0). Increasing the value widens the range of text identified as ruby.
Specifying Pages to Process¶
You can choose to process only specific pages. Pages can be specified either as a comma-separated list or as a range using a hyphen.
Specify and Execute a Model¶
You can run AI-OCR by specifying particular models.
Use tr_name to define the text recognition model and td_name to define the text detection model.
Model List and Key Features¶
| Category | Model Name | Version | Max Sequence Length | Supported Text Types | Description |
|---|---|---|---|---|---|
| Text Recognition | parseqv3 |
v1.3.0 | 100 characters | Printed / Handwritten | Accuracy-optimized model providing high OCR performance for general documents. |
| Text Recognition | parseqv4 |
v1.4.0 | 100 characters | Printed / Handwritten / Old-style / Variant Characters | High-accuracy model supporting a wide range of Japanese characters, including historical and variant forms. (★ Default) |
| Text Recognition | parseqv4-short |
v1.4.0 | 75 characters | Printed / Handwritten / Old-style / Variant Characters | Balanced model optimized for both processing speed and accuracy. |
| Text Recognition | parseqv4-tiny |
v1.4.0 | 50 characters | Printed / Handwritten / Old-style / Variant Characters | High-speed lightweight model optimized for CPU inference with broad versatility. |
| Text Recognition | parseqv4-large |
v1.6 | 100 characters | Printed / Handwritten / Old-style / Variant Characters | Large-scale model with stronger language model correction. Improved recognition of fine characters, vertical text, and symbols. |
| Text Detection | dbnet |
v1.0.0 | — | Printed | Detection model optimized for printed text. |
| Text Detection | dbnetv2 |
v1.2.0 | — | Printed / Handwritten | Detection model optimized for both printed and handwritten text. |
| Text Detection | dbnetv2_1 |
v1.6 | — | Printed / Handwritten | Improved version of dbnetv2 with enhanced detection of fine characters, vertical text, and symbols. (★ Default) |
- For maximum accuracy:
parseqv4-large+dbnetv2_1 - For balanced speed and accuracy on printed documents:
parseqv4-short+dbnetv2_1 - For efficient CPU-based recognition of mixed printed and handwritten documents:
parseqv4-tiny+dbnetv2_1
Recommended Combinations¶
- For maximum accuracy:
parseqv4-large+dbnetv2_1 - For balanced speed and accuracy on printed documents:
parseqv4-short+dbnetv2_1 - For efficient CPU-based recognition of mixed printed and handwritten documents:
parseqv4-tiny+dbnetv2_1