Document Analyzer¶

Bases: BaseJob

A class for analyzing documents, including text detection, recognition, and layout analysis.

This class provides functionality to process and analyze documents by detecting text regions, recognizing text content, and analyzing the layout structure. It supports various configurations for customization, including device selection, preprocessing options, and reading order preferences.

引数：

名前	タイプ	デスクリプション	デフォルト
`configs`	`dict`	A dictionary of configurations to override the default settings. Defaults to an empty dictionary.	`{}`
`device`	`str`	The device to use for computation, e.g., "cuda" or "cpu". Defaults to "cuda".	`'cuda'`
`visualize`	`bool`	Whether to enable visualization during processing. Defaults to False.	`False`
`ignore_meta`	`bool`	Whether to ignore metadata in the document. Defaults to False.	`False`
`reading_order`	`str`	The reading order for text extraction. Options include "auto", ""left2right"", "top2bottom", etc. Defaults to "auto".	`'auto'`
`split_text_across_cells`	`bool`	Whether to split text across table cells. Defaults to False.	`False`
`enable_preprocess`	`bool`	Whether to enable preprocessing steps such as rotation detection. Defaults to False.	`False`
`license_key`	`str`	The license key for using specific features or services. Defaults to None.	`None`
`secret_key`	`str`	The secret key for authentication with external services. Defaults to None.	`None`
`device_token`	`str`	The device token for authentication with external services. Defaults to None.	`None`

属性：

名前	タイプ	デスクリプション
`text_detector`	`TextDetector`	Instance of the text detection module, initialized based on the configurations.
`text_recognizer`	`TextRecognizer`	Instance of the text recognition module, initialized based on the configurations.
`layout`	`LayoutAnalyzer`	Instance of the layout analysis module, initialized based on the configurations.

ソースコード位置： src/yomitoku/document_analyzer.py

class DocumentAnalyzer(BaseJob):
    """
    A class for analyzing documents, including text detection, recognition, and layout analysis.

    This class provides functionality to process and analyze documents by detecting text regions,
    recognizing text content, and analyzing the layout structure. It supports various configurations
    for customization, including device selection, preprocessing options, and reading order preferences.

    Args:
        configs (dict, optional): A dictionary of configurations to override the default settings. Defaults to an empty dictionary.
        device (str, optional): The device to use for computation, e.g., "cuda" or "cpu". Defaults to "cuda".
        visualize (bool, optional): Whether to enable visualization during processing. Defaults to False.
        ignore_meta (bool, optional): Whether to ignore metadata in the document. Defaults to False.
        reading_order (str, optional): The reading order for text extraction. Options include "auto", ""left2right"", "top2bottom", etc. Defaults to "auto".
        split_text_across_cells (bool, optional): Whether to split text across table cells. Defaults to False.
        enable_preprocess (bool, optional): Whether to enable preprocessing steps such as rotation detection. Defaults to False.
        license_key (str, optional): The license key for using specific features or services. Defaults to None.
        secret_key (str, optional): The secret key for authentication with external services. Defaults to None.
        device_token (str, optional): The device token for authentication with external services. Defaults to None.

    Attributes:
        text_detector (TextDetector): Instance of the text detection module, initialized based on the configurations.
        text_recognizer (TextRecognizer): Instance of the text recognition module, initialized based on the configurations.
        layout (LayoutAnalyzer): Instance of the layout analysis module, initialized based on the configurations.
    """

    def __init__(
        self,
        configs={},
        device="cuda",
        visualize=False,
        ignore_meta=False,
        reading_order="auto",
        split_text_across_cells=False,
        split_text_across_paragraphs=False,
        enable_preprocess=False,
        license_key=None,
        secret_key=None,
        device_token=None,
        ignore_ruby=False,
        ruby_threshold=2.0,
    ):
        super().__init__(use_thread_loop=True)
        self.split_text_across_cells = split_text_across_cells
        self.split_text_across_paragraphs = split_text_across_paragraphs
        self.enable_preprocess = enable_preprocess
        self.reading_order = reading_order
        self.ignore_ruby = ignore_ruby
        self.ruby_threshold = ruby_threshold

        default_configs = {
            "preprocess": {
                "rotate_detector": {
                    "device": device,
                    "license_key": license_key,
                    "secret_key": secret_key,
                    "device_token": device_token,
                },
            },
            "ocr": {
                "text_detector": {
                    "device": device,
                    "license_key": license_key,
                    "secret_key": secret_key,
                    "device_token": device_token,
                },
                "text_recognizer": {
                    "device": device,
                    "license_key": license_key,
                    "secret_key": secret_key,
                    "device_token": device_token,
                },
            },
            "layout_analyzer": {
                "layout_parser": {
                    "device": device,
                    "visualize": visualize,
                    "license_key": license_key,
                    "secret_key": secret_key,
                    "device_token": device_token,
                },
                "table_structure_recognizer": {
                    "device": device,
                    "visualize": visualize,
                    "license_key": license_key,
                    "secret_key": secret_key,
                    "device_token": device_token,
                },
            },
        }

        if isinstance(configs, dict):
            recursive_update(default_configs, configs)
        else:
            self.close()
            raise ValueError(
                "configs must be a dict. See the https://kotaro-kinoshita.github.io/yomitoku-dev/usage/"
            )

        try:
            self.text_detector = TextDetector(
                **default_configs["ocr"]["text_detector"],
            )
            self.text_recognizer = TextRecognizer(
                **default_configs["ocr"]["text_recognizer"]
            )

            self.layout = LayoutAnalyzer(
                configs=default_configs["layout_analyzer"],
            )

            if self.enable_preprocess:
                self.preprocessor = Preprocessor(
                    configs=default_configs["preprocess"],
                )
            self.visualize = visualize
            self.ignore_meta = ignore_meta
        except Exception as e:
            self.close()
            raise e

    def aggregate(self, ocr_res, layout_res, preprocess_res, img):
        word_assigned_list = [False] * len(ocr_res.words)

        word_assigned_list = _assign_words_to_table_cells(
            layout_res.tables,
            ocr_res.words,
            word_assigned_list,
            ignore_ruby=self.ignore_ruby,
            ruby_threshold=self.ruby_threshold,
        )

        paragraphs, word_assigned_list = _assign_words_to_paragraph(
            layout_res.paragraphs,
            ocr_res.words,
            layout_res.figures,
            word_assigned_list,
            ignore_ruby=self.ignore_ruby,
            ruby_threshold=self.ruby_threshold,
        )

        _convert_words_to_paragraphs(
            ocr_res.words,
            paragraphs,
            word_assigned_list,
        )

        figures, paragraph_assigned_list = _extract_paragraph_within_figure(
            paragraphs,
            layout_res.figures,
            img,
        )

        _assign_caption(figures, layout_res.tables, paragraphs, paragraph_assigned_list)

        paragraphs = [
            paragraph
            for paragraph, flag in zip(paragraphs, paragraph_assigned_list)
            if not flag and paragraph.contents is not None
        ]

        page_direction = _judge_page_direction(paragraphs)
        paragraphs, figures, tables = self.sort_reading_order(
            paragraphs, figures, layout_res.tables, page_direction, img
        )

        font_size = _calc_median_font_size(ocr_res.words)
        _postprocessing_list_item(paragraphs, font_size)

        outputs = {
            "preprocess": preprocess_res,
            "paragraphs": paragraphs,
            "tables": tables,
            "figures": figures,
            "words": ocr_res.words,
        }

        return outputs

    def sort_reading_order(self, paragraphs, figures, tables, page_direction, img):
        headers = [
            paragraph
            for paragraph in paragraphs
            if paragraph.role == "page_header" and not self.ignore_meta
        ]

        footers = [
            paragraph
            for paragraph in paragraphs
            if paragraph.role == "page_footer" and not self.ignore_meta
        ]

        indies = [
            paragraph
            for paragraph in paragraphs
            if paragraph.role == "index" and not self.ignore_meta
        ]

        page_contents = [
            paragraph
            for paragraph in paragraphs
            if paragraph.role is None
            or paragraph.role
            in [
                "section_headings",
                "list_item",
                "caption",
                "display_formula",
                "inline_formula",
            ]
        ]

        elements = page_contents + tables + figures

        prediction_reading_order(headers, "left2right")
        prediction_reading_order(indies, "top2bottom")
        prediction_reading_order(footers, "left2right")

        if self.reading_order == "auto":
            reading_order = (
                "right2left" if page_direction == "vertical" else "top2bottom"
            )
        else:
            reading_order = self.reading_order

        prediction_reading_order(elements, reading_order, img)

        for i, index in enumerate(indies):
            index.order += len(headers)

        for i, element in enumerate(elements):
            element.order += len(headers) + len(indies)

        for i, footer in enumerate(footers):
            footer.order += len(elements) + len(headers) + len(indies)

        paragraphs = headers + indies + page_contents + footers
        paragraphs = sorted(paragraphs, key=lambda x: x.order)
        figures = sorted(figures, key=lambda x: x.order)
        tables = sorted(tables, key=lambda x: x.order)

        return paragraphs, figures, tables

    async def run_tasks(self, img):
        results_preprocess = None
        if self.enable_preprocess:
            results_preprocess, corrected_img = self.preprocessor(img)
            img = corrected_img

        with ThreadPoolExecutor(max_workers=2) as executor:
            tasks = [
                asyncio.get_running_loop().run_in_executor(
                    executor, self.text_detector, img
                ),
                asyncio.get_running_loop().run_in_executor(executor, self.layout, img),
            ]

            results = await asyncio.gather(*tasks)

            results_det, det_score = results[0]
            results_layout, layout = results[1]

            if self.split_text_across_paragraphs:
                results_det = _split_text_across_paragraphs(results_det, results_layout)

            if self.split_text_across_cells:
                results_det = _split_text_across_cells(results_det, results_layout)

            results_rec = self.text_recognizer(
                img, results_det.points, results_det.scores
            )

            outputs = {"words": ocr_aggregate(results_rec)}
            results_ocr = OCRSchema(**outputs)

            ocr_vis = None
            if self.visualize:
                ocr_vis = ocr_visualizer(
                    results_ocr.words,
                    img,
                    font_path=self.text_recognizer._cfg.visualize.font,
                    det_score=det_score,
                    vis_heatmap=self.text_detector._cfg.visualize.heatmap,
                )

            outputs = self.aggregate(
                results_ocr,
                results_layout,
                results_preprocess,
                img,
            )

            results = DocumentAnalyzerSchema(**outputs)
            return results, ocr_vis, layout, img

    def __call__(
        self, img: np.ndarray
    ) -> tuple[DocumentAnalyzerSchema, np.ndarray | None, np.ndarray | None]:
        """
        Perform document analysis on the given image.

        This method processes the input image by running text detection, recognition,
        layout analysis, and other preprocessing tasks. It also supports visualization
        of the layout and reading order if the `visualize` attribute is enabled.

        Args:
            img (np.ndarray): The input image in BGR format

        Returns:
            tuple: A tuple containing:

                - results (DocumentAnalyzerSchema): The aggregated results of the document analysis,
                  including OCR, layout, and preprocessing outputs.

                - ocr (np.ndarray or None): The visualized OCR results if `visualize` is enabled, otherwise `None`.

                - layout (np.ndarray or None): The visualization of the layout and reading order if
                  `visualize` is enabled, otherwise `None`.
        """

        future = self.thread_loop.run_coroutine(self.run_tasks(img))
        try:
            results, ocr, _, processed_img = future.result()
        except torch.cuda.OutOfMemoryError as e:
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
            logger.error("GPU out of memory in DocumentAnalyzer: %s", e)
            raise make_error(ErrorCode.GPU_OUT_OF_MEMORY) from e
        layout = None

        if self.visualize:
            layout = layout_visualizer_detail(results, processed_img)
            layout = reading_order_visualizer(layout, results)

        return results, ocr, layout

`call(img)` ¶

Perform document analysis on the given image.

This method processes the input image by running text detection, recognition, layout analysis, and other preprocessing tasks. It also supports visualization of the layout and reading order if the visualize attribute is enabled.

引数：

名前	タイプ	デスクリプション	デフォルト
`img`	`ndarray`	The input image in BGR format	必須

戻り値：

名前	タイプ	デスクリプション
`tuple`	`tuple[DocumentAnalyzerSchema, ndarray \| None, ndarray \| None]`	A tuple containing: results (DocumentAnalyzerSchema): The aggregated results of the document analysis, including OCR, layout, and preprocessing outputs. ocr (np.ndarray or None): The visualized OCR results if `visualize` is enabled, otherwise `None`. layout (np.ndarray or None): The visualization of the layout and reading order if `visualize` is enabled, otherwise `None`.

ソースコード位置： src/yomitoku/document_analyzer.py

def __call__(
    self, img: np.ndarray
) -> tuple[DocumentAnalyzerSchema, np.ndarray | None, np.ndarray | None]:
    """
    Perform document analysis on the given image.

    This method processes the input image by running text detection, recognition,
    layout analysis, and other preprocessing tasks. It also supports visualization
    of the layout and reading order if the `visualize` attribute is enabled.

    Args:
        img (np.ndarray): The input image in BGR format

    Returns:
        tuple: A tuple containing:

            - results (DocumentAnalyzerSchema): The aggregated results of the document analysis,
              including OCR, layout, and preprocessing outputs.

            - ocr (np.ndarray or None): The visualized OCR results if `visualize` is enabled, otherwise `None`.

            - layout (np.ndarray or None): The visualization of the layout and reading order if
              `visualize` is enabled, otherwise `None`.
    """

    future = self.thread_loop.run_coroutine(self.run_tasks(img))
    try:
        results, ocr, _, processed_img = future.result()
    except torch.cuda.OutOfMemoryError as e:
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        logger.error("GPU out of memory in DocumentAnalyzer: %s", e)
        raise make_error(ErrorCode.GPU_OUT_OF_MEMORY) from e
    layout = None

    if self.visualize:
        layout = layout_visualizer_detail(results, processed_img)
        layout = reading_order_visualizer(layout, results)

    return results, ocr, layout

Document Analyzer¶

__call__(img) ¶

`call(img)` ¶