Skip to content

Document Analyzer

Bases: BaseJob

A class for analyzing documents, including text detection, recognition, and layout analysis.

This class provides functionality to process and analyze documents by detecting text regions, recognizing text content, and analyzing the layout structure. It supports various configurations for customization, including device selection, preprocessing options, and reading order preferences.

引数:

名前 タイプ デスクリプション デフォルト
configs dict

A dictionary of configurations to override the default settings. Defaults to an empty dictionary.

{}
device str

The device to use for computation, e.g., "cuda" or "cpu". Defaults to "cuda".

'cuda'
visualize bool

Whether to enable visualization during processing. Defaults to False.

False
ignore_meta bool

Whether to ignore metadata in the document. Defaults to False.

False
reading_order str

The reading order for text extraction. Options include "auto", ""left2right"", "top2bottom", etc. Defaults to "auto".

'auto'
split_text_across_cells bool

Whether to split text across table cells. Defaults to False.

False
enable_preprocess bool

Whether to enable preprocessing steps such as rotation detection. Defaults to False.

False
license_key str

The license key for using specific features or services. Defaults to None.

None
secret_key str

The secret key for authentication with external services. Defaults to None.

None
device_token str

The device token for authentication with external services. Defaults to None.

None

属性:

名前 タイプ デスクリプション
text_detector TextDetector

Instance of the text detection module, initialized based on the configurations.

text_recognizer TextRecognizer

Instance of the text recognition module, initialized based on the configurations.

layout LayoutAnalyzer

Instance of the layout analysis module, initialized based on the configurations.

ソースコード位置: src/yomitoku/document_analyzer.py
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
class DocumentAnalyzer(BaseJob):
    """
    A class for analyzing documents, including text detection, recognition, and layout analysis.

    This class provides functionality to process and analyze documents by detecting text regions,
    recognizing text content, and analyzing the layout structure. It supports various configurations
    for customization, including device selection, preprocessing options, and reading order preferences.

    Args:
        configs (dict, optional): A dictionary of configurations to override the default settings. Defaults to an empty dictionary.
        device (str, optional): The device to use for computation, e.g., "cuda" or "cpu". Defaults to "cuda".
        visualize (bool, optional): Whether to enable visualization during processing. Defaults to False.
        ignore_meta (bool, optional): Whether to ignore metadata in the document. Defaults to False.
        reading_order (str, optional): The reading order for text extraction. Options include "auto", ""left2right"", "top2bottom", etc. Defaults to "auto".
        split_text_across_cells (bool, optional): Whether to split text across table cells. Defaults to False.
        enable_preprocess (bool, optional): Whether to enable preprocessing steps such as rotation detection. Defaults to False.
        license_key (str, optional): The license key for using specific features or services. Defaults to None.
        secret_key (str, optional): The secret key for authentication with external services. Defaults to None.
        device_token (str, optional): The device token for authentication with external services. Defaults to None.

    Attributes:
        text_detector (TextDetector): Instance of the text detection module, initialized based on the configurations.
        text_recognizer (TextRecognizer): Instance of the text recognition module, initialized based on the configurations.
        layout (LayoutAnalyzer): Instance of the layout analysis module, initialized based on the configurations.
    """

    def __init__(
        self,
        configs={},
        device="cuda",
        visualize=False,
        ignore_meta=False,
        reading_order="auto",
        split_text_across_cells=False,
        split_text_across_paragraphs=False,
        enable_preprocess=False,
        license_key=None,
        secret_key=None,
        device_token=None,
        ignore_ruby=False,
        ruby_threshold=2.0,
    ):
        super().__init__(use_thread_loop=True)
        self.split_text_across_cells = split_text_across_cells
        self.split_text_across_paragraphs = split_text_across_paragraphs
        self.enable_preprocess = enable_preprocess
        self.reading_order = reading_order
        self.ignore_ruby = ignore_ruby
        self.ruby_threshold = ruby_threshold

        default_configs = {
            "preprocess": {
                "rotate_detector": {
                    "device": device,
                    "license_key": license_key,
                    "secret_key": secret_key,
                    "device_token": device_token,
                },
            },
            "ocr": {
                "text_detector": {
                    "device": device,
                    "license_key": license_key,
                    "secret_key": secret_key,
                    "device_token": device_token,
                },
                "text_recognizer": {
                    "device": device,
                    "license_key": license_key,
                    "secret_key": secret_key,
                    "device_token": device_token,
                },
            },
            "layout_analyzer": {
                "layout_parser": {
                    "device": device,
                    "visualize": visualize,
                    "license_key": license_key,
                    "secret_key": secret_key,
                    "device_token": device_token,
                },
                "table_structure_recognizer": {
                    "device": device,
                    "visualize": visualize,
                    "license_key": license_key,
                    "secret_key": secret_key,
                    "device_token": device_token,
                },
            },
        }

        if isinstance(configs, dict):
            recursive_update(default_configs, configs)
        else:
            self.close()
            raise ValueError(
                "configs must be a dict. See the https://kotaro-kinoshita.github.io/yomitoku-dev/usage/"
            )

        try:
            self.text_detector = TextDetector(
                **default_configs["ocr"]["text_detector"],
            )
            self.text_recognizer = TextRecognizer(
                **default_configs["ocr"]["text_recognizer"]
            )

            self.layout = LayoutAnalyzer(
                configs=default_configs["layout_analyzer"],
            )

            if self.enable_preprocess:
                self.preprocessor = Preprocessor(
                    configs=default_configs["preprocess"],
                )
            self.visualize = visualize
            self.ignore_meta = ignore_meta
        except Exception as e:
            self.close()
            raise e

    def aggregate(self, ocr_res, layout_res, preprocess_res, img):
        word_assigned_list = [False] * len(ocr_res.words)

        word_assigned_list = _assign_words_to_table_cells(
            layout_res.tables,
            ocr_res.words,
            word_assigned_list,
            ignore_ruby=self.ignore_ruby,
            ruby_threshold=self.ruby_threshold,
        )

        paragraphs, word_assigned_list = _assign_words_to_paragraph(
            layout_res.paragraphs,
            ocr_res.words,
            layout_res.figures,
            word_assigned_list,
            ignore_ruby=self.ignore_ruby,
            ruby_threshold=self.ruby_threshold,
        )

        _convert_words_to_paragraphs(
            ocr_res.words,
            paragraphs,
            word_assigned_list,
        )

        figures, paragraph_assigned_list = _extract_paragraph_within_figure(
            paragraphs,
            layout_res.figures,
            img,
        )

        _assign_caption(figures, layout_res.tables, paragraphs, paragraph_assigned_list)

        paragraphs = [
            paragraph
            for paragraph, flag in zip(paragraphs, paragraph_assigned_list)
            if not flag and paragraph.contents is not None
        ]

        page_direction = _judge_page_direction(paragraphs)
        paragraphs, figures, tables = self.sort_reading_order(
            paragraphs, figures, layout_res.tables, page_direction, img
        )

        font_size = _calc_median_font_size(ocr_res.words)
        _postprocessing_list_item(paragraphs, font_size)

        outputs = {
            "preprocess": preprocess_res,
            "paragraphs": paragraphs,
            "tables": tables,
            "figures": figures,
            "words": ocr_res.words,
        }

        return outputs

    def sort_reading_order(self, paragraphs, figures, tables, page_direction, img):
        headers = [
            paragraph
            for paragraph in paragraphs
            if paragraph.role == "page_header" and not self.ignore_meta
        ]

        footers = [
            paragraph
            for paragraph in paragraphs
            if paragraph.role == "page_footer" and not self.ignore_meta
        ]

        indies = [
            paragraph
            for paragraph in paragraphs
            if paragraph.role == "index" and not self.ignore_meta
        ]

        page_contents = [
            paragraph
            for paragraph in paragraphs
            if paragraph.role is None
            or paragraph.role
            in [
                "section_headings",
                "list_item",
                "caption",
                "display_formula",
                "inline_formula",
            ]
        ]

        elements = page_contents + tables + figures

        prediction_reading_order(headers, "left2right")
        prediction_reading_order(indies, "top2bottom")
        prediction_reading_order(footers, "left2right")

        if self.reading_order == "auto":
            reading_order = (
                "right2left" if page_direction == "vertical" else "top2bottom"
            )
        else:
            reading_order = self.reading_order

        prediction_reading_order(elements, reading_order, img)

        for i, index in enumerate(indies):
            index.order += len(headers)

        for i, element in enumerate(elements):
            element.order += len(headers) + len(indies)

        for i, footer in enumerate(footers):
            footer.order += len(elements) + len(headers) + len(indies)

        paragraphs = headers + indies + page_contents + footers
        paragraphs = sorted(paragraphs, key=lambda x: x.order)
        figures = sorted(figures, key=lambda x: x.order)
        tables = sorted(tables, key=lambda x: x.order)

        return paragraphs, figures, tables

    async def run_tasks(self, img):
        results_preprocess = None
        if self.enable_preprocess:
            results_preprocess, corrected_img = self.preprocessor(img)
            img = corrected_img

        with ThreadPoolExecutor(max_workers=2) as executor:
            tasks = [
                asyncio.get_running_loop().run_in_executor(
                    executor, self.text_detector, img
                ),
                asyncio.get_running_loop().run_in_executor(executor, self.layout, img),
            ]

            results = await asyncio.gather(*tasks)

            results_det, det_score = results[0]
            results_layout, layout = results[1]

            if self.split_text_across_paragraphs:
                results_det = _split_text_across_paragraphs(results_det, results_layout)

            if self.split_text_across_cells:
                results_det = _split_text_across_cells(results_det, results_layout)

            results_rec = self.text_recognizer(
                img, results_det.points, results_det.scores
            )

            outputs = {"words": ocr_aggregate(results_rec)}
            results_ocr = OCRSchema(**outputs)

            ocr_vis = None
            if self.visualize:
                ocr_vis = ocr_visualizer(
                    results_ocr.words,
                    img,
                    font_path=self.text_recognizer._cfg.visualize.font,
                    det_score=det_score,
                    vis_heatmap=self.text_detector._cfg.visualize.heatmap,
                )

            outputs = self.aggregate(
                results_ocr,
                results_layout,
                results_preprocess,
                img,
            )

            results = DocumentAnalyzerSchema(**outputs)
            return results, ocr_vis, layout, img

    def __call__(
        self, img: np.ndarray
    ) -> tuple[DocumentAnalyzerSchema, np.ndarray | None, np.ndarray | None]:
        """
        Perform document analysis on the given image.

        This method processes the input image by running text detection, recognition,
        layout analysis, and other preprocessing tasks. It also supports visualization
        of the layout and reading order if the `visualize` attribute is enabled.

        Args:
            img (np.ndarray): The input image in BGR format

        Returns:
            tuple: A tuple containing:

                - results (DocumentAnalyzerSchema): The aggregated results of the document analysis,
                  including OCR, layout, and preprocessing outputs.

                - ocr (np.ndarray or None): The visualized OCR results if `visualize` is enabled, otherwise `None`.

                - layout (np.ndarray or None): The visualization of the layout and reading order if
                  `visualize` is enabled, otherwise `None`.
        """

        future = self.thread_loop.run_coroutine(self.run_tasks(img))
        try:
            results, ocr, _, processed_img = future.result()
        except torch.cuda.OutOfMemoryError as e:
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
            logger.error("GPU out of memory in DocumentAnalyzer: %s", e)
            raise make_error(ErrorCode.GPU_OUT_OF_MEMORY) from e
        layout = None

        if self.visualize:
            layout = layout_visualizer_detail(results, processed_img)
            layout = reading_order_visualizer(layout, results)

        return results, ocr, layout

__call__(img)

Perform document analysis on the given image.

This method processes the input image by running text detection, recognition, layout analysis, and other preprocessing tasks. It also supports visualization of the layout and reading order if the visualize attribute is enabled.

引数:

名前 タイプ デスクリプション デフォルト
img ndarray

The input image in BGR format

必須

戻り値:

名前 タイプ デスクリプション
tuple tuple[DocumentAnalyzerSchema, ndarray | None, ndarray | None]

A tuple containing:

  • results (DocumentAnalyzerSchema): The aggregated results of the document analysis, including OCR, layout, and preprocessing outputs.

  • ocr (np.ndarray or None): The visualized OCR results if visualize is enabled, otherwise None.

  • layout (np.ndarray or None): The visualization of the layout and reading order if visualize is enabled, otherwise None.

ソースコード位置: src/yomitoku/document_analyzer.py
def __call__(
    self, img: np.ndarray
) -> tuple[DocumentAnalyzerSchema, np.ndarray | None, np.ndarray | None]:
    """
    Perform document analysis on the given image.

    This method processes the input image by running text detection, recognition,
    layout analysis, and other preprocessing tasks. It also supports visualization
    of the layout and reading order if the `visualize` attribute is enabled.

    Args:
        img (np.ndarray): The input image in BGR format

    Returns:
        tuple: A tuple containing:

            - results (DocumentAnalyzerSchema): The aggregated results of the document analysis,
              including OCR, layout, and preprocessing outputs.

            - ocr (np.ndarray or None): The visualized OCR results if `visualize` is enabled, otherwise `None`.

            - layout (np.ndarray or None): The visualization of the layout and reading order if
              `visualize` is enabled, otherwise `None`.
    """

    future = self.thread_loop.run_coroutine(self.run_tasks(img))
    try:
        results, ocr, _, processed_img = future.result()
    except torch.cuda.OutOfMemoryError as e:
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
        logger.error("GPU out of memory in DocumentAnalyzer: %s", e)
        raise make_error(ErrorCode.GPU_OUT_OF_MEMORY) from e
    layout = None

    if self.visualize:
        layout = layout_visualizer_detail(results, processed_img)
        layout = reading_order_visualizer(layout, results)

    return results, ocr, layout