コンテンツにスキップ

OCR

Bases: BaseJob

A class for performing Optical Character Recognition (OCR) on images.

This class integrates text detection and text recognition components to extract text from images. It supports customization through configurations and allows asynchronous processing of images.

Parameters:

Name Type Description Default
configs dict

A dictionary of configurations to override the default settings. The configs dictionary can include keys such as "text_detector" and "text_recognizer" to customize specific components. Defaults to an empty dictionary.

{}
device str

The device to use for computation, e.g., "cuda" or "cpu". Defaults to "cuda".

'cuda'
visualize bool

Whether to enable visualization during OCR processing. Defaults to False.

False
license_key str

The license key for using specific features or services. Defaults to None.

None
secret_key str

The secret key for authentication with external services. Defaults to None.

None
device_token str

The device token for authentication with external services. Defaults to None.

None

Attributes:

Name Type Description
detector TextDetector

An instance of the text detection module used to detect text regions in images.

recognizer TextRecognizer

An instance of the text recognition module used to recognize text content from detected regions.

Source code in src/yomitoku/ocr.py
class OCR(BaseJob):
    """
    A class for performing Optical Character Recognition (OCR) on images.

    This class integrates text detection and text recognition components to extract text
    from images. It supports customization through configurations and allows asynchronous
    processing of images.

    Args:
        configs (dict, optional): A dictionary of configurations to override the default settings.
            The `configs` dictionary can include keys such as "text_detector" and "text_recognizer"
            to customize specific components. Defaults to an empty dictionary.
        device (str, optional): The device to use for computation, e.g., "cuda" or "cpu". Defaults to "cuda".
        visualize (bool, optional): Whether to enable visualization during OCR processing. Defaults to False.
        license_key (str, optional): The license key for using specific features or services. Defaults to None.
        secret_key (str, optional): The secret key for authentication with external services. Defaults to None.
        device_token (str, optional): The device token for authentication with external services. Defaults to None.

    Attributes:
        detector (TextDetector): An instance of the text detection module used to detect text regions in images.
        recognizer (TextRecognizer): An instance of the text recognition module used to recognize text content
            from detected regions.
    """

    def __init__(
        self,
        configs={},
        device="cuda",
        visualize=False,
        license_key=None,
        secret_key=None,
        device_token=None,
    ):
        super().__init__()
        text_detector_kwargs = {
            "device": device,
            "license_key": license_key,
            "secret_key": secret_key,
            "device_token": device_token,
        }
        text_recognizer_kwargs = {
            "device": device,
            "license_key": license_key,
            "secret_key": secret_key,
            "device_token": device_token,
        }

        if isinstance(configs, dict):
            if "text_detector" in configs:
                text_detector_kwargs.update(configs["text_detector"])
            if "text_recognizer" in configs:
                text_recognizer_kwargs.update(configs["text_recognizer"])
        else:
            raise ValueError(
                "configs must be a dict. See the https://kotaro-kinoshita.github.io/yomitoku-dev/usage/"
            )

        self.detector = TextDetector(**text_detector_kwargs)
        self.recognizer = TextRecognizer(**text_recognizer_kwargs)
        self.visualize = visualize

    async def run(self, img) -> tuple[dict[str, Any], np.ndarray]:
        """
        Perform OCR on the given image asynchronously.

        This method detects text regions in the image and recognizes the text content
        from those regions. It also supports visualization of the OCR process.

        Args:
            img (np.ndarray): The input image in BGR format.

        Returns:
            tuple: A tuple containing:

                - outputs (dict): A dictionary with the recognized words and their positions.

                - vis (np.ndarray): The visualization image (if visualization is enabled).
        """
        det_outputs, det_score = self.detector(img)
        rec_outputs = self.recognizer(
            img, det_outputs.points, det_scores=det_outputs.scores
        )
        outputs = {"words": ocr_aggregate(rec_outputs)}
        return outputs, det_score

    def __call__(self, img) -> tuple[OCRSchema, np.ndarray | None]:
        """
        Perform OCR on the given image.

        This method is a synchronous wrapper for the `run` method, allowing direct
        invocation of the OCR process.

        Args:
            img (np.ndarray): The input image in BGR format (as loaded by OpenCV).

        Returns:
            tuple: A tuple containing:

                - outputs (OCRSchema): An OCR output with the recognized words and their positions.

                - vis (np.ndarray or None): The visualization image (if visualization is enabled).
        """
        outputs, det_score = asyncio.run(self.run(img))
        results = OCRSchema(**outputs)

        ocr_vis = None
        if self.visualize:
            ocr_vis = ocr_visualizer(
                results.words,
                img,
                font_path=self.recognizer._cfg.visualize.font,
                det_score=det_score,
                vis_heatmap=self.detector._cfg.visualize.heatmap,
            )

        return results, ocr_vis

__call__(img)

Perform OCR on the given image.

This method is a synchronous wrapper for the run method, allowing direct invocation of the OCR process.

Parameters:

Name Type Description Default
img ndarray

The input image in BGR format (as loaded by OpenCV).

required

Returns:

Name Type Description
tuple tuple[OCRSchema, ndarray | None]

A tuple containing:

  • outputs (OCRSchema): An OCR output with the recognized words and their positions.

  • vis (np.ndarray or None): The visualization image (if visualization is enabled).

Source code in src/yomitoku/ocr.py
def __call__(self, img) -> tuple[OCRSchema, np.ndarray | None]:
    """
    Perform OCR on the given image.

    This method is a synchronous wrapper for the `run` method, allowing direct
    invocation of the OCR process.

    Args:
        img (np.ndarray): The input image in BGR format (as loaded by OpenCV).

    Returns:
        tuple: A tuple containing:

            - outputs (OCRSchema): An OCR output with the recognized words and their positions.

            - vis (np.ndarray or None): The visualization image (if visualization is enabled).
    """
    outputs, det_score = asyncio.run(self.run(img))
    results = OCRSchema(**outputs)

    ocr_vis = None
    if self.visualize:
        ocr_vis = ocr_visualizer(
            results.words,
            img,
            font_path=self.recognizer._cfg.visualize.font,
            det_score=det_score,
            vis_heatmap=self.detector._cfg.visualize.heatmap,
        )

    return results, ocr_vis

run(img) async

Perform OCR on the given image asynchronously.

This method detects text regions in the image and recognizes the text content from those regions. It also supports visualization of the OCR process.

Parameters:

Name Type Description Default
img ndarray

The input image in BGR format.

required

Returns:

Name Type Description
tuple tuple[dict[str, Any], ndarray]

A tuple containing:

  • outputs (dict): A dictionary with the recognized words and their positions.

  • vis (np.ndarray): The visualization image (if visualization is enabled).

Source code in src/yomitoku/ocr.py
async def run(self, img) -> tuple[dict[str, Any], np.ndarray]:
    """
    Perform OCR on the given image asynchronously.

    This method detects text regions in the image and recognizes the text content
    from those regions. It also supports visualization of the OCR process.

    Args:
        img (np.ndarray): The input image in BGR format.

    Returns:
        tuple: A tuple containing:

            - outputs (dict): A dictionary with the recognized words and their positions.

            - vis (np.ndarray): The visualization image (if visualization is enabled).
    """
    det_outputs, det_score = self.detector(img)
    rec_outputs = self.recognizer(
        img, det_outputs.points, det_scores=det_outputs.scores
    )
    outputs = {"words": ocr_aggregate(rec_outputs)}
    return outputs, det_score