Skip to content

OCR

Bases: BaseJob

A class for performing Optical Character Recognition (OCR) on images.

This class integrates text detection and text recognition components to extract text from images. It supports customization through configurations and allows asynchronous processing of images.

引数:

名前 タイプ デスクリプション デフォルト
configs dict

A dictionary of configurations to override the default settings. The configs dictionary can include keys such as "text_detector" and "text_recognizer" to customize specific components. Defaults to an empty dictionary.

{}
device str

The device to use for computation, e.g., "cuda" or "cpu". Defaults to "cuda".

'cuda'
visualize bool

Whether to enable visualization during OCR processing. Defaults to False.

False
license_key str

The license key for using specific features or services. Defaults to None.

None
secret_key str

The secret key for authentication with external services. Defaults to None.

None
device_token str

The device token for authentication with external services. Defaults to None.

None

属性:

名前 タイプ デスクリプション
detector TextDetector

An instance of the text detection module used to detect text regions in images.

recognizer TextRecognizer

An instance of the text recognition module used to recognize text content from detected regions.

ソースコード位置: src/yomitoku/ocr.py
class OCR(BaseJob):
    """
    A class for performing Optical Character Recognition (OCR) on images.

    This class integrates text detection and text recognition components to extract text
    from images. It supports customization through configurations and allows asynchronous
    processing of images.

    Args:
        configs (dict, optional): A dictionary of configurations to override the default settings.
            The `configs` dictionary can include keys such as "text_detector" and "text_recognizer"
            to customize specific components. Defaults to an empty dictionary.
        device (str, optional): The device to use for computation, e.g., "cuda" or "cpu". Defaults to "cuda".
        visualize (bool, optional): Whether to enable visualization during OCR processing. Defaults to False.
        license_key (str, optional): The license key for using specific features or services. Defaults to None.
        secret_key (str, optional): The secret key for authentication with external services. Defaults to None.
        device_token (str, optional): The device token for authentication with external services. Defaults to None.

    Attributes:
        detector (TextDetector): An instance of the text detection module used to detect text regions in images.
        recognizer (TextRecognizer): An instance of the text recognition module used to recognize text content
            from detected regions.
    """

    def __init__(
        self,
        configs={},
        device="cuda",
        visualize=False,
        license_key=None,
        secret_key=None,
        device_token=None,
    ):
        super().__init__()
        text_detector_kwargs = {
            "device": device,
            "license_key": license_key,
            "secret_key": secret_key,
            "device_token": device_token,
        }
        text_recognizer_kwargs = {
            "device": device,
            "license_key": license_key,
            "secret_key": secret_key,
            "device_token": device_token,
        }

        if isinstance(configs, dict):
            if "text_detector" in configs:
                text_detector_kwargs.update(configs["text_detector"])
            if "text_recognizer" in configs:
                text_recognizer_kwargs.update(configs["text_recognizer"])
        else:
            raise ValueError(
                "configs must be a dict. See the https://kotaro-kinoshita.github.io/yomitoku-dev/usage/"
            )

        self.detector = TextDetector(**text_detector_kwargs)
        self.recognizer = TextRecognizer(**text_recognizer_kwargs)
        self.visualize = visualize

    async def run(self, img) -> tuple[dict[str, Any], np.ndarray]:
        """
        Perform OCR on the given image asynchronously.

        This method detects text regions in the image and recognizes the text content
        from those regions. It also supports visualization of the OCR process.

        Args:
            img (np.ndarray): The input image in BGR format.

        Returns:
            tuple: A tuple containing:

                - outputs (dict): A dictionary with the recognized words and their positions.

                - vis (np.ndarray): The visualization image (if visualization is enabled).
        """
        det_outputs, det_score = self.detector(img)
        rec_outputs = self.recognizer(
            img, det_outputs.points, det_scores=det_outputs.scores
        )
        outputs = {"words": ocr_aggregate(rec_outputs)}
        return outputs, det_score

    def __call__(self, img) -> tuple[OCRSchema, np.ndarray | None]:
        """
        Perform OCR on the given image.

        This method is a synchronous wrapper for the `run` method, allowing direct
        invocation of the OCR process.

        Args:
            img (np.ndarray): The input image in BGR format (as loaded by OpenCV).

        Returns:
            tuple: A tuple containing:

                - outputs (OCRSchema): An OCR output with the recognized words and their positions.

                - vis (np.ndarray or None): The visualization image (if visualization is enabled).
        """
        outputs, det_score = asyncio.run(self.run(img))
        results = OCRSchema(**outputs)

        ocr_vis = None
        if self.visualize:
            ocr_vis = ocr_visualizer(
                results.words,
                img,
                font_path=self.recognizer._cfg.visualize.font,
                det_score=det_score,
                vis_heatmap=self.detector._cfg.visualize.heatmap,
            )

        return results, ocr_vis

__call__(img)

Perform OCR on the given image.

This method is a synchronous wrapper for the run method, allowing direct invocation of the OCR process.

引数:

名前 タイプ デスクリプション デフォルト
img ndarray

The input image in BGR format (as loaded by OpenCV).

必須

戻り値:

名前 タイプ デスクリプション
tuple tuple[OCRSchema, ndarray | None]

A tuple containing:

  • outputs (OCRSchema): An OCR output with the recognized words and their positions.

  • vis (np.ndarray or None): The visualization image (if visualization is enabled).

ソースコード位置: src/yomitoku/ocr.py
def __call__(self, img) -> tuple[OCRSchema, np.ndarray | None]:
    """
    Perform OCR on the given image.

    This method is a synchronous wrapper for the `run` method, allowing direct
    invocation of the OCR process.

    Args:
        img (np.ndarray): The input image in BGR format (as loaded by OpenCV).

    Returns:
        tuple: A tuple containing:

            - outputs (OCRSchema): An OCR output with the recognized words and their positions.

            - vis (np.ndarray or None): The visualization image (if visualization is enabled).
    """
    outputs, det_score = asyncio.run(self.run(img))
    results = OCRSchema(**outputs)

    ocr_vis = None
    if self.visualize:
        ocr_vis = ocr_visualizer(
            results.words,
            img,
            font_path=self.recognizer._cfg.visualize.font,
            det_score=det_score,
            vis_heatmap=self.detector._cfg.visualize.heatmap,
        )

    return results, ocr_vis

run(img) async

Perform OCR on the given image asynchronously.

This method detects text regions in the image and recognizes the text content from those regions. It also supports visualization of the OCR process.

引数:

名前 タイプ デスクリプション デフォルト
img ndarray

The input image in BGR format.

必須

戻り値:

名前 タイプ デスクリプション
tuple tuple[dict[str, Any], ndarray]

A tuple containing:

  • outputs (dict): A dictionary with the recognized words and their positions.

  • vis (np.ndarray): The visualization image (if visualization is enabled).

ソースコード位置: src/yomitoku/ocr.py
async def run(self, img) -> tuple[dict[str, Any], np.ndarray]:
    """
    Perform OCR on the given image asynchronously.

    This method detects text regions in the image and recognizes the text content
    from those regions. It also supports visualization of the OCR process.

    Args:
        img (np.ndarray): The input image in BGR format.

    Returns:
        tuple: A tuple containing:

            - outputs (dict): A dictionary with the recognized words and their positions.

            - vis (np.ndarray): The visualization image (if visualization is enabled).
    """
    det_outputs, det_score = self.detector(img)
    rec_outputs = self.recognizer(
        img, det_outputs.points, det_scores=det_outputs.scores
    )
    outputs = {"words": ocr_aggregate(rec_outputs)}
    return outputs, det_score