SYSTEMS AND METHODS FOR SEPARATING LIGATURE CHARACTERS IN DIGITIZED DOCUMENT IMAGES
Embodiments disclosed herein provide for systems and methods of separating characters associated with ligatures in digitized documents. The systems and methods provide for a ligature detection engine configured to identify the ligatures, and a ligature processing engine configured to identify and remove the glyphs attaching the separate characters forming the ligature.
The present application relates to systems and methods for separating characters associated with ligatures in digitized document images.
BACKGROUNDLigatures generally refer to characters consisting of two or more joined letters or graphemes. Although ligatures can be utilized as design choices, they're often the result of image processing errors for digitized documents. For example, if a digitized document image is of low quality, it may be difficult for the optical character recognition (OCR) processing the image to distinguish a ligature from the characters it is composed of. As such, the OCR engine can fail to convert the low-quality images containing the ligatures into accurate textual representations, e.g., American Standard Code for Information Interchange (ASCII) text. Currently, inefficient techniques such as connected component analysis, segmentation, and ligature recognition are utilized to identify and separate the characters associated with ligatures.
Accordingly, there is a need to efficiently identify and separate the characters associated with ligatures.
The following description of embodiments provides non-limiting representative examples referencing numerals to particularly describe features and teachings of different aspects of the invention. The embodiments described should be recognized as capable of implementation separately, or in combination, with other embodiments from the description of the embodiments. A person of ordinary skill in the art reviewing the description of embodiments should be able to learn and understand the different described aspects of the invention. The description of embodiments should facilitate understanding of the invention to such an extent that other implementations, not specifically covered but within the knowledge of a person of skill in the art having read the description of embodiments, would be understood to be consistent with an application of the invention.
One aspect of the present disclosure is to provide a system and method for separating characters including ligatures. The systems and methods herein address at least one of the problems discussed above.
According to an embodiment, a system for separating characters associated with ligatures in digitized document images includes: (i) a ligature detection engine, wherein the ligature detection engine is configured to: receive at least one digitized document image including a plurality of characters; determine which of the plurality of characters in the at least one digitized document image are associated with ligatures; and generate a contour around each of the ligatures, wherein the contour includes a pixelated version of the ligature, wherein pixels associated with glyphs of the ligature are darkened; and (ii) a ligature processing engine, wherein the ligature processing engine is configured to: scan each column of the contour; determine, for each scanned column including at least one darkened pixel, a height of a respective glyph associated with the at least one darkened pixel; identify a pinch point for the ligature based on a comparison between a plurality of adjacent scanned columns including at least one darkened pixel; remove the glyph associated with the pinch point; and separate the characters associated with the ligature based on the removed glyph.
According to an embodiment, a method for separating characters associated with ligatures in digitized document images includes: receiving, with a processor, at least one digitized document image including a plurality of characters; determining, with the processor, which of the plurality of characters in the at least one digitized document image are associated with ligatures; generating, with the processor, a contour around each of the ligatures, wherein the contour includes a pixelated version of the ligature, wherein pixels associated with glyphs of the ligature are darkened; and scanning, with the processor, each column of the contour; determining, with the processor, for each scanned column including at least one darkened pixel, a height of a respective glyph associated with the at least one darkened pixel; identifying, with the processor, a pinch point for the ligature based on a comparison between a plurality of adjacent scanned columns including at least one darkened pixel; removing, with the processor, the glyph associated with the pinch point; and separating, with the processor, the characters associated with the ligature based on the removed glyph.
According to an embodiment, a system for separating characters associated with ligatures in digitized document images includes: a processor, wherein the processor is configured to: receive a contour including a ligature, wherein the ligature is pixelated, wherein pixels associated with the ligature are darkened; scan each column of the contour; determine, for each scanned column including at least one darkened pixel, a height of a respective glyph associated with the at least one darkened pixel; identify a pinch point for the ligature based on a comparison between a plurality of adjacent scanned columns including at least one darkened pixel; remove the glyph associated with the pinch point; and separate the characters associated with the ligature based on the removed glyph.
Implementations of the various techniques described herein may be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. Implementations may be implemented as a computer program product, i.e., a computer program tangibly embodied in an information carrier, e.g., in a machine readable storage device or in a propagated signal, for execution by, or to control the operation of, data processing apparatus, e.g., a programmable processor, a computer, or multiple computers. A computer program, such as the computer program(s) described above, can be written in any form of programming language, including compiled or interpreted languages, and can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a communication network.
Method steps may be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Method steps also may be performed by, and an apparatus may be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit).
In the foregoing Description of Embodiments, various features may be grouped together in a single embodiment for purposes of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claims require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the following claims are hereby incorporated into this Description of Embodiments, with each claim standing on its own as a separate embodiment of the invention.
Moreover, it will be apparent to those skilled in the art from consideration of the specification and practice of the present disclosure that various modifications and variations can be made to the disclosed systems without departing from the scope of the disclosure, as claimed. Thus, it is intended that the specification and examples be considered as exemplary only, with a true scope of the present disclosure being indicated by the following claims and their equivalents.
Claims
1. A system for separating characters associated with ligatures in digitized document images, the system comprising:
- a ligature detection engine, wherein the ligature detection engine is configured to: receive at least one digitized document image including a plurality of characters; determine which of the plurality of characters in the at least one digitized document image are associated with ligatures; and generate a contour around each of the ligatures, wherein the contour includes a pixelated version of the ligature, wherein pixels associated with glyphs of the ligature are darkened; and
- a ligature processing engine, wherein the ligature processing engine is configured to: scan each column of the contour; determine, for each scanned column including at least one darkened pixel, a height of a respective glyph associated with the at least one darkened pixel by scanning from top and bottom of the column of the contour, wherein the scanning determines an imaginary vertical line so as to separate one or more characters from the plurality of characters, the imaginary vertical line based on a transition from a first portion of a first slope of pixels to a second portion of a second slope of pixels; identify a pinch point for the ligature based on the imaginary vertical line and a comparison between a plurality of adjacent scanned columns including at least one darkened pixel; remove the glyph associated with the pinch point; and separate the one or more characters associated with the ligature based on the removed glyph and the imaginary vertical line.
2. The system of claim 1, further comprising:
- an optical character recognition (OCR) engine, wherein the OCR engine is configured to (i) receive the separated characters and (ii) verify the accuracy of the separated characters.
3. The system of claim 1, wherein the columns are scanned from at least one selected from the group of: (i) from left to right and (ii) right to left.
4. The system of claim 1, wherein the contour includes (i) a height based, at least in part, on a height of the ligature and (ii) a width of the ligature.
5. The system of claim 1, wherein the height of the respective glyph is determined based on (i) a first distance from the top of the contour to a topmost darkened pixel in the column and (ii) a second distance from the bottom of the contour to a bottommost darkened pixel in the column.
6. The system of claim 5, wherein (i) the first distance is determined based on a first scan from the top of the contour to the topmost darkened pixel in the column and (ii) the second distance is determined based on a second scan from the bottom of the contour to the bottommost darkened pixel in the column.
7. The system of claim 1, wherein the pinch point is identified upon determining (i) a decrease in height of the respective glyph from a first column to a second column and (ii) an increase in height of the respective glyph from a second column to a third column.
8. The system of claim 7, wherein the ligature processing engine is further configured to segment the contour vertically at the pinch point and separate the contour into separate contours.
9. The system of claim 1, wherein the ligature processing engine is further configured to store a height of a glyph in a previously scanned column upon determining a change in height of the glyph from the previously scanned column to another glyph in a currently scanned column.
10. The system of claim 1, wherein the ligature detection engine is further configured to convert the at least one digitized document image into monochrome.
11. A method for separating characters associated with ligatures in digitized document images, the method comprising:
- receiving, with a processor, at least one digitized document image including a plurality of characters;
- determining, with the processor, which of the plurality of characters in the at least one digitized document image are associated with ligatures;
- generating, with the processor, a contour around each of the ligatures, wherein the contour includes a pixelated version of the ligature, wherein pixels associated with glyphs of the ligature are darkened; and
- scanning, with the processor, each column of the contour;
- determining, with the processor, for each scanned column including at least one darkened pixel, a height of a respective glyph associated with the at least one darkened pixel by scanning from top and bottom of the column of the contour, wherein the scanning determines an imaginary vertical line so as to separate one or more characters from the plurality of characters, the imaginary vertical line based on a transition from a first portion of a first slope of pixels to a second portion of a second slope of pixels;
- identifying, with the processor, a pinch point for the ligature based on the imaginary vertical line and a comparison between a plurality of adjacent scanned columns including at least one darkened pixel;
- removing, with the processor, the glyph associated with the pinch point; and
- separating, with the processor, the one or more characters associated with the ligature based on the removed glyph and the imaginary vertical line.
12. The method of claim 11, wherein the columns are scanned from at least one selected from the group of: (i) from left to right and (ii) right to left.
13. The method of claim 11, wherein the contour includes (i) a height of the ligature and (ii) a width of the ligature.
14. The method of claim 11, wherein the height of the respective glyph is determined based on (i) a first distance from the top of the contour to a topmost darkened pixel in the column and (ii) a second distance from the bottom of the contour to a bottommost darkened pixel in the column.
15. The method of claim 14, wherein (i) the first distance is determined based on a first scan from the top of the contour to the topmost darkened pixel in the column and (ii) the second distance is determined based on a second scan from the bottom of the contour to the bottommost darkened pixel in the column.
16. The method of claim 11, wherein the pinch point is identified upon determining (i) a decrease in height of the respective glyph from a first column to a second column and (ii) an increase in height of the respective glyph from a second column to a third column.
17. The method of claim 16, further comprising:
- segmenting, with the processor, the contour vertically at the pinch point and separating the contour into separate contours.
18. The method of claim 11, further comprising:
- storing, with the processor, a height of a glyph associated with a previously scanned column in a memory database upon determining a change in height of the glyph from the previously scanned column to another glyph in a currently scanned column.
19. A system for separating characters associated with ligatures in digitized document images, the system comprising:
- a processor, wherein the processor is configured to: receive a contour including a ligature, wherein the ligature is pixelated, wherein pixels associated with the ligature are darkened; scan each column of the contour; determine, for each scanned column including at least one darkened pixel, a height of a respective glyph associated with the at least one darkened pixel by scanning from top and bottom of the column of the contour, wherein the scanning determines an imaginary vertical line so as to separate one or more characters from the plurality of characters, the imaginary vertical line based on a transition from a first portion of a first slope of pixels to a second portion of a second slope of pixels; identify a pinch point for the ligature based on the imaginary vertical line and a comparison between a plurality of adjacent scanned columns including at least one darkened pixel; remove the glyph associated with the pinch point; and separate the one or more characters associated with the ligature based on the removed glyph imaginary vertical line.
20. The system of claim 19, wherein the processor is further configured to verify the accuracy of the separated characters.
Type: Application
Filed: Mar 19, 2019
Publication Date: Sep 24, 2020
Patent Grant number: 10878271
Inventor: Douglas SLATTERY (McKinney, TX)
Application Number: 16/357,878