Patents Assigned to PDFlib GmbH
  • Patent number: 7705848
    Abstract: A method of identifying semantic units in an electronic document includes the steps of: providing an electronic document being described in a page description language, the document having at least one page having a plurality of text fragments, each text fragment including a plurality of glyphs that have not been identified as semantic units, the document further including geometric information and page description language parameters; determining strips of at least one glyph by comparing the geometric position of subsequent glyphs; determining zones of at least one strip wherein a zone is defined by the combined area of strips, the geometrical areas of which overlap with each other; determining a boundary between two semantic units in a zone based on the geometric properties of the glyphs; sorting the identified semantic units in the zone in a sorted list; and, combining subsequent semantic units in the sorted list according to geometric considerations.
    Type: Grant
    Filed: April 18, 2006
    Date of Patent: April 27, 2010
    Assignee: PDFlib GmbH
    Inventor: Serge Bronstein
  • Patent number: 7643682
    Abstract: A method of identifying redundant text fragments, which create artificial artifacts only, in an electronic page description language document includes a) providing a page having a plurality of text fragments, each text fragment comprising at least one glyph, the document including Unicode values for all glyphs and geometric information of all text fragments on the page and page description language parameters of all glyphs, b) identifying two text fragments as redundant candidates, if the Unicode sequence of the text fragments have identical corresponding Unicode sequences, c) defining a bounding box of quadrangular shape for each of the two redundant candidates according to their font characteristics, d) calculating the overlapping area of the two bounding boxes, and e) determining whether the two candidates form redundant text fragments by comparing the ratio of the overlapping area to the area of the smaller bounding box of both text fragments with a predetermined threshold.
    Type: Grant
    Filed: April 18, 2006
    Date of Patent: January 5, 2010
    Assignee: PDFlib GmbH
    Inventor: Serge Bronstein
  • Patent number: 7636885
    Abstract: A method of determining Unicode values corresponding to the text in digital documents includes: providing a digital document containing information related to the text in the document, the information including at least one set of data selected from the group consisting of: the numerical character code comprised by a single byte value or a sequence of multiple bytes, the glyph name corresponding to the character code for simple fonts, the code-to-Unicode mapping provided by a ToUnicode CMap, and font outline data embedded in the document; obtaining the information related to the text from the document; and determining the Unicode values corresponding to a specific code of a specific font on a per-glyph basis by executing a cascade of determination steps for each code separately, the cascade being executed in a predetermined sequence using different sources of information.
    Type: Grant
    Filed: June 6, 2006
    Date of Patent: December 22, 2009
    Assignee: PDFlib GmbH
    Inventors: Thomas Merz, Kurt Stützer