Spreadsheet Flat Data Extractor
Systems and methods extract flat data units from a non-flat input, such as a spreadsheet comprising tables organized according to a hierarchy. First, the non-flat input is read (e.g., using pandas in combination with openpyxl) to create a flat dataframe comprising the content of the non-flat input. Next, individual flat data units (e.g., spreadsheet tables) are recognized and split from the dataframe based upon the appearance of blank rows and/or columns. Headers present in the flat data units are determined (e.g., based upon alphabetic cell text, bolded cell text, and/or early position of the cell in a column), and then connections between the flat data units are identified. Based upon the connections, individual flat data units are merged together. The resulting merged flat data units are subsequently available for consumption, for example user reports of content, and/or conversion to a new non-flat format (e.g., relational database schema).
Unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
A table comprising rows and columns is a common way to organize and present related data. A table may represent a single, discrete flat data unit comprising such rows and columns.
Alternatively, a table may be a part of an overall entity such as a spreadsheet. There, one table may be intrinsically related to other flat units according to a larger schema.
SUMMARYEmbodiments relate to systems and methods that extract flat data units from a non-flat input, such as a spreadsheet comprising tables organized according to a hierarchy. First, the non-flat input is read (e.g., using pandas in combination with openpyxl) to create a flat dataframe comprising the content of the non-flat input. Next, individual flat data units (e.g., spreadsheet tables) are recognized and split from the dataframe based upon the appearance of blank rows and/or columns. Headers present in the flat data units are determined (e.g., based upon alphabetic cell text, bolded cell text, and/or early position of the cell in a column), and then connections between the flat data units are identified. Based upon the connections, individual flat data units are merged together. The resulting merged flat data units are subsequently available for consumption, for example user reports of content, and/or conversion to a new non-flat format (e.g., relational database schema).
The following detailed description and accompanying drawings provide a better understanding of the nature and advantages of various embodiments.
Described herein are methods and apparatuses that implement flat data extraction. In the following description, for purposes of explanation, numerous examples and specific details are set forth in order to provide a thorough understanding of embodiments according to the present invention. It will be evident, however, to one skilled in the art that embodiments as defined by the claims may include some or all of the features in these examples alone or in combination with other features described below, and may further include modifications and equivalents of the features and concepts described herein.
This incoming data is in the form of a spreadsheet that is organized according to a hierarchical structure 106. That is, the data is not simply organized as a flat table in rows and columns. Such non-flat input may be a spreadsheet, or may be emails, PDFs, text documents, or other formats.
According to a first processing phase 108, the extraction engine reads the non-flat data in order to produce a dataframe 112. This dataframe includes the data from the spreadsheet, but in a flat, non-hierarchical format.
As is described in detail below, this reading of the non-flat spreadsheet to produce the flat dataframe, may efficiently be performed by resorting to certain specialized functionality. In specific embodiments, this can include the use of pandas in combination with openpyxl.
In a next processing phase 114, individual data units 116 are identified from the flat dataframe. These data units may comprise discrete tables that are present within the spreadsheet.
In particular, such splitting of the dataframe into individual flat data units may be accomplished by recognizing empty rows 118 and/or columns 120 within the datasheet. These empty rows/columns are understood to delineate individual spreadsheet tables from one another.
In a next phase of processing, the extraction engine seeks to determine 122 column headers 124 within the individual flat data units. Such headers may be explicitly apparent from the two-dimensional structure of the data unit.
Alternatively, column headers may need to be inferred from characteristics of the nature of a table cell. Examples of such characteristics can include but are not limited to alphabetic content, bold font, and/or position as an early row within a column.
Then, based upon recognized headers as well as other available information, during a next processing phase 126 the extraction engine seeks to identify connections 128 between the various flat data units that have been split from the dataframe.
These connections can be identified on the basis of the headers. Moreover, connections between flat data units can also be identified on the basis of other factors, for example table metadata and physical proximity between flat data units within the dataframe.
Having identified connections between the flat data units, in a next processing phase 130 the data extraction engine then operates to merge those related flat data units into a fewer number of merged flat data units 132. This merger may be based upon merge criteria 134, for example calculation of a matching percentage between the flat data units.
The merged flat data units are then stored in a database 136, for future consumption 138 by a user 140. Such consumption can involve the generation of report(s) 142 regarding the merged flat data units and their respective content and/or metadata.
The merged flat data units can also be consumed in a process of converting 144 to a hierarchical structure different from the original spreadsheet. Such a hierarchical structure may be a particular (relational) database schema.
At 204, the dataframe is split into a plurality of flat data units. This splitting may be based upon the presence of empty rows and/or columns.
At 206, headers of the flat data units are determined. At 208, connections between the flat data units are determined (based, e.g., upon headers).
At 210, two flat data units are merged together to form a merged flat data unit. This merging may be determined based upon merge criteria that considers a percentage matching.
Systems and methods according to embodiments, may avoid one or more issues. In particular, manually extracting flat data units and establishing connections between them, can be difficult and lead to errors. Embodiments avoid the time, effort, and expense of manually extracting flat data units such as tables from non-flat input.
Further details regarding extraction of flat data units according to various embodiments, are now provided in connection with the following example.
ExampleThe system of this exemplary embodiment relates to processing of an invoice within the EXCEL spreadsheet available from Microsoft Corporation of Redmond, Wash. In such an EXCEL spreadsheet there are multiple flat data units present.
Finding, relating, and merging of those flat data units in order to extract a connected table can pose a challenge, for several possible reasons. One reason is that reading the spreadsheet may be time consuming. Another mason is that a spreadsheet can have any number of tables, and identifying the valid tables from the spreadsheet is required.
In order to address these challenges, embodiments utilize python libraries in order to read and write the spreadsheet. In particular, this exemplary embodiment reads the spreadsheet using both Python Data Analysis Libraries (pandas) and openpyxl.
Processing is taken care of with the help of pandas. The fastest way to read the spreadsheet is with pandas as it works on C bottleneck libraries. Pandas read data in binary format, and covert data into Numpy array.
The Read_excel method of pandas is used to read the spreadsheet and convert it to the dataframe. This approach intelligently reads the spreadsheet until the maximum row and maximum column of the EXCEL spreadsheet.
Formatting is taken care with the openpyxl python library. In particular, after processing with pandas, coordinates of dataframes or flat data in spreadsheet are stored for each flat data unit. The code snippet of
A separate challenge to extracting flat data table units from a spreadsheet, arises from identifying how many flat data units are present in the spreadsheet. In particular, the spreadsheet file can contain unstructured data/flat data as well, in which case identifying the tables and the number of columns may not be straightforward.
In order to address this challenge, embodiments provide a procedure for finding a flat data unit. Specifically, after reading the spreadsheet the output will be the main dataframe—the entire flat data present in the EXCEL spreadsheet.
Then, this main dataframe that is read, is put in the queue data structure. This is shown in
In the intersection of the three separated connected units, there will be either a column empty or a row empty.
Embodiments thus find the empty columns in the main dataframe, and stores their location in one list. This is shown in
Next, the procedure splits the connected units on the basis of the location of empty columns. The procedure creates a dataframe on the range of empty columns and main dataframe row range. This is shown in
Now, embodiments again put the split dataframes location into the queue. As the split dataframes in the queue are split by column, there is no need to check the empty column. Rather, the procedure can just check the empty row in every dataframe one by one.
The procedure continues to store split dataframes location to the queue on the basis of separation. This is run until the queue is empty.
After the queue is empty, the output will be the locations of all the connected units. It is noted that there may not be a need to store connected units in the memory.
Rather, locations of connected units can be accessed using the panda “.iloc” function. This function is shown in the particular code snippet of
As one example, consider the spreadsheet content shown in
Other challenges to extracting flat data table units from a spreadsheet, arise from the need to identifying table header fields irrespective of language and format. For example, table headers may be in any language, may not be bold, and headers need not be in the first row but rather may lie anywhere in first few rows.
In order to address these challenges, embodiments iterate the first few rows of the table based on the table length. For each row, it is checked:
if the row has 75% of values,
if values are alphabets,
if the values are bold.
Rows matching any of these criteria are noted. Then, the row is selected as a line item row of the table based on highest criteria matched row
Returning to the specific example of
Still other challenges to extracting flat data table units from a spreadsheet, arise from identifying related tables within a spreadsheet. In particular, tables with no headers may be the continuation of a previous table. It may be required to identify the possible combinations of a table without headers that can be merged with a table having headers.
To address these challenges, embodiments perform specific steps to find related tables. Then, the related table without the header can be merged to the table with the header.
As a main condition in this exemplary embodiment, the table without the header should be below the table with header. And, the column range of the table without a header should be within the column range of table with a header.
The list has tables that have no header. These could be merged with the table that has header.
Returning again to the particular example of
Yet other challenges to extracting flat data table units from a spreadsheet, relate to the existence of metadata. Specifically, each column of a table may have associated metadata.
That metadata can be used to determine, for example:
the data type,
are the column values unique?
are the column values a series?
minimum and maximum data length of the column values.
Moreover, merging tables without headers to tables having headers, may be based on a calculation performed using the meta data of each column and overall table confidence calculation.
In order to address these issues, embodiments may perform processing according to a series of modules (Module1-Module4 in this example) in order to merge the table without the header to the table with the header.
In a Module 1, metadata of the table is obtained. In this module, the detailed information of the table is found.
For example, for each column of the table the following are computed:
Data Type;Data length (minimum and maximum data length);
list of Unique Values:
are values of column series.
Note: Series are not checked if the column values are unique and vice-versa.
A next Module2 iterates the list of lists obtained from the related table finder. The first table of the sub list is always the table with line item headers.
The meta data for each column of the first table is calculated and stored as: main_table_metadata. The meta data for each column of the rest of the tables in the sub list is calculated.
Then, a Module3 compares the metadata. In particular, the meta data of each columns of the sub tables is compared with the metadata of the corresponding columns of main_table_metadata.
How many columns of the sub table match with the main table metadata, is calculated. If the column match percent is more than 75%, then merge the sub table with main table; else reject the sub table.
This step is repeated for all sub tables of the sub list. In this manner, the Module3 returns the list of tables with all tables having a header.
Returning again to the example of
Then, the table 2 and the table 3 of
Returning now to
Rather, alternative embodiments could leverage the processing power of an in-memory database engine (e.g., the in-memory database engine of the HANA in-memory database available from SAP SE), in order to perform various functions.
Thus
An example computer system 2000 is illustrated in
Computer system 2010 may be coupled via bus 2005 to a display 2012, such as a Light Emitting Diode (LED) or liquid crystal display (LCD), for displaying information to a computer user. An input device 2011 such as a keyboard and/or mouse is coupled to bus 2005 for communicating information and command selections from the user to processor 2001. The combination of these components allows the user to communicate with the system. In some systems, bus 2005 may be divided into multiple specialized buses.
Computer system 2010 also includes a network interface 2004 coupled with bus 2005. Network interface 2004 may provide two-way data communication between computer system 2010 and the local network 2020. The network interface 2004 may be a digital subscriber line (DSL) or a modem to provide data communication connection over a telephone line, for example. Another example of the network interface is a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links are another example. In any such implementation, network interface 2004 sends and receives electrical, electromagnetic, or optical signals that carry digital data streams representing various types of information.
Computer system 2010 can send and receive information, including messages or other interface actions, through the network interface 2004 across a local network 2020, an Intranet, or the Internet 2030. For a local network, computer system 2010 may communicate with a plurality of other computer machines, such as server 2015. Accordingly, computer system 2010 and server computer systems represented by server 2015 may form a cloud computing network, which may be programmed with processes described herein. In the Internet example, software components or services may reside on multiple different computer systems 2010 or servers 2031-2035 across the network. The processes described above may be implemented on one or more servers, for example. A server 2031 may transmit actions or messages from one component, through Internet 2030, local network 2020, and network interface 2004 to a component on computer system 2010. The software components and processes described above may be implemented on any computer system and send and/or receive information across a network, for example.
Example 1. Computer implemented system and methods comprising:
receiving input data in a non-flat format;
reading the input data to produce a flat dataframe;
splitting the dataframe into a first flat data unit and a second flat data unit based upon a separation in the flat dataframe;
determining a header of the first flat data unit;
identifying a connection between the first flat data unit and the second flat data unit;
merging the first flat data unit and the second flat data unit to produce a merged flat data unit; and
storing the merged flat data unit in a database of a non-transitory computer readable storage medium.
Example 2. The computer implemented system and method of Example 1 wherein the splitting is based upon at least one of an empty row and an empty column as the separation.
Example 3. The computer implemented systems and methods of Examples 1 or 2 wherein:
the non-flat format comprises a spreadsheet;
the first flat data unit comprises a first table of the spreadsheet; and
the second flat data unit comprises a second table of the spreadsheet.
Example 4. The computer implemented systems and methods of Examples 1, 2, or 3 wherein determining the header references at least one of:
alphabetic content;
a bold font; and
an early position of a cell within a column of the first flat data unit.
Example 5. The computer implemented systems and methods of Examples 1, 2, 3, or 4 wherein the merging is based upon a percentage match between the first flat data unit and the second flat data unit.
Example 6. The computer implemented systems and methods of Example 5 wherein the percentage match is based upon at least one of: data type, data length, unique values, and series values.
Example 7. The computer implemented systems and methods of Examples 1, 2, 3, 4, 5, or 6 wherein the merging references a first location of the first flat data unit in the dataframe, and a second location of the second flat data unit in the dataframe.
Example 8. The computer implemented systems and methods of Examples 1, 2, 3, 4, 5, 6, or 7 wherein:
Example 9. The computer implemented systems and methods of Examples 1, 2, 3, 4, 5, 6, 7, or 8 wherein the reading utilizes a combination of a Python Data Analysis Library (panda) and openpyxl.
Example 10. The computer implemented systems and methods of Examples 7 or 9 wherein the first location and the second location are obtained using an .iloc function of panda.
The above description illustrates various embodiments of the present invention along with examples of how aspects of the present invention may be implemented. The above examples and embodiments should not be deemed to be the only embodiments, and are presented to illustrate the flexibility and advantages of the present invention as defined by the following claims. Based on the above disclosure and the following claims, other arrangements, embodiments, implementations and equivalents will be evident to those skilled in the art and may be employed without departing from the spirit and scope of the invention as defined by the claims.
Claims
1. A method comprising:
- receiving input data in a non-flat format;
- reading the input data to produce a flat dataframe;
- splitting the dataframe into a first flat data unit and a second flat data unit based upon a separation in the flat dataframe;
- determining a header of the first flat data unit by referencing at least one of alphabetic content, and a font;
- identifying a connection between the first flat data unit and the second flat data unit;
- merging the first flat data unit and the second flat data unit to produce a merged flat data unit; and
- storing the merged flat data unit in a database of a non-transitory computer readable storage medium,
- wherein,
- the merging is based upon a percentage match between the first flat data unit and the second flat data unit, and
- the percentage match is based upon series values.
2. A method as in claim 1 wherein the reading utilizes a combination of a Python Data Analysis Library (panda) and openpyxl.
3. A method as in claim 1 wherein the splitting is based upon at least one of an empty row and an empty column as the separation.
4. A method as in claim 1 wherein:
- the non-flat format comprises a spreadsheet;
- the first flat data unit comprises a first table of the spreadsheet; and
- the second flat data unit comprises a second table of the spreadsheet.
5. A method as in claim 1 wherein determining the header references a bold font.
6. A method as in claim 1 wherein the merging references a first location of the first flat data unit in the dataframe, and a second location of the second flat data unit in the dataframe.
7. A method as in claim 6 wherein:
- the reading utilizes a Python Data Analysis Library (panda); and
- the first location and the second location are obtained using an.iloc function of panda.
8. (canceled)
9. A method as in claim 1 wherein the percentage match is based upon data length.
10. A method as in claim 1 wherein the database comprises an in-memory database, and at least one of the reading, splitting, determining, identifying, and merging are performed by an in-memory database engine of the in-memory database.
11. A non-transitory computer readable storage medium embodying a computer program for performing a method, said method comprising:
- receiving a spreadsheet;
- reading the spreadsheet to produce a flat dataframe;
- splitting the dataframe into a first flat data unit comprising a first spreadsheet table and a second flat data unit comprising a second spreadsheet table, based upon a separation in the flat dataframe;
- determining a header of the first spreadsheet table by referencing at least one of alphabetic content of a cell in the first spreadsheet table, and a font of a cell in the first spreadsheet table;
- identifying a connection between the first spreadsheet table and the second spreadsheet table;
- merging the first spreadsheet table and the second spreadsheet table to produce a first merged spreadsheet table; and
- storing the merged spreadsheet table in a database of a non-transitory computer readable storage medium,
- wherein,
- the merging is based upon a percentage match between the first flat data unit and the second flat data unit, and
- the percentage match is based upon series values.
12. A non-transitory computer readable storage medium as in claim 11 wherein:
- the reading utilizes a combination of a Python Data Analysis Library (panda) and openpyxl; and
- the merging references a first location of the first spreadsheet table in the dataframe, and a second location of the second spreadsheet table in the dataframe, the first location and the second location obtained using an.iloc function of panda.
13. A non-transitory computer readable storage medium as in claim 11 wherein the splitting is based upon at least one of an empty row and an empty column as the separation.
14. A non-transitory computer readable storage medium as in claim 11 wherein determining the header references a bold font of a cell in the first spreadsheet table.
15. A non-transitory computer readable storage medium as in claim 11 wherein the merging is based upon the percentage match between the first spreadsheet table and the second spreadsheet table of data length.
16. A computer system comprising:
- one or more processors;
- a software program, executable on said computer system, the software program configured to cause an in-memory database engine of an in-memory database to:
- receive input data in a non-flat format;
- read the input data to produce a flat dataframe;
- split the dataframe into a first flat data unit and a second flat data unit based upon a separation in the flat dataframe;
- determine a header of the first flat data unit by referencing at least one of alphabetic content of a cell in the first spreadsheet table, and a font of a cell in the first spreadsheet table;
- identify a connection between the first flat data unit and the second flat data unit;
- merge the first flat data unit and the second flat data unit to produce a merged flat data unit; and
- store the merged flat data unit in the in-memory database,
- wherein,
- the merge is based upon a percentage match between the first flat data unit and the second flat data unit, and
- the percentage match is based upon series values.
17. A computer system as in claim 16 wherein the in-memory database engine is configured to:
- read the input data utilizing a combination of a Python Data Analysis Library (panda) and openpyxl; and
- perform the merge referencing a first location of the first spreadsheet table in the dataframe, and a second location of the second spreadsheet table in the dataframe, the first location and the second location obtained using an.iloc function of panda.
18. A computer system as in claim 16 wherein the in-memory database engine is configured to perform the split based upon at least one of an empty row and an empty column as the separation.
19. A computer system as in claim 16 wherein the in-memory database engine is configured to determine the header by referencing a bold font of a cell in the first spreadsheet table.
20. A computer system as in claim 16 wherein the in-memory database engine is configured to perform the merge based upon the percentage match between the first spreadsheet table and the second spreadsheet table of data length.
Type: Application
Filed: Jul 14, 2021
Publication Date: Jan 19, 2023
Inventors: Kanupriya Goyal (Bengaluru), Swagath (Bengaluru)
Application Number: 17/375,823