Our research consists of the following parts.
PDF table extraction
Many non-editable documents are shared in PDF (Portable Document Format). There are typically no tags annotating layout structures of their pages. One of the challengeable tasks is table extraction from such documents. We develop a tool for automatic PDF table detection and cell structure recognition based on combining artificial neural networks and ad-hoc heuristics. The preliminary results demonstrate the high performance of this approach on the test datasets that is comparable with the state-of-the-art competitive solutions.
Available at GitHub, https://github.com/tabbydoc/tabbypdf2
Web-based tool for semi-automatic heuristic-based PDF table extraction is accessible at http://cells.icc.ru/pdfte
Rule-based spreadsheet data extraction and transformation
Spreadsheets are widely used in science, engineering, business, and other activities. Overall, they conceal a big volume of data in an unstructured form. We develop a novel software platform facilitated for liberating the data from this form. The platform provides rule-based spreadsheet data extraction and transformation to a structured form. Its core consists of a flexible table object model and a domain-specific rule language for table analysis. They serve to represent knowledge of table layout and content features, as well as their interpretation, depended on transformation goals. This enables processing arbitrary tables originated from various domains. Our empirical results demonstrate the applicability of the software platform to develop programs for rule-based converting data from arbitrary spreadsheet tables.
Available at GitHub, https://github.com/tabbydoc/tabbyxl
Semantic table interpretation
Available at GitHub, https://github.com/tabbydoc/tabbyld