Our research consists of the following parts.
Web-based tool for semi-automatic PDF table extraction and annotation
The demo is accessible at http://cells.icc.ru/pdfte
Available at GitHub:
- Core, https://github.com/cellsrg/tabbypdf
- Server-side, https://github.com/cellsrg/tabbypdf-web
- Client-side, https://github.com/cellsrg/tabbypdf-front
PDF table extraction
Many non-editable documents are shared in PDF (Portable Document Format). There are typically no tags annotating layout structures of their pages. One of the challengeable tasks is table extraction from such documents. We develop a tool for automatic PDF table detection and cell structure recognition based on combining artificial neural networks and ad-hoc heuristics. The preliminary results demonstrate the high performance of this approach on the test datasets that is comparable with the state-of-the-art competitive solutions.
Available at GitHub, https://github.com/tabbydoc/tabbypdf2
Prototype of web-based system for spreadsheet data canonicalization
The demo is accessible at http://cells.icc.ru/ssdc
Rule-based spreadsheet data extraction and transformation
Spreadsheets are widely used in science, engineering, business, and other activities. Overall, they conceal a big volume of data in an unstructured form. We develop a novel software platform facilitated for liberating the data from this form. The platform provides rule-based spreadsheet data extraction and transformation to a structured form. Its core consists of a flexible table object model and a domain-specific rule language for table analysis. They serve to represent knowledge of table layout and content features, as well as their interpretation, depended on transformation goals. This enables processing arbitrary tables originated from various domains. Our empirical results demonstrate the applicability of the software platform to develop programs for rule-based converting data from arbitrary spreadsheet tables.
Available at GitHub, https://github.com/tabbydoc/tabbyxl