TabbyDOC

Table Understanding Research

About

This research project aims at developing methods and software for the extraction of entities and their relationships from tables represented in unstructured and semi-structured data formats

This work was supported by the Russian Science Foundation (grant no. 18-71-10001). Our prior works were supported by the Russian Foundation for Basic Research (grant no. 12-07-31051 and grant no. 15-37-20042) and the Council for grants of the President of the Russian Federation (Scholarship No. SP-3387.2013.5)

Source code

tabbypdf, Rule-based PDF table extraction
tabbypdf2, Deep-learning-based PDF table extraction
tabbyxl, Rule-based spreadsheet data extraction
tabbyld, Semantic table interpretation using open knowledge graphs

Publications

2022

Shigarov A. (2022). Table understanding: Problem overview. WIREs Data Mining and Knowledge Discovery, 13(1), e1482. https://doi.org/10.1002/widm.1482
Kostyleva O., Paramonov V., Shigarov A., Vetrova V. (2022). Towards comparison of table type taxonomies. 45th Jubilee Int. Conv. on Information, Communication and Electronic Technology (MIPRO), 1461-1465. https://doi.org/10.23919/MIPRO55190.2022.9803520.

2021

Dorodnykh N., Yurin A., Shigarov A., Turdakov D. (2021). Ontology engineering at the assertion level based on semantic annotation of tabular data. 2021 Ivannikov Memorial Workshop (IVMEM). 28-34. https://doi.org/10.1109/IVMEM53963.2021.00011
Yurin A., Dorodnykh N., Shigarov A. (2021). Semi-automated formalization and representation of the engineering knowledge extracted from spreadsheet data. IEEE Access. 9, 157468-157481. https://doi.org/10.1109/ACCESS.2021.3130172
Paramonov V., Shigarov A., Vetrova V. (2021). Rule-driven spreadsheet data extraction from statistical tables: case study. Information and Software Technologies. ICIST 2021. CCIS 1486, 84-95. https://doi.org/10.1007/978-3-030-88304-1_7
Dorodnykh N., Yurin A. (2021). TabbyLD: a tool for semantic interpretation of spreadsheets data. Modelling and Development of Intelligent Systems. MDIS 2020. CCIS 1341, 315-333. https://doi.org/10.1007/978-3-030-68527-0_20
Dorodnykh N., Shigarov A., Yurin A. (2022). Using the semantic annotation of web table data for knowledge base construction. Proc. 4th Artificial Intelligence and Cloud Computing Conference. AICCC’21, 122-129. https://doi.org/10.1145/3508259.3508277
Dorodnykh N., Yurin A. (2022). Extraction of facts from web-tables based on semantic interpretation tabular data. 2022 Ivannikov Memorial Workshop (IVMEM), 7-17. https://doi.org/10.1109/IVMEM57067.2022.9983959
Mikhailov A., Shigarov A. Page layout analysis for refining table extraction from PDF documents. 2021 Ivannikov Ispras Open Conference (ISPRAS), 114-119. https://doi.org/10.1109/ISPRAS53967.2021.00021

2020

Mikhailov A., Shigarov A., Rozhkov E., Cherepanov I. (2020). On graph-based verification for PDF table detection. 2020 Ivannikov ISPRAS Open Conference (ISPRAS). 91-95. https://doi.org/10.1109/ISPRAS51486.2020.00020
Cherepanov I., Mikhailov A., Shigarov A., Paramonov V. (2020). On automated workflow for fine-tuning deep neural network models for table detection in document images. 2020 43rd International Convention on Information, Communication and Electronic Technology. 1130-1133. https://doi.org/10.23919/MIPRO48935.2020.9245241
Dorodnykh N. & Yurin A. (2020). Towards a universal approach for semantic interpretation of spreadsheets data. Proc. 24th Symposium on International Database Engineering & Applications. Article 22, 1-9. https://doi.org/10.1145/3410566.3410609
Paramonov V., Shigarov A., Vetrova V. (2020). Table header correction algorithm based on heuristics for improving spreadsheet data extraction. Information and Software Technologies. 1283 CCIS, 147-158. https://doi.org/10.1007/978-3-030-59506-7_13
Yurin A. & Dorodnykh N. (2020). Experimental evaluation of a spreadsheets transformation in the context of domain model engineering. Ural S. Biomedical Engineering, Radioelectronics and Information Technology. 0388-0391. https://doi.org/10.1109/USBEREIT48449.2020.9117674
Dorodnykh N., Yurin A., Shigarov A. (2020). Conceptual model engineering for industrial safety inspection based on spreadsheet data analysis. Modelling and Development of Intelligent Systems. 1126 CCIS, 51-65. https://doi.org/10.1007/978-3-030-39237-6_4
Paramonov, V., Shigarov, A., Vetrova, V., Mikhailov, A. (2020). Heuristic algorithm for recovering a physical structure of spreadsheet header. Information Systems Architecture and Technology. 1050 AISC, 140-149. https://doi.org/10.1007/978-3-030-30440-9_14

2019

Yurin A. & Dorodnykh N. (2019). A reverse engineering process for inferring conceptual models from canonicalized tables. 2019 Int. Multi-Conf. on Engineering, Computer and Information Sciences (SIBIRCON). 0485-0490. https://doi.org/10.1109/SIBIRCON48586.2019.8958458
Shigarov, A., Khristyuk, V., Mikhailov, A., Paramonov, V. (2019). TabbyXL: rule-based spreadsheet data extraction and transformation. Information and Software Technologies. 1078 CCIS, 59-75. https://doi.org/10.1007/978-3-030-30275-7_6
Preprint
Presentation
Shigarov, A., Khristyuk, V., Mikhailov, A. (2019). TabbyXL: software platform for rule-based spreadsheet data extraction and transformation. SoftwareX, 10. https://doi.org/10.1016/j.softx.2019.100270
Preprint
Shigarov, A., Cherepanov, I., Cherkashin, E., Dorodnykh, N., Khristyuk, V., Mikhailov, A., Paramonov, V., Rozhkow, E., Yurin A. (2019). Towards end-to-end transformation of arbitrary tables from untagged portable documents (PDF) to linked data. CEUR-WS Proc. 2463, 1-12.
Article
Shigarov, A., Khristyuk, V., Mikhailov, A., Paramonov, V. (2019). Software development for rule-based spreadsheet data extraction and transformation. Proc. 42nd Int. Convention on Information and Communication Technology, Electronics and Microelectronics. 1132-1137. https://doi.org/10.23919/MIPRO.2019.8756829
Preprint
Cherkashin, E., Shigarov, A., Paramonov, V., Mikhailov, A. (2019). Digital archives supporting document content inference. Proc. 42nd Int. Convention on Information and Communication Technology, Electronics and Microelectronics. 1037-1042. https://doi.org/10.23919/MIPRO.2019.8757196
Preprint
Dorodnykh, N., Yurin, A. (2019). Towards ontology engineering based on transformation of conceptual models and spreadsheet data: a case study. Intelligent Systems Applications in Software Engineering. 1046 AISC, 233-247. https://doi.org/10.1007/978-3-030-30329-7_22
Preprint
Paramonov, V., Shigarov, A., Ruzhnikov, G., Cherkashin, E. (2019). Phonetic string matching for languages with Cyrillic alphabet. Information Systems Architecture and Technology. 852 AISC, 301-311. https://doi.org/10.1007/978-3-319-99981-4_28
Preprint

2018

Shigarov, A., Altaev, A., Mikhailov, A., Paramonov, V., Cherkashin, E. (2018). TabbyPDF: web-based system for PDF table extraction. Information and Software Technologies. 920 CCIS, 257-269. https://doi.org/10.1007/978-3-319-99972-2_20
Preprint
Yang, S., Wei, R., Shigarov, A. (2018). Semantic interoperability for electronic business through a novel cross-context semantic document exchange approach. Proc. 18th ACM Symposium on Document Engineering. 28:1-28:10. https://doi.org/10.1145/3209280.3209523
Cherkashin, E., Kopaygorodsky, A., Kazi, L., Shigarov, A., Paramonov, V. (2018). Model driven architecture implementation using linked data. Information and Software Technologies. 920 CCIS, 412-423. https://doi.org/10.1007/978-3-319-99972-2_34
Preprint

2017

Shigarov, A., Mikhailov, A. (2017). Rule-based spreadsheet data transformation from arbitrary to relational tables. Information Systems. 71, 123-136. https://doi.org/10.1016/j.is.2017.08.004
Preprint

2016

Shigarov, A., Mikhailov, A., Altaev, A. (2016). Configurable table structure recognition in untagged PDF documents. Proc. 16th ACM Symposium on Document Engineering. 119-122. https://doi.org/10.1145/2960811.2967152
Preprint
Poster
Shigarov, A., Paramonov, V., Belykh, P., Bondarev, A. (2016). Rule-based canonicalization of arbitrary tables in spreadsheets. Information and Software Technologies. 639 CCIS, 78-91. https://doi.org/10.1007/978-3-319-46254-7_7
Preprint
Paramonov, V., Shigarov, A., Ruzhnikov, G., Belykh, P. (2016). Polyphon: an algorithm for phonetic string matching in Russian language. Information and Software Technologies. 639 CCIS, 568-579. https://doi.org/10.1007/978-3-319-46254-7_46
Preprint
Шигаров, А. (2016). Методологическое и программное обеспечение трансформации табличных данных от произвольной к реляционной форме. Научная секция заседания Объединенного ученного совета СО РАН по нанотехнологиям и информационным технологиям.
Presentation

2015

Shigarov, A. (2015). Table understanding using a rule engine. Expert Systems with Applications. 42(2), 929-937. https://doi.org/10.1016/j.eswa.2014.08.045
Preprint
Presentation
Shigarov, A. (2015). Rule-based table analysis and interpretation. Information and Software Technologies. 538 CCIS, 175-186. https://doi.org/10.1007/978-3-319-24770-0_16
Preprint
Шигаров, А. О., Бычков, И. В., Парамонов, В. В., Белых, П. В. (2015). Анализ и интерпретация произвольных таблиц на основе исполнения CRL-правил. Вычислительные технологии. 20(6), 87-112.
Preprint
Shigarov, A., Paramonov, V. (2015). CRL: a rule language for analysis and interpretation of arbitrary tables. CEUR-WS Proc. 1536, 22-29.
Article
Presentation

2014

Шигаров, А. О. (2014). Восстановление логической структуры таблиц из неструктурированных текстов на основе логического вывода. Вычислительные технологии. 19(1), 87-99.
Preprint
Shigarov, A. (2014). Automated table understanding using a rule engine. CEUR-WS Proc. 1297, 216-223.
Article

2013

Шигаров, А. О., Бычков, И. В., Ружников, Г. М., Хмельнов, А. Е., Федоров, Р. К. (2013). Система трансформации таблиц. Информационные технологии и вычислительные системы. 3, 15-26.
Preprint

2011

Shigarov, A., Fedorov, R. (2011). Simple algorithm for page layout analysis. Pattern Recognition and Image Analysis. 21(2), 324-327. https://doi.org/10.1134/S1054661811021008
Preprint

2009

Шигаров, А. О. (2009). Технология извлечения табличной информации из электронных документов разных форматов. Дис. канд. техн. наук.
PhD Thesis
PhD Abstract
Presentation
Shigarov, A., Bychkov, I., Hmelnov, A., Ruzhnikov, G. (2009). A method for table detection in metafiles. Pattern Recognition and Image Analysis. 19(4), 693-697. https://doi.org/10.1134/S1054661809040191
Preprint
Poster
Бычков, И. В., Ружников, Г. М., Хмельнов, А. Е., Шигаров, А. О. (2009). Эвристический метод обнаружения таблиц в разноформатных документах. Вычислительные технологии. 14(2), 58-73.
Preprint

2008

Хмельнов, А. Е., Шигаров, А. О. (2008). Метод извлечения таблиц из неформатированного текста. Вычислительные технологии. 13(1), 93-101.
Preprint

Contacts

Office 222, Block EVM, Lermontov st. 134, Irkutsk, Russia, 664033 Department of Information Technology and Systems, Matrosov Institute for System Dynamics and Control Theory, Siberian Branch of the Russian Academy of Sciences

Alexey Shigarov (e-mail: shigarov@gmail.com)

tabbydoc.github.io

Tabular Document Analysis Research Group at ISDCT SB RAS