Keynote Speakers – ICDAR 2026

We are happy to announce the keynote speakers of ICDAR 2026!

Keynotes

August 31 – The Changing Landscape of Document Understanding – Prof. Dr. C. V. Jawahar
September 1 – From Archives to Algorithms and Back: What Historians Need from Document Understanding – Prof. Dr. Christoph Rass
September 2 – The Next Frontier: From Document Understanding to Document Agency – Dr. Tong Sun

Young Investigator Award

September 1 – The paraverbal language of text images: how your text looks matters as much as what it reads – Dr. Silvia Cascianelli
September 1 – Advances and Open Challenges in Optical Music Recognition – Prof. Dr. Jorge Calvo-Zaragoza

Prof. Dr. C. V. Jawahar

IIIT Hyderabad, India
Co-Founder, Centre for Visual Information Technology (CVIT)

The Changing Landscape of Document Understanding

Abstract

Document understanding has changed dramatically over the past two decades. Early systems were built as tightly connected pipelines combining optical character recognition, layout analysis, and rule based post processing. Over time, the field has moved toward richer representations that aim to capture not just text, but also structure, semantics, and intent. This shift is now accelerating, driven by large scale data, self-supervised learning, and foundation models that increasingly integrate vision, language, and reasoning.

This keynote reflects on how document understanding research has evolved and what has driven these changes. It revisits key shifts in perspective, from handcrafted features to learned representations, from narrowly defined tasks to more unified models, and from explicit structural modeling to implicit multimodal reasoning. Despite this progress, documents continue to pose distinctive challenges. Complex layouts, long range dependencies, multilingual content, and domain specific conventions still make document understanding resistant to simple end to end solutions.

The talk highlights emerging directions that are shaping both research and practice in the field. These include the growing influence of foundation models, new forms of supervision and data creation, and evolving evaluation practices that move beyond recognition accuracy alone. We also discuss how this is impacting the ongoing research as well as researchers in the document community. These changes also impact the under represented languages and geographies. The keynote also touches open questions and opportunities for the community, emphasizing the importance of benchmarks that enable deeper and more meaningful document intelligence.

Bio

Dr. C. V. Jawahar is a Professor at IIIT Hyderabad and the co-founder of the Centre for Visual Information Technology (CVIT). His research spans document image analysis, computer vision, and multimodal learning, with a long-standing focus on building intelligent systems that go beyond recognition to enable higher-level understanding and reasoning over visual and textual data.

Prof. Jawahar has made significant contributions to document understanding, multilingual technologies, and real-world vision systems, particularly in the context of large-scale document processing for societal applications. He has been an active contributor to the ICDAR community through research, tutorials, and leadership roles.

His work has influenced both academic research and systems operating at a population scale. He is a recipient of several national and international recognitions, including the ACM’s Outstanding Contribution to Computing Education (OCCE) and Outstanding Applied Innovation in Computing (OAIC). His current research interests include vision-language models, document intelligence, and next-generation AI systems that integrate perception with reasoning. He is passionate about applications in road safety, assistive technology, education, entertainment, and cultural heritage.

Prof. Dr. Christoph Rass

Osnabrück University

From Archives to Algorithms and Back: What Historians Need from Document Understanding

Abstract

Historians may be among the most demanding users of document analysis technologies – a point often overlooked on both sides of the aisle. The nature of the challenge, however, differs by period. Modern history is drowning in sources: typed reports, standardized forms, administrative registries running into millions of pages – often legible enough, yet far exceeding any human capacity to process. Earlier periods present the inverse problem: fewer documents, but harder to read, written in historical hands and damaged by centuries of use. What unites both is that understanding a historical document means more than recognizing characters. It requires reconstructing meaning, context, and intent; it often demands that we first recover the very categories – administrative, legal, racial – through which contemporaries made sense of their world.

This keynote offers a historian’s perspective on document understanding, drawing on nearly three decades of data-driven research. When I began building databases from Wehrmacht personnel files in 1996, „electronic data processing“ meant relational databases and statistical analysis of manually transcribed records – laborious, yet transformative. Since then, the field has passed through several technological generations: geographical information systems for modeling spatial patterns; automated extraction from archival card files; and now Large Vision Language Models capable of processing document images directly. Each transition opened new possibilities. Each also raised – difficult – questions about what we gain and what we lose when machines mediate our access to the past.

Three themes structure this reflection. First, scale: digital methods allow us to process millions of documents, changing what questions become tractable – yet scale requires new forms of judgment about what patterns mean and what they obscure. Second, representation: how do we model historical knowledge in graphs and ontologies without flattening ambiguity, or worse, reproducing the categories of those who created the documents? How do we encode uncertainty in ways that remain legible to future researchers? Third, validation: when extraction volumes exceed any possibility of comprehensive review, what role can domain expertise still play – and where does it reach its limits?

Document analysis and historical research need each other – not as a polite formula for interdisciplinary funding applications, but as a methodological necessity. Large Vision Language Models can enable „deep indexing“ of archival holdings at unprecedented scale. Whether this proves meaningful depends on sustained exchange across disciplinary boundaries. Historians bring questions, contextual knowledge, and a trained suspicion toward confident claims; computer scientists bring methods and the capacity to implement solutions that work on millions of pages. The challenge is neither simply technical nor simply interpretive. It lies in learning to ask, together, what „understanding“ a document actually requires.

Bio

Christoph Rass (Dr. rer. pol.) is Professor of Modern History and Historical Migration Research at Osnabrück University, Germany. His path toward data-driven historical research began at Saarland University in the early 1990s, where he studied history alongside information science. His dissertation (2001), a database-driven collective biography of Wehrmacht personnel records, established an approach he has pursued through several technological generations since: from relational databases and statistical social profile analysis, through geographical information systems for modeling migration and organized violence, to automated extraction from historical card files. Projects on the Osnabrück Gestapo registry and the city’s foreign resident registry developed pipelines for extracting structured data from administrative documents – typed, handwritten, and hybrid – at scale. Current work aims to extend these approaches to Large Vision Language Models. Funding has come from the German Research Foundation (DFG), the Volkswagen Foundation, and the Foundation Remembrance, Responsibility and Future. Collaborative work with the Pattern Recognition Group at TU Dortmund on few-shot information extraction was presented at ICDAR 2025.

Rass is Principal Investigator in the DFG Collaborative Research Center 1604 „Production of Migration,“ where his subproject applies digital corpus analysis to trace the historical semantics of categories such as ‚Gastarbeiterkinder‘ in educational discourse. He contributes to the Lower Saxony research consortium FuturMig (2025–2029) and serves on the board of the Institute for Migration Research and Intercultural Studies.

Dr. Tong Sun

Senior Director, Adobe

The Next Frontier: From Document Understanding to Document Agency

Abstract

For decades, the ICDAR community has advanced the state of the art in document analysis—transforming static pages into machine-readable and structured representations. Today, however, rapid progress in large language models, multimodal foundation models, and agentic AI systems is reshaping the landscape of knowledge work.

Understanding alone is no longer the end goal.

This keynote introduces the concept of document agency—a new paradigm in which documents evolve from passive information containers into dynamic, interactive, and executable knowledge environments. Rather than serving merely as inputs to AI systems, documents become structured state spaces that support reasoning, planning, personalization, and tool use.

I will explore the technical foundations required for this shift, including multimodal structural representations, long-context reasoning architectures, generative interface layers, and new evaluation frameworks that measure collaborative intelligence rather than extraction accuracy.

As AI transforms the knowledge workplace, ICDAR has an opportunity to lead the next frontier: building document systems that not only understand content, but actively participate in cognition and action.

Bio

Dr. Tong Sun is Senior Director of the Document Intelligence Lab at Adobe, where she leads research at the intersection of multimodal AI, document foundation models, and agentic systems. Her work focuses on transforming documents from static files into interactive, intelligent media that power the next generation of knowledge work. Bridging cutting-edge research with large-scale product impact, she and her team at Adobe Research have advanced structured multimodal reasoning, efficient on-device models, and generative interfaces—shaping a future where document systems actively collaborate with humans in thinking, decision-making, and action.

Dr. Silvia Cascianelli

Young Investigator Award

University of Modena and Reggio Emilia, Italy

The paraverbal language of text images: how your text looks matters as much as what it reads

Abstract

Every written artifact (a handwritten letter, a music score, a printed sign photographed in the wild) is a visual object, and how it looks carries meaning of its own beyond what it reads: the hand of a writer, the conventions of a notation system, an aesthetic, a historical trace. Learning to model this visual dimension is what allows us not only to read text, but to reproduce it, restyle it, and edit it, with direct consequences for document analysis, digital humanities, cultural heritage, and accessibility.

This talk traces a research line that treats text as visual content to be modeled, generated, and edited. It will begin with handwritten text generation, from transformer-based GANs that render arbitrary content to more recent autoregressive formulations that improve controllability and faithfulness. I will then turn to controllable diffusion models to perform style transfer across images to support automatic recognition. Finally, I will present a pipeline for editing printed text directly within images, built on flow-matching models and extended to the challenging multi-instance setting.

The aim of the talk is to convey the idea that faithfully capturing how text looks is at once a demanding generative problem and a powerful instrument that complements recognition and opens new ways for the document community to preserve, understand, and interact with visually rich text.

Bio

Silvia Cascianelli is an Assistant Professor in the Department of Engineering „Enzo Ferrari“ at the University of Modena and Reggio Emilia, Italy. She is a member of the Equality, Diversity and Inclusion Committee of the International Association for Pattern Recognition (IAPR) and of the European Laboratory for Learning and Intelligent Systems (ELLIS). She received her European PhD in Industrial and Information Engineering from the University of Perugia in 2019. Her research lies at the intersection of Computer Vision, Generative AI, Document AI, Multimodal AI, and Digital Humanities, with a particular focus on generating, editing, and understanding visually rich documents. She has published extensively in top-tier venues, including CVPR, ICML, ECCV, and IEEE T-PAMI, and fosters active research collaborations with international universities and industrial partners, including Google. She contributes to the community as an Associate Editor for IEEE Robotics and Automation Letters and as an Area Chair for conferences such as CVPR, ECCV, AAAI, and ICDAR. She is passionate about applying generative models to cultural heritage, digital humanities, and accessibility.

Prof. Dr. Jorge Calvo-Zaragoza

Young Investigator Award

University of Alicante, Spain

Advances and Open Challenges in Optical Music Recognition

Abstract

Optical Music Recognition (OMR) addresses the automatic reading of music score images and their conversion into machine-readable representations. While often compared to OCR, OMR poses a distinctive document-analysis problem: music notation is not primarily a linear sequence of characters, but a structured visual language in which meaning depends on spatial organization, graphical relationships, simultaneity, notation conventions, and musical context. As a result, OMR requires not only recognizing visual elements on the page, but also reconstructing the musical structure that makes those elements interpretable and usable.

This keynote will introduce OMR from the perspective of document image analysis, outlining what makes music scores different from other document-reading problems and why they remain a challenging test case for visual recognition and structured interpretation. It will provide an overview of the evolution of the field, highlighting representative advances in approaches, methods, representations, datasets, evaluation, and practical tools. The talk will use selected examples to show how OMR has evolved from early recognition pipelines to current learning-based end-to-end approaches.

The talk will also discuss open challenges that continue to define the field, including the recovery of musical structure beyond symbol recognition, evaluation, data limitations, robustness across domains, and the role of human correction in real-world workflows. More broadly, it will present OMR as a relevant case study for document analysis problems in which reading means moving from visual objects to structured, domain-specific meaning.

Bio

Jorge Calvo-Zaragoza is Full Professor at the University of Alicante, Spain, where he leads the Pattern Recognition and Artificial Intelligence Group (PRAIG), devoted to machine learning, computer vision, document analysis, and music information retrieval. He received his PhD from the University of Alicante in 2016 and subsequently held postdoctoral positions at McGill University, Canada, and the Universitat Politècnica de València, Spain.

His research focuses especially on Optical Music Recognition (OMR), addressing how to transform digitised music scores into structured, machine-readable musical documents that support intelligent access to musical heritage. He has contributed to the consolidation of OMR through deep learning methods for printed and handwritten music recognition, structured representations, benchmarks, evaluation frameworks, software tools, and community-building initiatives such as the Workshop on Reading Music Systems (WoRMS). In this area, he has led several national and European research projects, supervised doctoral theses, and developed collaborations with cultural heritage institutions that have helped bring OMR closer to real-world applications.