Abstract:Information, stored or transmitted in digital form, is often structured. Individual data records are usually represented as hierarchies of their elements. Together, records form larger structures. Information processing applications have to take account of this structuring, which assigns different semantics to different data elements or records. Big variety of structural schemata in use today often requires much flexibility from applications--for example, to process information coming from different sources. To ensure application interoperability, translators are needed that can convert one structure into another. This paper puts forward a formal data model aimed at supporting hierarchical data processing in a simple and flexible way. The model is based on and extends results of two classical theories, studying finite string and tree automata. The concept of finite automata and regular languages is applied to the case of arbitrarily structured tree-like hierarchical data records, represented as "structured strings." These automata are compared with classical string and tree automata; the model is shown to be a superset of the classical models. Regular grammars and expressions over structured strings are introduced. Regular expression matching and substitution has been widely used for efficient unstructured text processing; the model described here brings the power of this proven technique to applications that deal with information trees. A simple generic alternative is offered to replace today's specialised ad-hoc approaches. The model unifies structural and content transformations, providing applications with a single data type. An example scenario of how to build applications based on this theory is discussed. Further research directions are outlined.

Data Language Specification via Terminal Attribution

Saggitarius: A DSL for Specifying Grammatical Domains

Parsing methods streamlined

Specification and Verification for Semi-Structured Data

Generalizing input-driven languages: theoretical and practical benefits

Managing Complex Structured Data In a Fast Evolving Environment

3DGen: AI-Assisted Generation of Provably Correct Binary Format Parsers

A Chart-Parsing Algorithm for Efficient Semantic Analysis

NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System

Exploiting limited data for parsing

Using Tree Automata and Regular Expressions to Manipulate Hierarchically Structured Data

LL(2) parsing approach based on LL(1)

Lexicalization and Grammar Development

A Hybrid Semantic Parsing Approach for Tabular Data Analysis

Hal: A Language-General Framework for Analysis of User-Specified Monotone Frameworks [DRAFT]

From Business Modeling to Software Design

Aperture synthesis for gravitational-wave data analysis: Deterministic Sources

A Biologically Plausible Parser

Nominal Tree Automata With Name Allocation

Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction

Types and Semantics for Extensible Data Types (Extended Version)