2010
DOI: 10.1145/1809028.1806622
|View full text |Cite
|
Sign up to set email alerts
|

A context-free markup language for semi-structured text

Abstract: An ad hoc data format is any non-standard, semi-structured data format for which robust data processing tools are not available. In this paper, we present ANNE, a new kind of mark-up language designed to help users generate documentation and data processing tools for ad hoc text data. More specifically, given a new ad hoc data source, an ANNE programmer will edit the document to add a number of simple annotations, which serve to specify its syntactic structure. Annotations include elements that specify constan… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2011
2011
2024
2024

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 23 publications
(8 reference statements)
0
3
0
Order By: Relevance
“…The PADS project has enabled simplification of ad hoc data processing tasks for programmers by contributing along several dimensions: development of domain specific languages for describing text structure or data format [2,3], learning algorithms for automatically inferring such formats [4], and a markup language to allow users to add simple annotations to enable more effective learning of text structure [23]. The learned format can then be used by programmers for documentation or implementation of custom data analysis tools.…”
Section: Data Processing For Programmersmentioning
confidence: 99%
“…The PADS project has enabled simplification of ad hoc data processing tasks for programmers by contributing along several dimensions: development of domain specific languages for describing text structure or data format [2,3], learning algorithms for automatically inferring such formats [4], and a markup language to allow users to add simple annotations to enable more effective learning of text structure [23]. The learned format can then be used by programmers for documentation or implementation of custom data analysis tools.…”
Section: Data Processing For Programmersmentioning
confidence: 99%
“…It even supports automatic grammar inference [14]. ANNE [15] is an eclectic tool that derives PADS [13] data format specifications from userannotated data sources. Data Format Description Language (DFDL) [16] is a complex language for the specification of data formats.…”
Section: Related Workmentioning
confidence: 99%
“…Data Extraction from Log Files The PADS project [7] has enabled simplification of ad hoc data processing tasks for programmers by contributing along several dimensions: development of domain specific languages for describing text structure or data format, learning algorithms for automatically inferring such formats [8], and a markup language to allow users to add simple annotations to enable more effective learning of text structure [23] While PADS supports parsing of entire files, FlashExtract allows users to extract only parts of the file thereby avoiding unnecessary complications. PADS's learner only supports a fixed line-by-line chunking strategy to split the records; in contrast, FlashExtract can learn chunking (aka, structure boundaries) from examples, making it suitable for extracting data fields and records that have arbitrary length (and might cross multiple lines).…”
Section: Related Workmentioning
confidence: 99%