DERIN: A data extraction method based on rendering information and n-gram

Figueiredo, Leandro Neiva Lopes; Assis, Guilherme Tavares de; Ferreira, Anderson A.

doi:10.1016/j.ipm.2017.04.007

Cited by 16 publications

(4 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the literature, there are many proposals to extract data from HTML documents in general, not specifically tables (Ferrara, de Meo, Fiumara, & Baumgartner, 2014;Sleiman & Corchuelo, 2013a). They rely on text alignment (Sleiman & Corchuelo, 2013b), neural networks (Sleiman & Corchuelo, 2014), learning first-order rules (Jiménez & Corchuelo, 2016a), inferring propositiorelational rules (Jiménez & Corchuelo, 2016b), learning decision trees (Uzun, Agun, & Yerlikaya, 2013), embedding graphs (Jiménez, Roldán, Gallego, & Corchuelo, 2020), or using n-grams and rendering information (Figueiredo, Assis, & Ferreira, 2017), to mention a few. Unfortunately, they do not seem to be appropriate to extract the underlying relationships between the cells in HTML tables (Cafarella et al, 2018), which motivated much work on table-understanding (Roldán et al, 2020;Zhang & Balog, 2020).…”

Section: Context and Motivationmentioning

confidence: 99%

A clustering approach to extract data from HTML tables

Jiménez

Roldán²,

Corchuelo³

2021

Information Processing & Management

View full text Add to dashboard Cite

Section: Context and Motivationmentioning

confidence: 99%

A clustering approach to extract data from HTML tables

Jiménez

Roldán²,

Corchuelo³

2021

Information Processing & Management

View full text Add to dashboard Cite

“…Data mining principles can be independent of a particular domain for knowledge extraction [11] since their methods are able to learn how to extract the data, perform a given analysis domain independently and detect different record structures and their attributes based on rendering information [18]. It is increased the importance of understanding correlations between data, and data mining methods are interesting to find some patterns and association rules for various analyses and decision aids such as product category recommendations and determination of possible behavioral changes [31].…”

Section: Data Mining and Meteorologymentioning

confidence: 99%

Explainability with Association Rule Learning for Weather Forecast

2021

View full text Add to dashboard Cite

The reliability of the weather forecast models is a complex issue since it depends on numerous parameters and the technical infrastructure which supports them. In doing so, there is a need for advanced works oriented towards a better understanding of these models and the analysis of main associated parameters. Our approach is to study the applicability of the extracted association rules to provide a clearer understanding of atmospheric exchanges. In this work, the proposed methodology is based on the discovery of the interesting interpretable relationships between measured meteorological parameters at the Atmospheric Research Center of Lannemezan (South-West of France). In the preprocessing step, the proposed method is considered to be effectively flexible to account for data uncertainties, unlike the majority of classical evaluation methods mainly directed towards the reduction of variables and data redundancy. In postprocessing, the advantage of our approach is that the extracted rules are a metamodeling of interpretable useful knowledge for the clarity and conciseness of its representation. Moreover, in the processing, the interpretability in data sciences is recent and still in its infancy. The generated association rules with their statistical and semantic interpretations have globally highlighted the possibilities of explicit analysis of meteorological parameters. This study showed that among the generated relevant rules, three parameters (temperature, humidity, wind speed) have a high frequency in the antecedents of the rules and that the only consequence is rain. This is useful for the identification of potential improvements and gaps in the existing models of atmospheric observations, in particular, to understand the related parameterizations to the productivity of the rain phenomenon.

show abstract

“…n-Gram Models help determine the probability of a sequence of words in a sentence or in a text. Their application varies from identifying patterns in text [42] to data extraction [43], automatic speech recognition, machine translation, and spell checking [44,45]. Neural Network Language models offer an improved version [46], both having the potential to be integrated into computer-assisted tools for supporting text reviewers.…”

Section: Natural Language Processing Approachesmentioning

confidence: 99%

Hit or Miss? Evaluating the Potential of a Research Niche: A Case Study in the Field of Virtual Quality Management

et al. 2019

View full text Add to dashboard Cite

When knowledge is developed fast, as it is the case so often nowadays, one of the main difficulties in initiating new research in any field is to identify the domain's specific state-of-the-art and trends. In this context, to evaluate the potential of a research niche by assisting the literature review process and to add a new and modern large-scale and automated dimension to it, the paper proposes a methodology that uses "Latent Semantic Analysis" (LSA) for identifying trends, focused within the knowledge space created at the intersection of three sustainability-related methodologies/concepts: "virtual Quality Management" (vQM), "Industry 4.0", and "Product Life-Cycle" (PLC). The LSA was applied to a significant number of scientific papers published around these concepts to generate ontology charts that describe the knowledge structure of each by the frequency, position, and causal relation of associated notions. These notions are combined for defining the common high-density knowledge zone from where new technological solutions are expected to emerge throughout the PLC. The authors propose the concept of the knowledge space, which is characterized through specific descriptors with their own evaluation scales, obtained by processing the emerging information as identified by a combination of classic and innovative techniques. The results are validated through an investigation that surveys a relevant number of general managers, specialists, and consultants in the field of quality in the automotive sector from Romania. This practical demonstration follows each step of the theoretical approach and yields results that prove the capability of the method to contribute to the understanding and elucidation of the scientific area to which it is applied. Once validated, the method could be transferred to fields with similar characteristics. Even if their creators endowed them with a clear meaning at an incipient stage, when they become more popular in an emerging area, these concepts are quickly surrounded by a large amount of new knowledge that is developed with an amazing speed, enriching and enlarging their initial sphere.The "virtual Quality Management" (vQM) concept could be a significant example for the circumstances described previously. It is born through a semantic operation, joining two established and mature concepts: "virtual" and "QM", thus it is representative for an area which is in a period of high dynamic development and of interest for companies preoccupied with sustainability from the perspective of operations management and organizational culture.In this context in which the amount of information relating to new concepts quickly reaches unmanageable levels, regardless of the field, solutions that can analyze extended documentation with the purpose of disambiguating information and capturing the essentials, thus creating knowledge, become the focus of attention and gain in importance. Traditional solutions for that purpose lay in the literature review process, trying to collect, select, filter, and struc...

show abstract

DERIN: A data extraction method based on rendering information and n-gram

Cited by 16 publications

References 11 publications

A clustering approach to extract data from HTML tables

A clustering approach to extract data from HTML tables

Explainability with Association Rule Learning for Weather Forecast

Hit or Miss? Evaluating the Potential of a Research Niche: A Case Study in the Field of Virtual Quality Management

Contact Info

Product

Resources

About