Data motifs

Gao, Wanling; Zhan, Jianfeng; Wang, Lei; Luo, Chunjie; Zheng, Daoyi; Tang, Fei; Xie, Bin; Zheng, Chen; Xu, Wen; He, Xiwen; Ye, Hainan; Ren, Rui

doi:10.1145/3243176.3243190

Cited by 28 publications

(3 citation statements)

References 25 publications

(18 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For handling the big data use cases, graph traversal , machine learning , text analytics , and statistics algorithms are commonly used 72,73 . In this work, only those applications have been tested to explore compiler search space for 3Vs (Volume, Velocity, Variety) which are part of standard big data benchmarks, representing graph mining , machine learning , and text search categories 72,73 . These applications have been selected through well‐known C/C++ based benchmark suites including Rodinia, 27 Graphbig, 28 Phoneix, 29 CortexSuite, 31 genann, 32 and grep‐bench 30 .…”

Section: Methodsmentioning

confidence: 99%

“…Big data has been emerging in domains like social networks, search engines, ecommerce, multimedia processing, and bioinformatics. For handling the big data use cases, graph traversal , machine learning , text analytics , and statistics algorithms are commonly used 72,73 . In this work, only those applications have been tested to explore compiler search space for 3Vs (Volume, Velocity, Variety) which are part of standard big data benchmarks, representing graph mining , machine learning , and text search categories 72,73 .…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Toward a novel engine for compiler optimization space exploration of big data workloads

Ahmed

Ismail

2021

Softw Pract Exp

View full text Add to dashboard Cite

Recently, big data specific technologies have been emerging, including domain-specific languages, software frameworks, databases, third-party libraries, and so forth. These techniques are successful in concealing the low-level details by producing high-level code, which is passed through the conventional compilation cycle for generating hardware operable code. Several optimization opportunities exist in the compiler which can assist in meeting the processing deadlines of big data workloads, through optimized machine code. However, the existing iterative compilation techniques are not enough for the exploration of big data applications. In this regard, a novel engine has been presented for exploiting the compiler optimization space of big data workloads. The engine is comprised of training and testing phases. During the training stage, the big data application is optimized with Mitigates the Compiler Phase-ordering (MiCOMP) and genetic algorithm (GA) optimization sequences, which are executed with train datasets. In the testing stage, the test datasets are executed only for the best 300 optimization sequences discovered at the training stage. The proposed engine has been tested with graph mining, machine learning, and text search categories of big data applications using a wide range of real-world and synthetic datasets. Overall, the engine is 56.8×, 47×, and 9.8× faster than Iterative Optimization for the Data Center (IODC), MiCOMP, and GA respectively in exploiting the compiler search space for big data workloads. Further, the integration of best-10 and best-3 techniques with the engine brings a speedup of 5.9× and 7.8×. The compiler level exploitation of general-purpose machines incurs no extra overhead, no heavy computing, and no personnel cost. Also, the overall performance of big data specialized software solutions can be enhanced by compiling their high-level code with suitable compiler optimizations.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Toward a novel engine for compiler optimization space exploration of big data workloads

Ahmed

Ismail

2021

Softw Pract Exp

View full text Add to dashboard Cite

show abstract

“…In the Semantic Web (SW), ontologies have traditionally been used to act as lenses over data eventually leading to an entire field of study: Ontology-Based Data Access (Calvanese et al, 2015). More recently, lenses have been used to create multiple views over chemistry data (Batchelor et al, 2014), help manage large data sets (Lenzerini, 2018), or understand big data and AI workloads (Gao et al, 2018).…”

Section: Related Workmentioning

confidence: 99%

In Media Res: A Corpus for Evaluating Named Entity Linking with Creative Works

Brașoveanu

Weichselbraun

Nixon

2020

Proceedings of the 24th Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

Annotation styles express guidelines that direct human annotators by explicitly stating the rules to follow when creating gold standard annotations of text corpora. These guidelines not only shape the gold standards they help create, but also influence the training and evaluation of Named Entity Linking (NEL) tools, since different annotation styles correspond to divergent views on the entities present in a document. Such divergence is particularly relevant for texts from the media domain containing references to creative works. This paper presents a corpus of 1000 annotated documents from sources such as Wikipedia, TVTropes and WikiNews that are organized in ten partitions. Each document contains multiple gold standard annotations representing various annotation styles. The corpus is used to evaluate a series of Named Entity Linking tools in order to understand the impact of the differences in annotation styles on the reported accuracy when processing highly ambiguous entities such as names of creative works. Relaxed annotation guidelines that include overlap styles, for instance, lead to better results across all tools.

show abstract