Ben Blamey scite author profile

Crick

Oatley

2013

Detecting and understanding temporal expressions are key tasks in natural language processing (NLP), and are important for event detection and information retrieval. In the existing approaches, temporal semantics are typically represented as discrete ranges or specific dates, and the task is restricted to text that conforms to this representation. We propose an alternate paradigm: that of distributed temporal semantics -where a probability density function models relative probabilities of the various interpretations. We extend SUTime, a state-of-the-art NLP system to incorporate our approach, and build definitions of new and existing temporal expressions. A worked example is used to demonstrate our approach: the estimation of the creation time of photos in online social networks (OSNs), with a brief discussion of how the proposed paradigm relates to the point-and interval-based systems of time. An interactive demonstration, along with source code and datasets, are available online.

R U :-) or :-( ? Character- vs. Word-Gram Feature Selection for Sentiment Classification of OSN Corpora

Crick

Oatley

2012

Binary sentiment classification, or sentiment analysis, is the task of computing the sentiment of a document, i.e. whether it contains broadly positive or negative opinions. The topic is well-studied, and the intuitive approach of using words as classification features is the basis of most techniques documented in the literature. The alternative character n-gram language model has been applied successfully to a range of NLP tasks, but its effectiveness at sentiment classification seems to be under-investigated, and results are mixed. We present an investigation of the application of the character n-gram model to text classification of corpora from online social networks, the first such documented study, where text is known to be rich in so-called unnatural language, also introducing a novel corpus of Facebook photo comments. Despite hoping that the flexibility of the character n-gram approach would be well-suited to unnatural language phenomenon, we find little improvement over the baseline algorithms employing the word n-gram language model.

Differentiated Assessments for Advanced Courses that Reveal Issues with Prerequisite Skills

Nelson

Strömbäck

Korhonen

et al. 2020

Computing learners may not master basic concepts, or forget them between courses or from infrequent use. Learners also often struggle with advanced computing courses, perhaps from weakness with prerequisite concepts. One underlying challenge for researchers and instructors is determining the reason why a learner gets an advanced question wrong. Was the wrong answer because the learner lacked prerequisite skills, has not mastered the advanced skill, or some combination of the two? We contribute a design investigation into how to create differentiated questions which diagnose prerequisite and advanced skills at the same time. We focused on tracing and related skills as prerequisites, and on advanced object-oriented programming, concurrency, algorithm and data structures as the advanced skills. We conducted an inductive qualitative analysis of existing assessment questions from instructors and from a concept inventory with a validity argument (the Basic Data Structures Inventory). We found dependencies on a variety of prerequisite knowledge and mixed potential for diagnosing difficulties with prerequisites. Inspired by this analysis, we developed examples

HarmonicIO: Scalable Data Stream Processing for Scientific Datasets

Torruangwatthana

Wieslander

et al. 2018

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing

Hellander

Toor

2020

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing

Blamey¹,

Hellander²,

Toor³

2018

Preprint

This paper presents a benchmark of stream processing throughput comparing Apache Spark Streaming (under file-, TCP socket-and Kafka-based stream integration), with a prototype P2P stream processing framework, HarmonicIO. Maximum throughput for a spectrum of stream processing loads are measured, specifically, those with large message sizes (up to 10MB), and heavy CPU loads -more typical of scientific computing use cases (such as microscopy), than enterprise contexts. A detailed exploration of the performance characteristics with these streaming sources, under varying loads, reveals an interplay of performance trade-offs, uncovering the boundaries of good performance for each framework and streaming source integration. We compare with theoretic bounds in each case. Based on these results, we suggest which frameworks and streaming sources are likely to offer good performance for a given load. Broadly, the advantages of Spark's rich feature set comes at a cost of sensitivity to message size in particular -common stream source integrations can perform poorly in the 1MB-10MB range. The simplicity of HarmonicIO offers more robust performance in this region, especially for raw CPU utilization.

Adapting the Secretary Hiring Problem for Optimal Hot-Cold Tier Placement Under Top-K Workloads

Wrede

Karlsson

et al. 2019

Top-K queries are an established heuristic in information retrieval. This paper presents an approach for optimal tiered storage allocation under stream processing workloads using this heuristic: those requiring the analysis of only the top-K ranked most relevant, or most interesting, documents from a fixed-length stream, stream window, or batch job. In this workflow, documents are analyzed for relevance with a userspecified interestingness function, on which they are ranked, the top-K being selected (and hence stored) for further processing. This workflow allows human in the loop systems, including supervised machine learning, to prioritize documents. This scenario bears similarity to the classic Secretary Hiring Problem (SHP), and the expected rate of document writes, and document lifetime, can be modelled as a function of document index. We present parameter-based algorithms for storage tier placement, minimizing document storage and transport costs. We show that optimal parameter values are a function of these costs. It is possible to model application IO characteristics analytically for this class of workloads. When combined with tiered storage, the tractability of the probabilistic model of IO makes it possible to optimize (and budget for) storage tier allocation a priori, without needing to monitor the application. This contrasts with (often complex) existing work on tiered storage optimization, which is either tightly coupled to specific use cases, or requires active monitoring of application IO load (a reactive approach)ill-suited to long-running or one-off operations common in the scientific computing domain. We evaluate our model with a tracedriven simulation of a bio-chemical model exploration, and give case studies for two cloud storage case studies.

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit

Toor

Dahlö

et al. 2021

Background Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered “data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources. Findings In our pipeline model, an “interestingness function” assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a “policy” guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope. Conclusions Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems – and is intended for use with a range of technologies in different deployment scenarios.