Claudia Perlich scite author profile

Identifier attributes-very high-dimensional categorical attributes such as particular product ids or people's names-rarely are incorporated in statistical modeling. However, they can play an important role in relational modeling: it may be informative to have communicated with a particular set of people or to have purchased a particular set of products. A key limitation of existing relational modeling techniques is how they aggregate bags (multisets) of values from related entities. The aggregations used by existing methods are simple summaries of the distributions of features of related entities: e.g., MEAN, MODE, SUM, or COUNT. This paper's main contribution is the introduction of aggregation operators that capture more information about the value distributions, by storing meta-data about value distributions and referencing this meta-data when aggregating-for example by computing class-conditional distributional distances. Such aggregations are particularly important for aggregating values from high-dimensional categorical attributes, for which the simple aggregates provide little information. In the first half of the paper we provide general guidelines for designing aggregation operators, introduce the new aggregators in the context of the relational learning system ACORA (Automated Construction of Relational Attributes), and provide theoretical justification. We also conjecture special properties of identifier attributes, e.g., they proxy for unobserved attributes and for information deeper in the relationship network. In the second half of the paper we provide extensive empirical evidence that the distribution-based aggregators indeed do facilitate modeling with high-dimensional categorical attributes, and in support of the aforementioned conjectures.

show abstract

Leakage in data mining

Kaufman

Rosset

Perlich³

et al. 2012

ACM Trans. Knowl. Discov. Data

267

View full text Add to dashboard Cite

Deemed "one of the top ten data mining mistakes", leakage is the introduction of information about the data mining target that should not be legitimately available to mine from. In addition to our own industry experience with real-life projects, controversies around several major public data mining competitions held recently such as the INFORMS 2010 Data Mining Challenge and the IJCNN 2011 Social Network Challenge are evidence that this issue is as relevant today as it has ever been. While acknowledging the importance and prevalence of leakage in both synthetic competitions and real-life data mining projects, existing literature has largely left this idea unexplored. What little has been said turns out not to be broad enough to cover more complex cases of leakage, such as those where the classical independently and identically distributed (i.i.d.) assumption is violated, that have been recently documented. In our new approach, these cases and others are explained by explicitly defining modeling goals and analyzing the broader framework of the data mining problem. The resulting definition enables us to derive general methodology for dealing with the issue. We show that it is possible to avoid leakage with a simple specific approach to data management followed by what we call a learn-predict separation, and present several ways of detecting leakage when the modeler has no control over how the data have been collected. We also offer an alternative point of view on leakage that is based on causal graph modeling concepts.

show abstract

Causally motivated attribution for online advertising

D'Alessandro¹,

Perlich²,

Stitelman³

et al. 2012

View full text Add to dashboard Cite

Machine learning for targeted display advertising: transfer learning in action

Perlich¹,

D'Alessandro²,

Raeder³

et al. 2013

Mach Learn

137

View full text Add to dashboard Cite

This paper presents a detailed discussion of problem formulation and data representation issues in the design, deployment, and operation of a massive-scale machine learning system for targeted display advertising. Notably, the machine learning system itself is deployed and has been in continual use for years, for thousands of advertising campaigns (in contrast to simply having the models from the system be deployed). In this application, acquiring sufficient data for training from the ideal sampling distribution is prohibitively expensive. Instead, data are drawn from surrogate domains and learning tasks, and then transferred to the target task. We present the design of this multistage transfer learning system, highlighting the problem formulation aspects. We then present a detailed experimental evaluation, showing that the different transfer stages indeed each add value. We next present production results across a variety of advertising clients from a variety of industries, illustrating the performance of the system in use. We close the paper with a collection of lessons learned from the work over half a decade on this complex, deployed, and broadly used machine learning system.

show abstract

Spatial-temporal causal modeling for climate change attribution

Lozano

Niculescu-Mizil

et al. 2009

View full text Add to dashboard Cite

Attribution of climate change to causal factors has been based predominantly on simulations using physical climate models, which have inherent limitations in describing such a complex and chaotic system. We propose an alternative, data centric, approach that relies on actual measurements of climate observations and human and natural forcing factors. Specifically, we develop a novel method to infer causality from spatial-temporal data, as well as a procedure to incorporate extreme value modeling into our method in order to address the attribution of extreme climate events, such as heatwaves. Our experimental results on a real world dataset indicate that changes in temperature are not solely accounted for by solar radiance, but attributed more significantly to CO2 and other greenhouse gases. Combined with extreme value modeling, we also show that there has been a significant increase in the intensity of extreme temperatures, and that such changes in extreme temperature are also attributable to greenhouse gases. These preliminary results suggest that our approach can offer a useful alternative to the simulation-based approach to climate modeling and attribution, and provide valuable insights from a fresh perspective.

show abstract

A market-based framework for bankruptcy prediction

Reisz

Perlich

2007

Journal of Financial Stability

View full text Add to dashboard Cite

Leave-One-Out Cross-Validation

Webb¹,

Sammut²,

Perlich³

et al. 2011

104

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Claudia Perlich

Bid optimizing and inventory scoring in targeted online advertising

Distribution-based aggregation for relational learning with identifier attributes

Leakage in data mining

Causally motivated attribution for online advertising

Machine learning for targeted display advertising: transfer learning in action

Spatial-temporal causal modeling for climate change attribution

A market-based framework for bankruptcy prediction

Leave-One-Out Cross-Validation

Contact Info

Product

Resources

About