Abstract:A major breakthrough in the visualization of dissimilarities between pairs of objects was the formulation of the least-squares multidimensional scaling (MDS) model as defined by the Stress function. This function is quite flexible in that it allows possibly nonlinear transformations of the dissimilarities to be represented by distances between points in a low dimensional space. To obtain the visualization, the Stress function should be minimized over the coordinates of the points and the over the transformatio… Show more
“…Weights are useful when we have input data with missing values. Since there is no restriction on any distance X, we can define fixed values of w ij = 0 if δ ij is missing and w ij = 1 otherwise [20].…”
Accurate and efficient entity resolution (ER) has been a problem in data analysis and data mining projects for decades. In our work, we are interested in developing ER methods to handle big data. Good public datasets are restricted in this area and usually small in size. Simulation is one technique for generating datasets for testing. Existing simulation tools have problems of complexity, scalability and limitations of resampling. We address these problems by introducing a better way of simulating testing data for big data ER. Our proposed simulation model is simple, inexpensive and fast. We focus on avoiding the detail-level simulation of records using a simple vector representation. In this paper, we will discuss how to simulate simple vectors that approximate the properties of names (commonly used as identification keys).
“…Weights are useful when we have input data with missing values. Since there is no restriction on any distance X, we can define fixed values of w ij = 0 if δ ij is missing and w ij = 1 otherwise [20].…”
Accurate and efficient entity resolution (ER) has been a problem in data analysis and data mining projects for decades. In our work, we are interested in developing ER methods to handle big data. Good public datasets are restricted in this area and usually small in size. Simulation is one technique for generating datasets for testing. Existing simulation tools have problems of complexity, scalability and limitations of resampling. We address these problems by introducing a better way of simulating testing data for big data ER. Our proposed simulation model is simple, inexpensive and fast. We focus on avoiding the detail-level simulation of records using a simple vector representation. In this paper, we will discuss how to simulate simple vectors that approximate the properties of names (commonly used as identification keys).
“…LSMDS initially maps each item in the non-metric or metric-space to a 𝐾-dimensional point. Then minimises the discrepancy between the actual dissimilarities and the estimated distances in the 𝐾-dimensional space by optimisation [13]. This discrepancy is measured using raw stress (𝜎 𝑟𝑎𝑤 ) given by the relative error where 𝛿 𝑖 𝑗 is the dissimilarity between the two objects and 𝑑 𝑖 𝑗 is the Euclidean distance between their estimated points.…”
Section: Problem Formulationmentioning
confidence: 99%
“…Possible weights for each pair of points are denoted by 𝑤 𝑖 𝑗 . Weights are useful in handling missing values and the default values are 𝑤 𝑖 𝑗 = 0, if 𝛿 𝑖 𝑗 is missing and 𝑤 𝑖 𝑗 = 1, otherwise [13]. We do not apply weights in this work, hence, 𝑤 𝑖 𝑗 = 1 always.…”
Accurate and efficient entity resolution (ER) is a significant challenge in many data mining and analysis projects requiring integrating and processing massive data collections. It is becoming increasingly important in real-world applications to develop ER solutions that produce prompt responses for entity queries on large-scale databases. Some of these applications demand entity query matching against large-scale reference databases within a short time. We define this as the query matching problem in ER in this work. Indexing or blocking techniques reduce the search space and execution time in the ER process. However, approximate indexing techniques that scale to very large-scale datasets remain open to research. In this paper, we investigate the query matching problem in ER to propose an indexing method suitable for approximate and efficient query matching.We first use spatial mappings to embed records in a multidimensional Euclidean space that preserves the domain-specific similarity. Among the various mapping techniques, we choose multidimensional scaling. Then using a Kd-tree and the nearest neighbour search, the method returns a block of records that includes potential matches for a query. Our method can process queries against a large-scale dataset using only a fraction of the data 𝐿 (given the dataset size is 𝑁 ), with a 𝑂 (𝐿 2 ) complexity where 𝐿 ≪ 𝑁 . The experiments conducted on several datasets showed the effectiveness of the proposed method.
“…The non-negative weights w i, j in (39) were originally included and suggested by De Leeuw to provide more flexibility. They can be used to express the importance of the residualŝ (X) or can be used to handle missing data (Groenen and van de Velden 2016). For multidimensional unfolding, the configuration matrix X can be decomposed in two matrices X 1 and X 2 , which are of dimensionality n 1 × p and n 2 × p, respectively.…”
Label ranking is a specific type of preference learning problem, namely the problem of learning a model that maps instances to rankings over a finite set of predefined alternatives. Like in conventional classification, these alternatives are identified by their name or label while not being characterized in terms of any properties or features that could be potentially useful for learning. In this paper, we consider a generalization of the label ranking problem that we call dyad ranking. In dyad ranking, not only the instances but also the alternatives are represented in terms of attributes. For learning in the setting of dyad ranking, we propose an extension of an existing label ranking method based on the Plackett-Luce model, a statistical model for rank data. This model is combined with a suitable feature representation of dyads. Concretely, we propose a method based on a bilinear extension, where the representation is given in terms of a Kronecker product, as well as a method based on neural networks, which allows for learning a (highly nonlinear) joint feature representation. The usefulness of the additional information provided by the feature description of alternatives is shown in several experimental studies. Finally, we propose a method for the visualization of dyad rankings, which is based on the technique of multidimensional unfolding.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.