Characterizing the Impact of Geometric Properties of Word Embeddings on Task Performance

Analysis of word embedding properties to inform their use in downstream NLP tasks has largely been studied by assessing nearest neighbors. However, geometric properties of the continuous feature space contribute directly to the use of embedding features in downstream models, and are largely unexplored. We consider four properties of word embedding geometry, namely: position relative to the origin, distribution of features in the vector space, global pairwise distances, and local pairwise distances. We define a sequence of transformations to generate new embeddings that expose subsets of these properties to downstream models and evaluate change in task performance to understand the contribution of each property to NLP models. We transform publicly available pretrained embeddings from three popular toolkits (word2vec, GloVe, and FastText) and evaluate on a variety of intrinsic tasks, which model linguistic information in the vector space, and extrinsic tasks, which use vectors as input to machine learning models. We find that intrinsic evaluations are highly sensitive to absolute position, while extrinsic tasks rely primarily on local similarity. Our findings suggest that future embedding models and post-processing techniques should focus primarily on similarity to nearby points in vector space.


Introduction
Learned vector representations of words, known as word embeddings, have become ubiquitous throughout natural language processing (NLP) applications. As a result, analysis of embedding spaces to understand their utility as input features has emerged as an important avenue of inquiry, in order to facilitate proper use of embeddings in downstream NLP tasks. Many analyses have focused on nearest neighborhoods, as a viable proxy for semantic information (Rogers et al., * These authors contributed equally to this work. 2018; Pierrejean and Tanguy, 2018). However, neighborhood-based analysis is limited by the unreliability of nearest neighborhoods (Wendlandt et al., 2018). Further, it is intended to evaluate the semantic content of embedding spaces, as opposed to characteristics of the feature space itself.
Geometric analysis offers another recent angle from which to understand the properties of word embeddings, both in terms of their distribution (Mimno and Thompson, 2017) and correlation with downstream performance (Chandrahas et al., 2018). Through such geometric investigations, neighborhood-based semantic characterizations are augmented with information about the continuous feature space of an embedding. Geometric features offer a more direct connection to the assumptions made by neural models about continuity in input spaces (Szegedy et al., 2014), as well as the use of recent contextualized representation methods using continuous language models (Peters et al., 2018;Devlin et al., 2018).
In this work, we aim to bridge the gap between neighborhood-based semantic analysis and geometric performance analysis. We consider four components of the geometry of word embeddings, and transform pretrained embeddings to expose only subsets of these components to downstream models. We transform three popular sets of embeddings, trained using word2vec (Mikolov et al., 2013), 1 GloVe (Pennington et al., 2014), 2 and FastText (Bojanowski et al., 2017), 3 and use the resulting embeddings in a battery of standard evaluations to measure changes in task performance.
We find that intrinsic evaluations, which model linguistic information directly in the vector space, are highly sensitive to absolute position in pretrained embeddings; while extrinsic tasks, in which word embeddings are passed as input features to a trained model, are more robust and rely primarily on information about local similarity between word vectors. Our findings, including evidence that global organization of word vectors is often a major source of noise, suggest that further development of embedding learning and tuning methods should focus explicitly on local similarity, and help to explain the success of several recent methods.

Related Work
Word embedding models and outputs have been analyzed from several angles. In terms of performance, evaluating the "quality" of word embedding models has long been a thorny problem. While intrinsic evaluations such as word similarity and analogy completion are intuitive and easy to compute, they are limited by both confounding geometric factors (Linzen, 2016) and task-specific factors (Faruqui et al., 2016;Rogers et al., 2017). Chiu et al. (2016) show that these tasks, while correlated with some semantic content, do not always predict downstream performance. Thus, it is necessary to use a more comprehensive set of intrinsic and extrinsic evaluations for embeddings. Nearest neighbors in sets of embeddings are commonly used as a proxy for qualitative semantic information. However, their instability across embedding samples (Wendlandt et al., 2018) is a limiting factor, and they do not necessarily correlate with linguistic analyses (Hellrich and Hahn, 2016). Modeling neighborhoods as a graph structure offers an alternative analysis method (Cuba Gyllensten and Sahlgren, 2015), as does 2-D or 3-D visualization (Heimerl and Gleicher, 2018). However, both of these methods provide qualitative insights only. By systematically analyzing geometric information with a wide variety of eval-uations, we provide a quantitative counterpart to these understandings of embedding spaces.

Methods
In order to investigate how different geometric properties of word embeddings contribute to model performance on intrinsic and extrinsic evaluations, we consider the following attributes of word embedding geometry: • position relative to the origin; • distribution of feature values in R d ; • global pairwise distances, i.e. distances between any pair of vectors; • local pairwise distances, i.e. distances between nearby pairs of vectors.
Using each of our sets of pretrained word embeddings, we apply a variety of transformations to induce new embeddings that only expose subsets of these attributes to downstream models. These are: affine transformation, which obfuscates the original position of the origin; cosine distance encoding, which obfuscates the original distribution of feature values in R d ; nearest neighbor encoding, which obfuscates global pairwise distances; and random encoding. This sequence is illustrated in Figure 1, and the individual transformations are discussed in the following subsections.
General notation for defining our transformations is as follows. Let W be our vocabulary of words taken from some source corpus. We associate with each word w ∈ W a vector v ∈ R d resulting from training via one of our embedding generation algorithms, where d is an arbitrary dimensionality for the embedding space. We define V to be the set of all pretrained word vectors v for a given corpus, embedding algorithm, and parameters. The matrix of embeddings M V associated with this set then has shape |V | × d. For simplicity, we restrict our analysis to transformed embeddings of the same dimensionality d as the original vectors.

Affine transformations
Affine transformations have been previously utilized for post-processing of word embeddings. For example, Artetxe et al. (2016) learn a matrix transform to align multilingual embedding spaces, and Faruqui et al. (2015) use a linear sparsification to better capture lexical semantics. In addition, the simplicity of affine functions in machine learning contexts (Hofmann et al., 2008) makes them a good starting point for our analysis.
Given a set of embeddings in R d , referred to as an embedding space, affine transformations change positions of points relative to the origin.
While prior work has typically focused on linear transformations, which fix the origin, we consider the broader class of affine transformations, which do not. Thus, affine transformations such as translation cannot in general be represented as a square matrix for finite-dimensional spaces.
We use the following affine transformations: • translations; • reflections over a hyperplane; • rotations about a subspace; • homotheties.
We give brief definitions of each transformation.
the reflection over the hyperplane through the origin orthogonal to a.
Definition 3. A rotation through the span of vectors u, x by angle θ is a map Rot u,x : and I ∈ Mat d,d (R) is the identity matrix.
Definition 4. For every a ∈ R d and λ ∈ R \ { 0 }, we call the map H a,λ : R d → R d given by a homothety of center a and ratio λ. A homothety centered at the origin is called a dilation.
Parameters used in our analysis for each of these transformations are provided in Appendix A.

Cosine distance encoding (CDE)
Our cosine distance encoding transformation obfuscates the distribution of features in R d by representing a set of word vectors as a pairwise distance matrix. Such a transformation might be used to avoid the non-interpretability of embedding features (Fyshe et al., 2015) and compare embeddings based on relative organization alone.
where the second term is the cosine similarity.
As all three sets of embeddings evaluated in this study have vocabulary size on the order of 10 6 , use of the full distance matrix is impractical. We use a subset consisting of the distance from each point to the embeddings of the 10K most frequent words from each embedding set, yielding This is not dissimilar to the global frequencybased negative sampling approach of word2vec (Mikolov et al., 2013). We then use an autoencoder to map this back to R d for comparability.
Then an autoencoder over R |V | is defined as Vector h ∈ R d is then used as the compressed representation of v.
In our experiments, we use ReLU as our activation function ϕ, and train the autoencoder for 50 epochs to minimize L 2 distance between v andv. We recognize that low-rank compression using an autoencoder is likely to be noisy, thus potentially inducing additional loss in evaluations. However, precedent for capturing geometric structure with autoencoders (Li et al., 2017b) suggests that this is a viable model for our analysis.

Nearest neighbor encoding (NNE)
Our nearest neighbor encoding transformation discards the majority of the global pairwise distance information modeled in CDE, and retains only information about nearest neighborhoods.
The output of f NNE (v) is a sparse vector.
This transformation relates to the common use of nearest neighborhoods as a proxy for semantic information (Wendlandt et al., 2018;Pierrejean and Tanguy, 2018). We take the previously proposed approach of combining the output of f NNE (v) for each v ∈ V to form a sparse adjacency matrix, which describes a directed nearest neighbor graph (Cuba Gyllensten and Sahlgren, 2015; Newman-Griffis and Fosler-Lussier, 2017), using three versions of f NNE defined below.
Thresholded The set of non-zero indices in f NNE (v) correspond to word vectorsṽ such that the cosine similarity of v andṽ is greater than or equal to an arbitrary threshold t. In order to ensure that every word has non-zero out degree in the graph, we also include the k nearest neighbors by cosine similarity for every word vector. Non-zero values in f NNE (v) are set to the cosine similarity of v and the relevant neighbor vector.
Weighted The set of non-zero indices in f NNE (v) corresponds to only the set of k nearest neighbors to v by cosine similarity. Cosine similarity values are used for edge weights.
Unweighted As in the previous case, only k nearest neighbors are included in the adjacency matrix. All edges are weighted equally, regardless of cosine similarity.
We report results using k = 5 and t = 0.05; other settings are discussed in Appendix B.
Finally, much like the CDE method, we use a second mapping function ψ : R |V | → R d to transform the nearest neighbor graph back to d-dimensional vectors for evaluation. Following Newman-Griffis and Fosler-Lussier (2017), we use node2vec (Grover and Leskovec, 2016) with default parameters to learn this mapping. Like the autoencoder, this is a noisy map, but the intent of node2vec to capture patterns in local graph structure makes it a good fit for our analysis.

Random encoding
Finally, as a baseline, we use a random encoding While intrinsic evaluations rely only on input embeddings, and thus lose all source information in this case, extrinsic tasks learn a model to transform input features, making even randomlyinitialized vectors a common baseline (Lample et al., 2016;Kim, 2014). For fair comparison, we generate one set of random baselines for each embedding set and re-use these across all tasks.

Other transformations
Many other transformations of a word embedding space could be included in our analysis, such as arbitrary vector-valued polynomial functions, rational vector-valued functions, or common decomposition methods such as principal components analysis (PCA) or singular value decomposition (SVD). Additionally, though they cannot be effectively applied to the unordered set of word vectors in a raw embedding space, transformations for sequential data such as discrete Fourier transforms or discrete wavelet transforms could be used for word sequences in specific text corpora.
For this study, we limit our scope to the transformations listed above. These transformations align with prior work on analyzing and post-processing embeddings for specific tasks, and are highly interpretable with respect to the original embedding space. However, other complex transformations represent an intriguing area of future work.

Evaluation
In order to measure the contributions of each geometric aspect described in Section 3 to the utility of word embeddings as input features, we evaluate embeddings transformed using our sequence of operations on a battery of standard intrinsic evaluations, which model linguistic information directly in the vector space; and extrinsic evaluations, which use the embeddings as input to learned models for downstream applications Our intrinsic evaluations include: We follow Rogers et al. (2018) in evaluating on a set of five extrinsic tasks: 5 • Relation classification: SemEval-2010 Task 8 (Hendrickx et al., 2010), using a CNN with word and distance embeddings (Zeng et al., 2014). • Sentence-level sentiment polarity classification: MR movie reviews (Pang and Lee, 2005), with a simplified CNN model from (Kim, 2014).
• Subjectivity/objectivity classification: Rotten Tomato snippets (Pang and Lee, 2004), using a logistic regression over summed word embeddings (Li et al., 2017a). • Natural language inference: SNLI (Bowman et al., 2015), using separate LSTMs for premise and hypothesis, combined with a feed-forward classifier. Figure 2 presents the results of each intrinsic and extrinsic evaluation on the transformed versions of our three sets of word embeddings. 6 The largest drops in performance across all three sets for intrinsic tasks occur when explicit embedding features are removed with the CDE transformation. While some cases of NNE-transformed embeddings recover a measure of this performance, they remain far under affine-transformed embeddings. Extrinsic tasks are similarly affected by the CDE transformation; however, NNE-transformed embeddings recover the majority of performance.

Analysis and Discussion
Comparing within the set of affine transformations, the innocuous effect of rotations, dilations, and reflections on both intrinsic and extrinsic tasks suggests that the models used are robust to simple linear transformations. Extrinsic evaluations are also relatively insensitive to translations, which can be modeled with bias terms, though the lack of learned models and reliance on cosine similarity for the intrinsic tasks makes them more sensitive to shifts relative to the origin. Interestingly, homothety, which effectively combines a translation and a dilation, leads to a noticeable drop in performance across all tasks. Intuitively, this result makes sense: by both shifting points relative to the origin and changing their distribution in the space, angular similarity values used for intrinsic tasks can be changed significantly, and the zero mean feature distribution preferred by neural models (Clevert et al., 2016) becomes harder to achieve. This suggests that methods for tuning embeddings should attempt to preserve the origin whenever possible.
The large drops in performance observed when using the CDE transformation is likely to relate 6 Due to their large vocabulary size, we were unable to run Thresholded-NNE experiments with word2vec embeddings. to the instability of nearest neighborhoods and the importance of locality in embedding learning (Wendlandt et al., 2018), although the effects of the autoencoder component also bear further investigation. By effectively increasing the size of the neighborhood considered, CDE adds additional sources of semantic noise. The similar drops from thresholded-NNE transformations, by the same token, is likely related to observations of the relationship between the frequency ranks of a word and its nearest neighbors (Faruqui et al., 2016). With thresholded-NNE, we find that the words with highest out degree in the nearest neighbor graph are rare words (e.g., "Chanterelle" and "Courtier" in FastText, "Tiegel" and "demangler" in GloVe), which link to other rare words. Thus, node2vec's random walk method is more likely to traverse these dense subgraphs of rare words, adding noise to the output embeddings.
Finally, we note that Melamud et al. (2016) showed significant variability in downstream task performance when using different embedding dimensionalities. While we fixed vector dimensionality for the purposes of this study, varying d in future work represents a valuable follow-up.
Our findings suggest that methods for training and tuning embeddings, especially for downstream tasks, should explicitly focus on local geometric structure in the vector space. One concrete example of this comes from Chen et al. (2018), who demonstrate empirical gains when changing the negative sampling approach of word2vec to choose negative samples that are currently near to the target word in vector space, instead of the original frequency-based sampling (which ignores geometric structure). Similarly, successful methods for tuning word embeddings for specific tasks have often focused on enforcing a specific neighborhood structure (Faruqui et al., 2015). We demonstrate that by doing so, they align qualitative semantic judgments with the primary geometric information that downstream models learn from.

Conclusion
Analysis of word embeddings has largely focused on qualitative characteristics such as nearest neighborhoods or relative distribution. In this work, we take a quantitative approach analyzing geometric attributes of embeddings in R d , in order to understand the impact of geometric properties on downstream task performance. We character-ized word embedding geometry in terms of absolute position, vector features, global pairwise distances, and local pairwise distances, and generated new embedding matrices by removing these attributes from pretrained embeddings. By evaluating the performance of these transformed embeddings on a variety of intrinsic and extrinsic tasks, we find that while intrinsic evaluations are sensitive to absolute position, downstream models rely primarily on information about local similarity.
As embeddings are used for increasingly specialized applications, and as recent contextualized embedding methods such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) allow for dynamic generation of embeddings from specific contexts, our findings suggest that work on tuning and improving these embeddings should focus explicitly on local geometric structure in sampling and evaluation methods. The source code for our transformations and complete tables of our results are available online at https://github.com/OSU-slatelab/ geometric-embedding-properties.

Appendix A Parameters
We give the following library of vectors in R d used as parameter values:

Appendix B NNE settings
We experimented with k ∈ {5, 10, 15} for our weighted and unweighted NNE transformations. For thresholded NNE, in order to best evaluate the impact of thresholding over uniform k, we used the minimum k = 5 and experimented with t ∈ {0.01, 0.05, 0.075}; higher values of t increased graph size sufficiently to be impractical. We report using k = 5 for weighted and unweighted settings in our main results for fairer comparison with the thresholded setting. The effect of thresholding on nearest neighbor graphs was a strongly right-tailed increase in out degree for a small portion of nodes. Our reported value of t = 0.05 increased the out degree of 20,229 nodes for FastText (out of 1M total nodes), with the maximum increase being 819 ("Chanterelle"), and 1,354 nodes increasing out degree by only 1. For GloVe, 7,533 nodes increased in out degree (out of 2M total), with maximum increase 240 ("Tiegel"), and 372 nodes increasing out degree by only 1.