Measuring lexical co-occurrence statistics against a part-of-speech baseline

Abstract

Analysing strength of lexical co-occurrence using Mutual Information (MI) and Pearson’s chi-square test is standard in corpus linguistics; typically, such analyses are conducted using a statistical baseline of all tokens in the data set (cf. Manning & Schuetze 1999). That is, the probability of a given type or lemma is measured as the number of occurrences of that type or lemma against the total number of tokens in the data. This baseline, however, is not ideal as a measure of linguistic probability: the denominator representing all tokens is artificially high because each token does not represent an opportunity for the given lemma to occur (cf. Wallis 2012). This high denominator in turn results in an artificially low probability and suggests an artificially high degree of confidence in the measurement. This paper reports an experiment in employing a grammatical part of speech (POS) baseline for calculating statistical probability of co-occurrence, asking: In what ways does a POS-baseline differ from a traditional baseline of all tokens, when calculating chi-square and MI? The experiment is conducted in the context of a major research project studying meaning through lexical co-occurrence in Early Modern English texts, and the data is drawn from Early English Books Online (Text Creation Partnership edition). I demonstrate that the traditional baseline of all tokens yields higher MI values and more ‘significant’ results than a POS-baseline. I argue that the traditional baseline of all tokens can be interpreted as yielding artificially high MI values; and as yielding an artificially high number of significant results – but I also illustrate that the improvements of the POS-baseline may be negligible for the typical task of ranking the top ten co-occurrence pairs for a given node word.

Metadata

Item Type:	Book Section
Authors/Creators:	Mehl, S. https://orcid.org/0000-0003-3036-8132
Editors:	Parviainen, H. Kaunisto, M. Pahta, P.
Copyright, Publisher and Additional Information:	© 2019 The Author.
Dates:	Published (online): 12 December 2019 Published: 12 December 2019
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Arts and Humanities (Sheffield)
Date Deposited:	13 Mar 2020 15:26
Last Modified:	13 Mar 2020 15:26
Published Version:	http://www.helsinki.fi/varieng/series/volumes/20/i...
Status:	Published
Publisher:	VARIENG, University of Helsinki
Series Name:	Studies in Variation, Contacts and Change in English
Refereed:	Yes
Related URLs:	Publisher
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:158350

CORE (COnnecting REpositories)

Measuring lexical co-occurrence statistics against a part-of-speech baseline

Abstract

Metadata

Download

External copy

Export

Statistics