Mehl, S. orcid.org/0000-0003-3036-8132 (2019) Measuring lexical co-occurrence statistics against a part-of-speech baseline. In: Parviainen, H., Kaunisto, M. and Pahta, P., (eds.) Corpus Approaches into World Englishes and Language Contrasts. Studies in Variation, Contacts and Change in English, 20 . VARIENG, University of Helsinki , Helsinki
Abstract
Analysing strength of lexical co-occurrence using Mutual Information (MI) and Pearson’s chi-square test is standard in corpus linguistics; typically, such analyses are conducted using a statistical baseline of all tokens in the data set (cf. Manning & Schuetze 1999). That is, the probability of a given type or lemma is measured as the number of occurrences of that type or lemma against the total number of tokens in the data. This baseline, however, is not ideal as a measure of linguistic probability: the denominator representing all tokens is artificially high because each token does not represent an opportunity for the given lemma to occur (cf. Wallis 2012). This high denominator in turn results in an artificially low probability and suggests an artificially high degree of confidence in the measurement. This paper reports an experiment in employing a grammatical part of speech (POS) baseline for calculating statistical probability of co-occurrence, asking: In what ways does a POS-baseline differ from a traditional baseline of all tokens, when calculating chi-square and MI? The experiment is conducted in the context of a major research project studying meaning through lexical co-occurrence in Early Modern English texts, and the data is drawn from Early English Books Online (Text Creation Partnership edition). I demonstrate that the traditional baseline of all tokens yields higher MI values and more ‘significant’ results than a POS-baseline. I argue that the traditional baseline of all tokens can be interpreted as yielding artificially high MI values; and as yielding an artificially high number of significant results – but I also illustrate that the improvements of the POS-baseline may be negligible for the typical task of ranking the top ten co-occurrence pairs for a given node word.
Metadata
Item Type: | Book Section |
---|---|
Authors/Creators: | |
Editors: |
|
Copyright, Publisher and Additional Information: | © 2019 The Author. |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Arts and Humanities (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 13 Mar 2020 15:26 |
Last Modified: | 13 Mar 2020 15:26 |
Published Version: | http://www.helsinki.fi/varieng/series/volumes/20/i... |
Status: | Published |
Publisher: | VARIENG, University of Helsinki |
Series Name: | Studies in Variation, Contacts and Change in English |
Refereed: | Yes |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:158350 |