Edge Hill Corpus Research Group


Language looks rather different when you look at a lot of it at once. (Sinclair 1991: 100)

Every corpus that I’ve had a chance to examine, however small, has taught me facts that I couldn’t imagine finding out about in any other way. (Fillmore, 1992: 35)

A single text on its own is quite insignificant: the effects of media power are cumulative, working through the repetition of particular ways of handling causality and agency, particular ways of positioning the reader. (Fairclough, 1989: 54)

The Edge Hill Corpus Research Group (EHU CRG) was established in October 2021 by Costas Gabrielatos.  EHU CRG aims at becoming an interdisciplinary forum for academics and students who are (interested in) using corpora and corpus linguistic approaches. In particular, it aims at encouraging and facilitating the use of corpus approaches at Edge Hill University.

As corpus linguistics is essentially a methodological approach which integrates quantitative and qualitative techniques, it can be employed by academics engaged in any type of research involving the analysis of naturally-occurring or elicited language (e.g. Education, History, Law, Linguistics, Literature, Politics, Psychology, Sociology). Also, its computational and quantitative aspects make corpus linguistics relevant to academics working in Computing and Statistics.

If you are interested in presenting at EHU CRG, please send an abstract to the coordinator, Costas Gabrielatos ([email protected]). EHU CRG will also feature invited presentations and demonstrations.

Meetings will have a variety of foci and formats, and will involve:

  • demonstrations of, and workshops on, corpora and corpus tools;
  • presentations reporting on completed or ongoing research projects (including PhD and MRes projects);
  • presentations and discussions on corpus linguistics theoretical constructs, analytical techniques, and metrics;
  • critical examination of published studies.

Meetings will take place online on Microsoft Teams. The online format will enable a much larger number of academics and students from around the world to attend and contribute. Meetings will be 1-2 hours depending on the type, and will allow ample time for discussion.

Currently, the plan is to have two meetings per semester (on Wednesdays), but additional meetings can also be arranged.


EHU CRG Meeting, Friday 8 April 2022, 2-4 pm

Topics: Corpus tools (manual annotation) & Appraisal Theory

Encarnación Hidalgo-Tenorio (University of Granada)

Miguel-Ángel Benítez-Castro (Universidad de Zaragoza)

Workshop: Manual Annotation with UAM Corpus Tool, http://www.corpustool.com [e-copy]

Presentation: Analysing Extremism under the Lens of Appraisal Theory


Appraisal Theory is aimed to understand how social relations are negotiated through alignment, as linguistically realised by the axes of ENGAGEMENT, GRADUATION and ATTITUDE (Martin & White 2005). Of the three subsystems, the latter has attracted more attention so far. ATTITUDE helps classify instances of emotion/al talk through the meaning domains of AFFECT, JUDGEMENT and APPRECIATION. As argued by White (2004) and Bednarek (2009), emotional talk may entail the more indirect expression of emotion by attending to ethical and aesthetic values. Given the omnipresence of affect in language (e.g. Ochs & Schieffelin 1989; Barrett 2017), there is growing consensus about treating AFFECT as a superordinate category, now taken to include the expression of EMOTION (emotional evaluation) and OPINION (ethical and aesthetic evaluation). As emotion permeates all levels of linguistic description (e.g. Alba-Juez & Thompson 2014; Alba-Juez & Mackenzie 2019), and all utterances are produced and interpreted through emotions (Klann-Delius 2015), AFFECT may be enriched through a more explicit focus on affective psychology, thereby proposing more sharply defined categories that may better describe any instance of emotive language (Thompson 2014). This paper shows how Benítez-Castro & Hidalgo-Tenorio’s (2019) more psychologically-driven Appraisal EMOTION sub-system can lead to a user-generated Appraisal scheme allowing a more fine-grained analysis of the complex interplay between (explicit and implicit) EMOTION and OPINION in discourse. To do so, we draw on examples and findings from two research strands we have covered so far: American right-wing populist discourse (Hidalgo-Tenorio & Benítez-Castro 2021b) and Jihadist propaganda (Benítez-Castro & Hidalgo-Tenorio Forthcoming).

Encarnación Hidalgo-Tenorio is Professor in English Linguistics at the University of Granada, Spain. Her main research area is corpus-based CDA, where she focuses on the notions of representation and power enactment in public discourse. She has published on language and gender, Irish studies, political communication, and has also paid attention to the analysis of the way identity is discursively constructed. She has tried to develop, or reconsider, some interesting aspects taken from SFL such as Transitivity, Modality, or Appraisal. Currently, she is working on the lexicogrammar of radicalization. Address for correspondence: Departamento de Filologías Inglesa y Alemana, Facultad de Filosofía y Letras, Campus de Cartuja s/n, 18071, University of Granada, Spain. <[email protected]>  

Miguel-Ángel Benítez-Castro is lecturer in English Language at the University of Zaragoza, Spain. His main research interest lies in SFL-inspired discourse analysis, based on corpus-driven methodologies, which he has managed to apply to his general focus on the interface between lexical choice, discourse structure, and evaluation. This is reflected in his previous and ongoing research on shell-noun phrases, on the evaluation of social minorities in public discourse and on the refinement of SFL’s linguistic theory of evaluation. Address for correspondence: Department of English and German Studies, Facultad de Ciencias Sociales y Humanas, Universidad de Zaragoza, Ciudad Escolar, s/n, 44003 Teruel. <[email protected]>

EHU CRG Meeting, Wednesday 9 February 2022

Topic: Corpus Tools

What’s new in CQPweb – 2022 edition

Andrew Hardie

(Lancaster University)


In this informal workshop / presentation, Andrew Hardie will give an overview of the latest new features in CQPweb version 3.3. This includes, most notably, the option for users to upload their own corpora to the system, tagging the data using either CLAWS or TreeTagger – plus the new system that enables other users on the same server to share access to these uploaded corpora. Participants are welcome to try it out “in real time” during the session.

EHU CRG Meeting, Wednesday 15 December 2021

Topic: Corpus tools and automated annotation

Counting words or wording counts?

Paul Rayson

(Lancaster University)



A wide variety of tools and methods are available across a number of disciplines (e.g. Education, History, Linguistics, Literature, Psychology) for the analysis of text, and many of the techniques (e.g. content analysis, topic modelling, sentiment analysis) rely on counting words. However, words can take different meanings in different contexts, and around 16% of running text counts as semantically meaningful multiword expressions (where the meaning of the whole expression is different from the collection of individual words). In this talk, I will describe what can be achieved by combining methods from computer science (natural language processing) with linguistics (corpus linguistics) to address these issues. The talk will cover the basics of semantic annotation where words and multiword expressions are automatically labelled with coarse-grained word senses using the UCREL Semantic Analysis System (USAS). Then, via a demonstration of the web-based Wmatrix tool, I will show how counting USAS categories and comparing the frequency profiles with those from other texts can be used to quickly gist a text or corpus. Along the way, I will provide some pointers to example case studies in psychology, political discourse analysis, and beyond, describe current research and development on open source USAS multilingual taggers, and provide attendees with pointers for Wmatrix access and further tutorials to follow up later using your own corpora.

Bio Note

I am a Professor in Computer Science at Lancaster University, UK and Director of the UCREL interdisciplinary research centre which carries out research in corpus linguistics and natural language processing (NLP). A long term focus of my work is semantic multilingual NLP in extreme circumstances where language is noisy e.g. in historical, learner, speech, email, txt and other CMC varieties. Along with domain experts, I have applied my research in the areas of dementia detection, mental health, online child protection, cyber security, learner dictionaries, and text mining of biomedical literature, historical corpora, and financial narratives. I was a co-investigator of the five-year ESRC Centre for Corpus Approaches to Social Science (CASS) which is designed to bring the corpus approach to bear on a range of social sciences. I’m also a member of the multidisciplinary Institute Security Lancaster, the Lancaster Digital Humanities Hub, and the Data Science Institute.

EHU CRG Meeting, Wednesday 10 November 2021

Corpus Compilation for Critical Discourse Studies

Constructing the Lone Wolf Corpus: Using polysemous query terms to compile a topic-specific corpus

Dan Malone (Edge Hill University)


This paper is concerned with the process of developing a query to compile a topic-specific corpus from a text database. For a corpus to be topic-specific, its texts must be relevant to the topic(s) for which it was compiled to investigate. However, polysemous query terms are more likely than monosemous query terms to retrieve nonrelevant texts and, therefore, reduce query precision, that is, the ratio of relevant to nonrelevant texts retrieved. More specifically, then, this paper suggests that the issue of polysemous query terms can be addressed through the implementation of a dual-group complex query (hereafter, DGQ).

The motivation for this paper arose while compiling the corpus for my PhD project ‘Constructing the Lone Wolf Terrorist: A corpus-driven study of the British press’. In its actor-based approach, corpus compilation was underpinned by an onomasiological perspective of the connection between lexical items and the concept of ‘the lone wolf terrorist’. According to Geeraerts (2010: 23), “onomasiology takes its starting point in a concept and investigates the different expressions via which the concept can be designated or named”. This is opposed to semasiology, which “takes its starting point in the word as a form and charts the meanings that the word can occur with” (ibid). Indeed, the concept ‘lone wolf terrorist’ can be expressed via a number of polysemous lexical items, such as lone actor, lone attacker, and solo actor, with the specificity of their meaning being derived from context.

To compile the Lone Wolf Corpus (LWC), rather than employing a simple query-string, the DGQ was devised to mitigate the polysemy of its query terms. It comprises two distinct groups of terms, with each based around a core semantic component of ‘lone-wolf terrorist’; Group A terms represented lone-wolf actors or actions, whereas Group B represented terrorism. By linking terms within each group with the Boolean operator OR and by then linking the two groups using AND, the query retrieved texts containing at least one term from each group. By drawing on textual context in the form of collocation, the potential for multiplicity of meaning of the polysemous query terms is restricted, leading to a reduction in the number of nonrelevant texts being retrieved by the query.

This paper develops the query formulation technique outlined by Gabrielatos (2007). Central to Gabrielatos’s technique is the metric of relative query term relevance (RQTR), which establishes the degree of relevance of candidate query terms to the topic being investigated. The RQTR technique has been adopted in a number of studies, such as Prentice (2010), Dimmroth, Steiger & Schünemann (2017), and Kreischer (2019), as a means to both expanding queries and establishing the relevancy of candidate terms. This paper expands the applicability of the RQTR method by illustrating how it can be applied to the DGQ and, therefore, cater for polysemous query terms.

From the initial core query terms lone wolf and terrorism, the LWC query was expanded to seventy query terms. When applied to the Lone Wolf Corpus (LWC) query, the DGQ improved query precision at minimal expense to recall, relative to a simple query. Based on a systematic sampling, the results show that the DGQ improved precision from 0.46 to 0.89, which was gained at the minimal expense of a 0.08 decrease in retrieved relevant texts.


  • Gabrielatos, C. (2007). Selecting query terms to build a specialised corpus from a restricted access database. ICAME Journal, 31, 5-44.
  • Geeraerts, D. (2010). Theories of Lexical Semantics. Oxford University Press.
  • Kreischer, K. S. (2019). The relation and function of discourses: a corpus-cognitive analysis of the Irish abortion debate. Corpora, 14(1), 105-130.
  • Prentice, S. (2010). Using automated semantic tagging in Critical Discourse Analysis: A case study on Scottish independence from a Scottish nationalist perspective. Discourse & Society, 21(4), 405–437.
  • Steiger, S., Schünemann, W. J., & Dimmroth, K. (2017). Outrage without consequences? Post-Snowden discourses and governmental practice in Germany. Media and Communication, 5(1), 7-16.

Building bilingual corpora for Critical Discourse Analysis: Mexican immigration to the US

Katia Adimora (Edge Hill University)


This talk will address the building of topic-specific corpora about Mexican immigration to the US during Donald Trump era. The corpora contain American and Mexican newspaper articles that cover Mexican immigration (44,779 articles, 30 million words). The aim is to analyse how immigrants are represented in them.  

 The newspapers included in the corpora are:

  • US newspapers: New York Times, Washington Post, USA Today, Los Angeles Times, The Arizona Republic and Chicago Tribune.    
  • Mexican newspapers: El Universal, Elimparcial.com, Reforma, El Norte, Lacronica.com and Mural.   

To gather the relevant articles, three-part query was formed based on the reading through various American and Mexican articles, and by identifying the words that are deployed to talk about immigrants or immigration. Bilingual queries: in English and Spanish, needed to be constructed. Spanish query terms are synonyms to English ones, however not necessary the literal translation from English as Mexican newspapers do not use specific expression that is used in English, or they use different expressions to talk about immigration and immigrants.   

Articles were transferred from online database ProQuest (Global Newsstream) to the software tool Sketch Engine (https://www.sketchengine.eu/).  

American and Mexican corpora were divided in subcorpora to be able to compare how the newspapers in American states with the highest number of Mexican immigrants, represent them in comparison to national newspapers. Similarly, Mexican subcorpora was formed to compare how newspapers from the regions in Mexico with the high number of Mexican migrants that move to the US address them compared to the national newspapers.   

These subcorpora division differs from the one commonly applied to the British press, on broadsheets vs. tabloid, and according to political leaning on leftist, rightist and centrist. This is due to the difficulty to draw the line between these types of grouping of American, and especially, Mexican newspapers.  


Kilgarriff, Adam, Vít Baisa, Jan Bušta, Miloš Jakubíček, Vojtěch Kovář, Jan Michelfeit, Pavel Rychlý, Vít Suchomel (2014) The Sketch Engine: ten years on. Lexicography, 1: 7-36. 

