Methodological approaches used in the course

Introducing corpus software and data as a resource for academic writing

James Thomas

This article describes one part of the language development work as it was undertaken within the IMPACT project. Some of it was experimental inasmuch as aspects of the approach and some of the activities had not been previously used with students who were not majoring in languages. The amount of time available within the teaching semester was quite limited as the multi-faceted nature of the course gave the students a wide variety of language experiences. The work described in this article represents only one aspect of the course.

The majority of scientists the world over are required to describe their work in English, which many find demanding, frustrating, time-consuming and expensive – others resent the fact completely. There are those, on the other hand, for whom writing formal papers in English is no more problematic than it is for native speakers – remember that few native speakers of any language are taught to write academic prose.

The language of academic prose involves all aspects of the so-called "Hierarchy of Language", namely morphology, words, phrases, clauses, sentences and text.


To varying extents, the language of each of these levels is influenced by factors such as the text's tenor, field and mode (from Halliday's Systemic Functional Grammar, 1985).

  • Tenor refers to the participants in a discourse, their relationships to each other, and their purposes.
  • Field refers the subject matter or content being discussed.
  • Mode refers to the channel (e.g. writing, video-conference) of the communication.

These contextual features can be observed in science papers when authors take into account the assumed knowledge of their readers, and whether the text is to sound encyclopaedic or like a discussion paper, how much detail to include in the required length, the differences between conference proposals and book chapters, and what their paper is announcing to the world. Furthermore, there are stylistic requirements and conventions of editors and publishers which must be obeyed.

Taken together, this means that authors have to make many choices at every stage and level of the writing process. Scientists who have a sound knowledge of general English across the hierarchy for language are well-placed to make genre choices that meet the requirements of science writing. To teach such scientists to write academic prose in English, the first step is to raise their awareness of the scientific linguistic menus that they can choose from. Those who read widely in their field in English typically develop a sound intuition about academic prose. This is a sound starting point.

Making choices requires criteria. For example, how is the decision to use sped up instead of speeded up made? Or if to use mouses instead of mice for the computer peripheral? These are basic morphological choices. Do sentences begin First…, Second …, or Firstly … Secondly …? This is stylistic convention. In the following extract, the author made a choice to use past tense for a 1957 event and the present for 1959, even though the text was published in 1986 when the time distance did not warrant this contrastive use of tense.

Skinner (1957) argued that language was learned through a process of stimulus-response, with large amounts of controlled repetition. Chomsky (1959) argues that language could never be learned in this way, and that we are all endowed at birth with a language acquisition device which provides essential assistance in the learning process. (Riddle 1986)

When scientists write for the general interested public, the text says that something is the case, but when writing for other specialists, something seems to be the case. Linguists refer to this sort of linguistic caution as "hedging", which can be expressed using certain words and phrases, by using modal verbs and adverbs and by other grammatical resources. Authors need criteria for choosing from this menu.

It is therefore necessary to provide science writers with criteria. Language is a multi-faceted phenomenon and its facets are often interdependent. They require considerable deconstructing to reveal the discreet units that are employed to meet the requirements of the genre. This is what teachers and textbooks aim to do. But the richness of language comes at a cost. No teacher, no textbook and no course can cater for every writer's needs in every situation that they find themselves in during their professional lives. Teachers and textbooks can however, equip learners with skills to become independent. This involves such metacognitive strategies as selecting what is important to learn, planning one's learning, and most importantly, becoming familiar with resources and online tools.

The most standard, traditional resources in use are dictionaries and grammars. People need to know what information they can find in them and how to use them. And they need to understand if these resources do not answer their questions, they can also search corpora, as we are about to see. In fact, contemporary dictionaries and grammars are written using corpus data, but space does not permit them to include every piece of language information that is available. The authors of these resources also have to make choices. What we find in our own corpus searches is the raw data that these published resources use.

Fortunately, scientists are accustomed to working with data. They form research questions, obtain data, process it and draw conclusions. They share their conclusions, get feedback and reconsider them. Fortunately, language can be treated as data, especially when stored in databases. Databases of texts, so-called corpora (singular corpus) are constructed for specific purposes. A corpus might be a large sample of general language that was produced between 2000 and 2005, or it might contain a bunch of texts concerning black holes or child soldiers or Roma integration or eutrophication or any tenor, mode or field in the world – as long as there are texts that can go into a corpus. There are specialised search tools for corpora called concordancers, which search for words and phrases, reveal language patterns that intuition generally cannot, and furnish examples corresponding to the language question. The biggest challenge is knowing how to formulate answerable questions.


Here are two screenshots from Sketch Engine, the tool that is developed at Masaryk University. It is not only a multipurpose concordancer, but it has many corpora ready to use and tools for making your own corpora. The first screenshot here shows the phrase depends on… context, in its Key Word in Context view (KWIC). Using this, an author can see a range of examples to help them make a decision about incorporating some form of it into their writing. Live example:

The second screenshot shows a word sketch of the word data. It shows the most significant adjectives, verbs, nouns that are typically used with it, and it shows these in the grammatical relationships that they have with data. Clicking on an underlined numbers opens a KWIC view of the two words. Live example:


Knowing how to learn from corpus data is a skill that equips people for life. There was a modest attempt in the IMPACT course to introduce science students to using corpora.

Here are three sentences from one abstract submitted by four students participating in the IMPACT project. This text, which if truly collaborative, indicates that there are four students who do not have a sound grasp of general English. The spelling mistakes suggests that they either do not know how to use a word processor's spell checker, or they don’t care, both of which raise some concern.

The aim of this section is to show corpus data demonstrating standard and non-standard usage.

1. Algal bloom is a rapid increase in the population of algea and its occurence in the Brno reservoir is problematic.

While rapid increase in is well-attested in the BNC with 63 hits, it is never in the role of complement, as in this standard English structure: subject – verb – complement. The clause at the beginning of their sentence, be a rapid increase in occurs 9 times, but always preceded by there. Of the 317 times that be an increase in occurs (without an adjective), 216 are preceded by there.

Another problem in this sentence is the collocation, problematic occurrence. This does not occur in the BNC at all. The nouns that are problematic include so-called "general nouns", e.g. nature, area, aspect, situation, concept, relationship, issue.

Sentence 2 is the next sentence in the same abstract.

2. Such growth not only affects recreation or drinking water supply, but as well the aquatic ecosystem.

The relationship between increase and growth is admirable, as is the use of such. So is the use of not onlybut also… But the use of or is strange because they mean and. The most important language to learn from this sentence, however, is the use of as well, which always occurs in final position. This is indicated in corpora by punctuation. Of the 558 occurrences of as well in the BAWE corpus not followed by as, 367 are followed by punctuation. Those that are not, are mostly conjunctions and auxiliary verbs, almost never lexical words [].

It is almost inconceivable that four masters level students could be responsible for the next sentence.

3. Collected data we analyze take in consideration the depth of water, the location in the reservoir and as well record the influence of the reservoir on the river Svratka.

English is a S V O language. Only under the influence of certain discourse level constructions is O S V used, which is not the case with "Collected data we analyse". In the BNC collected data occurs 14 times, never at the beginning of a sentence or clause. Of these 14, collected is an adjective three times only. We saw in the word sketch above, the verb <=> noun collocation collect data 389 times, and the compound data collection 2,506 times.

Their O V S structure is clearly a direct translation from their L1 which readily permits this syntax.

It is difficult to know who takes and who records. The intended chunk is take into consideration (238 hits) not in. Once again, the use of as well is problematic.

The last part of the sentence, record the influence of something on something is fine. Even better with a grammatical subject with a subject.

What we take from these corpus-based analyses, apart from the linguistic information about patterns of normal usage, is that a considerable amount of linguistic metalanguage is required. This is not just terminology for its own sake, but a sophisticated conceptualisation of language.

It helps to think of language as a network of probabilistic patterns rather than rules, which is not a new conceptualisation of language. The empirical language data that corpora provide has driven this pattern approach in most fields of contemporary linguistics, but it has not made much of an impression on the last half century's teaching practices. An advantage of thinking of language as patterns is that we have something concrete to search for in corpora relevant to our field.

In fact, at the beginning of the project we started making an "IMPACT" corpus consisting of recently published articles from the fields in which our scientists work. Unfortunately, they did not provide many texts and this aspect of the project was abandoned. There was also a problem converting pdfs and assigning metadata such as author, title, field to each document. The aim of working with a subject specific corpus is to facilitate the contextual observation of words and phrases that are peculiar to a field. Some of them are found in general corpora but mostly in general, non-scientific language.

We also find in the analysis of the three student sentences, that learning to use Sketch Engine software to find this information requires time and training. This is true of learning to use to best effect all the features of any new piece of equipment. Like any piece of equipment, the basic features of Sketch Engine can be used immediately and to great effect.

Although this was not possible within the IMPACT project, the students were introduced to the issue of language as data, and asking it questions to find answers to questions that involve choosing between possible and probable wordings. In fact, the very notion of possible vs. probable is central to modern linguistics. We are mostly interested in what is said, not what can be said. This follows from the above-mentioned dichotomy of systems and rules.

There are many coursebooks that teach academic prose, but they generally assume that learners have a solid grasp of the basics of English, and that they are ready to learn features of academic prose such as paragraph structure, hedging, the first person (to be or not to be), sign-posting, and more mechanical matters such as indentation, footnotes and citations. As the above extracts of students' work demonstrate a great deal of work on the basics of English is also necessary..

The main writing task the students undertook was the writing of abstracts. They were submitted in Word and corrected with Track Changes. This allows novice authors to see their original and the corrections/suggestions at the same time. Upon receiving their work back, they need to decide which ones to accept and reject to make their final copy. This requires them to process the teacher's comments, which involve higher order thinking skills. The process of commenting on their work was captured using a screencast programme called JING. Not only did the students see suggestions in Word, but they could also listen to the teacher discussing language issues. Furthermore, the process of making these mini-videos permits pausing, during which the teacher can find corpus examples, online dictionary pages, and demonstrate how these resources can be gainfully employed at a specific point in their writing. Viewing statistics however reveal that some students did not open these videos even once, and very few looked at anyone else's which was a wasted opportunity. Links to some of these videos can be found on this page:

It was not only students who were introduced to this contemporary approach to language study. So were the scientists and the other language teachers involved in this project. Few seemed to be in any doubt as to its worth, but there is no evidence of their use of corpus data in the worksheets they produced or in checking written work that the students produced. Most people are still content with their non-native speaker intuition.

The project also offered several one-day courses in the use of corpora through Sketch Engine. Separate days were offered to language teachers and to scientists. These groups exhibited great enthusiasm during the actual courses, although it is not possible to asses the long term impact.

To conclude, it is acknowledged that incorporating new thinking about language and learning to use such software requires systematic training over an extended period. There is only one book that teaches people how to use Sketch Engine to ask and answer language questions. It is called Discovering English with Sketch Engine [] and was published in May 2005, too late for the Impact project. The onus for such training does not lie with students, but with teachers. Students will study language in any way that their teachers advocate, and not inducting students into corpus use is depriving them of a resource that will equip them for life.

© LANGUAGE CENTRE, MASARYK UNIVERSITY, Brno 2014 | Print version with ISBN | Česká verze | visits