Tbilisi, September 14, 2015, Workshop on Specialized Corpora, Noah Bubenhofer
CQP is an important query language for corpora, for example used in the Open Corpus Workbench. In combination with CQPweb, it is easy to query corpora. The following introduction shows some basic functions of CQP in CQPweb. The Text+Berg-Korpora are just one example of corpora using the CWB system. A more in depth tutorial in German using CWB can be found here: Einführung in die Korpuslingustik: CQPweb




We want to define two subcorpora: One containing all texts from the 1970ies and one with the texts from the years after 2000. The idea is to find the vocabulary which is specific for the two time periods.




Regular expressions are a possibility to search for complex patterns in a text and replace this pattern bei something else or a modified version of this match. Consider a file with a lot of texts in it, each text begins with the following header:
Title of Text No 1 Author Name Here begins the text with a lot of paragraphs. Here is another paragraph. At the end of the text there are let's say three blank lines, followed by the next text. Title of Text No 2 Author Name Here comes the body of the second text. Title of Text No 3...
You want to convert this texts into the following XML form:
<TEI xml:id="MyTextNumber1">
<teiHeader>
<fileDesc>
<titleStmt>
<title>Title of Text No 1</title>
<author>Author Name</author>
</titleStmt>
</fileDesc>
</teiHeader>
<text>
<body>
<div>
Here begins the text with a lot of paragraphs.
Here is another paragraph.
At the end of the text there are let's say three blank lines,
followed by the next text.
</div>
</body>
</text>
</TEI>
In order to produce this structure, we need to identify the text units in the text file, identify its parts (title, author, body) and put it into the xml format. The approach is to say something like:
Search for a line of text (the title) followed by a blank line, followed by another line of text (the author line), followed by a blank line, followed by several lines, also with blank lines (the body of the text), followed at the end by three blank lines.
Put this statement in a regular expression:
.+\n\n.+\n\n[\w\W]+?\n\n\n\n
Let's analyze this expression:
. (the dot): matches any character, digit and also blanks
+: the preceding character must appear at least once
\n: line break
Therefore the expression .+\n\n matches a line of text followed by a blank line. A blank line is produced by two line breaks: One line break to end the character line, another to end the blank line. In the expression we have twice this pattern to match the title and the author line. Afterwards we see the expression [\w\W]+:
[]: character class; matches characters of the designated class, e.g. [0-9] matches a digit between 0 and 9
\w: shorthand for any word character
\W: any non-word character
Using [\w\W]+ we match an unlimited number of lines with words or non words (also with line breaks). So we should match all the paragraphs of the text body. Now we have to end the matching of paragraphs where the three blank lines appear. As three blank lines are produced by four line break characters (the last line with text ist broken by the first line break), we should define [\w\W]+\n\n\n\n. But as our class [\w\W]+ also matches several blank lines in sequence, this expression is greedy and matches everything (including "three blank lines" sequences) until the last "three blank lines" sequence. That's not what we want. Instead we want the expression to find the first solution where the sequence of paragraphs ends with three blank lines. Therefore we add a "laziness" (or reluctant) sign after the plus. That is the question mark:
[\w\W]+?\n\n\n\n
So now with .+\n\n.+\n\n[\w\W]+?\n\n\n\n we match a text and we know, what parts of the expression are title, author line and body of the text. As we want to divide these parts and embrace them with xml elements, we need to memorize them. We use brackets for that:
(.+)\n\n(.+)\n\n([\w\W]+?)\n\n\n\n
Now we define a replace pattern where we construct the xml structure. The memorized content in the brackets can be recalled in the replace expression using \1 for the first bracket, \2 for the second and so on. Depending on the text editor you are using or the programming language also $1, $2 etc. must be used.
For the replace expression we just use the desired xml structure putting \1, \2 and \3 at the locations where we want to fill in title, author line and body of the text:
<TEI xml:id="MyTextNumber1">
<teiHeader>
<fileDesc>
<titleStmt>
<title>\1</title>
<author>\2</author>
</titleStmt>
</fileDesc>
</teiHeader>
<text>
<body>
<div>
\3
</div>
</body>
</text>
</TEI>\n
Please note: depending on the behaviour of your text editor, you should omit the line breaks and tabs from the replace expression and replace them by \n (line breaks) and \t (tabs):
\t<TEI xml:id="MyTextNumber1">\n\t\t\t<teiHeader>\n\t\t\t\t<fileDesc>\n\t\t\t\t\t¶ <titleStmt>¶ \n\t\t\t\t\t\t\t<title>\1</title>\n\t\t\t\t\t\t\t<author>\2</author>¶ \n\t\t\t\t\t </titleStmt>\n\t\t\t\t</fileDesc>\n\t\t\t</teiHeader>¶ \n\t\t\t<text>\n\t\t\t\t<body>\n\t\t\t\t\t <div>\n\t\t\t\t\t\t\t\3\n\t\t\t\t\t¶ </div>\n\t\t\t\t</body>\n\t\t\t</text>\n\t </TEI>\n
This is a ugly single line expression, but it works as expected...
Here you have a screenshot of this search and replace operation using Sublime Text editor:

A lot more is possible using regular expressions. Please consider the following tutorials to learn more about that!