Linguistics 300, F12, Assignment 3
Assignment
As discussed in Coded corpora,
CorpusSearch coding queries encode the results of many ordinary
CorpusSearch queries as a single string of characters. For this
assignment, you will download a file containing all of the coding
strings from the coded versions of the Penn Parsed Corpus of Historical
English and the Parsed Corpus of Early English Correspondence. The
meaning of the various symbols is described in the
coding conventions.
In order to analyze the coding strings, you will be using a Linux/Unix
command called grep (an acronym based on
'global regular expression print'), and not
surprisingly, you will need to learn a bit about regular expressions in
order to use it. The tutorial on grep provides a summary of the
regular expressions that you will use.
Once you're somewhat familiar with the
coding conventions and
grep, start
working on the exercises below.
As you'll soon see, the grep searches that you need for the
above exercises are quite repetitive. In order to save time and to
avoid making typos in your command-line input, it is very convenient to
run the searches in batches rather than on a one-by-one basis. The
tutorial on shell scripts provides you with two sample
scripts. Once you understand how these scripts work, you will be able
to edit them to complete the exercises.
- A very important concept in dealing with large corpora is the
concept of the sanity check. It isn't humanly possible to check every
single last piece of data for errors in a large corpus, but it is
possible to identify types of data that are logically impossible and
then to search for them (hoping that we don't find them). For example,
no clause in the corpus should be both a negative declarative and a
question at the same time. Can you think of some more sanity checks
like this? Search for the relevant coding strings, and report any
"insane" instances.
- Track the rise of do support with ordinary verbs in negative
declarative sentences by convenient time periods (say, 20-year time
periods). "Ordinary" verbs are ones that are coded as "v" or "V" in the
coding strings; see the coding conventions for details.
- Ellegård
1953:199 distinguishes a know class of verbs (see the coding conventions for details).
Document how these verbs behave differently than ordinary verbs.
- Repeat the two above exercises for questions. If you wish, you can
take into account different subtypes of questions.
Troubleshooting
- The preferences on your text editor should be set as "primitively"
as possible.
- For instance, in TextEdit, under "New Document", set the
"Format" option to "Plain text" (not to "Rich text").
- Under "Open and Save", deselect all button options.
- Under "Plain Text File Encoding", select UTF-8 or UTF-16.
- The names of your shell scripts should contain no extensions (no .txt,
.rtf, etc.).
- The commands in the assignment assume that you are using
the tcsh shell. The default shell on most systems seems
to be a slightly different shell, the bash shell. You can
find out which shell you are running by typing
echo $SHELL
If typing the simple filenames to run the shell scripts gives an
error message, chances are that the bash shell doesn't know
how or where to find your command. Help it out by prefacing the
command you are trying to run with . (which explicitly
tells the shell to look in the current directory) or
with bash.
batchGrep
try one of the following two options:
./batchGrep
bash batchGrep
There is no space in the first option, but there is a space
after bash in the second option.