A First Search on babel

contents of this chapter:

what is babel?
useful print-outs
your .cshrc file
your command/output directory
a first command file
running the search
your output file
more about running the search
searching output

what is babel?

babel is a mainframe computer run by the Linguistics Department at the University of Pennsylvania. The following instructions are for those who have an account on babel.

useful print-outs

It's a good idea to print out a copy of the CorpusSearch Reference Page. You may also want to print out copies of the PPCME2 label lists.

your .cshrc file

add these lines to your .cshrc file:

prepend         PATH /pkg/java-1.2ea6/bin
setenv CLASSPATH /pkg/ling/MIDENG/PPCME2/clean_search
set mecorpus = /home/ataylor/MIDENG/PPCME2/SearchMe
set me = /home/ataylor/MIDENG/PPCME2/SearchMe
set all = /home/ataylor/MIDENG/PPCME2/SearchMe/*
alias CS 'java CorpusSearch'

The line beginning "prepend PATH" enables your account to run java programs.

The line beginning "setenv CLASSPATH" ensures that java will be able to find CorpusSearch when you call it from any directory in your account.

The lines beginning "set" and "alias" save typing. Instead of typing "/home/ataylor/MIDENG/PPCME2/SearchMe" (where the corpus is stored) in your java command, you can type "$me" to get the same result. Similarly, to search the entire corpus, you can type "$all" instead of typing "/home/ataylor/MIDENG/PPCME2/SearchMe/*". And to run the program you can type "CS" instead of "java CorpusSearch".

your command/output directory

Make a new directory in your account; you might call it "corpus_stuff". This directory will hold your command files (ending with ".q"), and your output files (ending with ".out").

a first command file

In your command/output directory, make a new file, named "first.q". Cut and paste this line into the file:

query:  (CP* iDominates *-LFD)

This query looks for left-dislocated constituents.

Save "first.q". This is your first command file.

running the search

enter this command at the babel prompt:

CS first.q $all

This command will search the entire corpus. You should see a message like this:

Searching.  Please be patient.

After some time, you will see a message like this:

Search completed.  Output file is first.out.

time taken:  63913 milliseconds.  1 minutes, 3 seconds.

your output file

Open up your output file, "first.out". Scroll down to the bottom of the file and have a look at the summary block. This sums up the statistics of the search. The very end of your output file should look like this:

  
    grand total hits :  45
    grand total tokens:  45
    grand total tokens searched:  83487
*/

So there were 45 distinct nodes containing the structure, 45 tokens containing those nodes, and a total of 83487 tokens searched. If you scroll up through the summary block, you'll see which corpus files contained the structure. Of course, the numbers aren't the whole story! Take some time to look at the output sentences too.

more about running the search

A search of the entire corpus may take several minutes, depending on the traffic on babel. You may want to run your search in the background, so you can do other work while the program is running. To do so, add "&" at the end of your command:

CS first.q $all &

To search, for instance, only the files from the "m4" period in the corpus, use this command:

CS first.q $me/*m4*

CorpusSearch can handle any number of source files, and they may be listed on the command line like this:

CS <command_file> <first_source> <second_source> <third_source> ...

To give your own name to the output file, instead of using the automatic name "first.out", use this command:

CS first.q -out <your_name>

searching output

The output of one search may be used directly as input to the next search.

In fact, CorpusSearch can search any file that contains parsed sentences. So, for instance, you could search a file of sentences you have collected from various output files. Just be sure that if you have added your own comments they are enclosed in comment markers /* and */.

the Query Language
Table of Contents