By convention, I use .txt for unannotated text files, and other extensions like .pos, .lex, .psd and the like for annotated files. But the different extensions and the different linguistic content of these various files don't affect or change their data format.
(1) ... | tr -s '\015' '\012' | ... (2) a. cat interview22Tape1Side1.txt | tr -s '\015' '\012' | tr -s '\011' ' ' b. cat interview22Tape1Side1.txt | tr -s '\015' '\012' | tr -s '\011' ' ' | sort | uniq c. cat interview22Tape1Side1.txt | tr -s '\015' '\012' | tr -s '\011' ' ' > FOO
For further examples, see the commands in the remainder of this section that don't contain "| ...".
cat interview22Tape1Side1-utf8.txt | tr -s '\015' '\012' | ...
cat interview.txt | tr '\015' '\012' | fmt -w 60 | ...
... | tr -s '\012' | ...
... | tr -s '\011' ' ' | ...
... | tr -s ' ' '\012' | ...
... | sort | uniq > FOO
If it is useful in connection with the last step, it is possible to ignore punctuation (by converting it to newline). Period does not need to be "escaped" as long as it is in enclosed in brackets and/or quotes.
... | tr -s '[.,;:?!]' '\012' | ...It's possible to convert space and punctuation to newline in one fell swoop by adding space to the alternatives in brackets.
... | tr -s '[ .,;:?!]' '\012' | ...
Clean up the file manually and/or with "grep -v" commands modeled on the following. For safety, the caret and dollar sign "anchor" the expressions to the beginning and end of line, respectively.
cat interview22-utf8.txt | tr -s '\015' '\012' | tr -s '\011' ' ' | tr -s ' ' '\012' | grep -v '^GRAM$' | grep -v '^LEX$' | grep -v '^PHON$' | grep -v '^[IB]:$' > interview22.txt
If you want to change all instances in one session (you don't have to, though), you can grep for "UT" and edit the path in all the files that come up as a result of grepping. You can also edit the path on a one-by-one basis, though, for just the scripts that you are using at the moment.
The path variable is near the top of the scripts, and the relevant line looks something like this.
UT="/home/beatrice/appalachian/ut"You would edit such a line to read something like this:
grep '[A-Z][A-Z][A-Z]' FILE.TextGrid
$UT/make-tok FILE.txtThe make-tok script assumes that the input file has a .txt extension and generates output with the same filename, but with a .tok extension.
The forward slash in end tags is replaced by a double dollar sign. This is to avoid confusion with the forward slash that delimits POS tags in the tagged version.
The pipe symbol that indicates variants (as in there | they) used to be replaced in the .tok files by "XOR". This is no longer necessary, since the text grids are no longer being compared to the original transcripts and there is no need to generate a diff file, where the pipe character has a meaning of its own.
$UT/sanity-check-tok FILE.tokYou can zero in on or ignore particular error types with commands like the following:
$UT/sanity-check-tok *.tok | grep -v '-'
$UT/sanity-check-tok *.tok | grep -v '-' | grep -v 'Mr.' | sort | uniq