- replace bad line breaks from Word
cat $input | tr '\015' '\012' > $output
- protect CODE material (line numbers, comments, and the like) by
enclosing it in angle brackets and tagging as CODE
enclose-code-material $input > $output
- replace spaces inside CODE material with underbars
fix-bad-spaces $input > $output
-
put #lem line on same line as word forms (in emacs)
- align lemmas with words in short lines (<= 5 words/lemmas)
align-lemmas $input > $output
- prepare remaining long lines
prepare-long-lines $input > $output
- align lemmas with words with one word and its lemma per line
- find spaces in lemmas and replace by underbars
grep '<[^>]*[ ]+[^>]*>/LEMMA' $input
- convert spaces to newlines
cat $input | tr -s ' ' '\012' > $output
- sanity checks
grep '>' $input | grep -v '<'
grep '<' $input | grep -v '>'
grep LEMMA $input | grep -v '<'
grep LEMMA $input | grep -v '>'
grep LEMMA $input | grep ' '
grep 'LEMMA.' $input
grep 'LEMMA.*LEMMA' $input
grep '/.*/' $input
- put lemmas on same line as their words (in emacs); not all lines will have a lemma
-
construct tagging scripts for open-class items from lists of lemmas
- tag-adjectives
- tag-adverbs
- tag-nouns
- tag-verbs
-
tag text with above scripts and tag-other
- split off case markers
fix-case-markers $input > $output
- check for remaining case markers
grep ']/[a-]z>/' $file
- deal with verb complex - NEEDS FURTHER WORK
- if not yet done, divide sentence tokens
- if using blanklines to divide sentence tokens, replace blanklines by
BLANKLINE/CODE, EOS/CODE, or the like (blanklines are gobbled up by the
following script)
- sanity check - does each line have exactly one slash?
- replace special characters (parens, slash)
protect-special-characters $input > $output
- convert from word/tag to (tag word) format
pos-to-psd2 $input > $output
- restore blanklines between sentence tokens
- add wrapper parens and root node (S-MAT) (in emacs)
- add syntactic structure
PARS1