$MF = /home/migration/other/MIDENG/MIDDLE-FRENCH
(In the exceptional case of commynes, the training data is constructed from a parsed file. sanity-check-lex can be adapted to this case; after the sanity check, the file needs to be converted from parsed to pos-tagged format by using some form of ref-to-pos.)
$MF/ut/sanity-check-lex
Important: The updated .tok file will eventually have to be glued back together with its .header and .footer files to make an updated .xml file. Editing the .tok file is easier and safer in several respects than editing the .xml file. Because of its format, the .tok file is easier to read. Also, editing the .tok file saves having to retokenize the .xml file after every update. The updated .tok file will be the second argument of the script that reinserts TEI codes into pos-tagged files ($MF/reinsert-tei-codes).
$MF/ut/make-pos TRAINING.lex
$MF/ut/sanity-check-pos egrep . <training-data> | egrep -v ' ' egrep . <training-data> | egrep '[ ].*[ ]'
cut -d ' ' -f 2 <training-data> | tr '+' '\012' | sort | uniq > TMP diff TMP $MF/training-tagger/LEGAL-TAGS | grep '<'
$MF/ut/add-blanklines-after-ponfp
Otherwise, the training will take forever.
This unpacks to:train-fntbl-mf <training-data>
cd /home/beatrice/fnTBL/test-cases/pos-tagging ../../exec/pos-train.prl -F tbl.lexical.train.params,tbl.context.pos.params -r 0.3 -f 2 -t NCS,NPRS -T 2,2 <training-data>
$MF/xml-ottawa-originals
$MF/xml-penn
cat $file | tr '\015' '\012'
$MF/ut/sanity-check-xml
The headers and footers are saved to .header and .footer files, which eventually will be glued back together with the corrected POS-tagged file. The main body of the text is written to a tokenized (.tok) file.$MF/ut/make-tok
$MF/ut/sanity-check-tok
The postprocessing might be improved.$MF/ut/make-lex
grep '//' FILE-golden.lex cat FILE-golden.lex | $MF/ut/clean-up-lex | tr ' ' '\012' | grep . | grep -v '/'
cat FILE-golden.lex | $MF/ut/clean-up-lex | tr ' ' '\012' | grep . | cut -d '/' -f1 > LHS
grep '>.+$' FILE.tok
grep . FILE.tok | grep -v '<' > RHS
diffils LHS RHS
cat FILE-golden.lex | $MF/ut/clean-up-lex | tr ' ' '\012' | grep . > FILE.1
grep . FILE.tok > FILE.2
$MF/ut/reinsert-tei-codes.py FILE.1 ../FILE.2 > FILE.lex.xml