$PPCMBE = /home/migration/other/MIDENG/PPCMBE
Check for formatting errors in the .lex files with
cat .lex files into a single .lex file (TRAINING.lex), reserving data for testing if desired.
Remove superfluous tags and comments with
The resulting training data are in $PPCMBE/pos/TRAINING.pos
Perform sanity checks on training data
Tags should be separated from the word by exactly one space; one word-tag pair per line.
$PPCMBE/ut/sanity-check-pos egrep .
| egrep -v ' ' egrep . | egrep '[ ].*[ ]'
Correct any formatting errors in the .lex files and reconstruct the training data until no more errors are found.
Check for illegal tags in training data.
cut -d ' ' -f 2 <training-data> | sort | uniq > TMP diff TMP $PPCMBE/training-tagger/LEGAL-TAGS-ALL | grep '[<|]'
Check that components of complex tags are themselves legal tags:
cut -d ' ' -f 2 <training-data> | tr '+' '\012' | sort | uniq > TMP diff TMP $PPCMBE/training-tagger/LEGAL-TAGS-SIMPLEX | grep '[<|]'
Be sure to perform both sanity checks, since the simplex check finds errors that the complex one doesn't (e.g. VBP+PRO).
Once again, correct any errors in the .lex files and reconstruct the training data until no errors are found.
If the training data is being constructed from parsed rather than POS-tagged files, filter out any instances of ID.
Finally, separate sentences in the training data by blanklines with
Otherwise, the training will take forever.
Training the tagger
See the xwiki under
for further details.
TRAINING_DIR = /home/beatrice/fnTBL/test-cases/pos-tagging
ORIG = /home/beatrice/fnTBL/test-cases/pos-tagging-orig
adjust to your local case
this unpacks to:
cd /home/beatrice/fnTBL/test-cases/pos-tagging ../../exec/pos-train.prl -v -F tbl.lexical.train.params,tbl.context.pos.params -t N,NPR -r 0.7 -T 1,100 <training data>