How to tag MBE files

This page gives a step-by-step overview over the training and tagging process for Modern British English. Details concerning training, running, and evaluating the tagger are available at How to train, test and run the fnTBL tagger.

$PPCMBE = /home/migration/other/MIDENG/PPCMBE

Constructing the training data

The training data is ordinarily constructed from corrected .lex files in $PPCMBE/lex/DONE.

Check for formatting errors in the .lex files with

$PPCMBE/ut/sanity-check-lex

cat .lex files into a single .lex file (TRAINING.lex), reserving data for testing if desired.

Remove superfluous tags and comments with

$PPCMBE/ut/make-pos TRAINING.lex

The resulting training data are in $PPCMBE/pos/TRAINING.pos

Perform sanity checks on training data

Tags should be separated from the word by exactly one space; one word-tag pair per line.

$PPCMBE/ut/sanity-check-pos egrep . | egrep -v ' ' egrep . | egrep '[ ].*[ ]'

Correct any formatting errors in the .lex files and reconstruct the training data until no more errors are found.

Check for illegal tags in training data.

cut -d ' ' -f 2 <training-data> | sort | uniq > TMP diff TMP $PPCMBE/training-tagger/LEGAL-TAGS-ALL | grep '[<|]'

Check that components of complex tags are themselves legal tags:

cut -d ' ' -f 2 <training-data> | tr '+' '\012' | sort | uniq > TMP diff TMP $PPCMBE/training-tagger/LEGAL-TAGS-SIMPLEX | grep '[<|]'

Be sure to perform both sanity checks, since the simplex check finds errors that the complex one doesn't (e.g. VBP+PRO).

Once again, correct any errors in the .lex files and reconstruct the training data until no errors are found.

If the training data is being constructed from parsed rather than POS-tagged files, filter out any instances of ID.

Finally, separate sentences in the training data by blanklines with

$PPCMBE/ut/add-blanklines-after-period

Otherwise, the training will take forever.

Training the tagger

See the xwiki under Fntbl and Results for further details.

TRAINING_DIR = /home/beatrice/fnTBL/test-cases/pos-tagging
ORIG = /home/beatrice/fnTBL/test-cases/pos-tagging-orig
adjust to your local case

Tagging new text