VARBRUL PROGRAMS
Susan Pintzuk
1987
TABLE OF CONTENTS
I. Introduction Page 1
II. Files and File Formats Page 2
A. Data File Page 2
B. Factor Specification File Page 4
C. Token File Page 6
D. Condition File (MAKECELL, MULTICEL) Page 7
E. Factor Definition File Page 10
F. Cell File (MAKECELL) Page 11
G. Cell File (MULTICEL) Page 12
H. Sorting Condition File (TSORT) Page 13
III. Program Descriptions Page 14
A. CHECKTOK Page 14
B. READTOK Page 17
C. MAKECELL Page 18
D. MULTICEL Page 20
E. COUNTUP Page 21
F. CROSSTAB Page 22
G. IVARB Page 23
H. TSORT Page 24
I. TEXTSORT Page 27
IV. Running the Programs Page 29
V. For VAX Users: A Comparison of the Old
and New Varbrul Systems Page 30
VI. For IBM PC Users: Input/Output Error Messages Page 32
^L Page 1
Introduction
I. INTRODUCTION
This document describes the PASCAL and FORTRAN varbrul programs for the
IBM personal computer and for the VAX. These programs duplicate the operation
of the LISP and FORTRAN programs running on the University of Pennsylvania
Moore School VAX in October 1984; some additional functions were provided, and
some functions were slightly modified (see Section V).
Three PASCAL programs perform the set-up functions necessary to run IVARB,
the variable rule program: CHECKTOK takes as input a data file and,
optionally, a factor specification file; it compares the contents of the
coding strings to sets of legal factors for each factor group, replaces default
characters with default values, pads short coding strings, and creates a new
data file with these coding string modifications. READTOK takes as input one
or more data files created by CHECKTOK and creates a token file. MAKECELL
takes as input a token file, a condition file, and, optionally, a factor
definition file and creates a cell file to be used as input for IVARB. A
fourth PASCAL program, COUNTUP, takes a token file as input and counts the
occurrences and calculates percentages of factors within a factor group.
CROSSTAB, a FORTRAN program, takes a cell file created by MAKECELL as input
and calculates a two-dimensional cross-tabulation of cell data.
MULTICEL creates a cell file to be used for multinomial analysis. MVARB,
the FORTRAN variable rule program which uses this cell file as input, will be
available with the next several months; in the meantime, MULTICEL can be used
to calculate percentages for a multi-valued variable.
Two additional PASCAL programs enable the user to sort data files: TSORT
copies to a separate disk file all tokens with user-specified coding string
values; TEXTSORT copies to a separate disk file all tokens containing
user-specified text strings.
A general note on user input: all PASCAL and FORTRAN programs accept both
upper and lower case input from the console and from disk files. All file
names and all pre-defined words ('nil', 'and', 'or', 'not', 'col', 'elsewhere'
'yes', 'no', etc.) may be in either upper or lower case and will still be
handled correctly by the programs, e.g. 'XYZ.DAT' refers to the same disk file
as 'xyz.dat'. However, the user should adopt a consistent use of either upper
or lower case for condition files, factor definition files, and coding string
factors: the predicate 'NP-OBJ' in a condition file does not match the label
'np-obj' in a factor definition file; 'AB1X' and 'ab1x' are not treated as two
instances of the same coding string.
File names for both the IBM PC and the VAX are a maximum of 8 characters
in length, with an additional 3 characters for extension. The programs use
thefirst 50 characters input by the user for file name specification,
including directory or path; any input after the first 50 characters is
ignored. The programs do not perform any checks on file names: if the format
of the file name specified by the user is illegal, an attempted disk read from
that file will generate a system error and the program will be aborted; disk
writes to that file may create a file with a name somewhat different from that
specified by the user. An attempted disk read from a non-existent file will
generate a system error and the program will be aborted. If the output file
specified by the user for any program already exists, disk writes to that file
on the IBM PC will overwrite the original contents; on the VAX, a new version
of the file will be created.
^L Page 2
Files and File Formats
II. FILES AND FILE FORMATS
A. DATA FILE:
A data file is a file of elements consisting of paired coding strings and
tokens. The accepted format for a data file element is shown below:
(coding-string (token...........................................
...........................................................
.
.
.
........................................) source and notes)
CHECKTOK, READTOK, TSORT, and TEXTSORT (the programs that use data files as
input) require each data file element to begin with an open parenthesis in
column 1; and any open parenthesis in column 1, even if the user intends it to
be a non-initial open parenthesis within a data file element, signals the
start of a new element. It is therefore suggested that within each data file
element, all lines but the first be indented. The maximum length for data
filelines is 120 characters. If CHECKTOK, by padding a short coding string,
generates a line longer than 120 characters, that line is flagged with an
error message, even if the original line was less than 120 characters long.
The coding string for each data file element must start in column 2. The
maximum length of a coding string is 78 characters. Each coding string must
be terminated by space, open parenthesis, or end of line. For obvious reasons,
space and open parenthesis are not legal factor values.
CHECKTOK and READTOK process only the initial open parenthesis in column 1
plus the coding string starting in column 2. All lines without an open
parenthesis in column 1 are, in effect, ignored, except for length checks:
CHECKTOK simply writes these lines to the new data file, and READTOK skips
them. Therefore, comments and blank lines may be freely interspersed among
the data; and it is not necessary that open and close parentheses within each
data file element match. Note that TSORT and TEXTSORT interpret comments and
blank lines as part of the preceding, rather than the following, token. The
format of a data file element, except for the position of the initial open
parenthesis and the coding string, may be different from that shown above.
In addition to the user-defined factors for each factor group, two special
characters may be used within a coding string: the 'doesn't apply' character
'/' (slash) and the default character '.' (period). '/' is used when the
factor group is not relevant for the token. For example, if factor group 2
represents type of direct object, and some tokens do not contain a direct
object, those tokens may be coded '/' for factor group 2. MAKECELL and
MULTICEL convert '/' to space; '/' is not included in the calculation of
probabilities for each factor in the group. If the factor group containing the
dependent variable is coded '/', either in the original coding string or by a
recode in the condition file, that coding string is not used by MAKECELL or
MULTICEL in building cells.
The default character is intended to be a convenience for the user when
one factor within a group is used for most tokens. For example, suppose that
most tokens are coded 'q' for factor group 3. For those tokens, if the user
codes that group as '.' and specifies 'q' as the default value for factor
group 3 when running CHECKTOK, the coding strings in the newly created data
file will
^L Page 3
Files and File Formats
contain 'q' as a replacement for '.'. It is not necessary that each 'q' in
factor group 3 throughout the data file be coded as '.': the user may not
discover until half-way through the coding process that 'q' should be the
default character for the group; '.' may be used from that point on.
Note the distinction between default character and default value. The
default character is the character '.'; it can be used in more than one factor
group. Default values are the characters with which CHECKTOK replaces the
default character. Each factor group has a different default value; the user
specifies the default value for each group when running CHECKTOK. The default
value must be a member of the set of legal factors for the factor group.
The default character has one additional use: if the user can determine
before starting to code that one or more factor groups will have default
values, these groups should placed last within the coding string. If the user
then specifies when running CHECKTOK that short coding strings are to be
filled with the default value for each group, these groups do not need to be
coded at all in the original data file. For example, if the coding strings in
the data file consist of four factor groups, and factor groups 3 and 4 have
default values 'q' and 'w', the two-character coding string 'ab' in the
original data file will be replaced by 'abqw' in the new data file.
Page 4
Files and File Formats
B. FACTOR SPECIFICATION FILE
The factor specification file is used as optional input to CHECKTOK (all
data in the factor specification file may instead be input through the
console). The factor specification file contains the following information:
number of factor groups, fill character for short coding strings, and legal
factors and default value for each factor group.
Within the factor specification file, blank lines and lines with spaces in
column 1 are ignored, so that comments may be interspersed among the data. The
number of factor groups, the fill character, the string of legal factors for
each group, and the default value for each group must be on separate lines,
beginning in column 1. A sample factor specification file is shown below:
----------
Factor specifications for data file xyz.dat
Number of factor groups =
4
Fill short coding strings with doesn't apply:
/
Group 1
Legal factors:
ds
Default value:
nil
Group 2
Legal factors:
abcdefg
Default value:
d
Group 3
Legal factors:
xyz
Default value:
/
Group 4
Legal factors:
1234
Default value:
3
----------
^L Page 5
Files and File Formats
Note that the above factor specification file is equivalent to that shown
below:
4
/
ds
nil
abcdefg
d
xyz
/
1234
3
Restrictions on the data in the factor definition file are described in
Section III.A.
^L Page 6
Files and File Formats
C. TOKEN FILE
A token file consists of a list of coding strings; it is created by
READTOK from the coding strings in one or more data files. (Note the ambiguity
in 'token': 'token' is used in different places to mean coding string, text,
or element of data file; the meaning should be clear from the context.) The
token file is used as input to MAKECELL and MULTICEL. Within the token file,
coding strings are written one per line, starting in column 1; they are all
the same length (since all the coding strings in the data file created by
CHECKTOK are the same length), and are a maximum of 78 characters long.
^L Page 7
Files and File Formats
D. CONDITION FILE (MAKECELL, MULTICEL)
A condition file is obligatory input to MAKECELL and MULTICEL; it
specifies the factor group containing the dependent variable, the factor
groups containing the independent variables, and the recodes which are
performed on the factor groups.
The maximum length of a line within the condition file is 78 characters.
The data within a condition file is in the form of a LISP list. Each element
of the list is itself a LISP list consisting of two parts: a factor group
number (column number within the coding string) and an optional set of recode
conditions. If no recode conditions are specified, the factor group is used
exactly as it is coded in the token file coding strings. All factor groups
specified in the condition file, and only those factor groups, are used by
MAKECELL and MULTICEL to build cells. The first factor group within the
condition file list is used as the group containing the dependent variable.
The order of factor groups specified within the condition file determines the
order of factor groups within the cell.
Each recode condition is again a LISP list consisting of two parts: the
first part is either a single character (the value to be used for the factor
group for those tokens which meet the second part of the condition, the test
clause), or 'nil'. If it is 'nil', that token is excluded from the list used
by MAKECELL and MULTICEL to build cells.
There are 5 test clause predicates: 'and', 'or', 'not', 'col', and
'elsewhere'. 'and', 'or', and 'not' are the standard logical operators; 'and'
and 'or' take 2 to 20 predicates as arguments, 'not' takes a single predicate
as argument. If it is necessary to define a recode condition with more than 20
arguments for 'and' or 'or', two or more of the arguments can be combined into
a more deeply embedded predicate, e.g.
(and a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18
a19 (and a20 a21 a22 a23 a24))
'col' takes 2 arguments, a factor group number (column number within the
coding strings) and a single-character value; 'col' is true if and only if
that column of the coding string contains the specified value. 'elsewhere' is
always true; it is used as the last test clause within a set of test clauses
for a factor group, and forces the recoding of the factor group to the
specified value if none of the previous conditions for that factor group have
been met.
Test clause predicates can be defined within a factor definition file (see
Section II.E.).
^L Page 8
Files and File Formats
A sample condition file is shown below:
----------
(
(1 (d (col 1 d))
(s (col 1 s))
(/ (elsewhere)))
(5)
(3 (/ (or (col 3 s) (col 3 t) (col 3 u)))
(m (or np-obj pro-obj))
(x np-subj)
(nil (elsewhere)))
(9 (1 (and (col 2 x) (col 8 a)))
(2 (and (col 2 x) (col 8 b)))
(3 (and (col 2 y) (col 8 a)))
(4 (and (col 2 y) (col 8 b)))
(/ (elsewhere)))
)
----------
Factor group 1 is the dependent variable; factor groups 5, 3, and 9 are the
independent variables which will be used by MAKECELL or MULTICEL to build
cells. Factor group 5 has no recode conditions and therefore will be used
exactly as it is coded in the token file. The recode conditions for factor
group 3 contain three predicates ('np-obj', 'pro-obj', and 'np-subj') which
must be defined within the factor definition file. Groups 2 and 8 are
interactive factor groups, whose interaction is being investigated by creating
factor group 9.
MAKECELL and MULTICEL process the recode conditions in the order in which
they appear in the condition file; the first condition that is satisfied is
used for recoding, and the rest of the conditions for that factor group are
ignored. For this reason, 'elsewhere' should be used only as the last
condition within a set of conditions for a factor group: since 'elsewhere' is
always true, any conditions listed after 'elsewhere' will be ignored. Also
note that when creating new factor groups which do not exist in the original
coding string (e.g. factor group 9 in the example above), the user should
include an 'elsewhere' condition if there is a possibility that none of the
other conditions for creating the new group will apply to some of the tokens;
otherwise, those tokens will be flagged with an error message ('no conditions
apply and no old value for group x'), and the MAKECELL or MULTICEL run will be
aborted.
For those users not familiar with LISP syntax, the following requirements
should be noted: 1) the list of conditions, i.e. the entire contents of the
file, must be enclosed within a set of parentheses; 2) each element of the
list, i.e. factor group number plus optional recode conditions, must be
enclosed in parentheses; 3) each recode condition must be enclosed in
parentheses; 4) each predicate, except for those defined within the factor
definition file, must be enclosed in parentheses.
^L Page 9
Files and File Formats
Other than the restrictions specified above, the format of a condition
file is fairly free: it is not necessary that any particular element appear in
any particular position on a line, since parentheses completely determine the
structure of the data within the condition file. Individual elements within
the file are terminated by space, open parenthesis, close parenthesis, or end
of line.
^L Page 10
Files and File Formats
E. FACTOR DEFINITION FILE
The factor definition file defines test clause predicates and predicate
labels to be used in the condition file. The option of using a factor
definition file was incorporated within the set-up functions under the
assumption that a meaningful label within the condition file will be easier for
the user to recognize than a series of coded predicates.
Each element of the factor definition file consists of a label plus a test
clause. A label is a string of 1 to 78 characters; space, open parenthesis,
and close parenthesis are the only invalid characters within a label. The
format of the test clause is the same as that described above in Section II.D.;
in particular, the test clause must be enclosed in parentheses. Factor
definition file test clauses may not reference other factor definition file
labels. The maximum length of a line in the factor definition file is 78
characters.
As MAKECELL and MULTICEL process the condition file, if an unknown
predicate is encountered, the factor definition file is searched for a label
matching that predicate. If a matching label is found, the test clause from
the factor definition file is used as a replacement for the label in the
condition file. If a matching label is not found, a condition file error is
generated.
The format of the factor definition file, like that of the condition file,
is fairly free: it is not necessary that any particular element appear in any
particular position on a line, since parentheses completely determine the
structure of the data in the factor definition file. Individual elements are
terminated by space, open parenthesis, close parenthesis, or end of line.
A sample factor definition file is shown below:
np-obj (or (col 3 n) (col 3 h))
pro-obj (or (col 3 1) (col 3 2) (col 3 3) (col 3 w) (col 3 u) (col 3 y)
(col 3 p) (col 3 t) (col 3 r) (col 3 x))
np-np (and (or (col 3 n) (col 3 h)) (col 7 n))
early (or (col 13 0) (col 13 1) (col 13 2) (col 13 3) (col 13 4)
(col 13 5))
middle (or (col 13 6) (col 13 7) (col 13 8))
late (or (col 13 9))
pos (col 6 p)
neg (or (col 6 f) (col 6 n) (col 6 o))
Page 11
Files and File Formats
F. CELL FILE (MAKECELL)
The cell file is created by MAKECELL, and used as input to IVARB. The
format of the cell file is given below:
1. First n lines: these lines are simply copied to the output file by
IVARB. They contain information such as header, date and time of creation of
cell file; number of cells; names of token file, condition file, and factor
definition file; application value; recoding conditions; factor definitions;
applications, totals, and percent application for each factor; and knock-out
factors and singleton factor groups.
2. Next line contains " FACTORGROUPS" (3 leading spaces) starting in
column 1. From this line on, the data in the file is used by IVARB.
3. Next line contains the number of independent factor groups. A leading
space is used if there are less than 10 factor groups.
4. Next i lines, where i = number of independent factor groups, list the
factors for each group, starting in column 1.
5. Next j lines, where j = number of cells, contain the following
information:
columns 1-4: number of applications, right-justified
columns 5-8: total number in cell, right-justified
next i columns, where i = number of independent factor groups: cell
contents
^L Page 12
Files and File Formats
G. CELL FILE (MULTICEL)
The cell file created by MULTICEL, to be used as input to MVARB, is very
similar to that created by MAKECELL. The format is given below:
1. First n lines: these lines are simply copied to the output file by
MVARB. They contain information such as header, date and time of creation of
cell file; number of cells; names of token file, condition file, and factor
definition file; application values; recoding conditions; factor definitions;
applications and percent application for each factor; and knock-out factors and
singleton factor groups.
2. Next line contains " PARAMETERS" (3 leading spaces) starting in
column 1. From this line on, the data in the file is used by MVARB.
3. Next line contains, starting in column 1, the number of variants of the
dependent variable, followed immediately by a list of the variants starting in
column 2.
4. Next line contains the number of independent factor groups. A leading
space is used if there are less than 10 factor groups.
5. Next i lines, where i = number of independent factor groups, contains
the number of factors in the group, right-justified in a 4-column field
starting in column 1, followed immediately by the list of factors in the group,
starting in column 5.
6. Next j lines, where j = number of cells, contain the following
information:
columns 1-4, 5-8, 9-12, etc.: number of applications, right-justified,
for each variant of the dependent variable
next i columns, where i = number of independent factor groups: cell
contents
7. Last line contains '-1' right-justified in the first four columns.
Page 13
Files and File Formats
H. SORTING CONDITION FILE (TSORT)
A sorting condition file is optional input for TSORT (all data in the
sorting condition file may instead be input through the console). The sorting
condition file specifies the conditions on coding strings under which tokens
will be copied to a separate disk file. These conditions are specified in the
same LISP format and with the same predicates ('and', 'or', 'not', 'col', but
not 'elsewhere') as those used for recode conditions in the factor definition
file and the condition file for MAKECELL and MULTICEL (see Section II.D.).
Examples are given below:
Example 1:
(col 1 g)
The above condition specifies that all tokens coded 'g' in column 1 be
copied to a separate file.
Example 2:
(or (col 1 g) (col 13 a))
The above condition specifies that all tokens coded either 'g' in column 1
or 'a' in column 13 be copied to a separate file.
Example 3:
(not (col 2 x))
The above condition specifies that all tokens not coded 'x' in column 2 be
copied to a separate file.
Example 4:
(and (col 2 c) (or (col 1 a) (col 3 b)))
The above condition specifies that all tokens coded 'c' in column 2 and
either 'a' in column 1 or 'b' in column 3 be copied to a separate file.
Example 5:
(or (col 2 c) (and (col 1 a) (col 3 b)))
The above condition specifies that all tokens either coded 'c' in column 2
or coded 'a' in column 1 and 'b' in column 3 be copied to a separate file.
Example 6:
(and (col 3 c) (not (col 4 d)))
The above condition specifies that all tokens coded 'c' in column 3 and
not coded 'd' in column 4 be copied to a separate file.
Page 14
Program Descriptions
III. PROGRAM DESCRIPTIONS
A. CHECKTOK
CHECKTOK performs three functions: 1) it compares the contents of coding
strings in the data file to specified lists of legal factors for each factor
group, and flags illegal factors; 2) it replaces each occurrence of the default
character '.' by the specified default value for the factor group; 3) it pads
short coding strings, either with the 'doesn't apply' character '/' or with the
default value for the factor group. If no errors are found in the data file,
CHECKTOK creates a new data file; the only difference between the original data
file and the new one is that the coding strings have been modified according to
2) and 3) above. It is the user's responsibility to delete the original data
file after running CHECKTOK and to rename the new data file, if desired. Only
data files created by CHECKTOK should be used as input to READTOK.
CHECKTOK first requests the following information from the user: number
of factor groups in the coding string, fill character for short coding strings,
and legal factors and default values for each factor group. This information
can be input either through the console or else through a previously created
disk file, the factor specification file; the user indicates the source of the
input by responding 'f' or 'c' to
FACTOR SPECIFICATIONS FROM FILE [= F] OR FROM CONSOLE [= C]:
If the user types 'f', the system requests FILE NAME. The user enters the name
of the file containing the factor specifications, the questions listed below
are not displayed, and the information contained in the file is read. The
restrictions on the data contained in the factor specification file -- number
of factor groups, fill character, legal factors, and default values -- are the
same as those described below for console input. For a complete description of
the factor specification file format, see Section II.B.
If the user types 'c', the questions listed below will be displayed one by
one on the screen (system output is in upper case):
NUMBER OF FACTOR GROUPS: the length of the coding strings in the data
file, an integer between 1 and 78; if the coding strings are different
lengths, the length of the longest string should be used. Note that
it is not possible to check only a portion of the coding string, for
example the first five characters of a ten-character coding string.
FILL SHORT CODING STRINGS WITH... [D = DEFAULT CHAR, / = GROUP DOESN'T
APPLY]: if the user response is 'd', short coding strings will be
padded with the default character for each factor group. If the
response is '/', short coding strings will be padded with the 'doesn't
apply' character. The user does not have the option of not padding
short coding strings: READTOK, which takes as input the data file
created by CHECKTOK, requires that all coding strings be the same
length. Since omitting a column within a long coding string is a
common user error, it is suggested that factors in one factor group
not be repeated in another factor group, or, at least, that factors in
one factor group be distinct from those in adjacent groups. If this
precaution is followed, the user will be able to determine that a
column has been omitted, since there will be flagged illegal factors
within the adjacent group.
Page 15
Program Descriptions
For each factor group, CHECKTOK requires the following:
LEGAL FACTORS: the legal factors for that factor group, a string of 2 to
30 characters. Space and left parenthesis are not legal factors, and
are flagged as errors. It is unnecessary to specify the default
character '.' or the 'doesn't apply' character '/' as factors for any
factor group, since they are legal for all groups; these two
characters are flagged as errors when the user inputs the factor list.
CHECKTOK does not check for duplicate factors within the list of legal
factors.
DEFAULT VALUE: a single character, or 'nil'. If 'nil' is specified,
then that factor group has no default value; if the factor group is
coded '.' in any coding string in the data file, that coding string
will be flagged with an error message. The default value, if it is
not '/' or 'nil', must be included in the string of legal factors for
the group; therefore space, left parenthesis, and period are not legal
default values.
It is recommended that a factor specification file be used rather than
console input, especially if the number of factor groups and legal factors is
large, or if more than one data file is to be checked.
If an error is detected during console input, the question is displayed on
the screen again; an error message may also be displayed. If an error is
detected while processing the factor specification file, a message is displayed
on the screen, an error message is written to file CHECKTOK.ERR, and CHECKTOK
is terminated without processing the data file(s).
When user input from console or factor specification file is complete with
no errors, CHECKTOK requests the names of the data files to be read and
written:
DATA FILE TO BE READ: name of data file to be processed, or carriage
return to exit CHECKTOK.
DATA FILE TO BE WRITTEN: name of data file to be created (with coding
string modifications as described above), or carriage return to exit
CHECKTOK. The name of the data file to be written must be different
from the name of the data file to be read.
CHECKTOK then processes the original data file, checking the contents of coding
strings against the lists of legal factors, substituting default values for the
default character, and padding short coding strings according to user
specification. CHECKTOK processes the complete data file whether or not errors
are detected in the coding strings. If errors are detected, messages are
written to file CHECKTOK.ERR, and a new data file is not created. If no errors
are detected, CHECKTOK creates a new data file with modified coding strings.
CHECKTOK creates one new data file for each original one; it cannot
combine two or more data files. After CHECKTOK has completely processed one
data file, the names of the next data files to be read and written are
requested. To terminate this loop and exit CHECKTOK, enter carriage return.
CHECKTOK uses the same factor definition data to process each data file.
To use different factor definition data for different data files, the user must
Page 16
Program Descriptions
exit CHECKTOK and then begin another CHECKTOK run.
Within file CHECKTOK.ERR, each error message contains an error number.
This number is for debugging purposes only, and should be ignored by the user.
The line number in CHECKTOK.ERR messages refers to lines with content -- blank
lines within the factor specification file and the data file are not counted.
Page 17
Program Descriptions
B. READTOK
READTOK reads coding strings from one or more data files and writes them
to a token file. All coding strings in the data file(s) must be the same
length; the maximum length of a coding string is 78 characters.
READTOK first requests the user to specify TOKEN FILE TO BE WRITTEN, and
then DATA FILE TO BE READ. After it has processed the data file and written
the coding strings to the token file, it requests the name of the next data
file. READTOK continues to request data file names and process data files
until the user enters carriage return in response to DATA FILE TO BE READ; the
coding strings from all of the data files are written to the same token file.
READTOK, TSORT, and TEXTSORT are the only PASCAL programs which accept data
from more than one input file to be written to one output file: CHECKTOK
creates one new data file for each original data file, and MAKECELL and
MULTICEL accept only one token file.
If the user wants to create more than one token file, he must exit READTOK
(by entering carriage return in response to DATA FILE TO BE READ) and then run
READTOK again.
If any errors are detected within the data files (i.e., if any coding
string is longer than 78 characters, or if all of the coding strings in all of
the data files are not the same length), a message is displayed on the screen
and an error message indicating data file name, token number, line number, and
type of error is written to file READTOK.ERR. Note that the line number refers
to lines with content; blank lines are not counted.
The user should never exit READTOK by entering control-z or control-c in
response to DATA FILE TO BE READ; these responses prevent the orderly closing
of files, and data written to the token file will be lost.
Page 18
Program Descriptions
C. MAKECELL
MAKECELL creates the cell file for input to IVARB. MAKECELL requests the
following information from the user: name of token file, name of cell file to
be created, optional header line to be written to the cell file (maximum of 78
characters long), name of condition file, name of factor definition file
(optional), and application value (factor within the factor group containing
the dependent variable which counts as an application of the variable rule).
MAKECELL recodes the tokens in the token file according to the
specifications in the condition file and factor definition file (see Sections
II.D. and II.E.); builds the cells for input to IVARB; and counts occurrences
and calculates percent application for each factor. All of this information is
written to the cell file (see Section II.F. for cell file format).
MAKECELL imposes the following limits on the input data:
Maximum number of independent factor groups = 20
Maximum number of factors per independent factor group = 30
Maximum number of factors in all independent factor groups = 49
Maximum number of cells = 1000
If format errors are detected in any of the three input files, or if
either of the first two limits above are exceeded, an error message is written
to the screen and to disk file MAKECELL.ERR, and MAKECELL is terminated without
creating a cell file. The error message written to MAKECELL.ERR indicates the
name of the file in which the error occurred, the line number (as in CHECKTOK,
blank lines are not counted), the type of error, and other information to help
the user to determine the source of the error.
If the maximum number of factors in all factor groups is exceeded, the
user is asked whether the cell file should still be created; the information
contained in the cell file, especially the number of occurrences and percent
application for each factor, may help the user to decide how to modify the
input so that this limit is not exceeded. Cell files created under these
conditions should not be used as input to IVARB or CROSSTAB, which will abort
if the cell file contains data for more than 49 factors.
If the maximum number of cells is exceeded, an error message is written to
the screen, and MAKECELL is terminated without creating a cell file.
There are three other error conditions which do not prevent the creation
of a cell file but which must be eliminated before running IVARB: knockout
factors, singleton factor groups, and factor groups containing no factors.
Knockout factors are those factors for which the percent application is either
0 or 100. Singleton factor groups (those which have only one factor occurring
in the data) and factor groups containing no factors can result either because
of the original coding in the data file or else by recoding specified in the
condition file. Error messages indicating these conditions are written to the
cell file on the line containing the number of occurrences and percent
application for these factors.
Within file MAKECELL.ERR, each error message contains an error number.
This number is for debugging purposes only, and should be ignored by the user.
The error messages for MAKECELL were intended to be as specific as possible to
help the user determine the source of the error; but at times these messages
Page 19
Program Descriptions
may be confusing, since it is unlikely that the user analyzes the data in the
files in the same way that the program does. In particular, the line number in
the error message may point to the line in the file after the one actually
containing the invalid data.
Page 20
Program Descriptions
D. MULTICEL
MULTICEL creates the cell file for input to MVARB. All of the
documentation for MAKECELL in Section III.C. applies to MULTICEL, with the
following exceptions:
1. The user response to APPLICATION VALUES is a string of up to 9
characters, rather than a single character.
2. Error messages are written to file MULTICEL.ERR rather than to
MAKECELL.ERR.
3. MULTICEL imposes the following limits on the input data:
Maximum number of variants in the dependent factor group = 9
Maximum number of independent factor groups = 20
Maximum number of factors per independent factor group = 30
Maximum number of factors in all independent factor groups = 135
Maximum number of cells = 1000
Note that MVARB, the multinomial version of IVARB, has not yet been
implemented on the IBM PC. Until MVARB is released, MULTICEL can be used only
to calculate the number of applications and percent application of the factors
in the independent factor groups. Neither IVARB nor CROSSTAB will run on a
cell file created by MULTICEL.
See Section II.G. for a description of the cell file created by MULTICEL.
Page 21
Program Descriptions
E. COUNTUP
COUNTUP counts occurrences and calculates percentages for factors within a
specified factor group in a token file. COUNTUP requires the following
information from the user: name of token file, name of output file, and factor
group number.
COUNTUP writes the following information to the output file: number of
tokens in the token file; number of occurrences of the 'doesn't apply'
character '/' in the specified factor group plus percentage of the total token
count; total number of occurrences of non-'/' factors in the specified factor
group plus percentage of the total token count; and for each factor in the
group, number of occurrences plus percentage of the total non-'/' count.
Three errors can occur during a COUNTUP run: a coding string within the
token file is too short to contain the specified factor group; the number of
factors within the specified group is greater than 30; or the token file
contains no tokens. If an error is detected, COUNTUP writes an error message
to the screen and to disk file COUNTUP.ERR and terminates without processing
the remainder of the token file. If no errors are detected, COUNTUP writes the
output to the specified file, and requests the number of the next factor group
to be counted. COUNTUP continues to request and process factor groups for the
specified token file until the user enters a carriage return in response to
FACTOR GROUP to exit COUNTUP.
Page 22
Program Descriptions
F. CROSSTAB
CROSSTAB calculates two-dimensional cross-tabulations of cell data.
CROSSTAB requires as input the name of a cell file created by MAKECELL, the
name of an output file, and two factor group numbers. These factor group
numbers refer to the sequential factor group numbers in the cell file, not to
the original factor groups in the coding string. Note that on the IBM PC,
CROSSTAB must be run on a computer which has an 8087 chip.
The first section of CROSSTAB output consists of the first n lines of the
cell file (see Section II.F.), plus the number of cells and a list of factors.
The format of the second section, the cross-tabulation data, is shown below:
F(x,1) F(x,2) ... F(x,n) T(y)
A(1,1) A(1,2) ... A(1,n) A(1)
F(y,1) T(1,1) T(1,2) ... T(1,n) T(1)
P(1,1) P(1,2) ... P(1,n) P(1)
A(2,1) A(2,2) ... A(2,n) A(2)
F(y,2) T(2,1) T(2,2) ... T(2,n) T(2)
P(2,1) P(2,2) P(2,n) P(2)
. . . . .
. . . . .
. . . . .
A(m,1) A(m,2) ... A(m,n) A(m)
F(y,m) T(m,1) T(m,2) ... T(m,n) T(m)
P(m,1) P(m,2) ... P(m,n) P(m)
A(1) A(2) ... A(n) A
T(x) T(1) T(2) ... T(n) T
P(1) P(2) ... P(n) P
where x is the first factor group specified
y is the second factor group specified
F(x,i) is factor i of group x, where i goes from 1 to the number
of factors in group x
F(y,j) is factor j of group y, where j goes from 1 to the number
of factors in group y
A(k,l), T(k,l), P(k,l) are the number of applications, the
total, and the percent application for tokens with factor k
in group y and factor l in group x
the data in the row and column labeled T(x) and T(y) are the
marginals for factor groups x and y
Page 23
Program Descriptions
G. IVARB
IVARB is the variable rule program. IVARB requires the following
information from the user: cell file name (created by MAKECELL), output file
name, non-applications multiplier (1 or 2 digits), and indication of one-level
or step up and down. If one-level is selected, a response to 'print expected
values and chi-square per cell' is also requested. On the IBM PC, error
checking is performed wherever possible on user input; however, some invalid
responses will cause IVARB to abort. In particular, a non-numeric response for
the non-applications multiplier cannot be handled by the program. Note that on
the IBM PC, IVARB must be run on a computer which has an 8087 chip.
Page 24
Program Descriptions
H. TSORT
TSORT is a PASCAL program which enables a user to 'sort' data files
according to coding strings: it copies to a separate disk file all tokens with
user-specified coding string values. The sorting conditions are specified
either through the console or else within a previously created disk file; these
conditions are specified with the same predicates ('and', 'or', 'not', 'col',
but not 'elsewhere') as those used for recode conditions in the factor
definition file and the condition file for MAKECELL and MULTICEL (see Section
II.D.).
TSORT first requests
SORTING CONDITIONS FROM FILE [=F] OR FROM CONSOLE [=C]:
If the user responds 'f', the program requests a file name and then reads and
processes the conditions contained in that file (see Section II.H. for examples
of sorting condition files). If a format error is detected in the file, an
error message is written to the screen and to disk file TSORT.ERR, and TSORT
is terminated.
If the user responds 'c', the program requests predicates and predicate
arguments from the console, guiding the user through the condition input.
Examples are shown below (system output is in upper case, user input in lower
case):
Example 1:
PREDICATE: col
COLUMN NUMBER: 1
COLUMN VALUE: g
The above condition specifies that all tokens coded 'g' in column 1 be
copied to a separate file.
Example 2:
PREDICATE: or
OR_ARGUMENT #1: col
COLUMN NUMBER: 1
COLUMN VALUE: g
OR_ARGUMENT #2: col
COLUMN NUMBER: 13
COLUMN VALUE: a
OR_ARGUMENT #3: (carriage return)
The above condition specifies that all tokens coded either 'g' in column 1
or 'a' in column 13 be copied to a separate file.
Example 3:
PREDICATE: not
NOT_ARGUMENT: col
COLUMN NUMBER: 2
COLUMN VALUE: x
Page 25
Program Descriptions
The above condition specifies that all tokens not coded 'x' in column 2 be
copied to a separate file.
Example 4:
PREDICATE: and
AND_ARGUMENT #1: col
COLUMN NUMBER: 2
COLUMN VALUE: c
AND_ARGUMENT #2: or
OR_ARGUMENT #1: col
COLUMN NUMBER: 1
COLUMN VALUE: a
OR_ARGUMENT #2: col
COLUMN NUMBER: 3
COLUMN VALUE: b
OR_ARGUMENT #3: (carriage return)
AND_ARGUMENT #3: (carriage return)
The above condition specifies that all tokens coded 'c' in column 2 and
either 'a' in column 1 or 'b' in column 3 be copied to a separate file.
Example 5:
PREDICATE: or
OR_ARGUMENT #1: col
COLUMN NUMBER: 2
COLUMN VALUE: c
OR_ARGUMENT #2: and
AND_ARGUMENT #1: col
COLUMN NUMBER: 1
COLUMN VALUE: a
AND_ARGUMENT #2: col
COLUMN NUMBER: 3
COLUMN VALUE: b
AND_ARGUMENT #3: (carriage return)
OR_ARGUMENT #3: (carriage return)
The above condition specifies that all tokens either coded 'c' in column 2
or coded 'a' in column 1 and 'b' in column 3 be copied to a separate file.
Example 6:
PREDICATE: and
AND_ARGUMENT #1: col
COLUMN NUMBER: 3
COLUMN VALUE: c
AND_ARGUMENT #2: not
NOT_ARGUMENT: col
COLUMN NUMBER: 4
COLUMN VALUE: d
AND_ARGUMENT #3: (carriage return)
The above condition specifies that all tokens coded 'c' in column 3 and
not coded 'd' in column 4 be copied to a separate file.
Notice that only one top-level predicate can be input, either through the
Page 26
Program Descriptions
console or within a file. As in condition files for MAKECELL and MULTICEL,
'not' takes one argument, 'and' and 'or' take up to twenty arguments.
After the sorting condition has been input and processed, the user is
asked
MATCH ON SHORT CODING STRINGS?
If the user responds 'n', then the condition is 'matched' by a coding string
only if factor groups exist in the data which satisfy the specified condition.
For example, if a data file contains token strings only 3 characters long, then
the condition in example 6 above will not be matched by any coding string in
the data file, even if some tokens are coded 'c' in column 3.
If the user responds 'y', then the condition is 'matched' by a coding
string if the factor groups which exist in the data satisfy those parts of the
condition which reference them. For example, if a data file contains coding
strings which are 3 characters long, then the condition in example 6 above will
be matched by all coding strings in the file which have 'c' in column 3. If a
data file contains coding strings which are 2 characters long, then the
condition in example 2 above will be matched by all coding strings in the file
which contain 'g' in column 1; the condition in example 4 above will be matched
by all coding strings in the file which contain 'c' in column 2 and 'a' in
column 1; the condition in example 5 above will be matched by all coding
strings in the file which contain 'c' in column 2 or 'a' in column 1. Note,
however, that some part of the condition must be satisfied: if a data file
contains coding strings that are 5 characters long, then the condition
(or (not (col 6 x)) (col 7 y))
will not be matched by any coding string. One way to understand the behavior
of TSORT under the 'match short coding strings = y' option is to assume that
those parts of the condition which reference non-existent coding string columns
are simply deleted from the condition. If all parts of the condition except
for the words 'and', 'or', and 'not' are deleted, the result is an empty
condition which cannot be matched by any coding string.
TSORT then requests the name of the output file and the names of a
sequence of input data files; all matching tokens from all input files are
written to the same output file. To terminate the input file sequence, enter
carriage return. TSORT then asks whether the user wants another TSORT run; if
the response is 'y', the user will be requested to input another set of
conditions and input and output files; if the response is 'n', the program will
terminate.
The maximum length of input file lines is 120 characters; if a data file
line is longer than 120 characters, the user has the option of truncating the
line and continuing the search on the truncated data, or of terminating the
TSORT run. Note that if the line is truncated, the truncated line rather than
the full line is written to the output file if the coding string matches the
condition.
Page 27
Program Descriptions
I. TEXTSORT
TEXTSORT is a PASCAL program which enables a user to 'sort' data files
according to token text: it copies to a separate disk file all tokens
containing user-specified text strings.
TEXTSORT requires the following input from the user: up to 50 different
search strings, each of which is a maximum of 50 characters in length; the
name of the output file (the file to which tokens containing one or more of the
search strings are written), and the names of a sequence of input data files;
all matching tokens from all input files are written to the same output file.
To terminate the input file sequence, enter carriage return. TEXTSORT then
asks whether the user wants another TEXTSORT run; if the response is 'y', the
user will be requested to input another set of search strings and input and
output files; if the response is 'n', the program will terminate.
The list of search strings may be input from the console or from a
previously created disk file. The format of the file is so simple that it is
not described separately in Section II: each search string is written on a
separate line, starting in column 1. No blank lines are permitted in the file,
and lines are a maximum of 50 characters in length: any data after the
fiftieth character are ignored. Note that very little format checking is
performed on the search string file: if a line containing 3 spaces is
contained in the file, the program will use a string consisting of 3 spaces as
one of the search strings.
There is no restriction on the characters within the search strings:
spaces and all special characters may be included.
TEXTSORT gives the user the option of including the coding string in the
text to be searched.
An example of TEXTSORT-user dialogue is given below (system output is in
upper case, user input in lower case):
INCLUDE CODING STRING IN MATCH? n
SEARCH STRINGS FROM FILE [=F] OR FROM CONSOLE [=C]? c
SEARCH STRING #1: be
SEARCH STRING #2: am
SEARCH STRING #3: are
SEARCH STRING #4: is
SEARCH STRING #5: was
SEARCH STRING #6: were
SEARCH STRING #7: being
SEARCH STRING #8: been
SEARCH STRING #9: (carriage return)
OUTPUT FILE: be.dat
INPUT FILE #1: x1.dat
23 TOKENS WRITTEN TO BE.DAT
INPUT FILE #2: x2.dat
Page 28
Program Descriptions
3 TOKENS WRITTEN TO BE.DAT
INPUT FILE #3: (carriage return)
26 TOKENS TOTAL WRITTEN TO BE.DAT
ANOTHER RUN? y
INCLUDE CODING STRING IN MATCH? ...
In the above example, the user is searching files x1.dat and x2.dat for all
forms of the copula, and writing tokens containing the copula to file be.dat.
Note that matching text strings can not cross line boundaries in the input
files. For example, if the user wants to find all occurrences of the string
'the boy', the token
(abcxyz (I know that the
boy is here.))
will not be matched by TEXTSORT.
The maximum length of input file lines is 120 characters; if a data file
line is longer than 120 characters, the user has the option of truncating the
line and continuing the search on the truncated data, or of terminating the
TEXTSORT run. Note that if the line is truncated, the truncated line rather
than the full line is written to the output file if the token contains the
search string(s). In addition, data which has been lost through truncation
will not be matched by the search strings(s).
Page 29
Running the Programs
IV. RUNNING THE PROGRAMS
On the VAX, all nine programs (CHECKTOK, READTOK, MAKECELL, MULTICEL,
COUNTUP, CROSSTAB, IVARB, TSORT, and TEXTSORT) are in directory [pintzuk.varb].
All programs are initiated by entering
run [pintzuk.varb]program-name
in response to the system prompt. On the IBM PC, all programs are initiated by
entering the program name in response to the system prompt.
The recommended procedure is to run CHECKTOK on all data files, then
READTOK to create a token file from one or more data files, then MAKECELL or
MULTICEL to create a cell file. COUNTUP can be run any time after a token file
is created. CROSSTAB and IVARB can be run any time after a cell file is
created by MAKECELL. TSORT and TEXTSORT can be run at any time on any data
file.
Page 30
For VAX Users
V. FOR VAX USERS: A COMPARISON OF THE OLD AND NEW VARBRUL SYSTEMS
Although all of the functions performed by the VAX LISP routines are
duplicated in the PASCAL programs, there are many differences between the old
system and the new one. All files used as input to the LISP routines (data
files, condition files, factor definition files) may be used with no change in
content or format for the PASCAL programs, except for the following:
1. The 'doesn't apply' character has been changed from 'b' to '/', and
therefore condition files, factor definition files, and all coding strings in
the data files must be modified accordingly. Within the new system, 'b' can be
used as a normal factor.
2. The maximum length of a data file line is 120 characters. The maximum
length of lines in the condition file and the factor definition file is 78
characters.
3. Test clauses within the factor definition file may not reference
factor definition file labels.
4. MAKECELL and MULTICEL do not delete singleton factor groups; instead,
an error message is printed in the same position as 'KNOCKOUT'. Therefore, the
user must delete all singleton factor groups from condition files.
The correspondence between the new programs and the old ones is as
follows: CHECKTOK is a new program, performing functions not available in the
old system. The PASCAL programs READTOK and COUNTUP perform the same functions
as their LISP counterparts. The PASCAL programs MAKECELL and MULTICEL combine
all of the functions of the LISP routines 'load-conditions', 'load-defs',
'make-cells', and 'make-cellfile'. The FORTRAN programs CROSSTAB and IVARB are
the same as the old versions, except that the 'doesn't apply' character has
been changed from 'b' to '/'. The PASCAL program TSORT performs the same
functions as the LISP version, but input of the sorting condition has been
completely changed. TEXTSORT is a new program, performing functions not
available in the old system.
All output from the PASCAL programs is in the form of disk files; much of
the output from the LISP routines is in the form of LISP lists kept in memory.
For this reason (among others), it is impossible to perform some of the set-up
functions using LISP routines and others using PASCAL programs. In addition,
the versions of IVARB and CROSSTAB in [pintzuk.varb] must be used to process
the cell file created by the PASCAL program MAKECELL, since the 'doesn't apply'
character has been changed.
VAX users should also note the following changes in the new system:
1. The PASCAL programs write to disk all messages indicating errors in
file format and content. These messages are much more specific than those in
the old system.
2. Since data files are not processed as sets of LISP lists, they may
contain comment lines; and it is not necessary that open and close parentheses
match within data file elements.
3. The cell file indicates the original factor group number as well as
the new sequential factor group number. This addition has not yet been
incorporated into IVARB, although these numbers are printed at the beginning of
Page 31
For VAX Users
the IVARB output.
4. The factor definition file is listed within the cell file, and
therefore within the IVARB output.
5. The PASCAL programs READTOK, MAKECELL, and TSORT should run much
faster than the corresponding LISP programs, since PASCAL is, in general, a
much faster language than LISP.
Page 32
For IBM PC Users
VI. FOR IBM PC USERS: INPUT/OUTPUT ERROR MESSAGES
The following error message may be generated by the IBM PC operating
system while you are running any of the PASCAL programs:
I/O error nn, PC = xxxx
Program aborted
Below is a list of error numbers and their meaning:
01 File does not exist. You are trying to read from a file which does
not exist on the specified disk.
F0 Disk full. You are trying to write a file, and there is no more room
on the disk.
F1 Directory full. You are trying to create a file, and there is no more
room in the disk directory.