DESCRIPTION
ACCEPTORS
The usual notion of finite-state acceptor is generalized here to
include costs -- each arc has a cost associated with making that tran-
sition and each state has a final cost associated with accepting at
that state.
The following is the text representation of a (weighted) acceptor:
(1) Each state in the acceptor is specified by a non-negative inte-
ger. Each distinct arc symbol is specified by a non-negative inte-
ger. 0 is reserved for the epsilon string (epsilon). Each arc cost
is represented as a floating point number (in the default compila-
tion).
(2) For each arc in the acceptor, there is a line of the form:
S1 S2 A C
where S1 is the source state number, S2 is the destination state
number, A is the arc symbol number and C is the arc cost. The last
field is optional; if it is omitted, the arc is assigned cost SMROne
(see fsmcost(3)).
(3) For each final state, there is a line of the form:
S C
where S is the state number and C is the cost of accepting at that
state. The second field is optional; if it is omitted, the state is
accepted with cost SMROne.
(4) The complete printed representation of the acceptor consists of
lines of the form in (2) and (3). The initial state is, by conven-
tion, the source state on the first line.
For example, the acceptor represented by a file containing:
0 0 1 .5
0 1 2 .3
1 2 3 .6
1 2 4 .6
2
accepts strings of the (grep) form x*y[zw], where x, y, z, w are the
symbols numbered 1, 2, 3, 4, respectively. If the accepted string has
N occurrences of x, then its cost is .5 N + .9, using the (default)
tropical cost semiring (see fsmcost(3)).
TRANSDUCERS
The text representation of a (weighted) transducer is similar to that
where S1 is the source state number, S2 is the destination state
number, IA is the input symbol number, OA is the output symbol num-
ber, and C is the arc cost. The last field is optional; if it is
omitted, the arc is assigned cost SMROne.
SYMBOL NAMES
In the above formats, any text string that excludes whitespace may be
substituted for the numbers in the state, input symbol and output sym-
bol fields, provided that appropriate mappings from text strings to
numbers are supplied in symbols files. A symbols file consists of
lines with two fields separated by whitespace. The first field con-
tains a text string and the second field its corresponding number. For
a state symbols file, there must be a text string for every state num-
ber encountered. Similarly, for input and output symbol symbols files.
For example, the acceptor represented by a file containing:
BEG BEG RED .5
BEG MID GREEN .3
MID END BLUE .6
MID END BLACK .6
END
with the state symbols file:
BEG 0
MID 1
END 2
and (input) arc symbols file:
RED 1
GREEN 2
BLUE 3
BLACK 4
is identical to the previous example above.
CAVEATS
Since some FSM commands (like FSMConcat and FSMClosure) introduce
epsilon symbols into their outputs, it is recommended that a label for
0, the epsilon ID, be included in all symbols files.
Some FSM operations allocate internal arrays based on the maximum inte-
ger used as an input or output arc symbol. This design choice, chosen
for efficiency, requires the user to avoid huge integer labels (e.g.,
INT_MAX) since memory may otherwise be exhausted.
When fsmprint is given the -l option, additional information is output
that when read by fsmcompile preserves the state potentials, the FSM
class, the cost semiring, and whether it is an acceptor or transducer
(overriding any conflicting command line options).
Cyril Allauzen (allauzen@research.att.com)
Mehryar Mohri (mohri@research.att.com)
Fernando Pereira (pereira@cis.upenn.edu)
Michael Riley (riley@research.att.com)
Copyright (C) 1998-2003 AT&T Corp. All rights reserved.
Version 4.0 FSM(5)