Thrax Tutorial
Thrax is used for compiling grammars expressed as regular expressions and content-dependent rewrite rules into WFSTs.
commands
- thraxmakedep
--save_symbols=true
- thraxcompiler
-input_grammar=example.grm
--output_far=example.far
--save_symbols=true|false(default)
- thraxrewrite-tester
--far=example.far
--rules=TOKENIZER
- thraxrandom-generator
--far=example.far
--rule=TOKENIZER
--noutput=10
general statements
- Each statement consists of an assignment terminating with a semicolon.
- Statements start with
export
keyword will be written to final output archive.
foo = "abc";
export bar = foo | "xyz";
string input
- String FSTs are defined by text enclosed by quotes (
"
). - Raw strings, such as filenames, are enclosed by single quotes (
'
). - In the default parse mode, each arc of the resulting FST will correspond to a single 1-byte character.
- When use symbol table parse mode, symbols should be separate by separators, which by default is a space.
- symbol table can be loaded by
SymbolTable
built-in function.
- symbol table can be loaded by
symtab = SymbolTable['/path/to/bears.symtab'];
pb = "polar bear".symtab;
- We can create temporal symbols by enclosing the symbol name inside a bracket (
[]
) within an FST string. - If the symbol name is a complete integer, then we use the number as arc label directly.
parse mode
use .
to explicitly specify the parse modes:
byte
: parse the string byte-by-byte. This is default mode.utf8
: use UTF8 characters for FST arcs.
a = "haha" # byte
b = "haha".byte # byte
c = "haha".utf8 # byte
d = "haha".symtab # symbol table
function
func UnionWithTriple[fst] {
fst3 = fst fst fst;
result = fst | fst3;
return result;
}
export a_or_a3 = UnionWithTriple["a"]
symbols
()
: Group an expression to be evaluated first.<>
: Attach a weight to the FST.
foo = "aaa" <1>;
goo = "aaa" : "bbb" <-1>;
operations
- Closure: repeats the argument FST.
fst*
fst+
fst?
fst{x,y}
- Concatenation: follows the first FST with the second.
foo bar
- * Difference*: accepts by the first and not the second.
foo - bar
- Composition: composes the first FST with the second.
foo @ bar
- Union: accepts either of the two FSTs.
foo | bar
- Rewrite: rewrites strings matching the first to the second.
foo : bar
- Determinize:
Determinize[fst]
- RmEpsilon:
RmEpsilon[fst]
- Minimize:
Minimize[fst]
- Optimize:
Optimize[fst]
- Reverse:
Reverse[fst]
file functions
- LoadFst: load fst from a file or extracting from a FAR.
LoadFst['/path/to/fst']
LoadFstFromFar['/path/to/far', 'fst_name']
- StringFile: load a file consisting of a list fo strings or pairs of strings.
- Compiles it (in byte mode) to an FST that represents the union of those string. This is significantly more efficient for large
StringFilie['strings_file']
- If the file contains single strings, one per line, then the resulting FST will be an acceptor.
- If the file contains pairs of tab-separated strings, the result will be a transducer.
- Specify the parse modes of left and right of the tab.
StringFile['strings_file', 'byte', symbols]
- Specify the parse modes of left and right of the tab.
- Compiles it (in byte mode) to an FST that represents the union of those string. This is significantly more efficient for large
- SymbolTable: loads and returns the symbol table.
SymbolTable['/path/to/symtab']