@@ -9,7 +9,7 @@ high-performance large-scale natural language tokenization,
99based on a finite state
1010transducer generated with [ Foma] ( https://fomafst.github.io/ ) .
1111
12- The library contains precompiled tokenizer models for
12+ The repository currently contains precompiled tokenizer models for
1313
1414- [ german] ( testdata/tokenizer_de.matok )
1515- [ english] ( testdata/tokenizer_en.matok )
@@ -18,6 +18,8 @@ The focus of development is on the tokenization of
1818[ DeReKo] ( https://www.ids-mannheim.de/digspra/kl/projekte/korpora ) ,
1919the german reference corpus.
2020
21+ Datok can be used as a standalone tool or as a library in Go.
22+
2123## Performance
2224
2325![ Speed comparison of german tokenizers] ( https://raw.githubusercontent.com/KorAP/Datok/master/misc/benchmarks.svg )
@@ -54,6 +56,7 @@ The special `END OF TRANSMISSION` character (`\x04`) can be used to mark the end
5456> * Caution* : When experimenting with STDIN and echo,
5557> you may need to disable [ history expansion] ( https://www.gnu.org/software/bash/manual/html_node/History-Interaction.html ) .
5658
59+
5760## Conversion
5861
5962```
@@ -68,6 +71,38 @@ Flags:
6871 representation
6972```
7073
74+ ## Library
75+
76+ ``` go
77+ package main
78+
79+ import (
80+ " github.com/KorAP/datok"
81+ " os"
82+ " strings"
83+ )
84+
85+ func main () {
86+
87+ // Load transducer binary
88+ dat := datok.LoadTokenizerFile (" tokenizer_de.matok" )
89+ if dat == nil {
90+ panic (" Can't load tokenizer" )
91+ }
92+
93+ // Create a new TokenWriter object
94+ tw := datok.NewTokenWriter (os.Stdout , datok.TOKENS |datok.SENTENCES )
95+ defer tw.Flush ()
96+
97+ // Create an io.Reader object refering to the data to tokenize
98+ r := strings.NewReader (" Das ist <em>interessant</em>!" )
99+
100+ // The transduceTokenWriter accepts an io.Reader
101+ // object and a TokenWriter object to transduce the input
102+ dat.TransduceTokenWriter (r, tw)
103+ }
104+ ```
105+
71106## Conventions
72107
73108The FST generated by [ Foma] ( https://fomafst.github.io/ ) must adhere to
0 commit comments