Skip to content

Commit 78d270d

Browse files
committed
Add library usage explanation
Change-Id: I4401ebd7218a3e699efb73917aa5de8baf6f17f8
1 parent 8e80393 commit 78d270d

1 file changed

Lines changed: 36 additions & 1 deletion

File tree

Readme.md

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ high-performance large-scale natural language tokenization,
99
based on a finite state
1010
transducer generated with [Foma](https://fomafst.github.io/).
1111

12-
The library contains precompiled tokenizer models for
12+
The repository currently contains precompiled tokenizer models for
1313

1414
- [german](testdata/tokenizer_de.matok)
1515
- [english](testdata/tokenizer_en.matok)
@@ -18,6 +18,8 @@ The focus of development is on the tokenization of
1818
[DeReKo](https://www.ids-mannheim.de/digspra/kl/projekte/korpora),
1919
the german reference corpus.
2020

21+
Datok can be used as a standalone tool or as a library in Go.
22+
2123
## Performance
2224

2325
![Speed comparison of german tokenizers](https://raw.githubusercontent.com/KorAP/Datok/master/misc/benchmarks.svg)
@@ -54,6 +56,7 @@ The special `END OF TRANSMISSION` character (`\x04`) can be used to mark the end
5456
> *Caution*: When experimenting with STDIN and echo,
5557
> you may need to disable [history expansion](https://www.gnu.org/software/bash/manual/html_node/History-Interaction.html).
5658
59+
5760
## Conversion
5861

5962
```
@@ -68,6 +71,38 @@ Flags:
6871
representation
6972
```
7073

74+
## Library
75+
76+
```go
77+
package main
78+
79+
import (
80+
"github.com/KorAP/datok"
81+
"os"
82+
"strings"
83+
)
84+
85+
func main () {
86+
87+
// Load transducer binary
88+
dat := datok.LoadTokenizerFile("tokenizer_de.matok")
89+
if dat == nil {
90+
panic("Can't load tokenizer")
91+
}
92+
93+
// Create a new TokenWriter object
94+
tw := datok.NewTokenWriter(os.Stdout, datok.TOKENS|datok.SENTENCES)
95+
defer tw.Flush()
96+
97+
// Create an io.Reader object refering to the data to tokenize
98+
r := strings.NewReader("Das ist <em>interessant</em>!")
99+
100+
// The transduceTokenWriter accepts an io.Reader
101+
// object and a TokenWriter object to transduce the input
102+
dat.TransduceTokenWriter(r, tw)
103+
}
104+
```
105+
71106
## Conventions
72107

73108
The FST generated by [Foma](https://fomafst.github.io/) must adhere to

0 commit comments

Comments
 (0)