medkit.text.segmentation.syntagma_tokenizer
medkit.text.segmentation.syntagma_tokenizer#
Classes:
|
Syntagma segmentation annotator based on provided separators |
- class SyntagmaTokenizer(separators, output_label='SYNTAGMA', strip_chars='.;,?! \n\r\t', uid=None)[source]#
Syntagma segmentation annotator based on provided separators
Instantiate the syntagma tokenizer
- Parameters
separators (Tuple[str, ...]) – The tuple of regular expressions corresponding to separators.
output_label (str, Optional) – The output label of the created annotations. Default: “SYNTAGMA” (cf. DefaultConfig)
strip_chars (
str) – The list of characters to strip at the beginning of the returned segment. Default: ‘.;,?!DefaultConfig) (' (cf.) –
uid (str, Optional) – Identifier of the tokenizer
Methods:
load_syntagma_definition(filepath[, encoding])Load the syntagma definition stored in yml file
run(segments)Return syntagmes detected in segments.
save_syntagma_definition(syntagma_seps, filepath)Save syntagma yaml definition file
- static load_syntagma_definition(filepath, encoding=None)[source]#
Load the syntagma definition stored in yml file
- Parameters
filepath (
Path) – Path to a yml file containing the syntagma separatorsencoding (
Optional[str]) – Encoding of the file to open
- Return type
Tuple[str, …]- Returns
Tuple[str, …] – Tuple containing the separators
- static save_syntagma_definition(syntagma_seps, filepath, encoding=None)[source]#
Save syntagma yaml definition file
- Parameters
syntagma_seps (
Tuple[str, …]) – The tuple of regular expressions corresponding to separatorsfilepath (
Path) – The path of the file to saveencoding (
Optional[str]) – The encoding of the file. Default: None