medkit.text.segmentation.syntagma_tokenizer#

Classes:

SyntagmaTokenizer(separators[, ...])

Syntagma segmentation annotator based on provided separators

class SyntagmaTokenizer(separators, output_label='SYNTAGMA', strip_chars='.;,?! \n\r\t', uid=None)[source]#

Syntagma segmentation annotator based on provided separators

Instantiate the syntagma tokenizer

Parameters

separators (Tuple[str, ...]) – The tuple of regular expressions corresponding to separators.
output_label (str, Optional) – The output label of the created annotations. Default: “SYNTAGMA” (cf. DefaultConfig)
strip_chars (str) – The list of characters to strip at the beginning of the returned segment. Default: ‘.;,?!
DefaultConfig) (' (cf.) –
uid (str, Optional) – Identifier of the tokenizer

Methods:

`load_syntagma_definition`(filepath[, encoding])	Load the syntagma definition stored in yml file
`run`(segments)	Return syntagmes detected in segments.
`save_syntagma_definition`(syntagma_seps, filepath)	Save syntagma yaml definition file

run(segments)[source]#

Return syntagmes detected in segments.

Parameters: segments (List[Segment]) – List of segments into which to look for sentences
Return type: List[Segment]
Returns: List[Segments] – Syntagmas segments found in segments

static load_syntagma_definition(filepath, encoding=None)[source]#

Load the syntagma definition stored in yml file

Parameters

Return type

Tuple[str, …]

Returns

Tuple[str, …] – Tuple containing the separators

static save_syntagma_definition(syntagma_seps, filepath, encoding=None)[source]#

Save syntagma yaml definition file

Parameters

syntagma_seps (Tuple[str, …]) – The tuple of regular expressions corresponding to separators
filepath (Path) – The path of the file to save
encoding (Optional[str]) – The encoding of the file. Default: None