medkit.text.segmentation#

APIs#

For accessing these APIs, you may use import like this:

from medkit.text.segmentation import <api_to_import>

Classes:

`SectionModificationRule`(section_name, ...)
`SectionTokenizer`(section_dict[, ...])	Section segmentation annotator based on keyword rules
`SentenceTokenizer`([output_label, ...])	Sentence segmentation annotator based on end punctuation rules
`SyntagmaTokenizer`(separators[, ...])	Syntagma segmentation annotator based on provided separators

class SectionTokenizer(section_dict, output_label='SECTION', section_rules=(), strip_chars='.;,?! \n\r\t', uid=None)[source]#

Section segmentation annotator based on keyword rules

Initialize the Section Tokenizer

Parameters

section_dict (Dict[str, List[str]]) – Dictionary containing the section name as key and the list of mappings as value (cf. content of default_section_dict.yml as example)
output_label (str) – Segment label to use for annotation output. Default is SECTION.
section_rules (Iterable[SectionModificationRule]) – List of rules for modifying a section name according its order to the other sections.
strip_chars (str) – The list of characters to strip at the beginning of the returned segment. Default: ‘.;,?!
DefaultConfig) (' (cf.) –
uid (str, Optional) – Identifier of the tokenizer

Methods:

`load_section_definition`(filepath[, encoding])	Load the sections definition stored in a yml file
`run`(segments)	Return sections detected in segments.
`save_section_definition`(section_dict, ...[, ...])	Save section yaml definition file

run(segments)[source]#

Return sections detected in segments.

Parameters: segments (List[Segment]) – List of segments into which to look for sections
Return type: List[Segment]
Returns: List[Segments] – Sections segments found in segments

static load_section_definition(filepath, encoding=None)[source]#

Load the sections definition stored in a yml file

Parameters

filepath (Path) – Path to a yml file containing the sections(name + mappings) and rules
encoding (Optional[str]) – Encoding of the file to open

Return type

Tuple[Dict[str, List[str]], Tuple[SectionModificationRule, …]]

Returns

Tuple[Dict[str, List[str]], Tuple[SectionModificationRule, …]] – Tuple containing: - the dictionary where key is the section name and value is the list of all equivalent strings. - the list of section modification rules. These rules allow to rename some sections according their order

static save_section_definition(section_dict, section_rules, filepath, encoding=None)[source]#

Save section yaml definition file

Parameters

section_dict (Dict[str, List[str]]) – Dictionary containing the section name as key and the list of mappings as value (cf. content of default_section_dict.yml as example)
section_rules (Iterable[SectionModificationRule]) – List of rules for modifying a section name according its order to the other sections.
filepath (Path) – Path to the file to save
encoding (Optional[str]) – File encoding. Default: None

class SectionModificationRule(section_name, new_section_name, other_sections, order)[source]#

class SentenceTokenizer(output_label='SENTENCE', punct_chars=('.', ';', '?', '!'), keep_punct=False, split_on_newlines=True, uid=None)[source]#

Sentence segmentation annotator based on end punctuation rules

Instantiate the sentence tokenizer

Parameters

output_label (str, Optional) – The output label of the created annotations.
punct_chars (Tuple[str], Optional) – The set of characters corresponding to end punctuations.
keep_punct (bool, Optional) – If True, the end punctuations are kept in the detected sentence. If False, the sentence text does not include the end punctuations.
split_on_newlines (bool) – Whether to consider that newlines characters are sentence boundaries or not.
uid (str, Optional) – Identifier of the tokenizer

Methods:

run(segments)

Return sentences detected in segments.

run(segments)[source]#

Return sentences detected in segments.

Parameters: segments (List[Segment]) – List of segments into which to look for sentences
Return type: List[Segment]
Returns: List[Segments] – Sentences segments found in segments

class SyntagmaTokenizer(separators, output_label='SYNTAGMA', strip_chars='.;,?! \n\r\t', uid=None)[source]#

Syntagma segmentation annotator based on provided separators

Instantiate the syntagma tokenizer

Parameters

separators (Tuple[str, ...]) – The tuple of regular expressions corresponding to separators.
output_label (str, Optional) – The output label of the created annotations. Default: “SYNTAGMA” (cf. DefaultConfig)
strip_chars (str) – The list of characters to strip at the beginning of the returned segment. Default: ‘.;,?!
DefaultConfig) (' (cf.) –
uid (str, Optional) – Identifier of the tokenizer

Methods:

`load_syntagma_definition`(filepath[, encoding])	Load the syntagma definition stored in yml file
`run`(segments)	Return syntagmes detected in segments.
`save_syntagma_definition`(syntagma_seps, filepath)	Save syntagma yaml definition file

run(segments)[source]#

Return syntagmes detected in segments.

Parameters: segments (List[Segment]) – List of segments into which to look for sentences
Return type: List[Segment]
Returns: List[Segments] – Syntagmas segments found in segments

static load_syntagma_definition(filepath, encoding=None)[source]#

Load the syntagma definition stored in yml file

Parameters

filepath (Path) – Path to a yml file containing the syntagma separators
encoding (Optional[str]) – Encoding of the file to open

Return type

Tuple[str, …]

Returns

Tuple[str, …] – Tuple containing the separators

static save_syntagma_definition(syntagma_seps, filepath, encoding=None)[source]#

Save syntagma yaml definition file

Parameters

syntagma_seps (Tuple[str, …]) – The tuple of regular expressions corresponding to separators
filepath (Path) – The path of the file to save
encoding (Optional[str]) – The encoding of the file. Default: None

Subpackages / Submodules#

`medkit.text.segmentation.rush_sentence_tokenizer`	This module needs extra-dependencies not installed as core dependencies of medkit.
`medkit.text.segmentation.section_tokenizer`
`medkit.text.segmentation.sentence_tokenizer`
`medkit.text.segmentation.syntagma_tokenizer`
`medkit.text.segmentation.tokenizer_utils`

medkit

medkit.text.segmentation

Contents

medkit.text.segmentation#

APIs#

Subpackages / Submodules#