medkit.text.segmentation#

APIs#

For accessing these APIs, you may use import like this:

from medkit.text.segmentation import <api_to_import>

Classes:

SectionModificationRule(section_name, ...)

SectionTokenizer([section_dict, ...])

Section segmentation annotator based on keyword rules

SentenceTokenizer([output_label, ...])

Sentence segmentation annotator based on end punctuation rules

SyntagmaTokenizer([separators, ...])

Syntagma segmentation annotator based on provided separators

class SectionTokenizer(section_dict=None, output_label='section', section_rules=(), strip_chars='.;,?! \n\r\t', uid=None)[source]#

Section segmentation annotator based on keyword rules

Initialize the Section Tokenizer

Parameters:
  • section_dict (dict of str to list of str, optional) – Dictionary containing the section name as key and the list of mappings as value. If None, the content of default_section_definition.yml will be used.

  • output_label (str, optional) – Segment label to use for annotation output.

  • section_rules (iterable of SectionModificationRule, optional) – List of rules for modifying a section name according its order to the other sections. If section_dict is None, the content of default_section_definition.yml will be used.

  • strip_chars (str, optional) – The list of characters to strip at the beginning of the returned segment.

  • uid (str, optional) – Identifier of the tokenizer

Methods:

load_section_definition(filepath[, encoding])

Load the sections definition stored in a yml file

run(segments)

Return sections detected in segments.

save_section_definition(section_dict, ...[, ...])

Save section yaml definition file

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(segments)[source]#

Return sections detected in segments. Each section is a segment with an attached attribute (label: <same as self.output_label>, value: <the name of the section>).

Parameters:

segments (list of Segment) – List of segments into which to look for sections

Return type:

list[Segment]

Returns:

list of Segment – Sections segments found in segments

static load_section_definition(filepath, encoding=None)[source]#

Load the sections definition stored in a yml file

Parameters:
  • filepath (Path) – Path to a yml file containing the sections(name + mappings) and rules

  • encoding (str, optional) – Encoding of the file to open

Return type:

tuple[dict[str, list[str]], tuple[SectionModificationRule, …]]

Returns:

tuple – Tuple containing: - the dictionary where key is the section name and value is the list of all equivalent strings. - the list of section modification rules. These rules allow to rename some sections according their order

static save_section_definition(section_dict, section_rules, filepath, encoding=None)[source]#

Save section yaml definition file

Parameters:
  • section_dict (dict of str to list of str) – Dictionary containing the section name as key and the list of mappings as value (cf. content of default_section_dict.yml as example)

  • section_rules (iterable of SectionModificationRule) – List of rules for modifying a section name according its order to the other sections.

  • filepath (Path) – Path to the file to save

  • encoding (str, optional) – File encoding

property description: OperationDescription#

Contains all the operation init parameters.

Return type:

OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters:

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

class SectionModificationRule(section_name, new_section_name, other_sections, order)[source]#
class SentenceTokenizer(output_label='sentence', punct_chars=('.', ';', '?', '!'), keep_punct=False, split_on_newlines=True, attrs_to_copy=None, uid=None)[source]#

Sentence segmentation annotator based on end punctuation rules

Instantiate the sentence tokenizer

Parameters:
  • output_label (str, optional) – The output label of the created annotations.

  • punct_chars (tuple of str, optional) – The set of characters corresponding to end punctuations.

  • keep_punct (bool, optional) – If True, the end punctuations are kept in the detected sentence. If False, the sentence text does not include the end punctuations.

  • split_on_newlines (bool, default=True) – Whether to consider that newlines characters are sentence boundaries or not.

  • attrs_to_copy (list of str, optional) – Labels of the attributes that should be copied from the input segment to the derived segment. For example, useful for propagating section name.

  • uid (str, optional) – Identifier of the tokenizer

Methods:

run(segments)

Return sentences detected in segments.

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(segments)[source]#

Return sentences detected in segments.

Parameters:

segments (list of Segment) – List of segments into which to look for sentences

Return type:

list[Segment]

Returns:

list of Segment – Sentences segments found in segments

property description: OperationDescription#

Contains all the operation init parameters.

Return type:

OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters:

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

class SyntagmaTokenizer(separators=None, output_label='syntagma', strip_chars='.;,?! \n\r\t', attrs_to_copy=None, uid=None)[source]#

Syntagma segmentation annotator based on provided separators

Instantiate the syntagma tokenizer

Parameters:
  • separators (tuple of str, optional) – The tuple of regular expressions corresponding to separators. If None provided, the rules in “default_syntagma_definitiion.yml” will be used.

  • output_label (str, optional) – The output label of the created annotations.

  • strip_chars (str, optional) – The list of characters to strip at the beginning of the returned segment.

  • attrs_to_copy (list of str, optional) – Labels of the attributes that should be copied from the input segment to the derived segment. For example, useful for propagating section name.

  • uid (str, optional) – Identifier of the tokenizer

Methods:

load_syntagma_definition(filepath[, encoding])

Load the syntagma definition stored in yml file

run(segments)

Return syntagmes detected in segments.

save_syntagma_definition(syntagma_seps, filepath)

Save syntagma yaml definition file

set_prov_tracer(prov_tracer)

Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(segments)[source]#

Return syntagmes detected in segments.

Parameters:

segments (list of Segment) – List of segments into which to look for sentences

Return type:

list[Segment]

Returns:

list of Segment – Syntagmas segments found in segments

static load_syntagma_definition(filepath, encoding=None)[source]#

Load the syntagma definition stored in yml file

Parameters:
  • filepath (Path) – Path to a yml file containing the syntagma separators

  • encoding (str, optional) – Encoding of the file to open

Return type:

tuple[str, …]

Returns:

tuple of str – Tuple containing the separators

property description: OperationDescription#

Contains all the operation init parameters.

Return type:

OperationDescription

static save_syntagma_definition(syntagma_seps, filepath, encoding=None)[source]#

Save syntagma yaml definition file

Parameters:
  • syntagma_seps (tuple of str) – The tuple of regular expressions corresponding to separators

  • filepath (Path) – The path of the file to save

  • encoding (str, optional) – The encoding of the file. Default: None

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters:

prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

Subpackages / Submodules#

medkit.text.segmentation.rush_sentence_tokenizer

This module needs extra-dependencies not installed as core dependencies of medkit.