medkit.text.postprocessing

medkit.text.postprocessing#

Classes:

`AttributeDuplicator`(attr_labels[, uid])	Annotator to copy attributes from a source segment to its nested segments.
`DocumentSplitter`(segment_label[, ...])	Split text documents using its segments as a reference.

Functions:

`compute_nested_segments`(source_segments, ...)	Return source segments aligned with its nested segments.
`filter_overlapping_entities`(entities)	Filter a list of entities and remove overlaps.

class AttributeDuplicator(attr_labels, uid=None)[source]#

Annotator to copy attributes from a source segment to its nested segments. For each attribute to be duplicated, a new attribute is created in the nested segment

Instantiate the attribute duplicator

Parameters:

attr_labels (list of str) – Labels of the attributes to copy
uid (str, optional) – Identifier of the annotator

Methods:

`run`(source_segments, target_segments)	Add attributes from source segments to all nested segments.
`set_prov_tracer`(prov_tracer)	Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(source_segments, target_segments)[source]#

Add attributes from source segments to all nested segments. The nested segments are chosen among the target_segments based on their spans.

Parameters:

source_segments (list of Segment) – List of segments with attributes to copy
target_segments (list of Segment) – List of segments target

property description: OperationDescription#

Contains all the operation init parameters.

Return type:: OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters:: prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

compute_nested_segments(source_segments, target_segments)[source]#

Return source segments aligned with its nested segments. Only nested segments fully contained in the source_segments are returned.

Parameters:

source_segments (list of Segment) – List of source segments
target_segments (list of Segment) – List of segments to align

Return type:

list[tuple[Segment, list[Segment]]]

Returns:

list of tuple – List of aligned segments

class DocumentSplitter(segment_label, entity_labels=None, attr_labels=None, relation_labels=None, name=None, uid=None)[source]#

Split text documents using its segments as a reference.

The resulting ‘mini-documents’ contain the entities belonging to each segment along with their attributes.

This operation can be used to create datasets from medkit text documents.

Instantiate the document splitter

Parameters:

segment_label (str) – Label of the segments to use as references for the splitter
entity_labels (list of str, optional) – Labels of entities to be included in the mini documents. If None, all entities from the document will be included.
attr_labels (list of str, optional) – Labels of the attributes to be included into the new annotations. If None, all attributes will be included.
relation_labels (list of str, optional) – Labels of relations to be included in the mini documents. If None, all relations will be included.
name (str, optional) – Name describing the splitter (default to the class name).
uid (str, Optional) – Identifier of the operation

Methods:

`run`(docs)	Split docs into mini documents
`set_prov_tracer`(prov_tracer)	Enable provenance tracing.

Attributes:

description

Contains all the operation init parameters.

run(docs)[source]#

Split docs into mini documents

Parameters:: docs (list of TextDocument) – List of text documents to split
Return type:: list[TextDocument]
Returns:: list of TextDocument – List of documents created from the selected segments

property description: OperationDescription#

Contains all the operation init parameters.

Return type:: OperationDescription

set_prov_tracer(prov_tracer)#

Enable provenance tracing.

Parameters:: prov_tracer (ProvTracer) – The provenance tracer used to trace the provenance.

filter_overlapping_entities(entities)[source]#

Filter a list of entities and remove overlaps. This method may be useful for the creation of data for named entity recognition, where a part of text can only contain one entity per ‘word’. When an overlap is detected, the longest entity is preferred.

Parameters:: entities (list of Entity) – Entities to filter
Return type:: list[Entity]
Returns:: list of Entity – Filtered entities

medkit.text.postprocessing

Contents

medkit.text.postprocessing#