medkit.text.ner#

APIs#

For accessing these APIs, you may use import like this:

from medkit.text.ner import <api_to_import>

Classes:

`ADICAPNormAttribute`(code[, sampling_mode, ...])	Attribute describing tissue sample using the ADICAP (Association pour le Développement de l'Informatique en Cytologie et Anatomo-Pathologie) coding.
`DateAttribute`(label[, year, month, day, ...])	Attribute representing an absolute date or time associated to a segment or entity.
`DucklingMatcher`(output_label, version[, ...])	Entity annotator using Duckling (https://github.com/facebook/duckling).
`DurationAttribute`(label[, years, months, ...])	Attribute representing a time quantity associated to a segment or entity.
`IAMSystemMatcher`(matcher[, label_provider, ...])	Entity annotator and linker based on iamsystem library
`MedkitKeyword`(label, kb_id, kb_name, ent_label)	A recommended iamsystem's IEntity implementation
`RegexpMatcher`([rules, attrs_to_copy, name, uid])	Entity annotator relying on regexp-based rules
`RegexpMatcherNormalization`(kb_name, ...)	Descriptor of normalization attributes to attach to entities created from a RegexpMatcherRule
`RegexpMatcherRule`(regexp, label[, id, ...])	Regexp-based rule to use with RegexpMatcher
`RegexpMetadata`(**kwargs)	Metadata dict added to entities matched by `RegexpMatcher`
`RelativeDateAttribute`(label, direction[, ...])	Attribute representing a relative date or time associated to a segment or entity, ie a date/time offset from an (unknown) reference date/time, with a direction.
`RelativeDateDirection`(value)	Direction of a `RelativeDateAttribute`
`UMLSNormAttribute`(cui, umls_version[, term, ...])	Normalization attribute linking an entity to a CUI in the UMLS knowledge base

class ADICAPNormAttribute(code, sampling_mode=None, technic=None, organ=None, pathology=None, pathology_type=None, behaviour_type=None, metadata=None, uid=None)[source]#

Attribute describing tissue sample using the ADICAP (Association pour le Développement de l’Informatique en Cytologie et Anatomo-Pathologie) coding.

Cf https://smt.esante.gouv.fr/wp-json/ans/terminologies/document?terminologyId=terminologie-adicap&fileName=cgts_sem_adicap_fiche-detaillee.pdf for a complete description of the coding.

This class is replicating EDS-NLP’s Adicap class, making it a medkit Attribute.

The code field fully describes the tissue sample. Additional information is derived from code in human readable fields (sampling_code, technic, organ, pathology, pathology_type, behaviour_type)

Variables

uid – Identifier of the attribute
label – The attribute label, always set to EntityNormAttribute.LABEL
code – ADICAP code as a string (ex: “BHGS0040”)
kb_id – Same as code
sampling_mode (Optional[str]) – Sampling mode (ex: “BIOPSIE CHIRURGICALE”)
technic (Optional[str]) – Sampling technic (ex: “HISTOLOGIE ET CYTOLOGIE PAR INCLUSION”)
organ (Optional[str]) – Organ and regions (ex: “SEIN (ÉGALEMENT UTILISÉ CHEZ L’HOMME)”)
pathology (Optional[str]) – General pathology (ex: “PATHOLOGIE GÉNÉRALE NON TUMORALE”)
pathology_type (Optional[str]) – Pathology type (ex: “ETAT SUBNORMAL - LESION MINEURE”)
behaviour_type (Optional[str]) – Behaviour type (ex: “CARACTERES GENERAUX”)
metadata – Metadata of the attribute

class UMLSNormAttribute(cui, umls_version, term=None, score=None, sem_types=None, metadata=None, uid=None)[source]#

Normalization attribute linking an entity to a CUI in the UMLS knowledge base

Variables

uid – Identifier of the attribute
label – The attribute label, always set to EntityNormAttribute.LABEL
value – Value of the attribute, built by prefixing the cui with “umls:”
kb_name – Name of the knowledge base. Always “umls”
kb_id – CUI (Concept Unique Identifier) to which the annotation should be linked
cui – Convenience alias of kb_id
kb_version – Version of the UMLS database (ex: “202AB”)
umls_version – Convenience alias of kb_version
term – Optional normalized version of the entity text
score – Optional score reflecting confidence of this link
sem_types (Optional[List[str]]) – Optional IDs of semantic types of the CUI (ex: [“T047”])
metadata – Metadata of the attribute

class DucklingMatcher(output_label, version, url='http://localhost:8000', locale='fr_FR', dims=None, attrs_to_copy=None, uid=None)[source]#

Entity annotator using Duckling (https://github.com/facebook/duckling).

This annotator can parse several types of information in multiple languages:: amount of money, credit card numbers, distance, duration, email, numeral, ordinal, phone number, quantity, temperature, time, url, volume.

This annotator currently requires a Duckling Server running. The easiest method is to run a docker container :

>>> docker run --rm -d -p <PORT>:8000 --name duckling rasa/duckling:<TAG>

This command will start a Duckling server listening on port <PORT>. The version of the server is identified by <TAG>

Instantiate the Duckling matcher

Parameters

version (str) – Version of the Duckling server.
output_label (str) – Label to use for attributes created by this annotator.
url (str) – URL of the server. Defaults to “http://localhost:8000”
locale (str) – Language flag of the text to parse following ISO-639-1 standard, e.g. “fr_FR”
dims (Optional[List[str]]) – List of dimensions to extract. If None, all available dimensions will be extracted.
attrs_to_copy (Optional[List[str]]) – Labels of the attributes that should be copied from the source segment to the created entity. Useful for propagating context attributes (negation, antecendent, etc)

Methods:

run(segments)

Return entities for each match in segments

run(segments)[source]#

Return entities for each match in segments

Parameters: segments (List[Segment]) – List of segments into which to look for matches
Return type: List[Entity]
Returns: entities (List[Entity]) – Entities found in segments

class RegexpMatcher(rules=None, attrs_to_copy=None, name=None, uid=None)[source]#

Entity annotator relying on regexp-based rules

For detecting entities, the module uses rules that may be sensitive to unicode or not. When the rule is not sensitive to unicode, we try to convert unicode chars to the closest ascii chars. However, some characters need to be pre-processed before (e.g., n° -> number). So, if the text lengths are different, we fall back on initial unicode text for detection even if rule is not unicode-sensitive. In this case, a warning is logged for recommending to pre-process data.

Instantiate the regexp matcher

Parameters

rules (Optional[List[RegexpMatcherRule]]) – The set of rules to use when matching entities. If none provided, the rules in “regexp_matcher_default_rules.yml” will be used
attrs_to_copy (Optional[List[str]]) – Labels of the attributes that should be copied from the source segment to the created entity. Useful for propagating context attributes (negation, antecendent, etc)
name (Optional[str]) – Name describing the matcher (defaults to the class name)
uid (str) – Identifier of the matcher

Methods:

`check_rules_sanity`(rules)	Check consistency of a set of rules
`load_rules`(path_to_rules[, encoding])	Load all rules stored in a yml file
`run`(segments)	Return entities (with optional normalization attributes) matched in segments

run(segments)[source]#

Return entities (with optional normalization attributes) matched in segments

Parameters: segments (List[Segment]) – List of segments into which to look for matches
Return type: List[Entity]
Returns: entities (List[Entity]:) – Entities found in segments (with optional normalization attributes). Entities have a metadata dict with fields described in RegexpMetadata

static load_rules(path_to_rules, encoding=None)[source]#

Load all rules stored in a yml file

Parameters

path_to_rules (Path) – Path to a yml file containing a list of mappings with the same structure as RegexpMatcherRule
encoding (Optional[str]) – Encoding of the file to open

Return type

List[RegexpMatcherRule]

Returns

List[RegexpMatcherRule] – List of all the rules in path_to_rules, can be used to init a RegexpMatcher

static check_rules_sanity(rules)[source]#

Check consistency of a set of rules

class RegexpMatcherRule(regexp, label, id=None, version=None, index_extract=0, case_sensitive=False, unicode_sensitive=False, exclusion_regexp=None, normalizations=<factory>)[source]#

Regexp-based rule to use with RegexpMatcher

Variables

regexp (str) – The regexp pattern used to match entities
label (str) – The label to attribute to entities created based on this rule
id (Optional[str]) – Unique identifier of the rule to store in the metadata of the entities
version (Optional[str]) – Version string to store in the metadata of the entities
index_extract (int) – If the regexp has groups, the index of the group to use to extract the entity
case_sensitive (bool) – Whether to ignore case when running regexp and `exclusion_regexp
unicode_sensitive (bool) – If True, regexp rule matches are searched on unicode text. So, regexp and `exclusion_regexs shall not contain non-ASCII chars because they would never be matched. If False, regexp rule matches are searched on closest ASCII text when possible. (cf. RegexpMatcher)
exclusion_regexp (Optional[str]) – An optional exclusion pattern. Note that this exclusion pattern will executed on the whole input annotation, so when relying on exclusion_regexp make sure the input annotations passed to RegexpMatcher are “local”-enough (sentences or syntagmes) rather than the whole text or paragraphs
normalization – Optional list of normalization attributes that should be attached to the entities created

class RegexpMatcherNormalization(kb_name, kb_version, id)[source]#

Descriptor of normalization attributes to attach to entities created from a RegexpMatcherRule

Variables

kb_name (str) – The name of the knowledge base we are referencing. Ex: “umls”
kb_version (str) – The name of the knowledge base we are referencing. Ex: “202AB”
id (Any) – The id of the entity in the knowledge base, for instance a CUI

class RegexpMetadata(**kwargs)[source]#

Metadata dict added to entities matched by RegexpMatcher

Parameters

rule_id (Union[str, int]) – Identifier of the rule used to match an entity. If the rule has no id, then the index of the rule in the list of rules is used instead.
version (Optional[str]) – Optional version of the rule used to match an entity

class IAMSystemMatcher(matcher, label_provider=None, attrs_to_copy=None, name=None, uid=None)[source]#

Entity annotator and linker based on iamsystem library

Instantiate the operation supporting the iamsystem matcher

Parameters

matcher (Matcher) – IAM system Matcher
label_provider (Optional[Callable[[Sequence[IKeyword]], Optional[str]]]) – Callable providing the output label to set for detected entity. As iamsystem matcher may return several keywords for an annotation, we have to know how to provide only one entity label whatever the number of matched keywords. In medkit, normalization attributes are used for representing detected keywords.
attrs_to_copy (Optional[List[str]]) – Labels of the attributes that should be copied from the input segment to the created entity. Useful for propagating context attributes (negation, antecedent, etc).
name (Optional[str]) – Name describing the matcher (defaults to the class name)
uid (str) – Identifier of the operation

class MedkitKeyword(label, kb_id, kb_name, ent_label)[source]#

A recommended iamsystem’s IEntity implementation

Also implements SupportEntLabel, SupportKBName protocols

class DateAttribute(label, year=None, month=None, day=None, hour=None, minute=None, second=None, metadata=None, uid=None)[source]#

Attribute representing an absolute date or time associated to a segment or entity.

The date or time can be incomplete: each date/time component is optional but at least one must be provided.

Variables

uid (str) – Identifier of the attribute
label (str) – Label of the attribute
year (Optional[int]) – Year component of the date
month (Optional[int]) – Month component of the date
day (Optional[int]) – Day component of the date
hour (Optional[int]) – Hour component of the time
minute (Optional[int]) – Minute component of the time
second (Optional[int]) – Second component of the time
metadata (Dict[str, Any]) – Metadata of the attribute

Methods:

format()

Return a string representation of the date with format YYYY-MM-DD for the date part and HH:MM:SS for the time part, if present.

format()[source]#

Return a string representation of the date with format YYYY-MM-DD for the date part and HH:MM:SS for the time part, if present. Missing components are replaced with question marks

Return type: str

class DurationAttribute(label, years=0, months=0, weeks=0, days=0, hours=0, minutes=0, seconds=0, metadata=None, uid=None)[source]#

Attribute representing a time quantity associated to a segment or entity.

Each date/time component is optional but at least one must be provided.

Variables

uid (str) – Identifier of the attribute
label (str) – Label of the attribute
direction – Direction the relative date. Ex: “2 years ago” correspond to the PAST direction and “in 2 weeks” to the FUTURE direction.
years (int) – Year component of the date quantity
months (int) – Month component of the date quantity
weeks (int) – Week component of the date quantity
days (int) – Day component of the date quantity
hours (int) – Hour component of the time quantity
minutes (int) – Minute component of the time quantity
seconds (int) – Second component of the time quantity
metadata (Dict[str, Any]) – Metadata of the attribute

Methods:

format()

Return a string representation of the date/time offset.

format()[source]#

Return a string representation of the date/time offset.

Ex: “1 year 10 months 2 days”

Return type: str

class RelativeDateAttribute(label, direction, years=0, months=0, weeks=0, days=0, hours=0, minutes=0, seconds=0, metadata=None, uid=None)[source]#

Attribute representing a relative date or time associated to a segment or entity, ie a date/time offset from an (unknown) reference date/time, with a direction.

At least one date/time component must be non-zero.

Variables

uid (str) – Identifier of the attribute
label (str) – Label of the attribute
direction (medkit.text.ner.date_attribute.RelativeDateDirection) – Direction the relative date. Ex: “2 years ago” corresponds to the PAST direction and “in 2 weeks” to the FUTURE direction.
years (int) – Year component of the date offset
months (int) – Month component of the date offset
weeks (int) – Week component of the date offset
days (int) – Day component of the date offset
hours (int) – Hour component of the time offset
minutes (int) – Minute component of the time offset
seconds (int) – Second component of the time offset
metadata (Dict[str, Any]) – Metadata of the attribute

Methods:

format()

Return a string representation of the date/time offset Ex: "+ 1 year 10 months 2 days"

format()[source]#

Return a string representation of the date/time offset Ex: “+ 1 year 10 months 2 days”

Return type: str

class RelativeDateDirection(value)[source]#: Direction of a RelativeDateAttribute

Subpackages / Submodules#

`medkit.text.ner.adicap_norm_attribute`
`medkit.text.ner.date_attribute`
`medkit.text.ner.duckling_matcher`
`medkit.text.ner.hf_entity_matcher`	This module needs extra-dependencies not installed as core dependencies of medkit.
`medkit.text.ner.hf_entity_matcher_trainable`	This module needs extra-dependencies not installed as core dependencies of medkit.
`medkit.text.ner.hf_tokenization_utils`
`medkit.text.ner.iamsystem_matcher`
`medkit.text.ner.quick_umls_matcher`	This module needs extra-dependencies not installed as core dependencies of medkit.
`medkit.text.ner.regexp_matcher`
`medkit.text.ner.tnm_attribute`	This package needs extra-dependencies not installed as core dependencies of medkit.
`medkit.text.ner.umls_coder_normalizer`	This module needs extra-dependencies not installed as core dependencies of medkit.
`medkit.text.ner.umls_norm_attribute`
`medkit.text.ner.umls_utils`

medkit

medkit.text.ner

Contents

medkit.text.ner#

APIs#

Subpackages / Submodules#