Using EDS-NLP with medkit
Contents
Using EDS-NLP with medkit#
EDS-NLP provides a set of spaCy components that are used to extract information from clinical notes written in French. Because medkit is spaCy-compatible, using EDS-NLP within medkit is supported, as we will see.
To follow this tutorial, you will need to install medkit spaCy support and EDS-NLP with
pip install 'medkit-lib[edsnlp]'
Running an EDS-NLP spaCy pipeline on entire documents#
We will need a sample text document to annotate:
from medkit.core.text import TextDocument
text = """COMPTE RENDU D'HOSPITALISATION
Monsieur Jean Dupont a été hospitalisé du 11/08/2019 au 17/08/2019 pour attaque d'asthme
ANTÉCÉDENTS
Peut-être atteint de Covid19 en aout 2020"""
doc = TextDocument(text)
and a spaCy pipeline with a few EDS-NLP components:
import spacy
nlp = spacy.blank("eds")
# General-purpose components
nlp.add_pipe("eds.normalizer")
nlp.add_pipe("eds.sentences")
# Entity extraction
nlp.add_pipe("eds.covid")
nlp.add_pipe("eds.dates")
# Context detection
nlp.add_pipe("eds.negation")
nlp.add_pipe("eds.hypothesis")
The eds.normalizer and eds.sentences components do some pre-processing,
eds.covid and eds.dates perform entity matching and create some spaCy
entities and spans, and eds.negation and eds.hypothesis attach some context
attributes to these entities and spans.
To be used within medkit, the pipeline could be wrapped into a generic
SpacyDocPipeline operation. But medkit also provides
a dedicated EDSNLPDocPipeline operation, with some additional support
for specific EDS-NLP components:
from medkit.text.spacy.edsnlp import EDSNLPDocPipeline
eds_nlp_pipeline = EDSNLPDocPipeline(nlp)
The operation is executed by applying its run() method on a list of documents:
eds_nlp_pipeline.run([doc])
Let’s look at the entities and segments that were found:
for entity in doc.anns.entities:
print(f"{entity.label}: {entity.text!r}")
for segment in doc.anns.segments:
print(f"{segment.label}: {segment.text!r}")
covid: 'Covid19'
dates: '11/08/2019'
dates: '17/08/2019'
dates: 'aout 2020'
Here are the attributes attached to the "covid" entity:
entity = doc.anns.get_entities(label="covid")[0]
for attr in entity.attrs:
print(f"{attr.label}={attr.value}")
negation=False
hypothesis=True
and the attributes of the first "dates" segment:
date_seg = doc.anns.get_segments(label="dates")[0]
for attr in date_seg.attrs:
print(f"{attr.label}={attr.value}")
date=None
negation=False
hypothesis=False
You may notice that the attributes created by the EDS-NLP components have been
slightly transformed. For instance, eds.hypothesis creates identical
"hypothesis" and "hypothesis_" attributes, as well as an optional
"hypothesis_cues" attribute. When transforming these back to medkit, the
redundant "hypothesis_" attribute is dropped, and "hypothesis_cues" is
integrated as additional metadata of the "hypothesis" attribute (if present).
EDSNLPDocPipeline will perform this sort of transformation for many
other EDS-NLP components.
Note
The transformations performed by EDSNLPDocPipeline can be overriden
or extended with the medkit_attribute_factories init parameter. For a list of
all the default transformations, see
DEFAULT_ATTRIBUTE_FACTORIES and corresponding
functions in medkit.text.spacy.edsnlp.
Let’s now examine more closely the "date" attribute:
date_seg = doc.anns.get_segments(label="dates")[0]
date_attr = date_seg.attrs.get(label="date")[0]
date_attr
DateAttribute(label='date', value=None, metadata={}, uid='ca83df6a-1a6c-11ee-82d5-0242ac110002', year=2019, month=8, day=11, hour=None, minute=None, second=None)
This attribute is an instance of DateAttribute, a
subclass of Attribute. While its value field is None,
it has year, month, day (etc) fields containing the different parts of the
date that was detected. A string representation can be obtained by calling its
format() method:
date_attr.format()
'2019-08-11'
One of the benefits of using EDSNLPDocPipeline instead of
SpacyDocPipeline is that some special EDS-NLP
attributes are automatically converted to a corresponding
Attribute subclass.
Here are the supported EDS-NLP attributes values and the corresponding medkit classes:
AdicapCode(created byeds.adicap):medkit.text.ner.ADICAPNormAttributeTNM(created byeds.TNM):medkit.text.ner.tnm_attribute.TNMAttributeAbsoluteDate(created byeds.dates):medkit.text.ner.DateAttributeRelativeDate(created byeds.dates):medkit.text.ner.RelativeDateAttributeDuration(created byeds.dates):medkit.text.ner.DurationAttribute
Running an EDL-NLP spaCy pipeline at the annotation level#
So far, we have wrapped a spaCy pipeline and executed it on an entire document
with EDSNLPDocPipeline. But it is also possible to run the spaCy
pipeline on text annotations instead of a document with
EDSNLPPipeline. To illustrate this, let’s create a medkit pipeline
using pure medkit operations for sentence tokenization and entity matching, and
EDS-NLP spaCy components for covid entity matching:
from medkit.core import Pipeline, PipelineStep
from medkit.text.ner import RegexpMatcher, RegexpMatcherRule
from medkit.text.segmentation import SentenceTokenizer
from medkit.text.spacy.edsnlp import EDSNLPPipeline
sentence_tokenizer = SentenceTokenizer()
matcher = RegexpMatcher(rules=[RegexpMatcherRule(regexp=r"\basthme\b", label="asthme")])
nlp = spacy.blank("eds")
nlp.add_pipe("eds.covid")
eds_nlp_pipeline = EDSNLPPipeline(nlp)
pipeline = Pipeline(
steps=[
PipelineStep(operation=sentence_tokenizer, input_keys=["full_text"], output_keys=["sentences"]),
PipelineStep(operation=matcher, input_keys=["sentences"], output_keys=["entities"]),
PipelineStep(operation=eds_nlp_pipeline, input_keys=["sentences"], output_keys=["entities"]),
],
input_keys=["full_text"],
output_keys=["entities"],
)
doc = TextDocument(text)
entities = pipeline.run([doc.raw_segment])
for entity in entities:
print(f"{entity.label}: {entity.text!r}")
asthme: 'asthme'
covid: 'Covid19'
For more information about advanced usage of EDSNLPDocPipeline and
EDSNLPPipeline, you may refer to the API doc of
medkit.text.spacy.edsnlp.