Annotating with a Spacy pipeline#
This example shows how to combine medkit and a spacy pipeline to annotate medkit documents.
SpaCy has some projects in its universe with custom versions of spaCy pipeline objects.
This example uses English documents, as the pipelines we will use do not work with French documents. The aim of this example is to show how to annotate with spacy, but you could use your own custom pipelines that work with French documents.
# You can download the file available in source code
# !wget https://raw.githubusercontent.com/medkit-lib/medkit/main/docs/data/text/1-EN-version.txt
from pathlib import Path
from medkit.core.text import TextDocument
medkit_doc = TextDocument.from_file(Path("../../data/text/1-EN-version.txt"))
print(medkit_doc.text)
SUBJECTIVE: This 23-year-old white female presents with complaint of allergies. She used to have allergies when she lived in Seattle but she thinks they are worse here. In the past, she has tried Claritin, and Zyrtec. Both worked for short time but then seemed to lose effectiveness. She has used Allegra also. She used that last summer and she began using it again two weeks ago. It does not appear to be working very well. She has used over-the-counter sprays but no prescription nasal sprays. She does have asthma but doest not require daily medication for this and does not think it is flaring up.
MEDICATIONS: Her only medication currently is Ortho Tri-Cyclen and the Allegra.
ALLERGIES: She has no known medicine allergies.
OBJECTIVE:
Vitals: Weight was 130 pounds and blood pressure 124/78.
HEENT: Her throat was mildly erythematous without exudate. Nasal mucosa was erythematous and swollen. Only clear drainage was seen. TMs were clear.
Neck: Supple without adenopathy.
Lungs: Clear.
ASSESSMENT: Allergic rhinitis.
PLAN:
- She will try Zyrtec instead of Allegra again. Another option will be to use loratadine. She does not think she has prescription coverage so that might be cheaper.
- Samples of Nasonex two sprays in each nostril given for three weeks. A prescription was written as well.
The document has a few sections describing the status of a female patient. We can start by detecting some entities. In the spacy universe, we found a connector spacy-stanza to the Stanza library. Stanza[1] is a library developed by the Stanford NLP research group and has some biomedical and clinical NER models for english documents.
# install spacy-stanza
!python -m pip install spacy-stanza
Annotating segments with spacy#
Let’s see how to create medkit entities with a nlp spacy object
Prepare the spacy-stanza nlp pipeline#
The list of available biomedical NER packages.
Let’s download the i2b2 stanza package, a pretrained model to detect ‘PROBLEM’, ‘TEST’, ‘TREATMENT’ entities.
# import spacy related modules
import stanza
import spacy_stanza
# stanza creates a nlp object in disk
# download and initialize the i2b2 pipeline
stanza.download('en', package='i2b2')
# Define the nlp object
nlp_spacy = spacy_stanza.load_pipeline('en', package='mimic', processors={'ner': 'i2b2'})
Define a medkit operation to add the entities#
Medkit has the SpacyPipeline operation, an operation that can wrap a nlp spacy object to annotate segments.
A nlp object may create many spacy annotations, you can select the spacy entities, spans and attributes that will be converted to medkit annotations. By default, all are converted into medkit annotations.
from medkit.text.spacy import SpacyPipeline
# Defines the medkit operation
medkit_stanza_matcher = SpacyPipeline(nlp=nlp_spacy)
# Detect entities using the raw segment
entities = medkit_stanza_matcher.run([medkit_doc.raw_segment])
# Add entities to the medkit document
for ent in entities:
medkit_doc.anns.add(ent)
print(medkit_doc.anns.get_entities()[0])
Entity(uid='5d76fec2-c44c-11ee-87b2-0242ac110002', label='PROBLEM', attrs=EntityAttributeContainer(ann_id='5d76fec2-c44c-11ee-87b2-0242ac110002', attrs=[]), metadata={}, keys=set(), spans=[Span(start=69, end=78)], text='allergies')
That’s all! We have detected entities using the biomedical model developed by the Stanford group.
Let’s visualize all the detected entities.
from spacy import displacy
from medkit.text.spacy.displacy_utils import medkit_doc_to_displacy
# Add some colors
options_displacy = dict(colors={'TREATMENT': "#85C1E9", "PROBLEM": "#cfe2f3"})
# Format the medkit doc to displacy
displacy_data = medkit_doc_to_displacy(medkit_doc)
displacy.render(displacy_data,style="ent",manual=True, options=options_displacy)
MEDICATIONS: Her only medication TREATMENT currently is Ortho Tri-Cyclen TREATMENT and the Allegra.
ALLERGIES: She has no known medicine allergies PROBLEM .
OBJECTIVE:
Vitals: Weight TEST was 130 pounds and blood pressure TEST 124/78.
HEENT: Her throat was mildly erythematous PROBLEM without exudate PROBLEM . Nasal mucosa was erythematous PROBLEM and swollen PROBLEM . Only clear drainage PROBLEM was seen. TMs were clear.
Neck: Supple without adenopathy PROBLEM .
Lungs: Clear.
ASSESSMENT: Allergic rhinitis PROBLEM .
PLAN:
- She will try Zyrtec TREATMENT instead of Allegra TREATMENT again. Another option will be to use loratadine TREATMENT . She does not think she has prescription coverage TREATMENT so that might be cheaper.
- Samples of Nasonex TREATMENT two sprays in each nostril given for three weeks. A prescription TREATMENT was written as well.
Annotating documents with spacy#
Here, we already have an annotated document. We will see how to use spacy to enrich existing annotations.
Exploring the spacy universe, we found negspaCy, a pipeline that detects negation in spacy entities. Using the ‘SpacyDoc’ class, we can annotate the entities of the document and add those attributes directly.
Prepare the negspacy nlp object:#
# install negspacy
!python -m pip install negspacy
# download english model from spacy
import spacy
if not spacy.util.is_package("en_core_web_sm"):
spacy.cli.download("en_core_web_sm")
# Import spacy nlp object from negspacy
from negspacy.negation import Negex
# Load the EN spacy model
nlp_spacy_negex = spacy.load("en_core_web_sm",disable=["ner"]) # Disable NER by default, it can add generic entities
# Config to detect negation in the i2b2 entities
i2b2_labels = ["PROBLEM","TEST","TREATMENT"]
nlp_spacy_negex.add_pipe("negex", config={"ent_types":i2b2_labels})
<negspacy.negation.Negex at 0x7fea566aa750>
Define a medkit operation to add the attributes#
Medkit has the SpacyDocPipeline operation, an operation that can wrap a nlp spacy object to annotate documents.
The point is to add attributes to the entities, so we select the entities of interest and do not transfer their current attributes, as they are not needed to detect the negation.
from medkit.text.spacy import SpacyDocPipeline
# Define the spacy wrapper
negation_detector = SpacyDocPipeline(
nlp=nlp_spacy_negex,
medkit_labels_anns=i2b2_labels, # entities to annotate
medkit_attrs=[], # the current entity attrs are no important
)
# Run the detector
# The docPipeline automatically adds annotations to the document
# it is not necessary to add annotations as in the case of `medkit_stanza_matcher`
negation_detector.run([medkit_doc])
Let’s see if the negation has been detected in the entities.
print(medkit_doc.anns.get_entities()[0])
Entity(uid='5d76fec2-c44c-11ee-87b2-0242ac110002', label='PROBLEM', attrs=EntityAttributeContainer(ann_id='5d76fec2-c44c-11ee-87b2-0242ac110002', attrs=[Attribute(label='negex', value=False, metadata={}, uid='682206d2-c44c-11ee-87b2-0242ac110002')]), metadata={}, keys=set(), spans=[Span(start=69, end=78)], text='allergies')
As we can see, the entity now has an attribute called negex with value=false. Which means that the entity is not part of a negation.
Let’s find the negated entities:
print("The following entities are negated: \n\n")
for entity in medkit_doc.anns.get_entities():
# Get the negex attr
attrs = entity.attrs.get(label="negex")
# If the attr exists and is positive, show a message.
if len(attrs) > 0 and attrs[0].value:
print(entity.label,entity.text,entity.spans)
The following entities are negated:
TREATMENT prescription nasal sprays [Span(start=476, end=501)]
TREATMENT daily medication [Span(start=547, end=563)]
PROBLEM flaring [Span(start=598, end=605)]
PROBLEM known medicine allergies [Span(start=714, end=738)]
PROBLEM exudate [Span(start=861, end=868)]
PROBLEM adenopathy [Span(start=984, end=994)]
TREATMENT prescription coverage [Span(start=1169, end=1190)]
TREATMENT Nasonex [Span(start=1230, end=1237)]
We can show the attribute value using displacy with more information in the labels
# enrich entity labels with [NEG] suffix
def format_entity(entity):
label = entity.label
negation_attr = entity.attrs.get(label="negex")[0]
if negation_attr.value:
return label + " [NEG]"
return label
options_displacy = dict(colors={'TREATMENT [NEG]': "#D28E98", "PROBLEM [NEG]": "#D28E98"})
# Format the medkit doc to displacy with a entity formatter
displacy_data = medkit_doc_to_displacy(medkit_doc,entity_formatter=format_entity)
displacy.render(displacy_data,style="ent",manual=True, options=options_displacy)
MEDICATIONS: Her only medication TREATMENT currently is Ortho Tri-Cyclen TREATMENT and the Allegra.
ALLERGIES: She has no known medicine allergies PROBLEM [NEG] .
OBJECTIVE:
Vitals: Weight TEST was 130 pounds and blood pressure TEST 124/78.
HEENT: Her throat was mildly erythematous PROBLEM without exudate PROBLEM [NEG] . Nasal mucosa was erythematous PROBLEM and swollen PROBLEM . Only clear drainage PROBLEM was seen. TMs were clear.
Neck: Supple without adenopathy PROBLEM [NEG] .
Lungs: Clear.
ASSESSMENT: Allergic rhinitis PROBLEM .
PLAN:
- She will try Zyrtec TREATMENT instead of Allegra TREATMENT again. Another option will be to use loratadine TREATMENT . She does not think she has prescription coverage TREATMENT [NEG] so that might be cheaper.
- Samples of Nasonex TREATMENT [NEG] two sprays in each nostril given for three weeks. A prescription TREATMENT was written as well.
For more information about advanced usage of spacy related operations, you may refer to the API doc of medkit.text.spacy.