Creating a custom text operation#

If you want to initialize a custom text operation from a simple user-defined function, you can take a look to the following examples.

Filtering annotations#

In this example, Jane wants to detect some entities (problems) from a raw text.

1. Create medkit document#

from medkit.core.text import TextDocument

text = "The patient has asthma and is using ventoline. The patient has diabetes"
doc = TextDocument(text=text)

2. Init medkit operations#

Jane would like to reuse a collegue’s file containing a list of regular expression rules for detecting entities. To this purpose, she had to split text into sentences before using the RegexpMatcher component.

from medkit.text.segmentation import SentenceTokenizer

sentence_tokenizer = SentenceTokenizer()

In real life, Jane should load the rules from a path using this instruction:

regexp_rules = RegexpMatcher.load_rules(path_to_rules_file)

But for this example, it is simpler for us to define this set of rules manually.

from medkit.text.ner import RegexpMatcher, RegexpMatcherRule

regexp_rules = [
       RegexpMatcherRule(regexp=r"\basthma\b", label="problem"),
       RegexpMatcherRule(regexp=r"\bventoline\b", label="treatment"),
       RegexpMatcherRule(regexp=r"\bdiabetes\b", label="problem")
       ]
regexp_matcher = RegexpMatcher(rules=regexp_rules)

3. Define filter operation#

As RegexpMatcher is based on her collegue’s file, Jane would like to add a filter operation so that only entities which are problems will be returned.

For that, she has to define her own filter function and use medkit tools to instantiate this custom operation.

from medkit.core.text import Entity

def keep_entities_with_label_problem(entity: Entity) -> bool:
    return entity.label == "problem"

from medkit.core.text import CustomTextOpType, create_text_operation

filter_operation = create_text_operation(function=keep_entities_with_label_problem, function_type=CustomTextOpType.FILTER)

# Same behavior as 
# filter_operation = create_text_operation(
#   name="keep_entities_with_label_problem", 
#   function=keep_entities_with_label_problem, 
#   _type=CustomTextOpType.FILTER)

4. Construct and run the pipeline#

from medkit.core import Pipeline, PipelineStep

steps=[
    PipelineStep(input_keys=["raw_text"], output_keys=["sentences"], operation=sentence_tokenizer),
    PipelineStep(input_keys=["sentences"], output_keys=["entities"], operation=regexp_matcher),
    PipelineStep(input_keys=["entities"], output_keys=["problems"], operation=filter_operation)
]

pipeline = Pipeline(
       steps=steps,
       input_keys=["raw_text"],
       output_keys=["problems"]
)

result = pipeline.run([doc.raw_segment])
result
[Entity(uid='b8f39b6e-1a6c-11ee-b684-0242ac110002', label='problem', attrs=EntityAttributeContainer(ann_id='b8f39b6e-1a6c-11ee-b684-0242ac110002', attrs=[]), metadata={'rule_id': 0, 'version': None}, keys={'problems'}, spans=[Span(start=16, end=22)], text='asthma'),
 Entity(uid='b8f39fce-1a6c-11ee-b684-0242ac110002', label='problem', attrs=EntityAttributeContainer(ann_id='b8f39fce-1a6c-11ee-b684-0242ac110002', attrs=[]), metadata={'rule_id': 2, 'version': None}, keys={'problems'}, spans=[Span(start=63, end=71)], text='diabetes')]

In this scenario, 2 entities with problem label are returned.

To compare with the intermediate results generated by regexpmatcher, we’ll use the entities intermediate key. There are 3 results.

IMPORTANT: the following code is only for demo purpose, all pipeline steps are executed, we just select what pipeline outputs

pipeline = Pipeline(
    steps=steps,
    input_keys=["raw_text"],
    output_keys=["entities"]
)

result = pipeline.run([doc.raw_segment])
result
[Entity(uid='b8f5cbc8-1a6c-11ee-b684-0242ac110002', label='problem', attrs=EntityAttributeContainer(ann_id='b8f5cbc8-1a6c-11ee-b684-0242ac110002', attrs=[]), metadata={'rule_id': 0, 'version': None}, keys={'entities'}, spans=[Span(start=16, end=22)], text='asthma'),
 Entity(uid='b8f5cdd0-1a6c-11ee-b684-0242ac110002', label='treatment', attrs=EntityAttributeContainer(ann_id='b8f5cdd0-1a6c-11ee-b684-0242ac110002', attrs=[]), metadata={'rule_id': 1, 'version': None}, keys={'entities'}, spans=[Span(start=36, end=45)], text='ventoline'),
 Entity(uid='b8f5d000-1a6c-11ee-b684-0242ac110002', label='problem', attrs=EntityAttributeContainer(ann_id='b8f5d000-1a6c-11ee-b684-0242ac110002', attrs=[]), metadata={'rule_id': 2, 'version': None}, keys={'entities'}, spans=[Span(start=63, end=71)], text='diabetes')]