---
jupytext:
  formats: md:myst
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.14.4
kernelspec:
  display_name: Python 3 (ipykernel)
  language: python
  name: python3
---

# Provenance tracing

```{warning}
Provenance tracing is still under development and may be changed in the future.
```

One of the main features of medkit is the tracing of provenance information.
When used, medkit is able to tell how each annotation was created, that is to
say:
- the operation that generated it;
- the input data that was used by the operation to generate the annotation;

This is true for the whole processing chain, including intermediate steps and
annotations.

The goal is to retain enough information to later output in the
[PROV-O](https://www.w3.org/TR/prov-o/) format. More practically, it can also be
useful to know how an annotation was generated in order to know if it is
trustworthy or not.

This tutorial will teach you how to gather provenance information in medkit.
Before you read it, you should be familiar with the medkit components exposed in
the [first steps](first_steps.md) and [pipeline](pipeline.md) tutorials.

## A minimalistic provenance graph

Let's start with the simplest use case possible and take a look at provenance
for a single annotation, generated by a single operation. We are going to create
a very simple `TextDocument` containing just one sentence, and run a
`RegexpMatcher` on it that will match a single `Entity`:

```{code-cell} ipython3
from medkit.core.text import TextDocument
from medkit.text.ner import RegexpMatcher, RegexpMatcherRule

text = "Je souffre d'asthme."
doc = TextDocument(text=text)

regexp_rule = RegexpMatcherRule(regexp=r"\basthme\b", label="problem")
regexp_matcher = RegexpMatcher(rules=[regexp_rule])
```

Before we actually call the `run()` method of our regexp matcher, we will
activate the tracing of provenance for the entities it creates. This is done by
assigning it a {class}`~medkit.core.ProvTracer` object. The `ProvTracer` is in
charge of gathering all provenance info across all the operations. Operations
need to know it because they will inform it of each annotation they create.

```{code-cell} ipython3
from medkit.core import ProvTracer

prov_tracer = ProvTracer()
regexp_matcher.set_prov_tracer(prov_tracer)
```

We may now run the regexp matcher which will, as expected, match one entity:

```{code-cell} ipython3
entities = regexp_matcher.run([doc.raw_segment])

for entity in entities:
    print(f"text={entity.text!r}, label={entity.label}")
```

Let's retrieve and inspect the provenance info concerning this entity:

```{code-cell} ipython3
def print_prov(prov):
    # data item
    print(f"data_item={prov.data_item.text!r}")
    # operation description (if available)
    op_desc = prov.op_desc
    print(f"op={op_desc.name if op_desc is not None else None}")
    # source data items
    print(f"source_items={[d.text for d in prov.source_data_items]}")
    # derived data items
    print(f"derived_items={[d.text for d in prov.derived_data_items]}", end="\n\n")

entity = entities[0]
prov = prov_tracer.get_prov(entity.uid)
print_prov(prov)
```

The `get_prov()` method of `ProvTracer` returns a simple
{class}`~medkit.core.Prov` object containing all the provenance info related to
a specific object. It has the following attributes:
 - `data_item` contains the object to which the provenance info refers. Here, it
   is our entity. Note that it doesn't have to be an `Annotation` subclass. For
   instance, it could also be an `Attribute`;
 - `op_desc` holds an {class}`~medkit.core.OperationDescription` object, that
   describes the operation that created the data item, in our case the regexp
   matcher. The `OperationDescription` will contain the name of
   the operation and the init parameters that were used;
 - `source_data_items` contains the objects that were used by the operation to
   create the new data item. Here there is only one source, the raw text
   segment, because the entity was found in this particular segment by the
   regexp matcher. But it is possible to have more than one data item in the
   sources;
 - reciprocally, `derived_data_items` contains the objects that were derived
   from the data item by further operations. In this simple example, there are
   none.

If we are interested in all the provenance info gathered by our `ProvTracer`
instance rather than the provenance of a specific item, then we can call the
`get_provs()` method:

```{code-cell} ipython3
for prov in prov_tracer.get_provs():
    print_prov(prov)
```

We can see that we have another `Prov` object with partial provenance info about
the raw text segment: we know how it was used (the entity was derived from it)
but we don't know how it was created. This is expected, as the raw segment is a
data item that was fed at the input of our processing flow, it was not created
by any operation.

Our provenance info has a graph structure, each `Prov` object representing a
node. For visualization, medkit provides a
{func}`~medkit.tools.save_prov_to_dot` helper function that generates
[graphviz](https://graphviz.org/)-compatible `.dot` files:

```{note}
[graphviz](https://graphviz.org/) is a graph visualization tool that defines a
simple text-based format for describing graphs, the `.dot` file format, and that
provides a `dot` command-line executable to generate images from such files. You
will need to install graphviz on your system to be able to run the following
code. On an ubuntu system, `apt install graphviz` should do the trick.
```

```{code-cell} ipython3
---
mystnb:
  image:
    align: center
    scale: 75%
---
from pathlib import Path
import subprocess
from IPython.display import Image
from medkit.tools import save_prov_to_dot

def display_dot(dot_file):
    png_file =  dot_file.with_suffix(".png")
    subprocess.run(["dot", "-Tpng", dot_file, "-o", png_file])
    return Image(png_file)

output_dir = Path("_out")
output_dir.mkdir(exist_ok=True)
dot_file = output_dir / "prov.dot"

save_prov_to_dot(prov_tracer, dot_file)
display_dot(dot_file)
```

## Provenance composition

Let's move on to a slightly more complex example: before using the
`RegexpMatcher` matcher, we will split our document into sentences with a
`SentenceTokenizer`. We will also wrap our `SentenceTokenizer` and our
`RegexpMatcher` in a pipeline:

```{code-cell} ipython3
from medkit.text.segmentation import SentenceTokenizer
from medkit.core.pipeline import PipelineStep, Pipeline

text = "Je souffre d'asthme. Je n'ai pas de diabète."
doc = TextDocument(text=text)

sent_tokenizer = SentenceTokenizer(output_label="sentence")

steps = [
    PipelineStep(sent_tokenizer, input_keys=["full_text"], output_keys=["sentences"]),
    PipelineStep(regexp_matcher, input_keys=["sentences"], output_keys=["entities"]),
]
pipeline = Pipeline(steps=steps, input_keys=["full_text"], output_keys=["entities"])
```

A pipeline being itself an operation, it also has a `set_prov_tracer()` method,
and calling it will automatically enable provenance tracing for all the
operations in the pipeline.

```{note}
In this tutorial, we always use a new `ProvTracer` instance for each example.
This is because the provenance tracer accumulates provenance information, but we
don't want to keep the provenance information from the previous examples, so we
create a new one.
```

```{code-cell} ipython3
prov_tracer = ProvTracer()
pipeline.set_prov_tracer(prov_tracer)

entities = pipeline.run([doc.raw_segment])

for entity in entities:
    print(f"text={entity.text!r}, label={entity.label}")
```

As expected, the result is identical to the first example: we have matched one
entity. However the provenance is structured differently:

```{code-cell} ipython3
for prov in prov_tracer.get_provs():
    print_prov(prov)
```

We can see that now, the operation that created the entity is not the
`RegexpMatcher` anymore, but the `Pipeline`. It might seem surprising but it
does make sense: the pipeline is a processing operation itself, it received as
input the raw segment, and used it to create an entity. The sentences are
considered internal intermediary results and are not listed.

However, if we are interested in the details about what happened inside the
`Pipeline`, the information is still available through a sub-provenance tracer
that can be retrieved with `get_sub_prov_tracer()`:

```{code-cell} ipython3
pipeline_prov_tracer = prov_tracer.get_sub_prov_tracer(pipeline.uid)

for prov in pipeline_prov_tracer.get_provs():
    print_prov(prov)
```

Although the order of each `Prov` returned by `get_provs()` isn't the order of
creation of the annotations themselves, we can see the details of what happened
in the pipeline: 2 sentences were derived from the raw text by the
`SentenceTokenizer`, then one entity was derived from one of the sentences by
the `RegexpMatcher`.

In other words, the provenance information held by the main `ProvTracer` is
composed: it is a graph, but some part of the graph have corresponding nested
sub-graphs, that can be expanded if desired. The `save_prov_to_dot()` helper is
able to leverage this structure. By default, it will expand and display all
sub-provenance info recursively, but it has a optional `max_sub_prov_depth`
parameter that allows to limit the depth of the sub-provenance to show:

```{code-cell} ipython3
---
mystnb:
  image:
    align: center
    scale: 75%
---
# show only outer-most provenance
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=0)
display_dot(dot_file)
```

```{code-cell} ipython3
---
mystnb:
  image:
    align: center
    scale: 85%
---
# expand next level of sub-provenance
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=1)
display_dot(dot_file)
```

The same way that pipeline can contain sub-pipelines recursively, the provenance
tracer can contain sub-provenance tracers recursively for the corresponding
sub-pipelines.

Composed provenance makes it possible to preserve exhaustive provenance
information about our data but to chose the appropriate level of detail when
inspecting it. The structure of the provenance will reflect the structure of the
processing flow: if it is built in a composed way, with pipelines containing
sub-pipelines dealing with specific sub-tasks, then the provenance information
will be composed the same way.

## A more complete provenance example

To demonstrate a bit more the potential of provenance tracing in medkit, let's
build a more complicated pipeline involving a sub-pipeline and an operation that
creates attributes:

```{code-cell} ipython3
from medkit.text.context import NegationDetector, NegationDetectorRule

# segmentation
sent_tokenizer = SentenceTokenizer(output_label="sentence")
# negation detection
neg_detector = NegationDetector(output_label="is_negated")
# entity recognition
regexp_rules = [
    RegexpMatcherRule(regexp=r"\basthme\b", label="problem"),
    RegexpMatcherRule(regexp=r"\bdiabète\b", label="problem"),
]
regexp_matcher = RegexpMatcher(rules=regexp_rules, attrs_to_copy=["is_negated"])

# context sub pipeline handling segmentation and negation detection
sub_pipeline_steps = [
    PipelineStep(sent_tokenizer, input_keys=["full_text"], output_keys=["sentences"]),
    PipelineStep(neg_detector, input_keys=["sentences"], output_keys=[]),  # no output
]
sub_pipeline = Pipeline(
    sub_pipeline_steps,
    name="ContextPipeline",
    input_keys=["full_text"],
    output_keys=["sentences"],
)

# main pipeline
pipeline_steps = [
    PipelineStep(sub_pipeline, input_keys=["full_text"], output_keys=["sentences"]),
    PipelineStep(regexp_matcher, input_keys=["sentences"], output_keys=["entities"]),
]
pipeline = Pipeline(
    pipeline_steps,
    name="MainPipeline",
    input_keys=["full_text"],
    output_keys=["entities"],
)
```

Note that since we have 2 pipelines, we pass an optional `name` parameter to
each of them that will be used in the operation description and will help us to
distinguish them.

Running the pipeline gives us 2 entities with negation attributes:

```{code-cell} ipython3
prov_tracer = ProvTracer()
pipeline.set_prov_tracer(prov_tracer)
entities = pipeline.run([doc.raw_segment])

for entity in entities:
    is_negated = entity.attrs.get(label="is_negated")[0].value
    print(f"text={entity.text!r}, label={entity.label}, is_negated={is_negated}")
```

At the outer-most level, the provenance tells us that the main pipeline created
2 entities and 2 attributes. Intermediary data items (sentences) and operations
(`SentenceTokenizer`, `NegationDetector`, `RegexpMatcher`) are hidden .

```{code-cell} ipython3
---
mystnb:
  image:
    align: center
    scale: 85%
---
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=0)
display_dot(dot_file)
```

You can see dotted arrow showing which attribute relates to which annotation.
While this is not strictly speaking provenance information, it is displayed
nonetheless to avoid any confusion, especially in the case where attributes
created by one operation are afterwards copied to new annotations (cf
`attrs_to_copy` as explained in the [First steps tutorial](first_steps.md#detecting-negation)).

Expanding one more level of sub-provenance gives us the following graph:

```{code-cell} ipython3
---
mystnb:
  image:
    align: center
---
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=1)
display_dot(dot_file)
```

We now see the details of the operations and data items handled in our main
pipeline: a sub-pipeline created sentence segments and negation
attributes, then the `RegexpMatcher` created entities, using the sentences
segments. The negation attributes were attached to both the sentences and the
entities derived from the sentences.


To have more details about the processing inside the context sub-pipeline, we
have to go one step deeper:

```{code-cell} ipython3
---
mystnb:
  image:
    align: center
---
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=2)
display_dot(dot_file)
```

## Wrapping it up

In this tutorial, we have seen how we can use `ProvTracer` to keep information
about how annotations and attributes were generated, ie. which operation created
them, using which data as input.

We have also seen how, when using pipelines and sub-pipelines, the provenance
information in a `ProvTracer` will be composed, the same way that our processing
graph is. This allows us to later display the level of details that we want to
see when inspecting provenance.

Finally, we have seen how the `save_prov_to_dot()` helper function can be used
to quickly visualize the captured provenance information. For more advanced
provenance usage, you may want to look at the [provenance API
docs](api:core:provenance). The source code of `save_prov_to_dot()` can also
serve as a reference on how to use it.