# Provenance tracing

:::{warning}
Provenance tracing is still under development and may be changed in the future.
:::

One of the distinctive features of `medkit` is the tracing of provenance information.
When enabled, `medkit` can record how each annotation was created,
i.e. the operation and associated input data used to generate it.

This is true for the whole processing pipeline, including intermediate steps and annotations.
Provenance information is stored through the duration of the processing
and can later be retrieved in [PROV-O](https://www.w3.org/TR/prov-o/) format.
This is particularly useful to build a chain of trust through the creation of an annotation.

This tutorial will teach you how to gather provenance information with `medkit`.
The readers is assumed to be familiar with basic `medkit` components
introduced in the [first steps](first_steps.md) and [pipeline](pipeline.md) sections.

## A minimalistic provenance graph

Let's start with the simplest use case possible
and take a look at provenance for a single annotation, generated by a single operation.

We are going to create a very simple `TextDocument` containing just one sentence,
and run a `RegexpMatcher` to match a single `Entity`:

```{code} python
from medkit.core.text import TextDocument
from medkit.text.ner import RegexpMatcher, RegexpMatcherRule

text = "Je souffre d'asthme."
doc = TextDocument(text=text)

regexp_rule = RegexpMatcherRule(regexp=r"\basthme\b", label="problem")
regexp_matcher = RegexpMatcher(rules=[regexp_rule])
```

Before calling the `run()` method of our regexp matcher,
we will activate provenance tracing for the generated entities.
This is done by assigning it a {class}`~medkit.core.ProvTracer` object.
The `ProvTracer` is in charge of gathering provenance information across all operations.

```{code} python
from medkit.core import ProvTracer

prov_tracer = ProvTracer()
regexp_matcher.set_prov_tracer(prov_tracer)
```

Now that provenance is enabled, the regexp matcher can be applied to the input document:

```{code} python
entities = regexp_matcher.run([doc.raw_segment])

for entity in entities:
    print(f"text={entity.text!r}, label={entity.label}")
```

Let's retrieve and inspect provenance information concerning the matched entity:

```{code} python
def print_prov(prov):
    # data item
    print(f"data_item={prov.data_item.text!r}")
    # operation description (if available)
    op_desc = prov.op_desc
    print(f"op={op_desc.name if op_desc is not None else None}")
    # source data items
    print(f"source_items={[d.text for d in prov.source_data_items]}")
    # derived data items
    print(f"derived_items={[d.text for d in prov.derived_data_items]}", end="\n\n")

entity = entities[0]
prov = prov_tracer.get_prov(entity.uid)
print_prov(prov)
```

The `get_prov()` method of `ProvTracer` returns a simple {class}`~medkit.core.Prov` object
containing all the provenance information related to a specific object.
It features the following attributes:
- `data_item` contains the object to which the provenance info refers. Here, it
 is our entity. Note that it doesn't have to be an `Annotation` subclass. 
 For instance, it could also be an `Attribute`;
- `op_desc` holds an {class}`~medkit.core.OperationDescription` object,
  that describes the operation that created the data item (here, the regexp matcher). 
  The `OperationDescription` will contain the name of the operation and the init parameters that were used;
- `source_data_items` contains the objects that were used by the operation to create the new data item.
  Here there is only one source, the raw text segment,
  because the entity was found in this particular segment by the regexp matcher.
  But it is possible to have more than one data item in the sources;
- `derived_data_items` contains the objects that were derived from the data item by further operations.
  In this simple example, there are none.

If we are interested in all the provenance information gathered by the `ProvTracer` instance,
rather than the provenance of a specific item,
then we can call the `get_provs()` method:

```{code} python
for prov in prov_tracer.get_provs():
    print_prov(prov)
```

Here, we have another `Prov` object with partial provenance information about the raw text segment:
we know how it was used (the entity was derived from it) but we don't know how it was created.
This is expected, as the raw segment is a data item that was provided as input to our processing flow,
it was not created by any operation upstream.

Provenance information can be represented as a graph structure,
with each `Prov` object representing a node.
For visualization purposes, `medkit` provides a {func}`~medkit.tools.save_prov_to_dot` helper function
that generates [graphviz](https://graphviz.org/)-compatible `.dot` files:

:::{note}
[graphviz](https://graphviz.org/) is a graph visualization tool that defines a simple text-based format for describing graphs,
the `.dot` file format.
It also provides command-line executable named `dot` to generate images from such files.
You will need to install `graphviz` on your system to be able to run the following code.
:::

```{code} python
from pathlib import Path
from IPython.display import Image
from medkit.tools import save_prov_to_dot

def display_dot(dot_file: Path) -> Image:
    import subprocess
    import warnings

    png_file =  dot_file.with_suffix(".png")
    try:
        subprocess.run(["dot", "-Tpng", dot_file, "-o", png_file])
    except FileNotFoundError:
        msg = (
            "The dot executable was not found, "
            "please make sure graphviz in installed."
        )
        warnings.warn(msg)
    return Image(png_file)

output_dir = Path("_out")
output_dir.mkdir(exist_ok=True)
dot_file = output_dir / "prov.dot"

save_prov_to_dot(prov_tracer, dot_file)
display_dot(dot_file)
```

## Provenance composition

Let's move on to a slightly more complex example.
Before using the `RegexpMatcher` matcher, we will split our document into sentences with a `SentenceTokenizer`.
We will also compose the `SentenceTokenizer` and our `RegexpMatcher` operations in a `Pipeline`.

```{code} python
from medkit.text.segmentation import SentenceTokenizer
from medkit.core.pipeline import PipelineStep, Pipeline

text = "Je souffre d'asthme. Je n'ai pas de diabète."
doc = TextDocument(text=text)

sent_tokenizer = SentenceTokenizer(output_label="sentence")

steps = [
    PipelineStep(sent_tokenizer, input_keys=["full_text"], output_keys=["sentences"]),
    PipelineStep(regexp_matcher, input_keys=["sentences"], output_keys=["entities"]),
]
pipeline = Pipeline(steps=steps, input_keys=["full_text"], output_keys=["entities"])
```

A pipeline being itself an operation, it also features a `set_prov_tracer()` method,
and calling it will automatically enable provenance tracing for all the operations in the pipeline.

:::{important}
Provenance tracers can only accumulate provenance information, not modify or delete it.
:::

```{code} python
prov_tracer = ProvTracer()
pipeline.set_prov_tracer(prov_tracer)

entities = pipeline.run([doc.raw_segment])

for entity in entities:
    print(f"text={entity.text!r}, label={entity.label}")
```

As expected, the result is identical to the first example: we have matched one entity.
However, its provenance is structured differently:

```{code} python
for prov in prov_tracer.get_provs():
    print_prov(prov)
```

Compared to the simpler case, the operation that created the entity is the `Pipeline`, instead of the `RegexpMatcher`.
It might sound a little surprising, but it does make sense: the pipeline is a processing operation itself,
it received the raw segment as input, and used it to create an entity.
The sentences are considered internal intermediary results and are not listed.

If we are interested in the details about what happened inside the `Pipeline`,
the information is still available through a sub-provenance tracer
that can be retrieved with `get_sub_prov_tracer()`:

```{code} python
pipeline_prov_tracer = prov_tracer.get_sub_prov_tracer(pipeline.uid)

for prov in pipeline_prov_tracer.get_provs():
    print_prov(prov)
```

Although the order of each `Prov` returned by `get_provs()` is not the order of creation of the annotations themselves,
we can see the details of what happened in the pipeline.
Two sentences were derived from the raw text by the `SentenceTokenizer`,
then one entity was derived from one of the sentences by the `RegexpMatcher`.

In other words, the provenance information held by the main `ProvTracer` is composed.
It is a graph, but some part of the graph have corresponding nested sub-graphs, that can be expanded if desired.
The `save_prov_to_dot()` helper is able to leverage this structure.
By default, it will expand and display all sub-provenance info recursively,
but it has a optional `max_sub_prov_depth` parameter that allows to limit the depth of the sub-provenance to show:

```{code} python
# show only outer-most provenance
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=0)
display_dot(dot_file)
```

```{code} python
# expand next level of sub-provenance
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=1)
display_dot(dot_file)
```

The same way that pipeline can contain sub-pipelines recursively,
the provenance tracer can contain sub-provenance tracers recursively for the corresponding sub-pipelines.

Composed provenance makes it possible to preserve exhaustive provenance information about our data,
and chose the appropriate level of detail when inspecting it.
The provenance structure will reflect the structure of the processing flow.
If built in a composed way, with pipelines containing sub-pipelines dealing with specific sub-tasks,
then provenance information will be composed the same way.

## A more complete provenance example

To demonstrate a bit more the potential of provenance tracing in `medkit`,
let's build a more complicated pipeline involving a sub-pipeline
and an operation that creates attributes:

```{code} python
from medkit.text.context import NegationDetector, NegationDetectorRule

# segmentation
sentence_tokenizer = SentenceTokenizer(output_label="sentence")
# negation detection
negation_detector = NegationDetector(output_label="is_negated")
# entity recognition
regexp_rules = [
    RegexpMatcherRule(regexp=r"\basthme\b", label="problem"),
    RegexpMatcherRule(regexp=r"\bdiabète\b", label="problem"),
]
regexp_matcher = RegexpMatcher(rules=regexp_rules, attrs_to_copy=["is_negated"])

# context sub pipeline handling segmentation and negation detection
sub_pipeline_steps = [
    PipelineStep(sentence_tokenizer, input_keys=["full_text"], output_keys=["sentences"]),
    PipelineStep(negation_detector, input_keys=["sentences"], output_keys=[]),  # no output
]
sub_pipeline = Pipeline(
    sub_pipeline_steps,
    name="ContextPipeline",
    input_keys=["full_text"],
    output_keys=["sentences"],
)

# main pipeline
pipeline_steps = [
    PipelineStep(sub_pipeline, input_keys=["full_text"], output_keys=["sentences"]),
    PipelineStep(regexp_matcher, input_keys=["sentences"], output_keys=["entities"]),
]
pipeline = Pipeline(
    pipeline_steps,
    name="MainPipeline",
    input_keys=["full_text"],
    output_keys=["entities"],
)
```

Since there are 2 pipelines, we need to pass an optional `name` parameter to each of them
that will be used in the operation description and will help us to distinguish between them.

Running the main pipeline returns 2 entities with negation attributes:

```{code} python
prov_tracer = ProvTracer()
pipeline.set_prov_tracer(prov_tracer)
entities = pipeline.run([doc.raw_segment])

for entity in entities:
    is_negated = entity.attrs.get(label="is_negated")[0].value
    print(f"text={entity.text!r}, label={entity.label}, is_negated={is_negated}")
```

At the outermost level, provenance tells us that the main pipeline created 2 entities and 2 attributes.
Intermediary data and operations (`SentenceTokenizer`, `NegationDetector`, `RegexpMatcher`) are hidden.

```{code} python
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=0)
display_dot(dot_file)
```

You can see dotted arrow showing which attribute relates to which annotation.
While this is not strictly speaking provenance information,
it is displayed nonetheless to avoid any confusion,
especially in the case where attributes created by one operation
are copied to new annotations (cf `attrs_to_copy` as explained in the
[First steps tutorial](./first_steps.md#detecting-negation)) afterwards.

Expanding one more level of provenance gives us the following graph:

```{code} python
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=1)
display_dot(dot_file)
```

Now, We can see the details of the operations and data items handled in our main pipeline.
A sub-pipeline created sentence segments and negation attributes,
then the `RegexpMatcher` created entities, using the sentences segments.
The negation attributes were attached to both the sentences and the entities derived from the sentences.

To have more details about the processing inside the context sub-pipeline, we have to go one level deeper:

```{code} python
save_prov_to_dot(prov_tracer, dot_file, max_sub_prov_depth=2)
display_dot(dot_file)
```

## Wrapping it up

In this tutorial, we have seen how we can use `ProvTracer`
to keep information about how annotations and attributes were generated,
i.e. which operation created them using which data as input.

Furthermore, we have seen how, when using pipelines and sub-pipelines,
provenance information generated by a `ProvTracer` will be composed,
the same way that our processing graph is.
This allows us to later display the level of details that we want to see when inspecting provenance.

Finally, we have seen how the `save_prov_to_dot()` helper function can be used
to quickly visualize the captured provenance information.
For more advanced provenance usage, you may want to look at the [provenance API docs](api:core:provenance).
The source code of `save_prov_to_dot()` can also serve as a reference on how to use it.