medkit.tools.e3c_corpus
=======================

.. py:module:: medkit.tools.e3c_corpus

.. autoapi-nested-parse::

   Tools for accessing data from the E3C corpus.

   Notes
   -----
   The `E3C corpus <https://github.com/hltfbk/E3C-Corpus>`_ [Rfd9a2fbc26fb-1]_ [Rfd9a2fbc26fb-2]_ is released under a
   Creative Commons NonCommercial license (CC-BY-NC).


   References
   ----------
   .. [Rfd9a2fbc26fb-1] Magnini, B., Altuna, B., Lavelli, A., Speranza, M., & Zanoli, R. (2020).
       The E3C Project: Collection and Annotation of a Multilingual Corpus of Clinical Cases.
       Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020.
   .. [Rfd9a2fbc26fb-2] Zanoli, R., Lavelli, A., Verdi do Amarante, D., & Toti, D. (2023).
       Assessment of the E3C corpus for the recognition of disorders in clinical texts.
       Natural Language Engineering, 1-19. doi:10.1017/S1351324923000335

   ..
       !! processed by numpydoc !!


Attributes
----------

.. autoapisummary::

   medkit.tools.e3c_corpus.SENTENCE_LABEL
   medkit.tools.e3c_corpus.CLINENTITY_LABEL


Functions
---------

.. autoapisummary::

   medkit.tools.e3c_corpus.load_document
   medkit.tools.e3c_corpus.load_data_collection
   medkit.tools.e3c_corpus.convert_data_collection_to_medkit
   medkit.tools.e3c_corpus.load_annotated_document
   medkit.tools.e3c_corpus.load_data_annotation
   medkit.tools.e3c_corpus.convert_data_annotation_to_medkit


Module Contents
---------------

.. py:data:: SENTENCE_LABEL
   :value: 'sentence'


   Label used by medkit for annotated sentences of E3C corpus


   ..
       !! processed by numpydoc !!

.. py:data:: CLINENTITY_LABEL
   :value: 'disorder'


   Label used by medkit for annotated clinical entities of E3C corpus


   ..
       !! processed by numpydoc !!

.. py:function:: load_document(filepath: str | pathlib.Path, encoding: str = 'utf-8') -> medkit.core.text.TextDocument

   
   Load a E3C corpus document (json document) as medkit text document.

   For example, one in data collection folder.
   Document id is always kept in medkit document metadata.

   :Parameters:

       **filepath** : str or Path
           The path to the json file of the E3C corpus

       **encoding** : str, default="utf-8"
           The encoding of the file. Default: 'utf-8'

   :Returns:

       TextDocument
           The corresponding medkit text document


   ..
       !! processed by numpydoc !!

.. py:function:: load_data_collection(dir_path: pathlib.Path | str, encoding: str = 'utf-8') -> Iterator[medkit.core.text.TextDocument]

   
   Load the E3C corpus data collection as medkit text documents.


   :Parameters:

       **dir_path** : str or Path
           The path to the E3C corpus data collection directory containing the json files
           (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)

       **encoding** : str, default="utf-8"
           The encoding of the files. Default: 'utf-8'

   :Returns:

       iterator of TextDocument
           An iterator on corresponding medkit text documents


   ..
       !! processed by numpydoc !!

.. py:function:: convert_data_collection_to_medkit(dir_path: pathlib.Path | str, output_file: str | pathlib.Path, encoding: str | None = 'utf-8')

   
   Convert E3C corpus data collection to medkit jsonl file.


   :Parameters:

       **dir_path** : str or Path
           The path to the E3C corpus data collection directory containing the json files
           (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)

       **output_file** : str or Path
           The medkit jsonl output file which will contain medkit text documents

       **encoding** : str, default="utf-8"
           The encoding of the files. Default: 'utf-8'


   ..
       !! processed by numpydoc !!

.. py:function:: load_annotated_document(filepath: str | pathlib.Path, encoding: str = 'utf-8', keep_sentences=False) -> medkit.core.text.TextDocument

   
   Load a E3C corpus annotated document (xml document) as medkit text document.

   For example, one in data annotation folder.
   Each annotation id is always kept in corresponding medkit element metadata.

   For the time being, only supports 'CLINENTITY' annotations.
   'SENTENCE' annotations may be also loaded.

   :Parameters:

       **filepath** : str | Path
           The path to the xml file of the E3C corpus

       **encoding** : str, default="utf-8"
           The encoding of the file. Default: 'utf-8'

       **keep_sentences** : bool, default=False
           Whether to load sentences into medkit documents.

   :Returns:

       TextDocument
           The corresponding medkit text document


   ..
       !! processed by numpydoc !!

.. py:function:: load_data_annotation(dir_path: pathlib.Path | str, encoding: str = 'utf-8', keep_sentences: bool = False) -> Iterator[medkit.core.text.TextDocument]

   
   Load the E3C corpus data annotation as medkit text documents.


   :Parameters:

       **dir_path** : str or Path
           The path to the E3C corpus data annotation directory containing the xml files
           (e.g., /tmp/E3C-Corpus-2.0.0/data_annotation/French/layer1)

       **encoding** : str, default="utf-8"
           The encoding of the files. Default: 'utf-8'

       **keep_sentences** : bool, default=False
           Whether to load sentences into medkit documents.

   :Returns:

       iterator of TextDocument
           An iterator on corresponding medkit text documents


   ..
       !! processed by numpydoc !!

.. py:function:: convert_data_annotation_to_medkit(dir_path: pathlib.Path | str, output_file: str | pathlib.Path, encoding: str | None = 'utf-8', keep_sentences: bool = False)

   
   Convert E3C corpus data annotation to medkit jsonl file.


   :Parameters:

       **dir_path** : str or Path
           The path to the E3C corpus data collection directory containing the json files
           (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)

       **output_file** : str or Path
           The medkit jsonl output file which will contain medkit text documents

       **encoding** : str, default="utf-8"
           The encoding of the files. Default: 'utf-8'

       **keep_sentences** : bool, default=False
           Whether to load sentences into medkit documents.


   ..
       !! processed by numpydoc !!