medkit.io.doccano
=================

.. py:module:: medkit.io.doccano


Classes
-------

.. autoapisummary::

   medkit.io.doccano.DoccanoTask
   medkit.io.doccano.DoccanoClientConfig
   medkit.io.doccano.DoccanoInputConverter
   medkit.io.doccano.DoccanoOutputConverter


Module Contents
---------------

.. py:class:: DoccanoTask(*args, **kwds)

   Bases: :py:obj:`enum.Enum`


   Supported doccano tasks.


   :Attributes:

       **TEXT_CLASSIFICATION**
           Documents with a category

       **RELATION_EXTRACTION**
           Documents with entities and relations (including IDs)

       **SEQUENCE_LABELING**
           Documents with entities in tuples


   ..
       !! processed by numpydoc !!

   .. py:attribute:: TEXT_CLASSIFICATION
      :value: 'text_classification'


   .. py:attribute:: RELATION_EXTRACTION
      :value: 'relation_extraction'


   .. py:attribute:: SEQUENCE_LABELING
      :value: 'sequence_labeling'


.. py:class:: DoccanoClientConfig

   
   Doccano client configuration.

   The default values are the default values used by doccano.


   :Attributes:

       **column_text** : str, default="text"
           Name or key representing the text

       **column_label** : str, default="label"
           Name or key representing the label


   ..
       !! processed by numpydoc !!

   .. py:attribute:: column_text
      :type:  str
      :value: 'text'


   .. py:attribute:: column_label
      :type:  str
      :value: 'label'


.. py:class:: DoccanoInputConverter(task: DoccanoTask, client_config: DoccanoClientConfig | None = None, attr_label: str = 'doccano_category', uid: str | None = None)

   
   Convert doccano files (.JSONL) containing annotations for a given task.

   For each line, a :class:`~.core.text.TextDocument` will be created.
   The doccano files can be loaded from a directory with zip files or from a jsonl file.

   The converter supports custom configuration to define the parameters used by doccano
   when importing the data (c.f. :class:`~.io.doccano.DoccanoClientConfig`)

   .. warning::
       If the option *Count grapheme clusters as one character*  was selected
       when creating the doccano project, the converted documents are
       likely to have alignment problems; the converter does not support this option.

   :Parameters:

       **task** : DocanoTask
           The doccano task for the input converter

       **client_config** : DoccanoClientConfig, optional
           Optional client configuration to define default values in doccano interface.
           This config can change, for example, the name of the text field or labels.

       **attr_label** : str, default="doccano_category"
           The label to use for the medkit attribute that represents the doccano category.
           This is related to :class:`~.io.DoccanoTask.TEXT_CLASSIFICATION` projects.

       **uid** : str, optional
           Identifier of the converter.

   :Attributes:

       **description** : str
           Description for the operation.


   ..
       !! processed by numpydoc !!

   .. py:attribute:: uid
      :value: None


   .. py:attribute:: client_config
      :value: None


   .. py:attribute:: task


   .. py:attribute:: attr_label
      :value: 'doccano_category'


   .. py:attribute:: _prov_tracer
      :type:  medkit.core.ProvTracer | None
      :value: None


   .. py:method:: set_prov_tracer(prov_tracer: medkit.core.ProvTracer)

      
      Enable provenance tracing.


      :Parameters:

          **prov_tracer** : ProvTracer
              The provenance tracer used to trace the provenance.


      ..
          !! processed by numpydoc !!


   .. py:property:: description
      :type: medkit.core.OperationDescription


      Contains all the input converter init parameters.


      ..
          !! processed by numpydoc !!


   .. py:method:: load_from_directory_zip(dir_path: str | pathlib.Path) -> list[medkit.core.text.TextDocument]

      
      Load text documents from a directory of zip files.

      The zip files should contain JSONL files coming from doccano.

      :Parameters:

          **dir_path** : str or Path
              The path to the directory containing zip files.


      :Returns:

          list of TextDocument
              A list of TextDocuments


      ..
          !! processed by numpydoc !!


   .. py:method:: load_from_zip(input_file: str | pathlib.Path) -> list[medkit.core.text.TextDocument]

      
      Load text documents from a zip file.


      :Parameters:

          **input_file** : str or Path
              The path to the zip file containing a docanno JSONL file


      :Returns:

          list of TextDocument
              A list of TextDocuments


      ..
          !! processed by numpydoc !!


   .. py:method:: load_from_file(input_file: str | pathlib.Path) -> list[medkit.core.text.TextDocument]

      
      Load text documents from a JSONL file.


      :Parameters:

          **input_file** : str or Path
              The path to the JSONL file containing doccano annotations


      :Returns:

          list of TextDocument
              A list of TextDocuments


      ..
          !! processed by numpydoc !!


   .. py:method:: _check_crlf_character(documents: list[medkit.core.text.TextDocument])

      
      Check if the list of converted documents contains the CRLF character.

      This character is the only indicator available to warn if there are alignment
      problems in the documents.


      ..
          !! processed by numpydoc !!


   .. py:method:: _parse_doc_line(doc_line: dict[str, Any]) -> medkit.core.text.TextDocument

      
      Parse a doc_line into a TextDocument depending on the task.


      :Parameters:

          **doc_line** : dict of str to Any
              A dictionary representing an annotation from doccano


      :Returns:

          TextDocument
              A document with parsed annotations.


      ..
          !! processed by numpydoc !!


   .. py:method:: _parse_doc_line_relation_extraction(doc_line: dict[str, Any]) -> medkit.core.text.TextDocument

      
      Parse a dictionary and return a TextDocument with entities and relations.


      :Parameters:

          **doc_line** : dict of str to Any
              Dictionary with doccano annotation


      :Returns:

          TextDocument
              The document with annotations


      ..
          !! processed by numpydoc !!


   .. py:method:: _parse_doc_line_seq_labeling(doc_line: dict[str, Any]) -> medkit.core.text.TextDocument

      
      Parse a dictionary and return a TextDocument with entities.


      :Parameters:

          **doc_line** : dict of str to Any
              Dictionary with doccano annotation.


      :Returns:

          TextDocument
              The document with annotations


      ..
          !! processed by numpydoc !!


   .. py:method:: _parse_doc_line_text_classification(doc_line: dict[str, Any]) -> medkit.core.text.TextDocument

      
      Parse a dictionary and return a TextDocument with an attribute.


      :Parameters:

          **doc_line** : dict of str to Any
              Dictionary with doccano annotation.


      :Returns:

          TextDocument
              The document with its category


      ..
          !! processed by numpydoc !!


.. py:class:: DoccanoOutputConverter(task: DoccanoTask, anns_labels: list[str] | None = None, attr_label: str | None = None, ignore_segments: bool = True, include_metadata: bool | None = True, uid: str | None = None)

   
   Convert medkit files to doccano files (.JSONL) for a given task.

   For each :class:`~medkit.core.text.TextDocument` a jsonline will be created.

   :Parameters:

       **task** : DoccanoTask
           The doccano task for the input converter

       **anns_labels** : list of str, optional
           Labels of medkit annotations to convert into doccano annotations.
           If `None` (default) all the entities or relations will be converted.
           Useful for :class:`~.io.DoccanoTask.SEQUENCE_LABELING` or
           :class:`~.io.DoccanoTask.RELATION_EXTRACTION` converters.

       **attr_label** : str, optional
           The label of the medkit attribute that represents the text category.
           Useful for :class:`~.io.DoccanoTask.TEXT_CLASSIFICATION` converters.

       **ignore_segments** : bool, default=True
           If `True` medkit segments will be ignored. Only entities will be
           converted to Doccano entities.  If `False` the medkit segments will
           be converted to Doccano entities as well.
           Useful for :class:`~.io.DoccanoTask.SEQUENCE_LABELING` or
           :class:`~.io.DoccanoTask.RELATION_EXTRACTION` converters.

       **include_metadata** : bool, default=True
           Whether include medkit metadata in the converted documents

       **uid** : str, optional
           Identifier of the converter.

   :Attributes:

       **description** : str
           Description for the operation.


   ..
       !! processed by numpydoc !!

   .. py:attribute:: uid
      :value: None


   .. py:attribute:: task


   .. py:attribute:: anns_labels
      :value: None


   .. py:attribute:: attr_label
      :value: None


   .. py:attribute:: ignore_segments
      :value: True


   .. py:attribute:: include_metadata
      :value: True


   .. py:property:: description
      :type: medkit.core.OperationDescription


   .. py:method:: save(docs: list[medkit.core.text.TextDocument], output_file: str | pathlib.Path)

      
      Convert and save a list of TextDocuments into a doccano file (.JSONL).


      :Parameters:

          **docs** : list of TextDocument
              List of medkit doc objects to convert

          **output_file** : str or Path
              Path or string of the JSONL file where to save the converted documents


      ..
          !! processed by numpydoc !!


   .. py:method:: _convert_doc_by_task(medkit_doc: medkit.core.text.TextDocument) -> dict[str, Any]

      
      Convert a TextDocument into a dictionary depending on the task.


      :Parameters:

          **medkit_doc** : TextDocument
              Document to convert


      :Returns:

          dict of str to Any
              Dictionary with doccano annotation


      ..
          !! processed by numpydoc !!


   .. py:method:: _convert_doc_relation_extraction(medkit_doc: medkit.core.text.TextDocument) -> dict[str, Any]

      
      Convert a TextDocument to a doc_line compatible with the doccano relation extraction task.


      :Parameters:

          **medkit_doc** : TextDocument
              Document to convert, it may contain entities and relations.


      :Returns:

          dict of str to Any
              Dictionary with doccano annotation. It may contain text, entities and relations.


      ..
          !! processed by numpydoc !!


   .. py:method:: _convert_doc_seq_labeling(medkit_doc: medkit.core.text.TextDocument) -> dict[str, Any]

      
      Convert a TextDocument to a doc_line compatible with the doccano sequence labeling task.


      :Parameters:

          **medkit_doc** : TextDocument
              Document to convert, it may contain entities.


      :Returns:

          dict of str to Any
              Dictionary with doccano annotation. It may contain
              text ans its label (a list of tuples representing entities).


      ..
          !! processed by numpydoc !!


   .. py:method:: _convert_doc_text_classification(medkit_doc: medkit.core.text.TextDocument) -> dict[str, Any]

      
      Convert a TextDocument to a doc_line compatible with the doccano text classification task.


      :Parameters:

          **medkit_doc** : TextDocument
              Document to convert, it may contain at least one attribute to convert.


      :Returns:

          dict of str to Any
              Dictionary with doccano annotation. It may contain
              text ans its label (a category(str)).


      ..
          !! processed by numpydoc !!