medkit.tools.e3c_corpus#
This module aims to provide facilities for accessing data from e3c corpus.
Version : 2.0.0 License: The E3C corpus is released under Creative Commons NonCommercial license (CC BY-NC).
Github: hltfbk/E3C-Corpus
Reference
B. magnini, B. Altuna, A. Lavelli, M. Speranza, and R. Zanoli. 2020. The E3C Project: Collection and Annotation of a Multilingual Corpus of Clinical Cases. In Proceedings of the Seventh Italian Conference on Computational Linguistics, Bologna, Italy, December. Associazione Italiana di Linguistica Computazionale.
Functions:
|
Convert E3C corpus data annotation to medkit jsonl file. |
|
Convert E3C corpus data collection to medkit jsonl file |
|
Load a E3C corpus annotated document (xml document) as medkit text document. |
|
Load the E3C corpus data annotation as medkit text documents. |
|
Load the E3C corpus data collection as medkit text documents |
|
Load a E3C corpus document (json document) as medkit text document. |
Data:
Label used by medkit for annotated clinical entities of E3C corpus |
|
Label used by medkit for annotated sentences of E3C corpus |
- load_document(filepath, encoding='utf-8')[source]#
Load a E3C corpus document (json document) as medkit text document. For example, one in data collection folder. Document id is always kept in medkit document metadata.
- Parameters:
filepath (str or Path) – The path to the json file of the E3C corpus
encoding (str, default="utf-8") – The encoding of the file. Default: ‘utf-8’
- Return type:
- Returns:
TextDocument – The corresponding medkit text document
- load_data_collection(dir_path, encoding='utf-8')[source]#
Load the E3C corpus data collection as medkit text documents
- Parameters:
dir_path (str or Path) – The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)
encoding (str, default="utf-8") – The encoding of the files. Default: ‘utf-8’
- Return type:
Iterator[TextDocument]- Returns:
iterator of TextDocument – An iterator on corresponding medkit text documents
- convert_data_collection_to_medkit(dir_path, output_file, encoding='utf-8')[source]#
Convert E3C corpus data collection to medkit jsonl file
- Parameters:
dir_path (str or Path) – The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)
output_file (str or Path) – The medkit jsonl output file which will contain medkit text documents
encoding (str, default="utf-8") – The encoding of the files. Default: ‘utf-8’
- load_annotated_document(filepath, encoding='utf-8', keep_sentences=False)[source]#
Load a E3C corpus annotated document (xml document) as medkit text document. For example, one in data annotation folder. Each annotation id is always kept in corresponding medkit element metadata.
For the time being, only supports ‘CLINENTITY’ annotations. ‘SENTENCE’ annotations may be also loaded.
- Parameters:
filepath (str | Path) – The path to the xml file of the E3C corpus
encoding (str, default="utf-8") – The encoding of the file. Default: ‘utf-8’
keep_sentences (bool, default=False) – Whether to load sentences into medkit documents.
- Return type:
- Returns:
TextDocument – The corresponding medkit text document
- load_data_annotation(dir_path, encoding='utf-8', keep_sentences=False)[source]#
Load the E3C corpus data annotation as medkit text documents.
- Parameters:
dir_path (str or Path) – The path to the E3C corpus data annotation directory containing the xml files (e.g., /tmp/E3C-Corpus-2.0.0/data_annotation/French/layer1)
encoding (str, default="utf-8") – The encoding of the files. Default: ‘utf-8’
keep_sentences (bool, default=False) – Whether to load sentences into medkit documents.
- Return type:
Iterator[TextDocument]- Returns:
iterator of TextDocument – An iterator on corresponding medkit text documents
- convert_data_annotation_to_medkit(dir_path, output_file, encoding='utf-8', keep_sentences=False)[source]#
Convert E3C corpus data annotation to medkit jsonl file.
- Parameters:
dir_path (str or Path) – The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)
output_file (str or Path) – The medkit jsonl output file which will contain medkit text documents
encoding (str, default="utf-8") – The encoding of the files. Default: ‘utf-8’
keep_sentences (bool, default=False) – Whether to load sentences into medkit documents.
- SENTENCE_LABEL = 'sentence'#
Label used by medkit for annotated sentences of E3C corpus
- CLINENTITY_LABEL = 'disorder'#
Label used by medkit for annotated clinical entities of E3C corpus