medkit.tools.e3c_corpus#
Tools for accessing data from the E3C corpus.
Notes#
The E3C corpus [1] [2] is released under a Creative Commons NonCommercial license (CC-BY-NC).
References#
Magnini, B., Altuna, B., Lavelli, A., Speranza, M., & Zanoli, R. (2020). The E3C Project: Collection and Annotation of a Multilingual Corpus of Clinical Cases. Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020.
Zanoli, R., Lavelli, A., Verdi do Amarante, D., & Toti, D. (2023). Assessment of the E3C corpus for the recognition of disorders in clinical texts. Natural Language Engineering, 1-19. doi:10.1017/S1351324923000335
Attributes#
Label used by medkit for annotated sentences of E3C corpus |
|
Label used by medkit for annotated clinical entities of E3C corpus |
Functions#
|
Load a E3C corpus document (json document) as medkit text document. |
|
Load the E3C corpus data collection as medkit text documents. |
|
Convert E3C corpus data collection to medkit jsonl file. |
|
Load a E3C corpus annotated document (xml document) as medkit text document. |
|
Load the E3C corpus data annotation as medkit text documents. |
|
Convert E3C corpus data annotation to medkit jsonl file. |
Module Contents#
- medkit.tools.e3c_corpus.SENTENCE_LABEL = 'sentence'#
Label used by medkit for annotated sentences of E3C corpus
- medkit.tools.e3c_corpus.CLINENTITY_LABEL = 'disorder'#
Label used by medkit for annotated clinical entities of E3C corpus
- medkit.tools.e3c_corpus.load_document(filepath: str | pathlib.Path, encoding: str = 'utf-8') medkit.core.text.TextDocument #
Load a E3C corpus document (json document) as medkit text document.
For example, one in data collection folder. Document id is always kept in medkit document metadata.
- Parameters:
- filepathstr or Path
The path to the json file of the E3C corpus
- encodingstr, default=”utf-8”
The encoding of the file. Default: ‘utf-8’
- Returns:
- TextDocument
The corresponding medkit text document
- medkit.tools.e3c_corpus.load_data_collection(dir_path: pathlib.Path | str, encoding: str = 'utf-8') Iterator[medkit.core.text.TextDocument] #
Load the E3C corpus data collection as medkit text documents.
- Parameters:
- dir_pathstr or Path
The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)
- encodingstr, default=”utf-8”
The encoding of the files. Default: ‘utf-8’
- Returns:
- iterator of TextDocument
An iterator on corresponding medkit text documents
- medkit.tools.e3c_corpus.convert_data_collection_to_medkit(dir_path: pathlib.Path | str, output_file: str | pathlib.Path, encoding: str | None = 'utf-8')#
Convert E3C corpus data collection to medkit jsonl file.
- Parameters:
- dir_pathstr or Path
The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)
- output_filestr or Path
The medkit jsonl output file which will contain medkit text documents
- encodingstr, default=”utf-8”
The encoding of the files. Default: ‘utf-8’
- medkit.tools.e3c_corpus.load_annotated_document(filepath: str | pathlib.Path, encoding: str = 'utf-8', keep_sentences=False) medkit.core.text.TextDocument #
Load a E3C corpus annotated document (xml document) as medkit text document.
For example, one in data annotation folder. Each annotation id is always kept in corresponding medkit element metadata.
For the time being, only supports ‘CLINENTITY’ annotations. ‘SENTENCE’ annotations may be also loaded.
- Parameters:
- filepathstr | Path
The path to the xml file of the E3C corpus
- encodingstr, default=”utf-8”
The encoding of the file. Default: ‘utf-8’
- keep_sentencesbool, default=False
Whether to load sentences into medkit documents.
- Returns:
- TextDocument
The corresponding medkit text document
- medkit.tools.e3c_corpus.load_data_annotation(dir_path: pathlib.Path | str, encoding: str = 'utf-8', keep_sentences: bool = False) Iterator[medkit.core.text.TextDocument] #
Load the E3C corpus data annotation as medkit text documents.
- Parameters:
- dir_pathstr or Path
The path to the E3C corpus data annotation directory containing the xml files (e.g., /tmp/E3C-Corpus-2.0.0/data_annotation/French/layer1)
- encodingstr, default=”utf-8”
The encoding of the files. Default: ‘utf-8’
- keep_sentencesbool, default=False
Whether to load sentences into medkit documents.
- Returns:
- iterator of TextDocument
An iterator on corresponding medkit text documents
- medkit.tools.e3c_corpus.convert_data_annotation_to_medkit(dir_path: pathlib.Path | str, output_file: str | pathlib.Path, encoding: str | None = 'utf-8', keep_sentences: bool = False)#
Convert E3C corpus data annotation to medkit jsonl file.
- Parameters:
- dir_pathstr or Path
The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)
- output_filestr or Path
The medkit jsonl output file which will contain medkit text documents
- encodingstr, default=”utf-8”
The encoding of the files. Default: ‘utf-8’
- keep_sentencesbool, default=False
Whether to load sentences into medkit documents.