medkit.tools.e3c_corpus

medkit.tools.e3c_corpus#

Tools for accessing data from the E3C corpus.

Notes#

The E3C corpus [1] [2] is released under a Creative Commons NonCommercial license (CC-BY-NC).

References#

[1]

Magnini, B., Altuna, B., Lavelli, A., Speranza, M., & Zanoli, R. (2020). The E3C Project: Collection and Annotation of a Multilingual Corpus of Clinical Cases. Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020.

[2]

Zanoli, R., Lavelli, A., Verdi do Amarante, D., & Toti, D. (2023). Assessment of the E3C corpus for the recognition of disorders in clinical texts. Natural Language Engineering, 1-19. doi:10.1017/S1351324923000335

Attributes#

`SENTENCE_LABEL`	Label used by medkit for annotated sentences of E3C corpus
`CLINENTITY_LABEL`	Label used by medkit for annotated clinical entities of E3C corpus

Functions#

`load_document`(→ medkit.core.text.TextDocument)	Load a E3C corpus document (json document) as medkit text document.
`load_data_collection`(...)	Load the E3C corpus data collection as medkit text documents.
`convert_data_collection_to_medkit`(dir_path, output_file)	Convert E3C corpus data collection to medkit jsonl file.
`load_annotated_document`(→ medkit.core.text.TextDocument)	Load a E3C corpus annotated document (xml document) as medkit text document.
`load_data_annotation`(...)	Load the E3C corpus data annotation as medkit text documents.
`convert_data_annotation_to_medkit`(dir_path, output_file)	Convert E3C corpus data annotation to medkit jsonl file.

Module Contents#

medkit.tools.e3c_corpus.SENTENCE_LABEL = 'sentence'#: Label used by medkit for annotated sentences of E3C corpus

medkit.tools.e3c_corpus.CLINENTITY_LABEL = 'disorder'#: Label used by medkit for annotated clinical entities of E3C corpus

medkit.tools.e3c_corpus.load_document(filepath: str | pathlib.Path, encoding: str = 'utf-8') → medkit.core.text.TextDocument#

Load a E3C corpus document (json document) as medkit text document.

For example, one in data collection folder. Document id is always kept in medkit document metadata.

Parameters:

filepathstr or Path: The path to the json file of the E3C corpus
encodingstr, default=”utf-8”: The encoding of the file. Default: ‘utf-8’

Returns:

TextDocument: The corresponding medkit text document

medkit.tools.e3c_corpus.load_data_collection(dir_path: pathlib.Path | str, encoding: str = 'utf-8') → Iterator[medkit.core.text.TextDocument]#

Load the E3C corpus data collection as medkit text documents.

Parameters:

dir_pathstr or Path: The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)
encodingstr, default=”utf-8”: The encoding of the files. Default: ‘utf-8’

Returns:

iterator of TextDocument: An iterator on corresponding medkit text documents

medkit.tools.e3c_corpus.convert_data_collection_to_medkit(dir_path: pathlib.Path | str, output_file: str | pathlib.Path, encoding: str | None = 'utf-8')#

Convert E3C corpus data collection to medkit jsonl file.

Parameters:

dir_pathstr or Path: The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)
output_filestr or Path: The medkit jsonl output file which will contain medkit text documents
encodingstr, default=”utf-8”: The encoding of the files. Default: ‘utf-8’

medkit.tools.e3c_corpus.load_annotated_document(filepath: str | pathlib.Path, encoding: str = 'utf-8', keep_sentences=False) → medkit.core.text.TextDocument#

Load a E3C corpus annotated document (xml document) as medkit text document.

For example, one in data annotation folder. Each annotation id is always kept in corresponding medkit element metadata.

For the time being, only supports ‘CLINENTITY’ annotations. ‘SENTENCE’ annotations may be also loaded.

Parameters:

filepathstr | Path: The path to the xml file of the E3C corpus
encodingstr, default=”utf-8”: The encoding of the file. Default: ‘utf-8’
keep_sentencesbool, default=False: Whether to load sentences into medkit documents.

Returns:

TextDocument: The corresponding medkit text document

medkit.tools.e3c_corpus.load_data_annotation(dir_path: pathlib.Path | str, encoding: str = 'utf-8', keep_sentences: bool = False) → Iterator[medkit.core.text.TextDocument]#

Load the E3C corpus data annotation as medkit text documents.

Parameters:

dir_pathstr or Path: The path to the E3C corpus data annotation directory containing the xml files (e.g., /tmp/E3C-Corpus-2.0.0/data_annotation/French/layer1)
encodingstr, default=”utf-8”: The encoding of the files. Default: ‘utf-8’
keep_sentencesbool, default=False: Whether to load sentences into medkit documents.

Returns:

iterator of TextDocument: An iterator on corresponding medkit text documents

medkit.tools.e3c_corpus.convert_data_annotation_to_medkit(dir_path: pathlib.Path | str, output_file: str | pathlib.Path, encoding: str | None = 'utf-8', keep_sentences: bool = False)#

Convert E3C corpus data annotation to medkit jsonl file.

Parameters:

dir_pathstr or Path: The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)
output_filestr or Path: The medkit jsonl output file which will contain medkit text documents
encodingstr, default=”utf-8”: The encoding of the files. Default: ‘utf-8’
keep_sentencesbool, default=False: Whether to load sentences into medkit documents.