medkit.tools.e3c_corpus#

Tools for accessing data from the E3C corpus.

Notes#

The E3C corpus [1] [2] is released under a Creative Commons NonCommercial license (CC-BY-NC).

References#

[1]

Magnini, B., Altuna, B., Lavelli, A., Speranza, M., & Zanoli, R. (2020). The E3C Project: Collection and Annotation of a Multilingual Corpus of Clinical Cases. Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020.

[2]

Zanoli, R., Lavelli, A., Verdi do Amarante, D., & Toti, D. (2023). Assessment of the E3C corpus for the recognition of disorders in clinical texts. Natural Language Engineering, 1-19. doi:10.1017/S1351324923000335

Attributes#

SENTENCE_LABEL

Label used by medkit for annotated sentences of E3C corpus

CLINENTITY_LABEL

Label used by medkit for annotated clinical entities of E3C corpus

Functions#

load_document(→ medkit.core.text.TextDocument)

Load a E3C corpus document (json document) as medkit text document.

load_data_collection(...)

Load the E3C corpus data collection as medkit text documents.

convert_data_collection_to_medkit(dir_path, output_file)

Convert E3C corpus data collection to medkit jsonl file.

load_annotated_document(→ medkit.core.text.TextDocument)

Load a E3C corpus annotated document (xml document) as medkit text document.

load_data_annotation(...)

Load the E3C corpus data annotation as medkit text documents.

convert_data_annotation_to_medkit(dir_path, output_file)

Convert E3C corpus data annotation to medkit jsonl file.

Module Contents#

medkit.tools.e3c_corpus.SENTENCE_LABEL = 'sentence'#

Label used by medkit for annotated sentences of E3C corpus

medkit.tools.e3c_corpus.CLINENTITY_LABEL = 'disorder'#

Label used by medkit for annotated clinical entities of E3C corpus

medkit.tools.e3c_corpus.load_document(filepath: str | pathlib.Path, encoding: str = 'utf-8') medkit.core.text.TextDocument#

Load a E3C corpus document (json document) as medkit text document.

For example, one in data collection folder. Document id is always kept in medkit document metadata.

Parameters:
filepathstr or Path

The path to the json file of the E3C corpus

encodingstr, default=”utf-8”

The encoding of the file. Default: ‘utf-8’

Returns:
TextDocument

The corresponding medkit text document

medkit.tools.e3c_corpus.load_data_collection(dir_path: pathlib.Path | str, encoding: str = 'utf-8') Iterator[medkit.core.text.TextDocument]#

Load the E3C corpus data collection as medkit text documents.

Parameters:
dir_pathstr or Path

The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)

encodingstr, default=”utf-8”

The encoding of the files. Default: ‘utf-8’

Returns:
iterator of TextDocument

An iterator on corresponding medkit text documents

medkit.tools.e3c_corpus.convert_data_collection_to_medkit(dir_path: pathlib.Path | str, output_file: str | pathlib.Path, encoding: str | None = 'utf-8')#

Convert E3C corpus data collection to medkit jsonl file.

Parameters:
dir_pathstr or Path

The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)

output_filestr or Path

The medkit jsonl output file which will contain medkit text documents

encodingstr, default=”utf-8”

The encoding of the files. Default: ‘utf-8’

medkit.tools.e3c_corpus.load_annotated_document(filepath: str | pathlib.Path, encoding: str = 'utf-8', keep_sentences=False) medkit.core.text.TextDocument#

Load a E3C corpus annotated document (xml document) as medkit text document.

For example, one in data annotation folder. Each annotation id is always kept in corresponding medkit element metadata.

For the time being, only supports ‘CLINENTITY’ annotations. ‘SENTENCE’ annotations may be also loaded.

Parameters:
filepathstr | Path

The path to the xml file of the E3C corpus

encodingstr, default=”utf-8”

The encoding of the file. Default: ‘utf-8’

keep_sentencesbool, default=False

Whether to load sentences into medkit documents.

Returns:
TextDocument

The corresponding medkit text document

medkit.tools.e3c_corpus.load_data_annotation(dir_path: pathlib.Path | str, encoding: str = 'utf-8', keep_sentences: bool = False) Iterator[medkit.core.text.TextDocument]#

Load the E3C corpus data annotation as medkit text documents.

Parameters:
dir_pathstr or Path

The path to the E3C corpus data annotation directory containing the xml files (e.g., /tmp/E3C-Corpus-2.0.0/data_annotation/French/layer1)

encodingstr, default=”utf-8”

The encoding of the files. Default: ‘utf-8’

keep_sentencesbool, default=False

Whether to load sentences into medkit documents.

Returns:
iterator of TextDocument

An iterator on corresponding medkit text documents

medkit.tools.e3c_corpus.convert_data_annotation_to_medkit(dir_path: pathlib.Path | str, output_file: str | pathlib.Path, encoding: str | None = 'utf-8', keep_sentences: bool = False)#

Convert E3C corpus data annotation to medkit jsonl file.

Parameters:
dir_pathstr or Path

The path to the E3C corpus data collection directory containing the json files (e.g., /tmp/E3C-Corpus-2.0.0/data_collection/French/layer1)

output_filestr or Path

The medkit jsonl output file which will contain medkit text documents

encodingstr, default=”utf-8”

The encoding of the files. Default: ‘utf-8’

keep_sentencesbool, default=False

Whether to load sentences into medkit documents.