Elan¶
- class pympi.Elan.Eaf(file_path=None, author='pympi')¶
Read and write Elan’s Eaf files.
Note
All times are in milliseconds and can’t have decimals.
Variables: - adocument (dict) – Annotation document TAG entries.
- licenses (list) – Licences included in the file of the form: (name, url).
- header (dict) – XML header.
- media_descriptors (list) – Linked files, where every file is of the form: {attrib}.
- properties (list) – Properties, where every property is of the form: (key, value).
- linked_file_descriptors (list) – Secondary linked files, where every linked file is of the form: {attrib}.
- timeslots (dict) – Timeslot data of the form: {id -> time(ms)}.
- tiers (dict) –
Tiers, where every tier is of the form: {tier_name -> (aligned_annotations, reference_annotations, attributes, ordinal)},
aligned_annotations of the form: [{id -> (begin_ts, end_ts, value, svg_ref)}],
reference annotations of the form: [{id -> (reference, value, previous, svg_ref)}].
- linguistic_types (list) – Linguistic types, where every type is of the form: {id -> attrib}.
- locales (dict) – Locales, of the form: {lancode -> (countrycode, variant)}.
- languages (dict) – Languages, of the form: {langid -> (langdef, langlabel)}.
- constraints (dict) – Constraints, every constraint is of the form: {stereotype -> description}.
- controlled_vocabularies (dict) –
Controlled vocabulary, where every controlled vocabulary is of the form: {id -> (descriptions, entries, ext_ref)},
descriptions of the form: [(value, lang_ref, description)],
entries of the form: {id -> (values, ext_ref)},
values of the form: [(lang_ref, description, text)].
- external_refs (list) – External references of the form: {id -> (type, value)}.
- lexicon_refs (list) – Lexicon references, where every reference is of the form: {id -> {attribs}}.
- annotations (dict) – Dictionary of annotations of the form: {id -> tier}, this is only used internally.
- __init__(file_path=None, author='pympi')¶
Construct either a new Eaf file or read on from a file/stream.
Parameters: - file_path (str) – Path to read from, - for stdin. If None an empty Eaf file will be created.
- author (str) – Author of the file.
- add_annotation(id_tier, start, end, value='', svg_ref=None)¶
Add an annotation.
Parameters: - id_tier (str) – Name of the tier.
- start (int) – Start time of the annotation.
- end (int) – End time of the annotation.
- value (str) – Value of the annotation.
- svg_ref (str) – Svg reference.
Raises: - KeyError – If the tier is non existent.
- ValueError – If one of the values is negative or start is bigger then end or if the tiers already contains ref annotations.
- add_controlled_vocabulary(cv_id, ext_ref=None)¶
Add a controlled vocabulary. This will initialize the controlled vocabulary without entries.
Parameters: - cv_id (str) – Name of the controlled vocabulary.
- ext_ref (str) – External reference.
- add_cv_description(cv_id, lang_ref, description=None)¶
Add a description to a controlled vocabulary.
Parameters: - cv_id (str) – Name of the controlled vocabulary to add the description.
- lang_ref (str) – Language reference.
- description (str) – Description, this can be none.
Throws KeyError: If there is no controlled vocabulary with that id.
Throws ValueError: If the language provided doesn’t exist.
- add_cv_entry(cv_id, cve_id, values, ext_ref=None)¶
Add an entry to a controlled vocabulary.
Parameters: - cv_id (str) – Name of the controlled vocabulary to add an entry.
- cve_id (str) – Name of the entry.
- values (list) – List of values of the form: (value, lang_ref, description) where description can be None.
- ext_ref (str) – External reference.
Throws KeyError: If there is no controlled vocabulary with that id.
Throws ValueError: If a language in one of the entries doesn’t exist.
- add_external_ref(eid, etype, value)¶
Add an external reference.
Parameters: - eid (str) – Name of the external reference.
- etype (str) – Type of the external reference, has to be in ['iso12620', 'ecv', 'cve_id', 'lexen_id', 'resource_url'].
- value (str) – Value of the external reference.
Throws KeyError: if etype is not in the list of possible types.
- add_language(lang_id, lang_def=None, lang_label=None)¶
Add a language.
Parameters: - lang_id (str) – ID of the language.
- lang_def (str) – Definition of the language(preferably ISO-639-3).
- lang_label (str) – Label of the language.
- add_lexicon_ref(lrid, name, lrtype, url, lexicon_id, lexicon_name, datcat_id=None, datcat_name=None)¶
Add lexicon reference.
Parameters: - lrid (str) – Lexicon reference internal ID.
- name (str) – Lexicon reference display name.
- lrtype (str) – Lexicon reference service type.
- url (str) – Lexicon reference service location
- lexicon_id (str) – Lexicon reference service id.
- lexicon_name (str) – Lexicon reference service name.
- datacat_id (str) – Lexicon reference identifier of data category.
- datacat_name (str) – Lexicon reference name of data category.
- add_license(name, url)¶
Add a license
Parameters: - name (str) – Name of the license.
- url (str) – URL of the license.
- add_linguistic_type(lingtype, constraints=None, timealignable=True, graphicreferences=False, extref=None, param_dict=None)¶
Add a linguistic type.
Parameters: - lingtype (str) – Name of the linguistic type.
- constraints (str) – Constraint name.
- timealignable (bool) – Flag for time alignable.
- graphicreferences (bool) – Flag for graphic references.
- extref (str) – External reference.
- param_dict (dict) – TAG attributes, when this is not None it will ignore all other options. Please only use dictionaries coming from the get_parameters_for_linguistic_type()
Raises KeyError: If a constraint is not defined
- add_linked_file(file_path, relpath=None, mimetype=None, time_origin=None, ex_from=None)¶
Add a linked file.
Parameters: - file_path (str) – Path of the file.
- relpath (str) – Relative path of the file.
- mimetype (str) – Mimetype of the file, if None it tries to guess it according to the file extension which currently only works for wav, mpg, mpeg and xml.
- time_origin (int) – Time origin for the media file.
- ex_from (str) – Extracted from field.
Raises KeyError: If mimetype had to be guessed and a non standard extension or an unknown mimetype.
- add_locale(language_code, country_code=None, variant=None)¶
Add a locale.
Parameters: - language_code (str) – The language code of the locale.
- country_code (str) – The country code of the locale.
- variant (str) – The variant of the locale.
- add_property(key, value)¶
Add a property
Parameters: - key (str) – Key of the property.
- value (str) – Value of the property.
- add_ref_annotation(id_tier, tier2, time, value='', prev=None, svg=None)¶
Add a reference annotation.
Parameters: - id_tier (str) – Name of the tier.
- tier2 (str) – Tier of the referenced annotation.
- time (int) – Time of the referenced annotation.
- value (str) – Value of the annotation.
- prev (str) – Id of the previous annotation.
- svg_ref (str) – Svg reference.
Raises: - KeyError – If the tier is non existent.
- ValueError – If the tier already contains normal annotations or if there is no annotation in the tier on the time to reference to.
- add_secondary_linked_file(file_path, relpath=None, mimetype=None, time_origin=None, assoc_with=None)¶
Add a secondary linked file.
Parameters: - file_path (str) – Path of the file.
- relpath (str) – Relative path of the file.
- mimetype (str) – Mimetype of the file, if None it tries to guess it according to the file extension which currently only works for wav, mpg, mpeg and xml.
- time_origin (int) – Time origin for the media file.
- assoc_with (str) – Associated with field.
Raises KeyError: If mimetype had to be guessed and a non standard extension or an unknown mimetype.
- add_tier(tier_id, ling='default-lt', parent=None, locale=None, part=None, ann=None, language=None, tier_dict=None)¶
Add a tier. When no linguistic type is given and the default linguistic type is unavailable then the assigned linguistic type will be the first in the list.
Parameters: - tier_id (str) – Name of the tier.
- ling (str) – Linguistic type, if the type is not available it will warn and pick the first available type.
- parent (str) – Parent tier name.
- locale (str) – Locale, if the locale is not present this option is ignored and the locale will not be set.
- part (str) – Participant.
- ann (str) – Annotator.
- language (str) – Language , if the language is not present this option is ignored and the language will not be set.
- tier_dict (dict) – TAG attributes, when this is not None it will ignore all other options. Please only use dictionaries coming from the get_parameters_for_tier()
Raises ValueError: If the tier_id is empty
- child_tiers_for(id_tier)¶
- clean_time_slots()¶
Clean up all unused timeslots.
Warning
This can and will take time for larger tiers.
When you want to do a lot of operations on a lot of tiers please unset the flags for cleaning in the functions so that the cleaning is only performed afterwards.
- copy_tier(eaf_obj, tier_name)¶
Copies a tier to another pympi.Elan.Eaf object.
Parameters: - eaf_obj (pympi.Elan.Eaf) – Target Eaf object.
- tier_name (str) – Name of the tier.
Raises KeyError: If the tier doesn’t exist.
- create_gaps_and_overlaps_tier(tier1, tier2, tier_name=None, maxlen=-1, fast=False)¶
Create a tier with the gaps and overlaps of the annotations. For types see get_gaps_and_overlaps()
Parameters: - tier1 (str) – Name of the first tier.
- tier2 (str) – Name of the second tier.
- tier_name (str) – Name of the new tier, if None the name will be generated.
- maxlen (int) – Maximum length of gaps (skip longer ones), if -1 no maximum will be used.
- fast (bool) – Flag for using the fast method.
Returns: List of gaps and overlaps of the form: [(type, start, end)].
Raises: - KeyError – If a tier is non existent.
- IndexError – If no annotations are available in the tiers.
- extract(start, end)¶
Extracts the selected time frame as a new object.
Parameters: - start (int) – Start time.
- end (int) – End time.
Returns: class:pympi.Elan.Eaf object containing the extracted frame.
- filter_annotations(tier, tier_name=None, filtin=None, filtex=None, regex=False, safe=False)¶
Filter annotations in a tier using an exclusive and/or inclusive filter.
Parameters: - tier (str) – Name of the tier.
- tier_name (str) – Name of the output tier, when None the name will be generated.
- filtin (list) – List of strings to be included, if None all annotations all is included.
- filtex (list) – List of strings to be excluded, if None no strings are excluded.
- regex (bool) – If this flag is set, the filters are seen as regex matches.
- safe (bool) – Ignore zero length annotations(when working with possible malformed data).
Returns: Name of the created tier.
Raises KeyError: If the tier is non existent.
- generate_annotation_id()¶
Generate the next annotation id, this function is mainly used internally.
- generate_ts_id(time=None)¶
Generate the next timeslot id, this function is mainly used internally
Parameters: time (int) – Initial time to assign to the timeslot. Raises ValueError: If the time is negative.
- get_annotation_data_at_time(id_tier, time)¶
Give the annotations at the given time. When the tier contains reference annotations this will be returned, check get_ref_annotation_data_at_time() for the format.
Parameters: - id_tier (str) – Name of the tier.
- time (int) – Time of the annotation.
Returns: List of annotations at that time.
Raises KeyError: If the tier is non existent.
- get_annotation_data_between_times(id_tier, start, end)¶
Gives the annotations within the times. When the tier contains reference annotations this will be returned, check get_ref_annotation_data_between_times() for the format.
Parameters: - id_tier (str) – Name of the tier.
- start (int) – Start time of the annotation.
- end (int) – End time of the annotation.
Returns: List of annotations within that time.
Raises KeyError: If the tier is non existent.
- get_annotation_data_for_tier(id_tier)¶
Gives a list of annotations of the form: (begin, end, value) When the tier contains reference annotations this will be returned, check get_ref_annotation_data_for_tier() for the format.
Parameters: id_tier (str) – Name of the tier. Raises KeyError: If the tier is non existent.
- get_child_tiers_for(id_tier)¶
Give all child tiers for a tier.
Parameters: id_tier (str) – Name of the tier. Returns: List of all children Raises KeyError: If the tier is non existent.
- get_controlled_vocabulary_names()¶
Gives all the controlled vocabulary names
- get_cv_descriptions(cv_id)¶
Gives all the controlled vocabulary descriptions.
Parameters: cv_id (str) – Name of the controlled vocabulary. Throws KeyError: If there is no controlled vocabulary with that id.
- get_cv_entries(cv_id)¶
Gives all the controlled vocabulary entries names.
Parameters: cv_id (str) – Name of the controlled vocabulary. Throws KeyError: If there is no controlled vocabulary with that id.
- get_external_ref(eid)¶
Give the external reference matching the id.
Parameters: eid (str) – Name of the external reference. Throws KeyError: If there is no external reference with that id.
- get_external_ref_names()¶
Gives all the external reference names.
- get_full_time_interval()¶
Give the full time interval of the file. Note that the real interval can be longer because the sound file attached can be longer.
Returns: Tuple of the form: (min_time, max_time).
- get_gaps_and_overlaps(tier1, tier2, maxlen=-1)¶
Give gaps and overlaps. The return types are shown in the table below. The string will be of the format: id_tiername_tiername.
Note
There is also a faster method: get_gaps_and_overlaps2()
For example when a gap occurs between tier1 and tier2 and they are called speakerA and speakerB the annotation value of that gap will be G12_speakerA_speakerB.
The gaps and overlaps are calculated using Heldner and Edlunds method found in:Heldner, M., & Edlund, J. (2010). Pauses, gaps and overlaps in conversations. Journal of Phonetics, 38(4), 555–568. doi:10.1016/j.wocn.2010.08.002id Description O12 Overlap from tier1 to tier2 O21 Overlap from tier2 to tier1 G12 Between speaker gap from tier1 to tier2 G21 Between speaker gap from tier2 to tier1 W12 Within speaker overlap from tier2 in tier1 W21 Within speaker overlap from tier1 in tier2 P1 Pause for tier1 P2 Pause for tier2 Parameters: - tier1 (str) – Name of the first tier.
- tier2 (str) – Name of the second tier.
- maxlen (int) – Maximum length of gaps (skip longer ones), if -1 no maximum will be used.
Yields: Tuples of the form [(start, end, type)].
Raises: - KeyError – If a tier is non existent.
- IndexError – If no annotations are available in the tiers.
- get_gaps_and_overlaps2(tier1, tier2, maxlen=-1)¶
Faster variant of get_gaps_and_overlaps(). Faster in this case means almost 100 times faster...
Parameters: - tier1 (str) – Name of the first tier.
- tier2 (str) – Name of the second tier.
- maxlen (int) – Maximum length of gaps (skip longer ones), if -1 no maximum will be used.
Yields: Tuples of the form [(start, end, type)].
Raises KeyError: If a tier is non existent.
- get_languages()¶
Gives all the languages in the format: {lang_id -> (lang_def, lang_label)}
- get_lexicon_ref(reid)¶
Gives the lexicon reference.
Parameters: reid (str) – Lexicon reference id. Throws KeyError: If there is no lexicon reference matching the id.
- get_lexicon_ref_names()¶
Gives all the lexicon reference names.
- get_licenses()¶
Gives all the licenses in the format: [(name, url)]
- get_linguistic_type_names()¶
Give a list of available linguistic types.
Returns: List of linguistic type names.
- get_linked_files()¶
Give all linked files.
- get_locales()¶
Gives all the locales in the format: {language_code -> (country_code, variant)}
- get_parameters_for_linguistic_type(lingtype)¶
Give the parameter dictionary, this is usable in add_linguistic_type().
Parameters: lingtype (str) – Name of the linguistic type. Raises KeyError: If the linguistic type doesn’t exist.
- get_parameters_for_tier(id_tier)¶
Give the parameter dictionary, this is useable in add_tier().
Parameters: id_tier (str) – Name of the tier. Returns: Dictionary of parameters. Raises KeyError: If the tier is non existent.
- get_properties()¶
Gives all the properties in the format: [(key, value)]
- get_ref_annotation_at_time(tier, time)¶
Give the ref annotations at the given time of the form [(start, end, value, refvalue)]
Parameters: - tier (str) – Name of the tier.
- time (int) – Time of the annotation of the parent.
Returns: List of annotations at that time.
Raises KeyError: If the tier is non existent.
- get_ref_annotation_data_between_times(id_tier, start, end)¶
Give the ref annotations between times of the form [(start, end, value, refvalue)]
Parameters: - tier (str) – Name of the tier.
- start (int) – End time of the annotation of the parent.
- end (int) – Start time of the annotation of the parent.
Returns: List of annotations at that time.
Raises KeyError: If the tier is non existent.
- get_ref_annotation_data_for_tier(id_tier)¶
“Give a list of all reference annotations of the form: [(start, end, value, refvalue)]
Parameters: id_tier (str) – Name of the tier. Raises KeyError: If the tier is non existent. Returns: Reference annotations within that tier.
- get_secondary_linked_files()¶
Give all linked files.
- get_tier_ids_for_linguistic_type(ling_type, parent=None)¶
Give a list of all tiers matching a linguistic type.
Parameters: - ling_type (str) – Name of the linguistic type.
- parent (str) – Only match tiers from this parent, when None this option will be ignored.
Returns: List of tiernames.
Raises KeyError: If a tier or linguistic type is non existent.
- get_tier_names()¶
List all the tier names.
Returns: List of all tier names
- insert_annotation(id_tier, start, end, value='', svg_ref=None)¶
Deprecated since version 1.2.
Use add_annotation() instead.
- insert_ref_annotation(id_tier, tier2, time, value='', prev=None, svg=None)¶
Deprecated since version 1.2.
Use add_ref_annotation() instead.
- merge_tiers(tiers, tiernew=None, gapt=0, sep='_', safe=False)¶
Merge tiers into a new tier and when the gap is lower then the threshhold glue the annotations together.
Parameters: - tiers (list) – List of tier names.
- tiernew (str) – Name for the new tier, if None the name will be generated.
- gapt (int) – Threshhold for the gaps, if the this is set to 10 it means that all gaps below 10 are ignored.
- sep (str) – Separator for the merged annotations.
- safe (bool) – Ignore zero length annotations(when working with possible malformed data).
Returns: Name of the created tier.
Raises KeyError: If a tier is non existent.
- remove_all_annotations_from_tier(id_tier, clean=True)¶
remove all annotations from a tier
Parameters: id_tier (str) – Name of the tier. Raises KeyError: If the tier is non existent.
- remove_annotation(id_tier, time, clean=True)¶
Remove an annotation in a tier, if you need speed the best thing is to clean the timeslots after the last removal. When the tier contains reference annotations remove_ref_annotation() will be executed instead.
Parameters: - id_tier (str) – Name of the tier.
- time (int) – Timepoint within the annotation.
- clean (bool) – Flag to clean the timeslots afterwards.
Raises KeyError: If the tier is non existent.
Returns: Number of removed annotations.
- remove_controlled_vocabulary(cv_id)¶
Remove a controlled vocabulary.
Parameters: cv_id (str) – Name of the controlled vocabulary. Throws KeyError: If there is no controlled vocabulary with that name.
- remove_cv_description(cv_id, lang_ref)¶
Remove a controlled vocabulary description.
Parameters: cv_id (str) – Name of the controlled vocabulary. Paarm str cve_id: Name of the entry. Throws KeyError: If there is no controlled vocabulary with that name.
- remove_cv_entry(cv_id, cve_id)¶
Remove a controlled vocabulary entry.
Parameters: cv_id (str) – Name of the controlled vocabulary. Paarm str cve_id: Name of the entry. Throws KeyError: If there is no entry or controlled vocabulary with that name.
- remove_external_ref(eid)¶
Remove an external reference.
Parameters: eid (str) – Name of the external reference. Throws KeyError: If there is no external reference with that id.
- remove_language(lang_id)¶
Remove the language mathing the id.
Parameters: lang_id (str) – Language id of the language. Throws KeyError: If there is no language matching the language id.
- remove_lexicon_ref(reid)¶
Remove a lexicon reference matching the id.
Parameters: reid (str) – Lexicon reference id. Throws KeyError: If there is no lexicon reference matching the id.
- remove_license(name=None, url=None)¶
Remove all licenses matching both key and value.
Parameters: - name (str) – Name of the license.
- url (str) – URL of the license.
- remove_linguistic_type(ling_type)¶
Remove a linguistic type.
Parameters: ling_type (str) – Name of the linguistic type. Raises KeyError: When the linguistic type doesn’t exist.
- remove_linked_files(file_path=None, relpath=None, mimetype=None, time_origin=None, ex_from=None)¶
Remove all linked files that match all the criteria, criterias that are None are ignored.
Parameters: - file_path (str) – Path of the file.
- relpath (str) – Relative filepath.
- mimetype (str) – Mimetype of the file.
- time_origin (int) – Time origin.
- ex_from (str) – Extracted from.
- remove_locale(language_code)¶
Remove the locale matching the language code.
Parameters: language_code (str) – Language code of the locale. Throws KeyError: If there is no locale matching the language code.
- remove_property(key=None, value=None)¶
Remove all properties matching both key and value.
Parameters: - key (str) – Key of the property.
- value (str) – Value of the property.
- remove_ref_annotation(id_tier, time)¶
Remove a reference annotation.
Parameters: - id_tier (str) – Name of tier.
- time (int) – Time of the referenced annotation
Raises KeyError: If the tier is non existent.
Returns: Number of removed annotations.
- remove_secondary_linked_files(file_path=None, relpath=None, mimetype=None, time_origin=None, assoc_with=None)¶
Remove all secondary linked files that match all the criteria, criterias that are None are ignored.
Parameters: - file_path (str) – Path of the file.
- relpath (str) – Relative filepath.
- mimetype (str) – Mimetype of the file.
- time_origin (int) – Time origin.
- ex_from (str) – Extracted from.
- remove_tier(id_tier, clean=True)¶
Remove a tier.
Parameters: - id_tier (str) – Name of the tier.
- clean (bool) – Flag to also clean the timeslots.
Raises KeyError: If tier is non existent.
- remove_tiers(tiers)¶
Remove multiple tiers, note that this is a lot faster then removing them individually because of the delayed cleaning of timeslots.
Parameters: tiers (list) – Names of the tier to remove. Raises KeyError: If a tier is non existent.
- rename_tier(id_from, id_to)¶
Rename a tier. Note that this renames also the child tiers that have the tier as a parent.
Parameters: - id_from (str) – Original name of the tier.
- id_to (str) – Target name of the tier.
Throws KeyError: If the tier doesnt’ exist.
- shift_annotations(time)¶
Shift all annotations in time. Annotations that are in the beginning and a left shift is applied can be squashed or discarded.
Parameters: time (int) – Time shift width, negative numbers make a left shift. Returns: Tuple of a list of squashed annotations and a list of removed annotations in the format: (tiername, start, end, value).
- to_file(file_path, pretty=True)¶
Write the object to a file, if the file already exists a backup will be created with the .bak suffix.
Parameters: - file_path (str) – Filepath to write to.
- pretty (bool) – Flag for pretty XML printing (Only unset this if you are afraid of wasting bytes because it won’t print unneccesary whitespace).
- to_textgrid(filtin=[], filtex=[], regex=False)¶
Convert the object to a pympi.Praat.TextGrid object.
Parameters: - filtin (list) – Include only tiers in this list, if empty all tiers are included.
- filtex (list) – Exclude all tiers in this list.
- regex (bool) – If this flag is set the filters are seen as regexes.
Returns: pympi.Praat.TextGrid representation.
Raises ImportError: If the pympi.Praat module can’t be loaded.
- pympi.Elan.eaf_from_chat(file_path, codec='ascii', extension='wav')¶
Reads a .cha file and converts it to an elan object. The functions tries to mimic the CHAT2ELAN program that comes with the CLAN package as close as possible. This function however converts to the latest ELAN file format since the library is designed for it. All CHAT headers will be added as Properties in the object and the headers that have a similar field in an Eaf file will be added there too. The file description of chat files can be found here.
Parameters: - file_path (str) – The file path of the .cha file.
- codec (str) – The codec, if the @UTF8 header is present it will choose utf-8, default is ascii. Older CHAT files don’t have their encoding embedded in a header so you will probably need to choose some obscure ISO charset then.
- extension (str) – The extension of the media file.
Throws StopIteration: If the file doesn’t contain a @End header, thus inferring the file is broken.
- pympi.Elan.indent(el, level=0)¶
Function to pretty print the xml, meaning adding tabs and newlines.
Parameters: - el (ElementTree.Element) – Current element.
- level (int) – Current level.
- pympi.Elan.parse_eaf(file_path, eaf_obj)¶
Parse an EAF file
Parameters: - file_path (str) – Path to read from, - for stdin.
- eaf_obj (pympi.Elan.Eaf) – Existing EAF object to put the data in.
Returns: EAF object.
- pympi.Elan.to_eaf(file_path, eaf_obj, pretty=True)¶
Write an Eaf object to file.
Parameters: - file_path (str) – Filepath to write to, - for stdout.
- eaf_obj (pympi.Elan.Eaf) – Object to write.
- pretty (bool) – Flag to set pretty printing.