Elan

class pympi.Elan.Eaf(file_path=None, author='pympi')

Read and write Elan’s Eaf files.

Note

All times are in milliseconds and can’t have decimals.

Variables:
  • annotation_document (dict) – Annotation document TAG entries.
  • licences (dict) – Licences included in the file.
  • header (dict) – XML header.
  • media_descriptors (list) – Linked files, where every file is of the form: {attrib}.
  • properties (list) – Properties, where every property is of the form: (value, {attrib}).
  • linked_file_descriptors (list) – Secondary linked files, where every linked file is of the form: {attrib}.
  • timeslots (dict) – Timeslot data of the form: {id -> time(ms)}.
  • tiers (dict) –

    Tiers, where every tier is of the form: {tier_name -> (aligned_annotations, reference_annotations, attributes, ordinal)},

    aligned_annotations of the form: [{id -> (begin_ts, end_ts, value, svg_ref)}],

    reference annotations of the form: [{id -> (reference, value, previous, svg_ref)}].

  • linguistic_types (list) – Linguistic types, where every type is of the form: {id -> attrib}.
  • locales (list) – Locales, every locale is of the form: {attrib}.
  • constraints (dict) – Constraints, every constraint is of the form: {stereotype -> description}.
  • controlled_vocabularies (dict) –

    Controlled vocabulary, where every controlled vocabulary is of the form: {id -> (descriptions, entries, ext_ref)},

    descriptions of the form: [(lang_ref, text)],

    entries of the form: {id -> (values, ext_ref)},

    values of the form: [(lang_ref, description, text)].

  • external_refs (list) – External references, where every reference is of the form [id, type, value].
  • lexicon_refs (list) – Lexicon references, where every reference is of the form: [{attribs}].
  • annotations (dict) – Dictionary of annotations of the form: {id -> tier}``, this is only used internally.
__init__(file_path=None, author='pympi')

Construct either a new Eaf file or read on from a file/stream.

Parameters:
  • file_path (str) – Path to read from, - for stdin. If None an empty Eaf file will be created.
  • author (str) – Author of the file.
add_linguistic_type(lingtype, constraints=None, timealignable=True, graphicreferences=False, extref=None, param_dict=None)

Add a linguistic type.

Parameters:
  • lingtype (str) – Name of the linguistic type.
  • constraints (list) – Constraint names.
  • timealignable (bool) – Flag for time alignable.
  • graphicreferences (bool) – Flag for graphic references.
  • extref (str) – External reference.
  • param_dict (dict) – TAG attributes, when this is not None it will ignore all other options. Please only use dictionaries coming from the get_parameters_for_linguistic_type()
Raises KeyError:
 

If a constraint is not defined

add_linked_file(file_path, relpath=None, mimetype=None, time_origin=None, ex_from=None)

Add a linked file.

Parameters:
  • file_path (str) – Path of the file.
  • relpath (str) – Relative path of the file.
  • mimetype (str) – Mimetype of the file, if None it tries to guess it according to the file extension which currently only works for wav, mpg, mpeg and xml.
  • time_origin (int) – Time origin for the media file.
  • ex_from (str) – Extracted from field.
Raises KeyError:
 

If mimetype had to be guessed and a non standard extension or an unknown mimetype.

add_secondary_linked_file(file_path, relpath=None, mimetype=None, time_origin=None, assoc_with=None)

Add a secondary linked file.

Parameters:
  • file_path (str) – Path of the file.
  • relpath (str) – Relative path of the file.
  • mimetype (str) – Mimetype of the file, if None it tries to guess it according to the file extension which currently only works for wav, mpg, mpeg and xml.
  • time_origin (int) – Time origin for the media file.
  • assoc_with (str) – Associated with field.
Raises KeyError:
 

If mimetype had to be guessed and a non standard extension or an unknown mimetype.

add_tier(tier_id, ling='default-lt', parent=None, locale=None, part=None, ann=None, tier_dict=None)

Add a tier. When no linguistic type is given and the default linguistic type is unavailable then the assigned linguistic type will be the first in the list.

Parameters:
  • tier_id (str) – Name of the tier.
  • ling (str) – Linguistic type, if the type is not available it will warn and pick the first available type.
  • parent (str) – Parent tier name.
  • locale (str) – Locale.
  • part (str) – Participant.
  • ann (str) – Annotator.
  • tier_dict (dict) – TAG attributes, when this is not None it will ignore all other options. Please only use dictionaries coming from the get_parameters_for_tier()
Raises ValueError:
 

If the tier_id is empty

child_tiers_for(id_tier)

Give all child tiers for a tier.

Parameters:id_tier (str) – Name of the tier.
Returns:List of all children
Raises KeyError:
 If the tier is non existent.
clean_time_slots()

Clean up all unused timeslots. .. warning:: This can and will take time for larger tiers. When you want to do a lot of operations on a lot of tiers please unset the flags for cleaning in the functions so that the cleaning is only performed afterwards.

copy_tier(eaf_obj, tier_name)

Copies a tier to another pympi.Elan.Eaf object.

Parameters:
  • eaf_obj (pympi.Elan.Eaf) – Target Eaf object.
  • tier_name (str) – Name of the tier.
Raises KeyError:
 

If the tier doesn’t exist.

create_controlled_vocabulary(cv_id, descriptions, entries, ext_ref=None)

Create a controlled vocabulary. .. warning:: This is a very raw implementation and you should check the Eaf file format specification for the entries.

Parameters:
  • cv_id (str) – Name of the controlled vocabulary.
  • descriptions (list) – List of descriptions.
  • entries (dict) – Entries dictionary.
  • ext_ref (str) – External reference.
create_gaps_and_overlaps_tier(tier1, tier2, tier_name=None, maxlen=-1, fast=False)

Create a tier with the gaps and overlaps of the annotations. For types see get_gaps_and_overlaps()

Parameters:
  • tier1 (str) – Name of the first tier.
  • tier2 (str) – Name of the second tier.
  • tier_name (str) – Name of the new tier, if None the name will be generated.
  • maxlen (int) – Maximum length of gaps (skip longer ones), if -1 no maximum will be used.
  • fast (bool) – Flag for using the fast method.
Returns:

List of gaps and overlaps of the form: [(type, start, end)].

Raises:
  • KeyError – If a tier is non existent.
  • IndexError – If no annotations are available in the tiers.
extract(start, end)

Extracts the selected time frame as a new object.

Parameters:
  • start (int) – Start time.
  • end (int) – End time.
Returns:

class:pympi.Elan.Eaf object containing the extracted frame.

filter_annotations(tier, tier_name=None, filtin=None, filtex=None, regex=False, safe=False)

Filter annotations in a tier using an exclusive and/or inclusive filter.

Parameters:
  • tier (str) – Name of the tier.
  • tier_name (str) – Name of the output tier, when None the name will be generated.
  • filtin (list) – List of strings to be included, if None all annotations all is included.
  • filtex (list) – List of strings to be excluded, if None no strings are excluded.
  • regex (bool) – If this flag is set, the filters are seen as regex matches.
  • safe (bool) – Ignore zero length annotations(when working with possible malformed data).
Raises KeyError:
 

If the tier is non existent.

generate_annotation_id()

Generate the next annotation id, this function is mainly used internally.

generate_ts_id(time=None)

Generate the next timeslot id, this function is mainly used internally

Parameters:time (int) – Initial time to assign to the timeslot.
Raises ValueError:
 If the time is negative.
get_annotation_data_at_time(id_tier, time)

Give the annotations at the given time.

Parameters:
  • id_tier (str) – Name of the tier.
  • time (int) – Time of the annotation.
Returns:

List of annotations at that time.

Raises KeyError:
 

If the tier is non existent.

get_annotation_data_between_times(id_tier, start, end)

Gives the annotations within the times.

Parameters:
  • id_tier (str) – Name of the tier.
  • start (int) – Start time of the annotation.
  • end (int) – End time of the annotation.
Returns:

List of annotations within that time.

Raises KeyError:
 

If the tier is non existent.

get_annotation_data_for_tier(id_tier)

Gives a list of annotations of the form: (begin, end, value)

Parameters:id_tier (str) – Name of the tier.
Raises KeyError:
 If the tier is non existent.
get_full_time_interval()

Give the full time interval of the file.

Returns:Tuple of the form: (min_time, max_time).
get_gaps_and_overlaps(tier1, tier2, maxlen=-1)

Give gaps and overlaps. The return types are shown in the table below. The string will be of the format: id_tiername_tiername.

Note

There is also a faster method: get_gaps_and_overlaps2()

For example when a gap occurs between tier1 and tier2 and they are called speakerA and speakerB the annotation value of that gap will be G12_speakerA_speakerB.

The gaps and overlaps are calculated using Heldner and Edlunds method found in:
Heldner, M., & Edlund, J. (2010). Pauses, gaps and overlaps in conversations. Journal of Phonetics, 38(4), 555–568. doi:10.1016/j.wocn.2010.08.002
id Description
O12 Overlap from tier1 to tier2
O21 Overlap from tier2 to tier1
G12 Between speaker gap from tier1 to tier2
G21 Between speaker gap from tier2 to tier1
W12 Within speaker overlap from tier2 in tier1
W21 Within speaker overlap from tier1 in tier2
P1 Pause for tier1
P2 Pause for tier2
Parameters:
  • tier1 (str) – Name of the first tier.
  • tier2 (str) – Name of the second tier.
  • maxlen (int) – Maximum length of gaps (skip longer ones), if -1 no maximum will be used.
Yields:

Tuples of the form [(start, end, type)].

Raises:
  • KeyError – If a tier is non existent.
  • IndexError – If no annotations are available in the tiers.
get_gaps_and_overlaps2(tier1, tier2, maxlen=-1)

Faster variant of get_gaps_and_overlaps().

Parameters:
  • tier1 (str) – Name of the first tier.
  • tier2 (str) – Name of the second tier.
  • maxlen (int) – Maximum length of gaps (skip longer ones), if -1 no maximum will be used.
Yields:

Tuples of the form [(start, end, type)].

Raises KeyError:
 

If a tier is non existent.

get_linguistic_type_names()

Give a list of available linguistic types.

Returns:List of linguistic type names.
get_linked_files()

Give all linked files.

get_parameters_for_linguistic_type(lingtype)

Give the parameter dictionary, this is usable in add_linguistic_type().

Parameters:lingtype (str) – Name of the linguistic type.
Raises KeyError:
 If the linguistic type doesn’t exist.
get_parameters_for_tier(id_tier)

Give the parameter dictionary, this is usaable in add_tier().

Parameters:id_tier (str) – Name of the tier.
Returns:Dictionary of parameters.
Raises KeyError:
 If the tier is non existent.
get_ref_annotation_at_time(tier, time)

Give the ref annotations at the given time.

Parameters:
  • tier (str) – Name of the tier.
  • time (int) – Time of the annotation of the parent.
Returns:

List of annotations at that time.

Raises KeyError:
 

If the tier is non existent.

get_ref_annotation_data_for_tier(id_tier)

“Give a list of all reference annotations of the form: [(start, end, value, refvalue)]

Parameters:id_tier (str) – Name of the tier.
Raises KeyError:
 If the tier is non existent.
Yields:Reference annotations within that tier.
get_secondary_linked_files()

Give all linked files.

get_tier_ids_for_linguistic_type(ling_type, parent=None)

Give a list of all tiers matching a linguistic type.

Parameters:
  • ling_type (str) – Name of the linguistic type.
  • parent (str) – Only match tiers from this parent, when None this option will be ignored.
Returns:

List of tiernames.

Raises KeyError:
 

If a tier or linguistic type is non existent.

get_tier_names()

List all the tier names.

Returns:List of all tier names
insert_annotation(id_tier, start, end, value='', svg_ref=None)

Insert an annotation.

Parameters:
  • id_tier (str) – Name of the tier.
  • start (int) – Start time of the annotation.
  • end (int) – End time of the annotation.
  • value (str) – Value of the annotation.
  • svg_ref (str) – Svg reference.
Raises:
  • KeyError – If the tier is non existent.
  • ValueError – If one of the values is negative or start is bigger then end or if the tiers already contains ref annotations.
insert_ref_annotation(id_tier, tier2, time, value, prev=None, svg=None)

Insert a reference annotation.

Parameters:
  • id_tier (str) – Name of the tier.
  • tier2 (str) – Tier of the referenced annotation.
  • time (int) – Time of the referenced annotation.
  • value (str) – Value of the annotation.
  • prev (str) – Id of the previous annotation.
  • svg_ref (str) – Svg reference.
Raises:
  • KeyError – If the tier is non existent.
  • ValueError – If the tier already contains normal annotations or if there is no annotation in the tier on the time to reference to.
merge_tiers(tiers, tiernew=None, gapt=0, sep='_', safe=False)

Merge tiers into a new tier and when the gap is lower then the threshhold glue the annotations together.

Parameters:
  • tiers (list) – List of tier names.
  • tiernew (str) – Name for the new tier, if None the name will be generated.
  • gapt (int) – Threshhold for the gaps, if the this is set to 10 it means that all gaps below 10 are ignored.
  • sep (str) – Separator for the merged annotations.
  • safe (bool) – Ignore zero length annotations(when working with possible malformed data).
Raises KeyError:
 

If a tier is non existent.

remove_all_annotations_from_tier(id_tier, clean=True)

remove all annotations from a tier

Parameters:id_tier (str) – Name of the tier.
Raises KeyError:
 If the tier is non existent.
remove_annotation(id_tier, time, clean=True)

Remove an annotation in a tier, if you need speed the best thing is to clean the timeslots after the last removal.

Parameters:
  • id_tier (str) – Name of the tier.
  • time (int) – Timepoint within the annotation.
  • clean (bool) – Flag to clean the timeslots afterwards.
Raises KeyError:
 

If the tier is non existent.

Returns:

Number of removed annotations.

remove_controlled_vocabulary(cv)

Remove a controlled vocabulary.

Parameters:cv (str) – Controlled vocabulary id.
Raises KeyError:
 If the controlled vocabulary is non existent.
remove_linguistic_type(ling_type)

Remove a linguistic type.

Parameters:ling_type (str) – Name of the linguistic type.
Raises KeyError:
 When the linguistic type doesn’t exist.
remove_linked_files(file_path=None, relpath=None, mimetype=None, time_origin=None, ex_from=None)

Remove all linked files that match all the criteria, criterias that are None are ignored.

Parameters:
  • file_path (str) – Path of the file.
  • relpath (str) – Relative filepath.
  • mimetype (str) – Mimetype of the file.
  • time_origin (int) – Time origin.
  • ex_from (str) – Extracted from.
remove_secondary_linked_files(file_path=None, relpath=None, mimetype=None, time_origin=None, assoc_with=None)

Remove all secondary linked files that match all the criteria, criterias that are None are ignored.

Parameters:
  • file_path (str) – Path of the file.
  • relpath (str) – Relative filepath.
  • mimetype (str) – Mimetype of the file.
  • time_origin (int) – Time origin.
  • ex_from (str) – Extracted from.
remove_tier(id_tier, clean=True)

Remove a tier.

Parameters:
  • id_tier (str) – Name of the tier.
  • clean (bool) – Flag to also clean the timeslots.
Raises KeyError:
 

If tier is non existent.

remove_tiers(tiers)

Remove multiple tiers, note that this is a lot faster then removing them individually because of the delayed cleaning of timeslots.

Parameters:tiers (list) – Names of the tier to remove.
Raises KeyError:
 If a tier is non existent.
shift_annotations(time)

Shift all annotations in time. Annotations that are in the beginning and a left shift is applied can be squashed or discarded.

Parameters:time (int) – Time shift width, negative numbers make a left shift.
Returns:Tuple of a list of squashed annotations and a list of removed annotations in the format: (tiername, start, end, value).
to_file(file_path, pretty=True)

Write the object to a file, if the file already exists a backup will be created with the .bak suffix.

Parameters:
  • file_path (str) – Filepath to write to.
  • pretty (bool) – Flag for pretty XML printing (Only unset this if you are afraid of wasting bytes because it won’t print unneccesary whitespace).
to_textgrid(filtin=[], filtex=[], regex=False)

Convert the object to a pympi.Praat.TextGrid object.

Parameters:
  • filtin (list) – Include only tiers in this list, if empty all tiers are included.
  • filtex (list) – Exclude all tiers in this list.
  • regex (bool) – If this flag is set the filters are seen as regexes.
Returns:

pympi.Praat.TextGrid representation.

Raises ImportError:
 

If the pympi.Praat module can’t be loaded.

pympi.Elan.indent(el, level=0)

Function to pretty print the xml, meaning adding tabs and newlines.

Parameters:
  • el (ElementTree.Element) – Current element.
  • level (int) – Current level.
pympi.Elan.parse_eaf(file_path, eaf_obj)

Parse an EAF file

Parameters:
  • file_path (str) – Path to read from, - for stdin.
  • eaf_obj (pympi.Elan.Eaf) – Existing EAF object to put the data in.
Returns:

EAF object.

pympi.Elan.to_eaf(file_path, eaf_obj, pretty=True)

Write an Eaf object to file.

Parameters:
  • file_path (str) – Filepath to write to, - for stdout.
  • eaf_obj (pympi.Elan.Eaf) – Object to write.
  • pretty (bool) – Flag to set pretty printing.