wivu2laf 1.0.1

Submodules

wivu2laf.wivu2laf module

wivu2laf.config module

wivu2laf.laf module

class Laf(cfg, wv, val)[source]

Knows the LAF data format.

All LAF knowledge is stored in template files together with sections in the main configuration file. The LAF class finds those templates, sets up the result files, and fills them.

Note:

Templates

template[key] = text
where key is an entry in the laf_templates section of the main config file.
Note:

Files and Filetypes

annotation_files[part][subpart] = (ftype, medium, location, requires, annotations, is_region)

The order is important, so we generate a list too:

file_order
list of ftypes according file_types section in main config file, expanded, in the order encountered

where

ftype

comes from the file_types section in the main config file. It has the shape of LAF file identifier, but with wild cards.

f.xxxxxx
not an annotation file, but primary data or a header file
f_part.subpart
annotation file for part, subpart
for each ftype

there is an infostring consisting of fields

location
file name of corresponding file, modulo a common prefix
medium
file type (text or xml)
annotations
space separated annotation labels occurring in this part, subpart
requires
space separated list of ftypes of required files
is_region
reveals whether the file only contains regions or not. A pure region file needs a different template.
Note:

Header Generation

All header files are generated here: * the feature declaration file * the header for the resource as a whole * the header for the primary data file

The headers of the annotation files are included in those files. Those headers contain statistics: counts of the number of annotations with a given label. We know those number only after generation because these statistics will be collected during further processing.

When the annotation files are generated, we use placeholders for the statistics. In a post-generation stage we read/write the annotation files and replace the place holders by the true numbers. The files are written in situ. So we must take care that the placeholders contain enough space around them.

Note:

Processing

This class provides methods to initialize and finalize the generation of primary data files and annotation files. There are methods to open/close all files that are relevant to the part that is being processed. (Part being: ‘monad’, ‘section’, ‘lingo’).

Note:

Statistics

Counts are collected in a stats dictionary.

  • stats[statistic_name] = statistic_value*

Initialization is:

  • setting up the list of annotation files.
  • reading and storing all templates
annotation_files = defaultdict(<function <lambda> at 0x39cae60>, {})
cfg = None
file_handles = {}
file_order = []
finish_annot(part)[source]

Closes all annotation files belonging to a part.

When needed, it fills in required statistics, such as the number of times an annotation label is used. Uses templates:

  • annotation_ftr
finish_primary()[source]

Closes the primary data file

gstats = defaultdict(<function <lambda> at 0x39caf50>, {})
makefeatureheader()[source]

Creates a feature declaration file for all features and its values.

Uses the templates:

feature_basic, feature, feature_val1, feature_val, feature_decl
makeheaders()[source]

Creates the headers that occupy separate files.

The resource header is the header file for the resource as a whole. The primary header is a header file for the primary data. The feature header is an xml document that contains feature declarations.

makeprimaryheader()[source]

Create the primary header.

Uses the templates:

  • annotation_item
  • primary_hdr
makeresourceheader()[source]

Creates the resource header

Uses the templates:

  • annotation_decl
  • resource_hdr,
primary_handle = None
report()[source]

Report the general statistics

start_annot(part)[source]

Creates the annotation headers of the annotation files belonging to a part.

Opens a file for writing, dumps the header to it, and leaves the file open for further writing by other parts of the program.

Uses templates:

  • annotation_label
  • region_hdr
  • annotation_hdr
start_primary()[source]

Opens a file for the primary header and leaves it open for other parts of the program to write to

stats = defaultdict(<function <lambda> at 0x39caed8>, {})
template = {}
wv = None

wivu2laf.mylib module

class Timestamp[source]
elapsed()[source]
progress(msg)[source]
timestamp = None
camel(text)[source]

Render text in camelcase, removing spaces, and let the first word start with lower case

fillup(size, val, lst)[source]

Fill lst up with dummy elements val until it has size

pretty(data)[source]
run(cmd)[source]
runx(cmd)[source]
today()[source]

wivu2laf.transform module

wivu2laf.validate module

class Validate(cfg)[source]

Validates all generated files, knows the schemas involved.

The main program generates a bunch of XML files, according to various schemas. They can be sent to this object, with or without a schema specification. All files with a schema specification will be validated.

The base locations of the schemas and of the generated files will be retrieved from the main configuration. All schemas will be copied from source to destination.

generated_files = list of [absolute_path, schema in destination, validation result]

Initialization is: get from config the schema locations and copy them all over

add(xml, xsd)[source]

Add an item to the generated files list. If xsd is given, the file will eventually be validated.

The validation result will be stored in a member of the item, which is initially None. If validation takes place, None will be replaced by True or False, depending on whether the xml is valid wrt. the xsd.

cfg = None
generated_files = []
report()[source]

Print a list of all generated files and indicate validation outcomes

validate()[source]

Validate all eligible files, but only if the validation flag is on

wivu2laf.wivu module

class Wivu(cfg)[source]

Knows the WIVU data format.

All WIVU knowledge is stored in a file that describes objects, features and values. These are many items, and we divide them in parts and subparts. We have a parts for monads, sections and linguistic objects. When we generate LAF files, they may become unwieldy in size. That is why we also divide parts in subparts. Parts correspond to sets of objects and their features. Subparts correspond to subsets of objects and or subsets of features. N.B. It is “either or”: either

  • a part consists of only one object type, and the subparts divide the features of that object type

or

  • a part consists of multiple object types, and the subparts divide the object types of that part. If an object type belongs to a subpart, all its features belong to that subpart too.

In our case, the part ‘monad’ has the single object type, and its features are divided over subparts. The part ‘lingo’ has object types sentence, sentence_atom, clause, clause_atom, phrase, phrase_atom, subphrase, word. Its subparts are a partition of these object types in several subsets. The part ‘section’ does not have subparts. Note that an object type may occur in multiple parts: consider ‘word’. However, ‘word’ in part ‘monad’ has all non-relational word features, but ‘word’ in part ‘lingo’ has only relational features, i.e.features that relate words to other objects.

The Wivu object stores the complete information found in the Wivu config file in a bunch of data structures, and defines accessor functions for it.

The feature information is stored in the following dictionaries:

(Ia) part_info[part][subpart][object_type][feature_name] = None

Stores the organization of individual objects and their features in parts and subparts. NB: object_types may occur in multiple parts.

(Ib) part_object[part][object_type] = None

Stores the set of object types of parts

(Ic) part_feature[part][object_type][feature_name] = None

Stores the set of features types of parts

(Id) object_subpart[part][object_type] = subpart

Stores the subpart in which each object type occurs, per part
  1. object_info[object_type] = [attributes]
Stores the information on objects, except their features and values.
  1. feature_info[object_type][feature_name] = [attributes]
Stores the information on features, except their values.
  1. value_info[object_type][feature_name][feature_value] = [attributes]
Stores the feature value information
  1. reference_feature[feature_name] = True | False

    Stores the names of features that reference other object. The feature ‘self’ is an example. But we skip this feature. ‘self’ will get the value False, other features, such as mother and parents get True

  1. annotation_files[part][subpart] = (ftype, medium, location, requires, annotations, is_region)
Stores information of the files that are generated as the resulting LAF resource

The files are organized by part and subpart. Header files and primary data files are in part ‘’. Other files may or may not contain annotations. If not, they only contain regions. Then is_region is True.

ftype
the file identifier to be used in header files
medium
text or xml
location
the last part of the file name. All file names can be obtained by appending location after the absolute path followed by a common prefix.
requires
the identifier of a file that is required by the current file
annotations
the annotation labels to be declared for this file
The feature information file contains lines with tab-delimited fields (only the starred ones are used):
0* 1* 2* 3* 4* 5* 6 7* 8 9 10 11* 12* object_type, feature_name, defined_on, wivu_type, feature_value, isocat_key, isocat_id, isocat_name, isocat_type, isocat_def, note, part, subpart 0 1 2 3 4 5 6 7 8

Initialization is: reading the excel sheet with feature information.

The sheet should be in the form of a tab-delimited text file.

There are columns with:
WIVU information:
object_type, feature_name, also_defined_on, type, value.
ISOcat information
key, id, name, type, definition, note
LAF sectioning
part, subpart

See the list of columns above.

So the file gives essential information to map objects/features/values to ISOcat data categories. It indicates how the LAF output can be chunked in parts and subparts.

cfg = None
check_raw_files(part)[source]

Generate the file with raw emdros output by executing a generated mql query. This query has been generated during initialization. Only when there is a command line flag present that tells to do this

feature_atts(object_type, feature_name)[source]

Returns a tuple of feature attributes, corresponding with the columns in the feature excel sheet.

The Wivu columns (object type, feature name) are missing, since they are given as arguments. The LAFcolumns are not included. The attributes returned are:

defined_on, wivu_type, isocat_key, isocat_name
feature_info = {}
feature_list(object_type)[source]

Answers: which features belong to an object type?

feature_list_part(part, object_type)[source]

Answers: which features belong to an object type, and also in a part and exclude the features to be skipped?

feature_list_subpart(part, subpart, object_type)[source]

Answers: which features belong to an object type, a part and subpart, and also in a part and exclude the features to be skipped?

is_ref_skip(feature_name)[source]

Tests if the feature_name is a reference feature that should be skipped

list_ref_noskip()[source]

List the reference features that should not be skipped

make_query_file(part)[source]

Generate an emdros query file to extract the raw data for part from the emdros database.

object_atts(object_type)[source]

Returns a tuple of object attributes, corresponding with the columns in the feature excel sheet.

The Wivu column (object type) is missing, since they are given as arguments. The LAFcolumns are not included. The attributes returned are:

isocat_key, isocat_name
object_info = {}
object_list(part, subpart)[source]

Answers: which objects are there in a subpart of a part?

object_list_all()[source]

Answers: which object types are there?

object_list_part(part)[source]

Answers: which objects are there in all subparts of a part?

object_subpart = {}
part_feature = defaultdict(<function <lambda> at 0x3ffd9b0>, {})
part_info = {}
part_list()[source]

Answers: which parts are there?

part_object = defaultdict(<function <lambda> at 0x3ffda28>, {})
raw_file(part)[source]

Give the name of the file with raw emdros output for part

reference_feature = {}
subpart_list(part)[source]

Answers: which subparts are there in a part?

the_subpart(part, object_type)[source]

Answers: which subpart of part contains this object type?

value_atts(object_type, feature_name, feature_value)[source]

Returns a tuple of value attributes, corresponding with the columns in the feature excel sheet.

The Wivu columns (object type, feature name, feature_value) are missing, since they are given as arguments The LAFcolumns are not included. The attributes returned are:

wivu_type, isocat_key, isocat_name
value_info = {}
value_list(object_type, feature_name)[source]

Answers: which values belong to a features of an object type?

Module contents