Overview

_images/logo.png

Usage

On the command line, go to the parent directory, and call the script convert.py

python convert.py [--raw] [--validate] [--fdecls-only] [Part]*

where part is monad, section, lingo, all, none

Transforms the WIVU database into a LAF resource. Because of file sizes, not all annotations are stored in one file. There are several parts of annotations: monads (words), sections (books, chapters, verses, etc), lingo (sentence, phrase, etc)

If –raw is given , a fesh export from the EMDROS database is made. For each part there is a separate export.

If –validate is given, generated xml files will be validated against their schemas.

If –fdecls-only is given, only the feature declaration file is generated.

The conversion is driven by a feature specification file. This file contains all information about objects, features and values that the program needs. The division into parts, but also the mapping to ISOcat is given in this file.

Input

The main input for the program is an EMDROS database, from which data will be exported by means of MQL queries. For every part (monad, section, lingo) an mql query file is generated, and this query is run against the database. The result is a plain text file (unicode utf8) per part.

Output

This is what the program generates:

The main output are annotation files plus a primary data file. And there are descriptive headers. The primary data file is a plain text file (unicode utf8) containing the complete vocalized text of the Hebrew Bible according to the Biblia Hebraica Stuttgartensa. There is some chunking into books, chapters and verses, only by means of newlines. No section indications occur in the primary text. This file is obtained from a few text-carrying features present in the database.

Annotation files are xml files that describe regions of in the primary data, and properties of those regions. Annotations are the translation of the WIVU objects and features. Annotation files start with header information.

There are several header files, one for the LAF resource as a whole, one for the primary data file, and one for linking the object types and features to descriptions in the ISOcat registry.

All generated XML files will be validated against their schemas by means of xmllint.

Definitions

The conversion process is defined by a substantial amount of information outside the program. This information comes in the form of a main configuration file, a feature definition file, a bunch of templates, and several XML schemas.

The main config files specifies file locations, the version of the Hebrew database, and the location of the ISOcat registry. The feature definition file is a big list of object types, their associated features with their enumerated values plus the ISOcat correspondences of it all. It also chunks the LAF materials to be generated into a monad, section and lingo part, providing even one more layer of subdivisions, in order to keep the resulting xml files manageable.

Project

SHEBANQ, funded by CLARIN-NL, 2013-05-01 till 2014-05-01

Author

Dirk Roorda, Data Archiving and Networked Services, dirk.roorda@dans.knaw.nl

Everyone has every right to do with this program as he or she pleases.