What are some guidelines for preparing batch ingests?

Escape XML control characters

If your MODS metadata contains XML control characters (see list here) they must be escaped - i.e. converted into alternatives that the software can process without errors. e.g. if your <abstract> element contains a line like the following:

"When exposure to helicopter activity is <1h per month…”

That needs to be changed to:

"When exposure to helicopter activity is &lt;1h per month”

You may find you have a lot of characters that need to be escaped. Consider especially things like quotation marks.

Zip file structure

For a standard batch ingest, all of your objects must meet the following criteria:

  • XML file and object file have same filename (except the extension)
  • They are all in the Zip file’s root (no directories allowed)
  • They are all going into the same collection and have the same content model (i.e. if you have both Thesis objects and Citation objects, put them into two different Zip files)

For a Newspaper batch, follow the guidelines here: https://github.com/Islandora/islandora_newspaper_batch

  • Multiple issues can go in one Zip file, but they must all be for the same newspaper
  • Each issue has a separate directory in the root of the Zip file
  • Each page has a subdirectory, containing the page’s TIF file. 
  • Page directories are named for their page number: 1, 2, 3, etc.
  • Page TIF files are all given the OBJ.tif filename
  • Each issue has a MODS XML file, in the issue directory

Ingest via the UI or command line

Very large batches might be difficult to execute via the user interface. If you’ve got a very large multi-gigabyte set, contact the Arca Office, and we’ll get it ingested for you via the command line. We can also help identify problems with your metadata before they end up in the repository.

Category: