WWO Browse Processing

From Digital Scholarship Group
Jump to navigation Jump to search

Preparing texts for WWO browsing display (do this on golf)

Overview

  • grab the files from /tb/distribution/ (or, nowadays, /tb/improvement/)
  • make sure there are no $$$$$, %%%%%, or <note type="temp">
  • run them through main processing pipe (mainWWOScript.bash)
  • run a script on them (shrtcites.perl) to add a new <title> element that Philologic will use
  • create the metadata HTML chunks
  • create the texts HTML chunks
  • update relevant lists accordingly

Note \[2009-10-23\] --- the following files are in improvement/ but should NOT be transferred to WWO website:

  1. cowley.preface
  2. cowley.dramas1
  3. haywood.spectator (w/o number)

Detailed instructions - actual commands in green

Note: new files usually go into WWO in small batches rather than one at a time, so most of the commands will be for dealing with multiple files at once. HOWEVER, dealing with all the files from /improvement/ can be a very slow job in the early stages of the processing pipeline, so the initial couple of commands will be specified for dealing with one file at a time and then when you get to the shrtcites step you can treat them as a batch. The commands for dealing with the compete set of /improvement/ texts as a group will be also be given.

start in the WWO home dir

cd /var/www/htdocs/WWO/

check for leftover template strings and temporary notes. That is, search for the strings $$$$$ and %%%%% and for <note> elements of type=temp in the files you plan to process.

saxon -l -xsl:xslt/pre-publication-tests.xslt -s:/opt/local/projects/wwp/tb/improvement/foo.bar.xml

When processing all files...: ./scripts/pre-publication-tests.bash operates on an entire directory at once (default is improvement/)

then clear out any old stuff

rm -f temp1/* temp2/*

now get the texts you want from /improvement/ or /under_construction/ one at a time, nuking the DOCTYPE declaration and resolving entity and numeric character references as you go -- note that we’re putting the files into temp2/ to start with

xmllint --dropdtd --noent --c14n /opt/local/projects/wwp/tb/improvement/foo.bar.xml> temp2/foo.bar.xml

When processing all files...: for f in /opt/local/projects/wwp/tb/improvement/..xml; do xmllint \--dropdtd \--noent \--c14n $f > temp2/`basename $f`; done

now tweak our P5 files into a P4-like state, moving them from temp2/ to temp1/

time saxon -l -s:temp2/ -o:temp1/ -xsl:/var/www/htdocs/WWP/utils/stylesheets/retrofitP5_to_publishable.xslt

run the main processing script on each file, one at a time, putting the result files in /temp2/

Note: Note the fully-specified paths!

scripts/mainWWOScript.bash /var/www/htdocs/WWO/temp1/foo.bar.xml /var/www/htdocs/WWO/temp2/foo.bar.xml

When processing all files...:

do it in emacs so you can scroll back to see error messages easily note that all of the indented lines are a continuation of the previous line (i.e., from "for" to "done" is all one big, long, line :-)

emacs & M-x shell cd /var/www/htdocs/WWO/temp1/ time for f in \*.xml ; do    echo \-e "---------$f:\n" ;    time ../scripts/mainWWOScript.bash    /var/www/htdocs/WWO/temp1/$f    /var/www/htdocs/WWO/temp2/$f ;    done

Warning! Watch out for these possible sources of error from mainWWOScript.bash:

  1. files that include entity refs to other files will need to have those references removed otherwise they won't make it through the processing. For an example of how I have worked around the problem previously, compare xml/texts/cavendish.62a-struct.xml with opt/local/projects/wwp/tb/distribution/cavendish.62a-struct.xml# when a file contains an unusual character entity reference that hasn't yet been declared in /WWO/entities/hexvals.ent

Interjection from Syd: About the above mentioned problems...

  1. File entity references should be very easy to fix.
  2. Characters we don't know about may be fixable, but probably isn't worth the effort.

Also, the file judson.account needs quite a bit of hand-tweaking. It has multiple bestowments on a single keyword. (I'm not even sure we allow that.) One way to fix this is to hand-propegate them to the <p> elements, as it easy to defend that those renditions are not caused by the quotation.

when you have processed all the files you want to with mainWWOscript.bash, delete everything from /temp1/ and move the result files from /temp2/ into /temp1/

cd /var/www/htdocs/WWO/

rm -f temp1/*

mv temp2/* temp1/

all files in WWO have an extra <title> element in the header, called <title type="cite"> which is very important for the Philologic displays. This extra element is inserted by a Perl script called shrtcites.perl which reads data from a file called shrtcites.txt. So for each NEW file you are processing you must create an appropriate entry in /var/www/htdocs/WWO/scripts/shrtcites.txt

An entry for a text has two lines: the first is the XML filename; the second is the element that will be inserted into the file of that name; ie:

foo.bar.xml

<title type="cite">Foo:Something</title>

where "Foo" is obviously the author's name and "Something" is a meaningful word from the title (in many cases, but not all, "Something" is the same as "bar" from the filename).

Warning!: Be sure you have entered all the necessary data in shrtcites.txt before continuing

scripts/shrtcites.perl

the shrtcites.perl script changes the files in place, so you now have in temp1/ new, processed, well-formed XML text file(s) that will form the basis of both the HTML display files and the Philologic search files. Thus the next important step is to save your file(s)

cp temp1/foo.bar.xml xml/texts/

When processing all files...:

remove the old set, copy the new set over in its place

rm xml/texts/*.xml cp temp1/*.xml xml/texts/

now create the metadata chunk of HTML by running extract-metadata.xslt on the file(s) in temp1/:

xsltproc xslt/extract-metadata.xsl temp1/foo.bar.xml > temp2/foo.bar.html

Warning!: The above process works when teiHeader/fileDesc/sourceDesc/biblStruct//imprint//date has a when=, not if it uses from= and to= as, e.g., lennox.museum.xml does. That one probably would have to be tweaked by hand.

When processing all files...: cd temp1/ for f in *.xml ; do echo "---------$f:" ; xsltproc ../xslt/extract-metadata.xsl $f > ../temp2/`basename $f .xml`.html ; done cd ../

and move to the right place, overwriting any existing version

cd ..;mv temp2/foo.bar.html html/metadata/

When processing all files...: cd .. rm -f html/metadata/*.html mv temp2/*.html html/metadata/

similarly, create the text chunks of HTML by running transform-content.xsl:

/opt/local/bin/saxon.bash -xsl:xslt/transform-content.xsl -s:temp1/foo.bar.xml -o:temp2/foo.bar.html

When processing all files...: /opt/local/bin/saxon.bash -xsl:xslt/transform-content.xsl -s:temp1/ -o:temp2/

Note that the output is already named .html (I think this is because of the <xsl:output method="html"/>, but I'm not sure.)

and move to the right place, overwriting any existing version

mv temp2/foo.bar.html html/texts/

When processing all files...: rm \-f html/texts/*.html mv temp2/*.html html/texts/

Finally, any new texts need to be added to the lists that let the user see what is available. The lists are in /html/lists/ and the ones that need to be updated are:

  • wTextsByAuthor.html
  • wTextsByDate.html
  • wTextsByTitle.html

To generate new lists, issue

/var/www/htdocs/WWO/scripts/generateLists.bash

It doesn't matter from which directory you issue this, as the input and output paths, not mention the path to Saxon, are hard-coded in the XSLT program and shellscript driver.

That's it, all the texts are ready for browse display on golf. You should now check the text and metadata displays to see that they are OK. If you are satisfied, copy the contents of these directories over to their equivalents on papa:

  • xml/texts/
  • html/metadata/
  • html/texts/
  • html/lists/
  • wwo.css
  • wwo_print.css

This can be accomplished by running /opt/local/bin/wwo_sync and then scp ing the CSS files separately (remember to make them r/w on papa first). I am not at all sure why we don't sync the whole WWO/ directory tree.

More To Do...

It turns out we also have lists of texts that are not in the WWO/ branch, but are in the WWP/ branch of our website that also need to be updated.

I have now automated this process, so it's pretty easy to do. It's worth noting that the work done by this process and by generateLists.bash is pretty similar, although the methods are quite different. They should probably be combined somehow someday.

Note that since the WWP/ branch is under Subversion control, this can be done on golf, tested, and then synced with papa if and when it works.

cd /var/www/htdocs/WWP/wwo/texts/ make all

For information on what is going on and gotchas to avoid, issue make and read the file README. Note also that the title-level URLs do not have an analog on golf, so the page you generate on golf will point to papa and thus get redirected to textbase.