Corpus from Gutenberg

From Digital Scholarship Group
Jump to navigation Jump to search

Making a plain text corpus from Project Gutenberg

  • Make a new directory to hold the corpus (here /tmp/GutCorp).
  • Make a new directory for temporary data, and cd into that directory.
  • Download (This may take awhile, likely measured in weeks. You do not have to wait for this command to finish to perform the further steps below, you just won’t get files that have not yet been downloaded):
 $ wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"
  • Copy only the .zip files from temporary to corpus dir
 $ find . -name '*.zip' -exec cp '{}' /tmp/GutCorp/ \;
  • Go to the corpus dir: cd /tmp/GutCorp
  • Unzip 'em all, erasing those that are successfully unzipped:
 $ for f in *.zip ; do if unzip $f; then rm $f ; fi ; done
  • Move the text files that are in folders up to corpus dir:
 $ mv */*.txt */*.TXT ./
  • Remove the now-empty subdirectories (expect to get a message for each subdirectory saying whether or not it was deleted):
 $ find . -type d -exec rmdir -v '{}' +
  • Rename the .TXT files to .txt:
 $ for f in *.TXT ; do mv $f `basename $f.TXT`.txt; done
  • Remove ASCII versions of files that also have an ISO-8859-1 encoded version (you will get an error message for each file that does not have an ISO-8859-1 version):
 $ for f in * ; do rm `basename $f '-8.txt'`.txt ; done

Filtering said corpus

divide into English and non-English

  • First, I used ls -1 . | egrep -v '\.txt$' to find non-text files.
  • There were a few that were folders, so I moved their contents up by hand and deleted the now empty folder.
  • Then, I removed obviously non-test files. Something like rm *.htm *.jpg *.png *.tif *.mus *.mid should do the trick.
  • I now have a (temporary) folder with nothing but ~51,066 plain text files.
  • Created a new sub-directory en/ to hold the English-language files. At this point I don’t know what other languages there will be (if any), so I have not created directories for other languages yet.
  • Moved those that declare themselves to be in English into en/ (took ~6¼ minutes):
 $ time for f in * ; do echo "---------$f:" ; if egrep -il '^language\s*:\s*english' $f ; then mv $f ./en/ && echo "moved" ; else echo "foriegn" ; fi ; done
  • Now I have 486 files left in my temp Gutenberg directory, and 50,580 in en/.
  • Make a new subdirectory, foriegn/.
  • Of those 486:
    • egrep -ih -m 1 '^language.?:' *.txt | rank shows me that 92 files have what might be language metadata
    • of those 92, 2 are clearly not language metadata, and those two have “language”, not “Language”. Futhermore, there are no spaces or anything else before the colon. So, revising the search a bit …
    • egrep -h -m 1 '^Language:' *.txt | rank shows me the languages pretty well.
    • Move those that obviously have large quantity of non-English text to foriegn/:
$ mv `egrep -Hl -m 1 '^Language: (Esperanto|Latin|German|French|Spanish|Greek|Italian|Kamilaroi|Cebuano|Chinese|.English an|Finnish|Welsh)' *.txt` ./foriegn/
    • Looked at the one with language “5468” by hand, it’s in English.
    • So move all the remaining files with language metadata into en/:

$ mv `egrep -Hl -m 1 '^Language:' *.txt` en/

  • I now have only 394 texts remaining that have not been sorted into en/ or foreign/ (well, 395 really; but 1 of them is the notes file I am keeping for this wiki page. :-)
  • In order to try to quickly find the obvious cases, I tried listing, for each file, all of the “words” that occurred 99 or more times in the file:
$ for f in *.txt ; do echo "---------$f:" ; egrep -v '^\*\*' $f | perl -pe 's/[.,;:?]//g; tr/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/; s,\s+,\n,g;' | rank | egrep -v '^      ?[0-9]?[0-9] '; done
  • I then just went through that bloody list by hand by repeatedly searching for “---------”, and scanning the first up to half-screen (~16) most common words. If I recognized even a few as English, on to the next. Took < ½ hour. I found:
    • None that were not in English, however
    • Quite a few (perhaps a dozen) are tables of numbers, and thus pretty uninteresting. I left them, anyway, as there is some prose.
    • 11 files (660–670) are Webster’s dictionary. Will filter these to a separate directory.
    • 6 files (3201–3206) look like they are more indices into files that I don't have than actual texts themselves. And even the documents I don't have are not that interesting — instructions to a hyphenation program. Will filter these to a separate directory.
  • Created ignore/ directory, moved the 17 files mentioned above into it.