Difference between revisions of "TextAnalysis"
Jump to navigation
Jump to search
(→Python) |
|||
(68 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
=Resources for Exploring Text Analysis= | =Resources for Exploring Text Analysis= | ||
*[http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/ "Where to Start," courtesy of Ted Underwood] | *[http://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/ "Where to Start," courtesy of Ted Underwood] | ||
+ | *[http://toolingup.stanford.edu/?page_id=981 Stanford's Introduction, from ''Tooling Up for Digital Humanities''] | ||
+ | *[http://people.cs.umass.edu/~wallach/workshops/nips2011css/papers/OConnor.pdf Brendan O’Connor, et al., "Computational Text Analysis for Social Science"] | ||
+ | * Paul Baker, ''Using Corpora in Discourse Analysis'' (soon to be available in the NEU Library stacks); covers Corpus Building, Frequency and Dispersion, Concordance, and Collocation | ||
+ | ** seeking suggestions for a web-based resource | ||
+ | ==Python== | ||
+ | * [http://fbkarsdorp.github.io/python-course/ Folgert Karsdorp, ''Python Programming for the Humanities''] | ||
+ | * [http://www.pythonlearn.com/book.php Charles Severance, ''Python for Informatics]'', an applied but comprehensive introductory Python text with sections on text parsing | ||
+ | * [https://www.python.org/downloads/ Download and install Python] | ||
+ | * [https://www.jetbrains.com/pycharm/ Download and install PyCharm], an Integrated Development Environment (IDE) for Python | ||
+ | * [https://ipython.org/ Download and install IPython], an interactive shell for Python | ||
==R== | ==R== | ||
− | * Matthew Jockers, Text Analysis With R for Students of Literature (PDF [http://onesearch.northeastern.edu/NU:NEU_ALMA51213317900001401&tabs=viewOnlineTab available for download] via the NEU Library) | + | * Matthew Jockers, ''Text Analysis With R for Students of Literature'' (PDF [http://onesearch.northeastern.edu/NU:NEU_ALMA51213317900001401&tabs=viewOnlineTab available for download] via the NEU Library) |
* [http://cran.at.r-project.org Download and install R] | * [http://cran.at.r-project.org Download and install R] | ||
− | * [http://www.rstudio.com Download and install RStudio] | + | * [http://www.rstudio.com Download and install RStudio], an Integrated Development Environment (IDE) for R |
− | * [http://www.rseek.org RSeek] | + | * [http://www.rseek.org RSeek], a search tool for finding resources on R |
* [http://www.cyclismo.org/tutorial/R/types.html Simple data types in R] | * [http://www.cyclismo.org/tutorial/R/types.html Simple data types in R] | ||
==Topic Modeling== | ==Topic Modeling== | ||
− | *[http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/ Megan R. Brett's "Basic Introduction"] | + | *[http://journalofdigitalhumanities.org/2-1/pacing-scholarly-conversations/ JDH's Special Issue] on Topic Modeling (2012) |
− | *[http://www.scottbot.net/HIAL/?p=19113 Scott Weingart's "Guided Tour"] | + | **[http://journalofdigitalhumanities.org/2-1/topic-modeling-a-basic-introduction-by-megan-r-brett/ Megan R. Brett's "Basic Introduction"] (conceptual) |
− | *[http://journalofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/ Ben Schmidt's | + | *[http://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/ Ted Underwood, "Topic modeling made just simple enough"] |
− | *[http://mallet.cs.umass.edu/topics.php MALLET] | + | *[http://www.scottbot.net/HIAL/?p=19113 Scott Weingart's "Guided Tour"] (comprehensive, lots of links) |
− | * GUI Tools that use MALLET | + | *[http://journalofdigitalhumanities.org/2-1/words-alone-by-benjamin-m-schmidt/ Ben Schmidt's article about Latent Dirichlet allocation's (LDA's) limitations] (also from the JDH special issue) |
− | ** [https://code.google.com/p/topic-modeling-tool/ Google's Topic Modeling Tool] | + | |
− | ** [http://vep.cs.wisc.edu/serendip/ Serendip] | + | |
− | * [http://nlp.stanford.edu/software/tmt/tmt-0.4/ Stanford Topic Modeling Toolbox] | + | ===Tools=== |
+ | *[http://mallet.cs.umass.edu/topics.php MALLET], an open-source and Java-based Latent Dirichlet allocation (LDA) package | ||
+ | ** [http://programminghistorian.org/lessons/topic-modeling-and-mallet Shawn Graham, Scott Weingart, and Ian Milligan's tutorial] for setting up a command line environment for using MALLET | ||
+ | ** [https://github.com/bmschmidt/RMallet Ben Schmidt's R package] wrapping MALLET | ||
+ | ** GUI Tools that use MALLET | ||
+ | *** [https://code.google.com/p/topic-modeling-tool/ Google's Topic Modeling Tool] | ||
+ | *** [http://vep.cs.wisc.edu/serendip/ Serendip] | ||
+ | * [http://nlp.stanford.edu/software/tmt/tmt-0.4/ Stanford Topic Modeling Toolbox] (an alternative to MALLET) | ||
==word2vec== | ==word2vec== | ||
* [http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html Ben Schmidt's Blog Post on Vector Space Models] | * [http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html Ben Schmidt's Blog Post on Vector Space Models] | ||
− | ** which links to his [https://github.com/bmschmidt/wordVectors R | + | ** which links to his [https://github.com/bmschmidt/wordVectors R package wrapping word2vec] (word2vec is written in C) |
==Miscellaneous text analysis tools== | ==Miscellaneous text analysis tools== | ||
− | * [http://voyant-tools.org/ Voyant Tools] | + | * [http://voyant-tools.org/ Voyant Tools], a simple web-based analysis and visualization tool |
+ | * [http://lexos.wheatoncollege.edu/upload Lexos], a tool for scrubbing, chunking, and tokenizing text; in addition to performing modest analysis and visualizing clusters | ||
+ | ** [http://scottkleinman.net/blog/2014/07/25/how-to-create-topic-clouds-with-lexos/ Scott Kleinman's blog post] on "How to Create Topic Clouds with Lexos" | ||
* [http://www.laurenceanthony.net/software/antconc/ Laurence Anthony's AntConc], a GUI concordancing and text analysis toolkit | * [http://www.laurenceanthony.net/software/antconc/ Laurence Anthony's AntConc], a GUI concordancing and text analysis toolkit | ||
− | * David McClure's TextPlot | + | * [https://sites.google.com/site/casualconc/ CasualConc], a Mac OSX-native toolkit (AntConc's Mac version is ported from the PC, and has some bugs) |
− | ** [http://dclure.org/essays/mental-maps-of-texts/ Blog post explaining concept] | + | * David McClure's TextPlot, a Python package that produces force-directed network of words in a text, the nodes of which are clustered using estimated kernel densities |
− | ** [http://dclure.org/tutorials/textplot-refresh/ Blog post | + | ** [http://dclure.org/essays/mental-maps-of-texts/ Blog post explaining concept of TextPlot] |
− | ** [http://dclure.org/logs/tuning-textplot/ Blog post | + | ** [http://dclure.org/tutorials/textplot-refresh/ Blog post on downloading and setting up TextPlot] |
+ | ** [http://dclure.org/logs/tuning-textplot/ Blog post explicating TextPlot's parameters] | ||
+ | * [http://bookworm.culturomics.org/ Bookworm], a customizable corpus trend visualization tool | ||
+ | * [https://www.jasondavies.com/wordtree/ Word Tree], a tool that creates [http://betterevaluation.org/evaluation-options/wordtree word trees] from a block of text | ||
=Corpus building= | =Corpus building= | ||
+ | *Amanda Rust's [http://subjectguides.lib.neu.edu/textdatamining Subject Guide] on "Text and Data Mining Library Databases" (Northeastern University Libraries) | ||
− | ==Some places to get | + | ==Some places to get texts== |
===Plain text=== | ===Plain text=== | ||
*[https://www.gutenberg.org/ Project Gutenberg] | *[https://www.gutenberg.org/ Project Gutenberg] | ||
− | *[http://eebo.chadwyck.com/home Early English Books Online (EEBO)] | + | *[http://eebo.chadwyck.com/home Early English Books Online (EEBO)] (some texts TEI-encoded) |
*[http://omekasites.northeastern.edu/ECDA/ Early Caribbean Digital Archive (ECDA)] | *[http://omekasites.northeastern.edu/ECDA/ Early Caribbean Digital Archive (ECDA)] | ||
− | ===Encoded=== | + | ===TEI-Encoded=== |
*[http://www.wwp.northeastern.edu/wwo/ Women Writers Online] | *[http://www.wwp.northeastern.edu/wwo/ Women Writers Online] | ||
*[http://www.textcreationpartnership.org/tcp-ecco/ Eighteenth Century Collections Online (ECCO-TCP)] | *[http://www.textcreationpartnership.org/tcp-ecco/ Eighteenth Century Collections Online (ECCO-TCP)] | ||
− | *[http://docsouth.unc.edu/ | + | *[http://docsouth.unc.edu/ Documenting the American South] |
Latest revision as of 04:14, 16 March 2016
Resources for Exploring Text Analysis
- "Where to Start," courtesy of Ted Underwood
- Stanford's Introduction, from Tooling Up for Digital Humanities
- Brendan O’Connor, et al., "Computational Text Analysis for Social Science"
- Paul Baker, Using Corpora in Discourse Analysis (soon to be available in the NEU Library stacks); covers Corpus Building, Frequency and Dispersion, Concordance, and Collocation
- seeking suggestions for a web-based resource
Python
- Folgert Karsdorp, Python Programming for the Humanities
- Charles Severance, Python for Informatics, an applied but comprehensive introductory Python text with sections on text parsing
- Download and install Python
- Download and install PyCharm, an Integrated Development Environment (IDE) for Python
- Download and install IPython, an interactive shell for Python
R
- Matthew Jockers, Text Analysis With R for Students of Literature (PDF available for download via the NEU Library)
- Download and install R
- Download and install RStudio, an Integrated Development Environment (IDE) for R
- RSeek, a search tool for finding resources on R
- Simple data types in R
Topic Modeling
- JDH's Special Issue on Topic Modeling (2012)
- Megan R. Brett's "Basic Introduction" (conceptual)
- Ted Underwood, "Topic modeling made just simple enough"
- Scott Weingart's "Guided Tour" (comprehensive, lots of links)
- Ben Schmidt's article about Latent Dirichlet allocation's (LDA's) limitations (also from the JDH special issue)
Tools
- MALLET, an open-source and Java-based Latent Dirichlet allocation (LDA) package
- Shawn Graham, Scott Weingart, and Ian Milligan's tutorial for setting up a command line environment for using MALLET
- Ben Schmidt's R package wrapping MALLET
- GUI Tools that use MALLET
- Stanford Topic Modeling Toolbox (an alternative to MALLET)
word2vec
- Ben Schmidt's Blog Post on Vector Space Models
- which links to his R package wrapping word2vec (word2vec is written in C)
Miscellaneous text analysis tools
- Voyant Tools, a simple web-based analysis and visualization tool
- Lexos, a tool for scrubbing, chunking, and tokenizing text; in addition to performing modest analysis and visualizing clusters
- Scott Kleinman's blog post on "How to Create Topic Clouds with Lexos"
- Laurence Anthony's AntConc, a GUI concordancing and text analysis toolkit
- CasualConc, a Mac OSX-native toolkit (AntConc's Mac version is ported from the PC, and has some bugs)
- David McClure's TextPlot, a Python package that produces force-directed network of words in a text, the nodes of which are clustered using estimated kernel densities
- Bookworm, a customizable corpus trend visualization tool
- Word Tree, a tool that creates word trees from a block of text
Corpus building
- Amanda Rust's Subject Guide on "Text and Data Mining Library Databases" (Northeastern University Libraries)
Some places to get texts
Plain text
- Project Gutenberg
- Early English Books Online (EEBO) (some texts TEI-encoded)
- Early Caribbean Digital Archive (ECDA)