Text Mining: How to analyse large amounts of text

Reiter

Clifford Stoll: „Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom“

With Distant Reading, we can use certain tools to examine and compare unknown texts from a literary studies perspective according to plot arc, character characterization and sentiment.

With Text Mining, we can process large unstructured volumes of text in such a way that we obtain structured data from them.

With Information Retrieval, we can use analysis tools to answer users' individual questions about texts.

Individual functions of distant reading, text mining and information retrieval overlap. They differ primarily in terms of methodology. In this unit, we will get to know the basic methods of text mining and information retrieval and differentiate between them.

Distant Reading with Voyant Tools

The webbased Yoyant Tools make for an easy start into the methodes of Distant Reading. Basic functions such as creating worldclouds, finding frequencies of terms in document segments or searching for contexts and phrases. Voyant tool also offers a comprehensive Guide that not only explains how to use the various tools but also what happens computationalwise in the background.

Task I

1. Download the complete canon of Sherlock Holmes

2. Load the canon into Voyant Tools.

3. Search for the Terms "Sherlock" or "Holmes" and "Watson" with the Trends Tool. You may search groups of terms by using the pipe Symbol (|). This means to compare the term "Watson" and with the terms "Sherlock" and "Holmes" you would search "Watson, Sherlock|Holmes". The comma seperates terms or groups of terms.

4. Compare the occurence of the entities Sherlock Holmes and John Watson. You may use differen versions of the names and different tools of the Voyant Canvas such as Cirrus, Phrases and Contexts. These are tools you already see in your standard canvas - but you may also use other tools, such as Topics, that you can activate by clicking on the window symbol in the upper right corner.

5. Based on your findings, can you infer anything obout the role of Sherlock Holmes and/or Dr. Wartson in the corpus? Write about it in your post.

Example for Text Mining

Suppose we want to determine which places are mentioned in a text, in this case "The Valley of Fear" by Arthur Conan Doyle and display them on a map. Named Entity Recognition is used to find out which locations can occur. We may use any natural language processing tool for this task. In this case for a shorter code we see an example with spaCy and python. In spacy, NEs receive a label with information about the entity type, for example "PERSON" for persons, "ORG" for organizations and "GPE" for geopolitical entities.
In section eight you will learn how to apply this for yourself - for now just try to understand what approach we are taking here.

In our Output we see the names of these entities:

We can use a library like Geopy to get suggestions on the longitude and latitude ridge of the location found:

Which produces asuggested latitude and longitude. In this case: 38.262336,-85.60032199999999

From this data we can create a csv file, a plain text file in which the first line contains the column names of a table. The lines are specified by so-called delimiters - in this case the comma character.

In this way, we have obtained structured data from unstructured text, which we can now display on a map. How to do this you will learn in section eleven of this course.

Text mining is therefore a variety of methods for extracting structured data from texts. This can be done using Named Entity Recognition, Sentiment Analysis, Emotion Analysis, Topic Modeling, TFIDF (Term Frequency-Inverse Document Frequency) and many other approaches. All these methods can be used in text mining as well as for information retrieval, the crucial difference is whether there is a user with a specific need for information or not.

Example Information Mining

Information retrieval is not just about data and information, but also about a user who has a need for information. Let's assume we want to know which places Sherlock Holmes, Dr. Watson and Professor Moriaty visit in the novel "Valley of Fear". In the Valley of Fear novella, we search for entities that are geopolitical entities, as in the last example. But now we go a bit further and try to find out which NEs have the person labels that identify them as Sherlock Holmes, Dr. Watson or Prof. Moriarty linked to NEs that have GPE labels.

The following method is simplified and therefore prone to error. We use a for-loop to examine all tokens of the text one after the other and only execute the following steps for the tokens with the label 'GPE'. For georeferencing, we save the text of the token in the query variable.

Then we create an exception. In this case, it is possible that no geodata can be assigned for certain entities. In this case an index error would occur. Therefore we use 'try'. The program tries to execute the indented code after 'try'. If an index error occurs, the program jumps to the exception 'except IndexError' and executes the line 'print("Unknown: " + token.text).

Now the text of the token is passed on to the previously imported geocoder.

To link persons in the text with places, three approaches are chosen in this code.

all persons that are subordinate to the geoentity in the semantic analysis are assigned to the place.
whenever a NE with the label 'person' is directly superordinate to the geo-entity in the semantic analysis, the NE is assigned to the location
if neither is the case, the document is searched from the token to the left for the first NE with the label 'Person'.

Finally, we filter the data according to whether the characters are Watson, Holmes or Moriarty. We can then also visualize this data on a map. However, there are of course several problems that need to be taken into account due to the simplicity of the approach:

Holmes reports in the text about his research at various locations. This does not mean that he has visited this place himself.
This approach does not deal with the so-called "coreference resolution". This is the assignment of pronouns to specific entities.

Task II

Try to find one example each for distant reading, text mining and information retrieval.