Tabs
Sentiment Analysis & Named Entity Recognition
Named Entity Recognition and Emotion Analysis are two different methods of computational linguistics. Both methods are frequently used in the digital humanities and since there are many overlaps in the pre-processing of the text that is used for both methods, we will treat them together in this unit. At the end of this section you will be given a code that you should then be able to read in the sense that you can understand what is does in theory. That means you will gain understanding not only on Sentiment Analysis and NER but also in interpreting code.
We will again use Sherlock Holmes quotes as examples. To give you an idea of what machine language processing can look like in practice, we will go through the individual processing steps using different python libraries. You will find a detailed explanation of the code at the end of each page. The aim of the unit is to gain a basic understanding of the results produced by the code.
spaCy
The first python library we use here is spaCy. This is an open-source library that is considered relatively easy to implement. With spaCy you can analyze 25 different languages and use other libraries such as TensorFlow, Pytorch or MXNet together with spaCy's own library Thinc. At https://spacy.io you will find the different models as well as clear and interactive tutorials for both beginners and advanced users.
If you use software, you must also cite it. For spaCy you could cite the first publication...
Metthew Honnibal (2015) Introducing spaCy. Explosion.ai. https://explosion.ai/blog/introducing-spacy .
...or the release notes of the last major update...
Explosion.ai (2023) "Fixes for APIs and requirements" GitHub https://github.com/explosion/spaCy/releases/tag/v3.7.2 .
...or a specific model...
Explosion.ai (2023) en_core_news_lg-3.7.0 GitHub https://github.com/explosion/spacy-models/releases/tag/de_core_news_lg-3.7.0 .
...depending on whether you want to talk or write about the library itself, current improvements or just a specific model.
Tokenisation
1 | import spacy |
The output is the following list:
1 | " |
spaCy has successfully recognized the different parts of the sentence.
1 | import spacy |
In this line, use the "import" command to import a library, i.e. a collection of Python scripts, into your own script. To import a library, it must be available on your system. This means, for example, that you must have installed it in your terminal with the command "pip install spacy" or via the Anaconda navigator so that you can access it in your own script.
1 | nlp = spacy.load("en_core_web_lg") |
In this line, you assign a value to a variable "nlp". In this case, it is the model "en_core_web_lg". You access the various functions of the library spacy by first specifying the library with "spacy" and using the "." to signal that you are accessing a specific function, in this case "load". For this function, use the name of the model to specify which model you would like to load into your script. On the page https://spacy.io/models/en you will find detailed information on the individual models that are available for the English language. The individual models must also be downloaded first. You can download the model in your terminal with the command "python -m spacy download en_core_web_lg".
1 | text = """"Excellent! I cried. "Elementary," said he.""" |
You also define a variable in this line. In "text", save the character string - "Excellent! I cried. "Elementary," said he. - In scripts, a sequence of characters is called a string. Theoretically, you can also mark a string with just a quotation mark at the beginning and end of the string. However, as our quote itself also contains quotation marks, it makes sense to use three quotation marks in succession. This function is designed for such cases and in the event that you want to write multi-line strings.
1 | doc = nlp(text) |
In this line, the string is processed using the previously defined model. The variable is of type 'doc', a special object type of the spacy library.
1 | for token in doc: |
A "for-loop" is started in the first line. A loop always consists of the initialisation of the loop and indented lines of code under this initialisation. These indented lines of code should be executed several times in succession. In a "for-loop", the indented part is always executed as often as the second part of the initialisation is long. This works with all iterable objects - i.e. lists, dictionaries, strings etc. Here are two examples:
1 | for number in [1,2,3,4]: |
If you execute this code, you get the following output:
1 | Hello there |
1 | for character in "Watson": |
If you execute this code, you get the following output:
1 | Sherlock |
At the same time, a variable is also assigned in the for-loop. In our example, this is token, number and character. The content of this variable changes with each run of the loop. The first time the loop is run, this is the first part of the iterable object. So in the first example l is the number - 1 - and in the second example the character - W -. The same examples follow, but this time we also output the content of the respective variable.
1 | for number in [1,2,3,4]: |
Output
1 | 1 |
1 | for character in "Watson": |
Output:
1 | W |
1 | print(token.text) |
In this line we use a standard python function - the print function. With this function, you can either output values directly or the content of a variable. This function is very practical for checking the content of variables at different points in the code.
Part-of-Speech Tagging
Natural Language Toolkit
The next task is part-of-speech tagging. We use nltk for this. The abbreviation stands for Natural Language Toolkit. This library is also open-source software that was initially developed by Steven Bird and Edward Loper at the University of Pennsylvania.
A wiki for the library is available on GitHub (https://github.com/nltk/nltk/wiki). nltk (2023) 3.8.1 https://github.com/nltk/nltk/releases/tag/3.8.1
1 | import nltk |
Output:
1 | [('There', 'EX'), |
We can call up what these tags mean using the help function. Only the nltk.help.upenn_tagset function is important in the following code. All other values in the print function only serve to make the output clearer. You don't need to know how this works yet, but take a look at the output. This is much more detailed than for spaCy.explain.
1 | for tag in pos_tags: |
This gives us the explanations of the POS tags:
1 | Word: There, Tag: EX |
Semantische Analyse
Stanford CoreNLP
For the semantic analysis, the dependency parsing, we use Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/).CoreNLP is made available under the GNU General Public License v3. The project was originally developed for internal use at Stanford University.For easy use, CoreNLP can be imported with the python library stanza. A citation method for the software is suggested on the official website.
How to cite: Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.
1 | import stanza |
The output is the following list
1 | How --advmod--> often |
There is a fundamental difference between tokenization and POS tagging. Dependency parsing does not work at token level but adds relations between tokens.
Named Entity Recognition
Textblob is a library that is based on nltk but is easier to use. It has only limited functions and is therefore intended more as a demonstration and teaching library. We use it here for Named Entity Recognition. I would also like to introduce you to the concept of comments in code at this point. You can use a hashtag (#) at any point in the code to insert a comment directly into the code, this code is skipped during execution and thus does not interfere with the functionality of your code. With this comment you ensure that a certain basic documentation is available every time the code is used. This time you will find the explanations as a comment directly in the code.
1 | # Loads the textblob library into your script. Textblob must be installed. |
Sentiment Analysis
Link (https://fortext.net/routinen/methoden/sentimentanalyse))
The Python library "transformers" is a powerful open source library developed by Hugging Face. It provides access to a wide range of pre-trained models for natural language processing (NLP) and machine learning. The library enables developers to efficiently use models for tasks such as text classification, machine translation, question-answering systems and more.
Official website: https://huggingface.co
GitHub project: https://github.com/huggingface/transformers
On the GitHub project you will also find the citation reference in BibText format. This allows you to import the citation into your reference management program and use it later in different citation styles depending on the occasion.
How to cite: Wolf, T. et al. (2020) ‘Transformers: State-of-the-Art Natural Language Processing’, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. October 2020 Online: Association for Computational Linguistics. pp. 38–45. [online]. Available from: https://www.aclweb.org/anthology/2020.emnlp-demos.6.
1 | # Mit from x import y, kann man von großen Bibliotheken nur bestimmte Funktionen in das skript laden. |
The Transformers Model not only shows the sentiment of the sentence but also the „confidence“ in the result.
1 | Sentence: I abhor the dull routine of existence. |
Here is another example:
1 | Sentence: A good detective knows that every task, every interaction, no matter how seemingly banal, has the potential to contain multitudes. |
Task
Sentiment Analyse & NER Mini Chat
Go through the provided code line by line and try to explain in your own words what happens in the individual lines. Add these explanations to the code as comments. You can also pass on parts of the code to any AI-Assistent - but please write down for which lines this is the case.
1 | import spacy |