Sociography: How to analyse societies digitally

Reiter

Sentiment Analysis & Named Entity Recognition

Named Entity Recognition and Emotion Analysis are two different methods of computational linguistics. Both methods are frequently used in the digital humanities and since there are many overlaps in the pre-processing of the text that is used for both methods, we will treat them together in this unit. At the end of this section you will be given a code that you should then be able to read in the sense that you can understand what is does in theory. That means you will gain understanding not only on Sentiment Analysis and NER but also in interpreting code.

We will again use Sherlock Holmes quotes as examples. To give you an idea of what machine language processing can look like in practice, we will go through the individual processing steps using different python libraries. You will find a detailed explanation of the code at the end of each page. The aim of the unit is to gain a basic understanding of the results produced by the code.

spaCy

The first python library we use here is spaCy. This is an open-source library that is considered relatively easy to implement. With spaCy you can analyze 25 different languages and use other libraries such as TensorFlow, Pytorch or MXNet together with spaCy's own library Thinc. At https://spacy.io you will find the different models as well as clear and interactive tutorials for both beginners and advanced users.

If you use software, you must also cite it. For spaCy you could cite the first publication...

Metthew Honnibal (2015) Introducing spaCy. Explosion.ai. https://explosion.ai/blog/introducing-spacy .

...or the release notes of the last major update...

Explosion.ai (2023) "Fixes for APIs and requirements" GitHub https://github.com/explosion/spaCy/releases/tag/v3.7.2 .

...or a specific model...

Explosion.ai (2023) en_core_news_lg-3.7.0 GitHub https://github.com/explosion/spacy-models/releases/tag/de_core_news_lg-3.7.0 .

...depending on whether you want to talk or write about the library itself, current improvements or just a specific model.

Tokenisation

1 2 3 4 5 6 7 8 9 10	import spacy nlp = spacy.load("en_core_web_lg") text = """"Excellent!" I cried. "Elementary," said he.""" doc = nlp(text) for token in doc: print(token.text)
Download

The output is the following list:

"
Excellent
!
"
I
cried
.
"
Elementary
,
"
said
he
.

spaCy has successfully recognized the different parts of the sentence.

1	import spacy

In this line, use the "import" command to import a library, i.e. a collection of Python scripts, into your own script. To import a library, it must be available on your system. This means, for example, that you must have installed it in your terminal with the command "pip install spacy" or via the Anaconda navigator so that you can access it in your own script.

1	nlp = spacy.load("en_core_web_lg")

In this line, you assign a value to a variable "nlp". In this case, it is the model "en_core_web_lg". You access the various functions of the library spacy by first specifying the library with "spacy" and using the "." to signal that you are accessing a specific function, in this case "load". For this function, use the name of the model to specify which model you would like to load into your script. On the page https://spacy.io/models/en you will find detailed information on the individual models that are available for the English language. The individual models must also be downloaded first. You can download the model in your terminal with the command "python -m spacy download en_core_web_lg".

1	text = """"Excellent! I cried. "Elementary," said he."""

You also define a variable in this line. In "text", save the character string - "Excellent! I cried. "Elementary," said he. - In scripts, a sequence of characters is called a string. Theoretically, you can also mark a string with just a quotation mark at the beginning and end of the string. However, as our quote itself also contains quotation marks, it makes sense to use three quotation marks in succession. This function is designed for such cases and in the event that you want to write multi-line strings.

1	doc = nlp(text)

In this line, the string is processed using the previously defined model. The variable is of type 'doc', a special object type of the spacy library.

1 2	for token in doc: print(token.text)

A "for-loop" is started in the first line. A loop always consists of the initialisation of the loop and indented lines of code under this initialisation. These indented lines of code should be executed several times in succession. In a "for-loop", the indented part is always executed as often as the second part of the initialisation is long. This works with all iterable objects - i.e. lists, dictionaries, strings etc. Here are two examples:

1
2
3

for number in [1,2,3,4]:
	print('Hello there')
print('General Kenobi')

If you execute this code, you get the following output:

Hello there
Hello there
Hello there
Hello there
General Kenobi

1 2	for character in "Watson": print("Sherlock")

If you execute this code, you get the following output:

Sherlock
Sherlock
Sherlock
Sherlock
Sherlock
Sherlock

At the same time, a variable is also assigned in the for-loop. In our example, this is token, number and character. The content of this variable changes with each run of the loop. The first time the loop is run, this is the first part of the iterable object. So in the first example l is the number - 1 - and in the second example the character - W -. The same examples follow, but this time we also output the content of the respective variable.

for number in [1,2,3,4]:
	print(number)
	print('Hello there')
print('General Kenobi')

Output

1
Hello there
2
Hello there
3
Hello there
4
Hello there
General Kenobi

1
2
3

for character in "Watson":
 print(character)
 print("Sherlock")

Output:

W
Sherlock
a
Sherlock
t
Sherlock
s
Sherlock
o
Sherlock
n
Sherlock

1	print(token.text)

In this line we use a standard python function - the print function. With this function, you can either output values directly or the content of a variable. This function is very practical for checking the content of variables at different points in the code.

Part-of-Speech Tagging

Natural Language Toolkit

The next task is part-of-speech tagging. We use nltk for this. The abbreviation stands for Natural Language Toolkit. This library is also open-source software that was initially developed by Steven Bird and Edward Loper at the University of Pennsylvania.

A wiki for the library is available on GitHub (https://github.com/nltk/nltk/wiki). nltk (2023) 3.8.1 https://github.com/nltk/nltk/releases/tag/3.8.1

import nltk
 
sentence = "There is nothing more deceptive than an obvious fact."
 
tokens = nltk.word_tokenize(sentence)
 
tagged = nltk.pos_tag(tokens)
 
print(tagged)

Output:

[('There', 'EX'),
 ('is', 'VBZ'),
 ('nothing', 'NN'),
 ('more', 'RBR'),
 ('deceptive', 'JJ'),
 ('than', 'IN'),
 ('an', 'DT'),
 ('obvious', 'JJ'),
 ('fact', 'NN'),
 ('.', '.')]

We can call up what these tags mean using the help function. Only the nltk.help.upenn_tagset function is important in the following code. All other values in the print function only serve to make the output clearer. You don't need to know how this works yet, but take a look at the output. This is much more detailed than for spaCy.explain.

for tag in pos_tags:
    print(f"Word: {tag[0]}, Tag: {tag[1]}")
    print(nltk.help.upenn_tagset(tag[1]))
    print("------------------------")

This gives us the explanations of the POS tags:

Word: There, Tag: EX
EX: existential there
    there
None
------------------------
Word: is, Tag: VBZ
VBZ: verb, present tense, 3rd person singular
    bases reconstructs marks mixes displeases seals carps weaves snatches
    slumps stretches authorizes smolders pictures emerges stockpiles
    seduces fizzes uses bolsters slaps speaks pleads ...
None
------------------------
Word: nothing, Tag: NN
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
None
------------------------
Word: more, Tag: RBR
RBR: adverb, comparative
    further gloomier grander graver greater grimmer harder harsher
    healthier heavier higher however larger later leaner lengthier less-
    perfectly lesser lonelier longer louder lower more ...
None
------------------------
Word: deceptive, Tag: JJ
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
None
------------------------
Word: than, Tag: IN
IN: preposition or conjunction, subordinating
    astride among uppon whether out inside pro despite on by throughout
    below within for towards near behind atop around if like until below
    next into if beside ...
None
------------------------
Word: an, Tag: DT
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
None
------------------------
Word: obvious, Tag: JJ
JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...
None
------------------------
Word: fact, Tag: NN
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...
None
------------------------
Word: ., Tag: .
.: sentence terminator
    . ! ?
None
------------------------

Semantische Analyse

Stanford CoreNLP

For the semantic analysis, the dependency parsing, we use Stanford CoreNLP (https://stanfordnlp.github.io/CoreNLP/).CoreNLP is made available under the GNU General Public License v3. The project was originally developed for internal use at Stanford University.For easy use, CoreNLP can be imported with the python library stanza. A citation method for the software is suggested on the official website.

How to cite: Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.

import stanza
 
stanza.download('en')
 
nlp = stanza.Pipeline('en', processors='tokenize,pos,lemma,depparse')
 
quote = "How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?"
 
doc = nlp(quote)
 
for sentence in doc.sentences:
    for edge in sentence.dependencies:
        print(f"{edge[2].text} --{edge[1]}--> {edge[0].text}")

The output is the following list

How --advmod--> often
often --advmod--> said
have --aux--> said
I --nsubj--> said
said --root--> ROOT
to --case--> you
you --obl--> said
that --mark--> truth
when --advmod--> eliminated
you --nsubj--> eliminated
have --aux--> eliminated
eliminated --advcl--> truth
the --det--> impossible
impossible --obj--> eliminated
, --punct--> eliminated
whatever --nsubj--> truth
remains --acl:relcl--> whatever
, --punct--> improbable
however --advmod--> improbable
improbable --xcomp--> remains
, --punct--> truth
must --aux--> truth
be --cop--> truth
the --det--> truth
truth --ccomp--> said
? --punct--> said

There is a fundamental difference between tokenization and POS tagging. Dependency parsing does not work at token level but adds relations between tokens.

Named Entity Recognition

Textblob is a library that is based on nltk but is easier to use. It has only limited functions and is therefore intended more as a demonstration and teaching library. We use it here for Named Entity Recognition. I would also like to introduce you to the concept of comments in code at this point. You can use a hashtag (#) at any point in the code to insert a comment directly into the code, this code is skipped during execution and thus does not interfere with the functionality of your code. With this comment you ensure that a certain basic documentation is available every time the code is used. This time you will find the explanations as a comment directly in the code.

# Loads the textblob library into your script. Textblob must be installed.
import textblob
 
# Creates a string as the content of the text variable.
text = "How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth?"
 
# Creates a TextBlob object from the text variable and stores this object in the blob variable.
blob = TextBlob(text)
 
# Executes Named Entity Recognition on the blob variable and stores the result in the variable entities.
entities = blob.noun_phrases
 
# Outputs the variable enitites
print(entities)

Sentiment Analysis

Link (https://fortext.net/routinen/methoden/sentimentanalyse))

The Python library "transformers" is a powerful open source library developed by Hugging Face. It provides access to a wide range of pre-trained models for natural language processing (NLP) and machine learning. The library enables developers to efficiently use models for tasks such as text classification, machine translation, question-answering systems and more.

Official website: https://huggingface.co

GitHub project: https://github.com/huggingface/transformers

On the GitHub project you will also find the citation reference in BibText format. This allows you to import the citation into your reference management program and use it later in different citation styles depending on the occasion.

How to cite: Wolf, T. et al. (2020) ‘Transformers: State-of-the-Art Natural Language Processing’, in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. October 2020 Online: Association for Computational Linguistics. pp. 38–45. [online]. Available from: https://www.aclweb.org/anthology/2020.emnlp-demos.6.

# Mit from x import y, kann man von großen Bibliotheken nur bestimmte Funktionen in das skript laden.
from transformers import pipeline
 
# Gibt das exakte Modell für Sentiment-Analyse an. Sie können modelle auf https://huggingface.co suchen und von dort importieren.
model_name = "lxyuan/distilbert-base-multilingual-cased-sentiments-student"
 
# Lädt das Sentiment-Analyse-Modell
sentiment_analyzer = pipeline("sentiment-analysis", model=model_name)
 
# Legt den zu analisierden Satz in der Variable sentence ab.
sentence = "I abhor the dull routine of existence."
 
# Führt die Sentiment-Analyse aus und speichert das Ergebnis in der result Variable
result = sentiment_analyzer(sentence)
 
# Gibt die Ergebnisse in übersichtlicher Form aus.
print(f"Satz: {sentence}")
print(f"Sentiment: {result[0]['label']}")
print(f"Vertrauen: {result[0]['score']}")

The Transformers Model not only shows the sentiment of the sentence but also the „confidence“ in the result.

1
2
3

Sentence: I abhor the dull routine of existence.
Sentiment: negative
Vertrauen: 0.5697616934776306

Here is another example:

1
2
3

Sentence: A good detective knows that every task, every interaction, no matter how seemingly banal, has the potential to contain multitudes.
Sentiment: positive
Vertrauen: 0.8050756454467773

Task

Sentiment Analyse & NER Mini Chat

Go through the provided code line by line and try to explain in your own words what happens in the individual lines. Add these explanations to the code as comments. You can also pass on parts of the code to any AI-Assistent - but please write down for which lines this is the case.

import spacy
def sentiment_analysis(sentiment_score):
   return "happy" if sentiment_score > 0 else "sad"
def analyze_sentence(sentence):
 
   nlp = spacy.load("de_core_news_sm")
 
   doc = nlp(sentence)
   
   persons = [ent.text for ent in doc.ents if ent.label_ == "PER"]
   if not persons:
       print("I can?t find anyone.")
   else:
       sentiment_score = doc.sentiment
       person = persons[0]
       sentiment = sentiment_analysis(sentiment_score)
       print(f"{person} ist {sentiment}.")
def main():
 
   user_input = input("Give me a sentence to analyse: ")
   
   analyze_sentence(user_input)
   
if __name__ == "__main__":
   main()

Download