Holy Book Similarity Analysis¶
CS410 Text Information Systems, Final Project, Fall 2018¶
Click here to start docker container at mybinder
For browsing (rather than the interactive notebook in mybinder) the main notebook is holybooks.ipynb, and the PDF equivalent is holybooks.pdf
Authors¶
Graham Chester - grahamc2: Division of Labour, Sourcing/Cleaning book data files, text processing, LDA, t-SNE coding, documentation, voiceover
John Moran - jfmoran2: Division of Labour, NLTK Tutorial, voiceover, mybinder setup, repo management.
Functionality Overview¶
This Jupyter notebook tool was developed to enable the visualisation of similarities and differences of the major texts from many of the worlds largest religions. It also includes non-religious works from approximately similar periods (of translation) as a benchmark. While the tool is focused on religious works, it is general enough to be applied to be used to visualize the comparison of books from any genre.
The tool, as supplied, consists of a Jupyter notebook and the raw texts downloaded from the websites below. It requires the installation of a fairly standard Python data science stack as described in the installation section below.
Data Sources¶
All books were sourced from open data sites in accordance with their licensing terms, as follows:
- Christianity: 2.2 billion followers, King James Bible
- Islam: 1.6 billion followers, Quran (Yusuf Ali version)
- Hinduism: 1 billion followers, Bhagavad Gita
- Buddhism: 380 million followers, Tipitaka
- Mormonism: 15 million followers, Book of Mormon
- Judaism: 14 million followers, Torah
- Shakespeare collection
- Jane Austen collecton
Installation¶
Option 1: Cloud¶
There are two options for installation. The first and simplest isnt really installation at all. The Jupyter notebook can be started directly from mybinder. It will take several minutes to start as it copies across file from this github repo, then builds and starts a docker container with the Python required libraries, but in the meantime you can browse the notebook itself.
Option 2: Local Machine¶
This Jupyter notebook is built on a reasonably standard Python Data Science stack. Perhaps the easiest to install the prerequisites, if they are not already on your Windows, Mac or Linux machine, is to download and install Anaconda (Python version 3) from here, and then "conda install nltk", or refer to the official NLTK website
If you have an existing Python 3.5 or above installation and don't wish to install Anaconda, you can do the following, but you may need to be careful with versions:
pip install numpy scipy matplotlib pandas scikit-learn jupyter nltk
You will then need to clone or download the GitHub repo from here. This contains the Jupyter notebook, and the raw religious texts in a directory called 'books-raw'.
At a terminal/command line window you then type 'jupyter notebook' in the dorectory that contains the notebook and books-raw directory. This will start the notebook server at port 8888 on your local machine and open a browser window. If you have any problems check this quickstart guide
Usage¶
Clicking on holybook.ipynb in the Jupyter notebook file browser window will open this notebook, and putting your cursor in a cell clicking on ">| Run" will run the cell. The processing steps in this notebook are:
1) Load libraries, initialize display options, download stopwords
2) Define utility functions for displaying similarity map, displaying topic words, and filtering text.
3) Text Cleaning: Sections to clean each book by reading the raw text from file(s) in the "books-raw" directory and creating a cleaned file in the 'books' directory. In the case of the old and new testament, a clean file is created for each chapter.
4) Text Processing: Stopword removal, stemming and lemmatization
5) Word Count Vectorisation using TFIDF
6) Topic Discovery using LDA
7) t-SNE based similarity mapping
8) MDS based similarity mapping
Start of Processing¶
Load libraries and initialise options and directory¶
import re
import glob
import lxml.html
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
%matplotlib inline
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.manifold import TSNE, MDS
# set Jupyter to display ALL output from a cell (not just last output)
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# set pandas and numpy options to make print format nicer
pd.set_option("display.width",100)
pd.set_option("display.max_columns",1000)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.max_rows', 500)
np.set_printoptions(linewidth=120, threshold=5000, edgeitems=10, suppress=True)
seed = 42
# create clean books directory if doesnt already exist
if not os.path.exists('books'):
os.makedirs('books')
nltk.download('stopwords')
nltk.download('wordnet')
Utility Functions¶
# plot similarity map and save as png file at good resolution
def plotmap(topics):
fig, ax = plt.subplots(figsize=(13,12)) # 12,11 or 8,7
_ = ax.scatter(topics[:,0], topics[:,1], color='black', s=20)
_ = plt.title('Holy Book Similarity Map', fontsize=20)
# _ = plt.xlim(-40,60); _ = plt.ylim(-65,60)
_ = ax.grid()
colourmap = {'OldTestmnt': 'dodgerblue', 'NewTestmnt': 'slateblue', 'Torah':'blue', 'Tipitaka':'palevioletred',
'Quran':'forestgreen', 'SiriGuruGranth':'teal', 'BhagavadGita':'indianred','BookofMormon': 'gold',
'Shakespeare':'darkviolet', 'JaneAusten':'magenta',}
colours = [colourmap[book[:book.index("-")]] for book in book_names]
for i, txt in enumerate(book_names):
_ = ax.annotate(txt, (topics[i,0], topics[i,1]), size=9, ha='center', rotation=0, color=colours[i])
patchList = []
for key in colourmap:
data_key = mpatches.Patch(color=colourmap[key], label=key)
patchList.append(data_key)
_ = plt.legend(handles=patchList, fontsize=8)
_ = plt.tight_layout()
_ = plt.savefig('holybooksplot', dpi=200)
_ = plt.show()
# import mpld3
# from mpld3 import plugins
# # mpld3.display(fig)
# mpld3.save_html(fig, 'index.html')
def display_topics(model, feature_names, num_top_words): # display words in topic
for topic_idx, topic in enumerate(model.components_[:-1,:]):
print("Topic%2d:" % (topic_idx), end='')
print(",".join([feature_names[i] for i in topic.argsort()[:-num_top_words - 1:-1]]))
# common filtering function used by all books processes below
def stripper(instring):
outstring = ' '.join(instring.split()).lower()
outstring = ' '.join(re.findall("[a-zA-Z]+", outstring))
return outstring
# calculate euclidean distance between two points
from math import hypot
def euclidean_distance(p1,p2):
x1,y1 = p1
x2,y2 = p2
return round(hypot(x2 - x1, y2 - y1),4)
# create dataframe of distances between books
def calc_distances(topics):
distances = []
book1 = []
book2 = []
for i, vector in enumerate(topics):
for j in range(i+1,len(topics)):
book1.append(book_names[i])
book2.append(book_names[j])
distances.append(euclidean_distance(topics[i], topics[j]))
distances_df = pd.DataFrame({'book1': book1, 'book2': book2, 'distance':distances})
distances_df = distances_df.sort_values('distance')
return distances_df
Process Old Testament¶
filename = 'OldTestmnt'
book = pd.read_csv('books-raw/'+filename+'-raw.txt', sep='~', header=None) # force all into one column
book['chapter'] = book[0].str.extract('([0-9]?[A-Za-z]*)', expand=False)
book['content'] = book[0].str.split(' ',1).str[1]
book.shape
for chapter in book.chapter.unique():
content = book[book.chapter==chapter].content.to_string(index=False)
content = stripper(content)
with open('books/'+filename+'-'+chapter+'.txt', 'w') as text_file:
_ = text_file.write(content)
print(chapter, len(content),', ',end='')
Process New Testament¶
filename = 'NewTestmnt'
book = pd.read_csv('books-raw/'+filename+'-raw.txt', sep='~', header=None) # force all into one column
book['chapter'] = book[0].str.extract('([0-9]?[A-Za-z]*)', expand=False)
book['content'] = book[0].str.split(' ',1).str[1]
book.shape
for chapter in book.chapter.unique():
content = book[book.chapter==chapter].content.to_string(index=False)
content = stripper(content)
with open('books/'+filename+'-'+chapter+'.txt', 'w') as text_file:
_ = text_file.write(content)
print(chapter, len(content),', ',end='')
Process Bhagavad Gita¶
filename = 'BhagavadGita'
book = pd.read_csv('books-raw/'+filename+'-raw.txt', sep='~', header=None) # force all into one column
book = book[~book[0].str.startswith('This free PDF')]
book = book[book[0].str.len() > 3]
book.shape
content = book[0].to_string(index=False)
content = stripper(content)
with open('books/'+filename+'-None.txt', 'w') as text_file:
_ = text_file.write(content)
print(filename, len(content))
Process Quran¶
filename = 'Quran'
book = pd.read_csv('books-raw/'+filename+'-raw.txt', sep='~', header=None) # force all into one column
book[0] = book[0].str.split('|',2).str[2]
book.shape
content = book[0].to_string(index=False)
content = stripper(content)
with open('books/'+filename+'-None.txt', 'w') as text_file:
_ = text_file.write(content)
print(filename, len(content))
Process Shakespeare¶
files = glob.glob("books-raw/Shakespeare*.txt")
for filename in files:
book = pd.read_csv(filename, sep='~', header=None) # force all into one column
content = book[0].to_string(index=False)
content = stripper(content)
with open('books/'+filename[10:-8]+'.txt', 'w') as text_file:
_ = text_file.write(content)
print(filename, len(content))
Process Jane Austen¶
files = glob.glob("books-raw/JaneAusten*.txt")
for filename in files:
book = pd.read_csv(filename, sep='~', header=None) # force all into one column
content = book[0].to_string(index=False)
content = stripper(content)
with open('books/'+filename[10:-8]+'.txt', 'w') as text_file:
_ = text_file.write(content)
print(filename, len(content))
Process Siri Guru Granth¶
filename = 'SiriGuruGranth'
book = pd.read_csv('books-raw/'+filename+'-raw.txt', sep='~', header=None) # force all into one column
book.shape
content = book[0].str.replace('[^\x00-\x7F]','').str.replace('(O+ \d+ O+)','').to_string(index=False)
content = stripper(content)
with open('books/'+filename+'-None.txt', 'w') as text_file:
_ = text_file.write(content)
print(filename, len(content))
Process Torah¶
files = glob.glob("books-raw/Torah*.txt")
for filename in files:
book = pd.read_csv(filename, sep='~', header=None) # force all into one column
content = book[0].str.replace('(\d+,\d+)','').str.replace('{P}','').to_string(index=False)
content = stripper(content)
with open('books/'+filename[10:-8]+'.txt', 'w') as text_file:
_ = text_file.write(content)
print(filename, len(content))
Process Tipitaka¶
files = glob.glob("books-raw/tipitaka/*.html")
content = ''
for filename in files:
with open(filename,'r') as infile:
raw_text = infile.read() # read entire data file into a string
root = lxml.html.document_fromstring(raw_text)
for html_class in ['chapter', 'freeverse']:
parent = root.find_class(html_class)
if len(parent) > 0:
parent = parent[0].getchildren()
for child in parent:
content += '' if child.text is None else child.text
content = stripper(content)
with open('books/Tipitaka-None.txt', 'w') as text_file:
_ = text_file.write(content)
print('books/Tipitaka-None.txt', len(content))
Process Book of Mormon¶
filename = 'BookofMormon'
book = pd.read_csv('books-raw/'+filename+'-raw.txt', sep='~', header=None) # force all into one column
book.shape
content = book[0].to_string(index=False)
content = stripper(content)
with open('books/'+filename+'-None.txt', 'w') as text_file:
_ = text_file.write(content)
print(filename, len(content))
Prepare Text for Analysis¶
Create a list of all stopword using NLTK's English stopwords and appending a number of old-English words
all_stopwords = ['one','thy','thou','thee','shall','unto','hath','let','ye','shalt','hast','thee','upon','said',
'thine','come','speak','without','us','also','ay','without','name','tis','st',
'saith','thus','thereof','put','may']
all_stopwords.extend(stopwords.words('english'))
all_stopwords = set(all_stopwords)
Create a list "all_books" where each element is the cleaned text of a book with stopwords removed, and create a list of book names where each element is the name of the book.
all_books = []
# book_files = glob.glob("books/*.txt")
book_files = ['books/OldTestmnt-Hos.txt', 'books/JaneAusten-Emma.txt', 'books/OldTestmnt-Ge.txt', 'books/NewTestmnt-1Cor.txt', 'books/OldTestmnt-2Ki.txt', 'books/OldTestmnt-Num.txt', 'books/NewTestmnt-2Pet.txt', 'books/OldTestmnt-Mic.txt', 'books/OldTestmnt-2Sm.txt', 'books/Shakespeare-MidsumNightsDream.txt', 'books/Torah-Genesis.txt', 'books/OldTestmnt-Ezra.txt', 'books/SiriGuruGranth-None.txt', 'books/OldTestmnt-Lam.txt', 'books/BookofMormon-None.txt', 'books/Torah-Leviticus.txt', 'books/OldTestmnt-Nahum.txt', 'books/BhagavadGita-None.txt', 'books/NewTestmnt-Jas.txt', 'books/NewTestmnt-Phi.txt', 'books/NewTestmnt-Jude.txt', 'books/OldTestmnt-Est.txt', 'books/OldTestmnt-2Chr.txt', 'books/OldTestmnt-Amos.txt', 'books/OldTestmnt-Job.txt', 'books/OldTestmnt-1Ki.txt', 'books/Shakespeare-HenryV.txt', 'books/JaneAusten-LadySusan.txt', 'books/NewTestmnt-Phmn.txt', 'books/Tipitaka-None.txt', 'books/NewTestmnt-Gal.txt', 'books/OldTestmnt-Lev.txt', 'books/Shakespeare-RomeoAndJuliet.txt', 'books/Quran-None.txt', 'books/OldTestmnt-Jer.txt', 'books/OldTestmnt-1Sm.txt', 'books/NewTestmnt-Mark.txt', 'books/OldTestmnt-Obad.txt', 'books/NewTestmnt-1Tim.txt', 'books/OldTestmnt-Eccl.txt', 'books/Shakespeare-TamingOfShrew.txt', 'books/OldTestmnt-Joel.txt', 'books/NewTestmnt-Mat.txt', 'books/JaneAusten-NorthangerAbbey.txt', 'books/NewTestmnt-Rom.txt', 'books/Shakespeare-Othello.txt', 'books/OldTestmnt-Mal.txt', 'books/NewTestmnt-Luke.txt', 'books/NewTestmnt-Heb.txt', 'books/OldTestmnt-Neh.txt', 'books/OldTestmnt-1Chr.txt', 'books/OldTestmnt-Ruth.txt', 'books/OldTestmnt-Deu.txt', 'books/NewTestmnt-3Jn.txt', 'books/NewTestmnt-1Pet.txt', 'books/NewTestmnt-John.txt', 'books/NewTestmnt-2Th.txt', 'books/Torah-Deuteronomy.txt', 'books/OldTestmnt-Eze.txt', 'books/NewTestmnt-2Cor.txt', 'books/Torah-Exodus.txt', 'books/OldTestmnt-Josh.txt', 'books/NewTestmnt-Col.txt', 'books/OldTestmnt-Exo.txt', 'books/Torah-Numbers.txt', 'books/NewTestmnt-Titus.txt', 'books/NewTestmnt-1Jn.txt', 'books/JaneAusten-MansfieldPark.txt', 'books/OldTestmnt-Zep.txt', 'books/OldTestmnt-Psa.txt', 'books/Shakespeare-TheTempest.txt', 'books/NewTestmnt-Rev.txt', 'books/NewTestmnt-2Tim.txt', 'books/OldTestmnt-SSol.txt', 'books/OldTestmnt-Hag.txt', 'books/OldTestmnt-Isa.txt', 'books/JaneAusten-PridePrejudice.txt', 'books/OldTestmnt-Jdgs.txt', 'books/Shakespeare-Hamlet.txt', 'books/JaneAusten-LoveFriendship.txt', 'books/NewTestmnt-1Th.txt', 'books/Shakespeare-Macbeth.txt', 'books/OldTestmnt-Dan.txt', 'books/JaneAusten-SenseSensibility.txt', 'books/OldTestmnt-Jonah.txt', 'books/NewTestmnt-Acts.txt', 'books/NewTestmnt-2Jn.txt', 'books/NewTestmnt-Eph.txt', 'books/OldTestmnt-Prv.txt', 'books/OldTestmnt-Zec.txt', 'books/OldTestmnt-Hab.txt']
for book_file in book_files:
with open(book_file, 'r') as text_file:
book_text = text_file.read()
tokens = [word for word in book_text.split() if word not in all_stopwords]
book_text = ' '.join(tokens)
all_books.append(book_text)
book_names = [book[6:-4] for book in book_files]
print(book_names)
Pre-process Word Tokens (optional)¶
The following section apply either stemming lemmatization or no pre-processing. It was found that pre-processing did not have any significant benefit for topic identification so this is turned off by default. To turn on change either of the following variables (but not both) to True
%%time
stem = False
lemmatize = True
if stem:
stemmer = nltk.stem.PorterStemmer()
for i, book_text in enumerate(all_books):
tokens = [stemmer.stem(word) for word in book_text.split()]
all_books[i] = ' '.join(tokens)
if lemmatize:
lemma = nltk.wordnet.WordNetLemmatizer()
for i, book_text in enumerate(all_books):
tokens = [lemma.lemmatize(word) for word in book_text.split()]
all_books[i] = ' '.join(tokens)
Word Count Vectorization¶
The following section applies Term Frequency Inverse Document Frequency (TFIDF) word count vectorization to all_books. It also filters stop-words, uses both unigrams and bigrams, and omits words that occur in less than 5% of books, or more than 90% of books.
It creates a numpy array 'wc_vectors' with one row for each book, and one column for each word in the vocabulary, the dimensions of which are output below. It also creates a list of the feature_names of each of these columns, of which a sample is shown. Note that TFIDF vectorization was found to be more stable that simple word count vectorization (included for reference but commented out).
%%time
# vect = CountVectorizer(stop_words=all_stopwords, ngram_range=(1,2), min_df=0.05, max_df=0.9)
vect = TfidfVectorizer(stop_words=all_stopwords, ngram_range=(1,2), min_df=0.05, max_df=0.9)
wc_vectors = vect.fit_transform(all_books)
feature_names = vect.get_feature_names()
wc_vectors.shape
print('Sample feature names:',feature_names[0:20])
Topic Discovery¶
Latent Dirichlet Allocation (LDA) is used to identify the specified number of topics from the TFIDF vectors supplied. It returns a numpy array with one row for each book, and one column for each topic weight, with dimensions shown below. The most heavily weighted few words are shown for each topic in the model.
%%time
lda_model = LatentDirichletAllocation(n_components=28, learning_method = 'batch', random_state=seed)
X_topics = lda_model.fit_transform(wc_vectors)
X_topics.shape
display_topics(lda_model, feature_names, 8)
Apply t-SNE to LDA topics to map to 2-D and plot similarity map¶
t-Distributed Stochastic Neighbor Embedding projects the multidimensional topic model generated by LDA onto two dimensions to enable plotting. It has two main tunable parameters, perplexity which losely is a way of balancing between local and global elements of the data, and early_exaggeration controls the tightness of clustering. Tuning t-SNE to get the clearest separation requires some experimentation with the parameters. It provides a better visual separation of the data points than MDS (below), but it more sensitive to small changes in the input data.
Further information is available at https://distill.pub/2016/misread-tsne/
tsne = TSNE(n_components=2, perplexity=27, early_exaggeration=12, init='pca',verbose=0, random_state=seed)
tsne_topics = tsne.fit_transform(X_topics)
plotmap(tsne_topics)
distances_df = calc_distances(tsne_topics)
topmost = 12
print('\nTop {} most related books'.format(topmost))
distances_df[:topmost]
print('\nTop {} least related books'.format(topmost))
distances_df[-topmost:]
Apply MDS to LDA topics to map to 2-D and plot similarity map¶
Multidimensial Scaling (MDS) is a conceptually simpler approach than t-SNE to project the multidimensional topics onto two dimensional space. This results in more visual clutter (as the points are not 'forced' apart as in the t-SNE method) and a lower clustering clustering accuracy. i.e. the most and least related books are not as intuitive.
Therefore the t-SNE approach is preferred however this section is included for completeness and reference.
mds = MDS(n_components=2, metric=True, verbose=0, random_state=seed)
mds_topics = mds.fit_transform(X_topics)
plotmap(mds_topics)
distances_df = calc_distances(mds_topics)
topmost = 12
print('\nTop {} most related books'.format(topmost))
distances_df[:topmost]
print('\nTop {} least related books'.format(topmost))
distances_df[-topmost:]
Comments
comments powered by Disqus