Corpus Generator & NLP Analysis


Case Studies.

See KnoGlo’s corpus generator and NLP analysis tool in action and what those results mean to us.

Example 1. See how the Internet has evolved in the past 50 years.

Keywords: Internet, ARPANET, and other related. Year: from 1969 to 2019. 5 years as a gap.


Generating Corpus

There are 55 text documents inside the corpus. For each document, there are up to 50 titles and abstracts about a selected topic published in a specific year.

For this research, I picked several keywords and a time range of 1969-2019 (5 years as a gap). Selected keywords include Internet, ARPANET, computer network, computer, and World Wide Web.

When I was making the corpus, I found something interesting - there were NO results for the “World Wide Web” until 1994.

NLP Topic Modeling

I made a few attempts on different numbers of topics. When I increased the number of topics, while the word cloud seemed like it could provide more results of the top n words, some of them are constantly repeating between topics. 4 topics return a fairly satisfying result in this case.

View the complete topic modeling result with 4 topics by running this IPYNB file in Jupyter Notebook.

This file has the output results for Example 1. Do not re-run the code, or the old results will be overwritten.

Note: after changing the topic number, be sure to change the row and columns for the distribution of word counts by dominant topic, word cloud, and word count & importance of topic keywords to match your topic numbers (i.e., 9 topics with a 3x3 visualization).

Compare with Statistical Metadata Results

We can compare the result of topic modeling from our corpus and the original subject and keyword tags from Springer. Although the topic modeling results might not contain everything that is included in Springer’s tags since my free API plan only allows me to get 50 titles and abstracts for each text file, we can find some topics share in common, like the “Medicine & Health” subject tag correlated to the top n words, such as protein, gene, and cell, in one of the topics from the topic modeling result.

Example 2. Medicine in Japan.

Keyword: medicine. Country: Japan. Year: from 1989 to 2019. 5 years as a gap.


Coming soon. Please check back later.