This first quarter of the 2020-2021 academic year, I supervise 6 final year projects in the Data Science Master of the UOC University, ranging from the development of a COVID-19 FAQ-based Q-A system to building knowledge graphs and performing end-to-end Natural Language Generation. Below is a list of resources and references I offer to get the students started.

Introduction to Natural Language Processing

A code-first introduction to Natural Language Processing, Fastai Course given by Rachel Thomas. Highly recommended.
Real world Natural Language Processing. Manning edition. First chapter of the book.
Natural Language Processing in Action: Understanding, analyzing, and generating text with Python. Manning edition. First chapter of the book

Wikipedia and DBPedia

Wikipedia mining: Wikipedia as a corpus for knowledge extraction. K. Nakayama, M. Pei, M. Erdmann, M. Ito, M. Shirakawa, T. Hara, and S. Nishio. In Annual Wikipedia Conference (Wikimania), 2008.
To download a small set of Wikipedia articles in xml format, paste the URIs (e.g., “Rafael_Nadal”) in “Add pages manually” in that page. To get the URIs given the names, you can query a SPARQL endpoint.
DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia, Jens Lehmann et al., Semantic Web Journal 2012
Wikipedia Data Science: Working with the World’s Largest Encyclopedia. How to programmatically download and parse Wikipedia.
How to setup DBPedia and Geonames on OpenLink Virtuoso.
SPARQL and DBPedia: article 1 and article 2.

Building Knowledge Graphs from Texts

Mining Knowledge Graphs from texts, 2018 Tutorial
Knowledge Graph — A Powerful Data Science Technique to Mine Information from Text
Building a Large-scale, Accurate and Fresh Knowledge Graph, 2018 Tutorial: you will find definitions of what is a knowledge graph with references, techniques, etc.
Bring Order to Chaos: A Graph-Based Journey from Textual Data to Wisdom, Blog Entry, 2018.
Concept Wikification for COVID-19. Automatically recognizes mentions of concepts related to COVID-19 in text and resolve them into Wikipedia titles.
Building a KG from a new dataset.

Information Extraction and Text Mining tasks

Entity linking, a primary NLP task for Information Extraction.
Exploiting semantic similarity for named entity disambiguation in knowledge graphs. G. Zhu, and C. Iglesias. Expert Syst. Appl. (2018)
Anaphora and coreference resolution: A review. Sukthanker, Rhea, et al. Information Fusion 59 (2020): 139-162
Relation Extraction: A Survey, CoRR 2017. Sachin Pawar, Girish K. Palshikar and Pushpak Bhatacharyya.
How to train your dragon…I mean your entity linker, with Spacy.
A survey on Open Information Extraction, 2018.

Language Models and Bert

Sentence similarity (including question similarity) with Bert as a service
SBert, aka Sentence embeddings.
Sentence similarity using SBert vs Bert.
Jay Alammar’s Blog, A visual guide to using Bert for the first time. Check out his other visual blog posts: word embeddings, seq2seq, etc.
BERT Word Embeddings Tutorial, Chris McCormick, 2019.
BERT fine-tuning tutorial with Pytorch, Chris McCormick, 2020.

Catalan Language Processing

StanfordNLP - Catalan (Ancora).
Stanza: a wrapper that allows to use StanfordNLP models with Spacy.
FreeLing.
caWaC - Catalan Web Corpus.
Catalan model for Spacy.
CalBert: Catalan Albert.

Question Answering (Q/A) systems in general

Automatic Question Answering: Problem Solved?, by Lluis Màrquez, 2018 talk. Good intro to general problem, with retrospective, achievements and challenges.
Question Answering, Chapter 25 of Jurafsky and Martin’s book, “Speech and Language Processing”, 2019 edition.
Tutorial about Open Domain Q/A, 2020
Reading Wikipedia to Answer Open-Domain Questions, 2017, and implementation in PyTorch, DrQA
List of Q/A Datasets

Q/A From Knowledge Graphs

Q/A about COVID-19

EMNLP'2020 workshop on NLP and COVID.
ACL'2020 workshop on NLP and COVID.
COVID-Q/A Dataset
How we created an open source covid 19 chatbot, Medium Article
Testing Bert-based question answering on Coronavirus articles, Medium Article
What are people asking about COVID 19: a new Question Classification dataset
COVID-Q Dataset
Implement Question Answering System on Corona : Approach
Real time disaster and monitoring tool using Wikipedia. Thomas Steiner & Ruben Verborgh. 2015 AAAI Spring Symposium Series.

Open Q/A with SQuAD

SQuAD: The Stanford Question Answering Dataset.
Question Answering with SQuAD Kaggle starter pack
Question Answering on SQuAD, university project assignment handout
Kaggle Tensorflow open Q/A Competition
Applying Bert to Question Answering with SQuAD

Natural Language Generation from Wikipedia Triples and texts

General

NLP with Deep Learning | Winter 2019 | Lecture 15 – Natural Language Generation: video and slides.
ACL Conference 2019 Tutorial on Storytelling from structured data and knowledge graphs (especially the introduction).
Seq2seq NLG: The Good, the Bad and the Boring.
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation, by Gatt & Krahmer, 2017.

End-to-end Neural NLG

Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples, 2018. Code and Article.
Neural Text Generation from Structured Data with Application to the Biography Domain, 2016. Dataset and Article.
Automatic Generation of Company Descriptions, 2018. Dataset and Article.

Datasets

T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples, 2018
Wikidata: Automatic Alignment of Wikipedia articles with triples.
WebNLG: a dataset of triples and Wikipedia texts. Dataset and Article.

Evaluation

Best practices for the human evaluation of automatically generated text, INLG Conference 2019.
Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge, 2020. In this paper, there are a lot of references of e2e systems that used the dataset of the challenge.