This first quarter of the 2020-2021 academic year, I supervise 6 final year projects in the Data Science Master of the UOC University, ranging from the development of a COVID-19 FAQ-based Q-A system to building knowledge graphs and performing end-to-end Natural Language Generation. Below is a list of resources and references I offer to get the students started.
Introduction to Natural Language Processing
- A code-first introduction to Natural Language Processing, Fastai Course given by Rachel Thomas. Highly recommended.
- Real world Natural Language Processing. Manning edition. First chapter of the book.
- Natural Language Processing in Action: Understanding, analyzing, and generating text with Python. Manning edition. First chapter of the book
Wikipedia and DBPedia
- Wikipedia mining: Wikipedia as a corpus for knowledge extraction. K. Nakayama, M. Pei, M. Erdmann, M. Ito, M. Shirakawa, T. Hara, and S. Nishio. In Annual Wikipedia Conference (Wikimania), 2008.
- To download a small set of Wikipedia articles in xml format, paste the URIs (e.g., “Rafael_Nadal”) in “Add pages manually” in that page. To get the URIs given the names, you can query a SPARQL endpoint.
- DBpedia – A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia, Jens Lehmann et al., Semantic Web Journal 2012
- Wikipedia Data Science: Working with the World’s Largest Encyclopedia. How to programmatically download and parse Wikipedia.
- How to setup DBPedia and Geonames on OpenLink Virtuoso.
- SPARQL and DBPedia: article 1 and article 2.
Building Knowledge Graphs from Texts
- Mining Knowledge Graphs from texts, 2018 Tutorial
- Knowledge Graph — A Powerful Data Science Technique to Mine Information from Text
- Building a Large-scale, Accurate and Fresh Knowledge Graph, 2018 Tutorial: you will find definitions of what is a knowledge graph with references, techniques, etc.
- Bring Order to Chaos: A Graph-Based Journey from Textual Data to Wisdom, Blog Entry, 2018.
- Concept Wikification for COVID-19. Automatically recognizes mentions of concepts related to COVID-19 in text and resolve them into Wikipedia titles.
- Building a KG from a new dataset.
Information Extraction and Text Mining tasks
- Entity linking, a primary NLP task for Information Extraction.
- Exploiting semantic similarity for named entity disambiguation in knowledge graphs. G. Zhu, and C. Iglesias. Expert Syst. Appl. (2018)
- Anaphora and coreference resolution: A review. Sukthanker, Rhea, et al. Information Fusion 59 (2020): 139-162
- Relation Extraction: A Survey, CoRR 2017. Sachin Pawar, Girish K. Palshikar and Pushpak Bhatacharyya.
- How to train your dragon…I mean your entity linker, with Spacy.
- A survey on Open Information Extraction, 2018.
Language Models and Bert
- Sentence similarity (including question similarity) with Bert as a service
- SBert, aka Sentence embeddings.
- Sentence similarity using SBert vs Bert.
- Jay Alammar’s Blog, A visual guide to using Bert for the first time. Check out his other visual blog posts: word embeddings, seq2seq, etc.
- BERT Word Embeddings Tutorial, Chris McCormick, 2019.
- BERT fine-tuning tutorial with Pytorch, Chris McCormick, 2020.
Catalan Language Processing
- StanfordNLP - Catalan (Ancora).
- Stanza: a wrapper that allows to use StanfordNLP models with Spacy.
- FreeLing.
- caWaC - Catalan Web Corpus.
- Catalan model for Spacy.
- CalBert: Catalan Albert.
Question Answering (Q/A) systems in general
- Automatic Question Answering: Problem Solved?, by Lluis Màrquez, 2018 talk. Good intro to general problem, with retrospective, achievements and challenges.
- Question Answering, Chapter 25 of Jurafsky and Martin’s book, “Speech and Language Processing”, 2019 edition.
- Tutorial about Open Domain Q/A, 2020
- Reading Wikipedia to Answer Open-Domain Questions, 2017, and implementation in PyTorch, DrQA
- List of Q/A Datasets
Q/A From Knowledge Graphs
- Q/A from knowledge base, 2018
- Introduction to Question Answering over Knowledge Graphs, Blog Post, 2019
- Core techniques of question answering systems over knowledge bases: a survey, 2017
- QALD: Question Answering over Linked Data Hybrid Question Answering System based on Natural Language Processing and SPARQL Query
- Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings, 2020.
Q/A about COVID-19
- EMNLP'2020 workshop on NLP and COVID.
- ACL'2020 workshop on NLP and COVID.
- COVID-Q/A Dataset
- How we created an open source covid 19 chatbot, Medium Article
- Testing Bert-based question answering on Coronavirus articles, Medium Article
- What are people asking about COVID 19: a new Question Classification dataset
- COVID-Q Dataset
- Implement Question Answering System on Corona : Approach
- Real time disaster and monitoring tool using Wikipedia. Thomas Steiner & Ruben Verborgh. 2015 AAAI Spring Symposium Series.
Open Q/A with SQuAD
- SQuAD: The Stanford Question Answering Dataset.
- Question Answering with SQuAD Kaggle starter pack
- Question Answering on SQuAD, university project assignment handout
- Kaggle Tensorflow open Q/A Competition
- Applying Bert to Question Answering with SQuAD
Natural Language Generation from Wikipedia Triples and texts
General
- NLP with Deep Learning | Winter 2019 | Lecture 15 – Natural Language Generation: video and slides.
- ACL Conference 2019 Tutorial on Storytelling from structured data and knowledge graphs (especially the introduction).
- Seq2seq NLG: The Good, the Bad and the Boring.
- Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation, by Gatt & Krahmer, 2017.
End-to-end Neural NLG
- Neural Wikipedian: Generating Textual Summaries from Knowledge Base Triples, 2018. Code and Article.
- Neural Text Generation from Structured Data with Application to the Biography Domain, 2016. Dataset and Article.
- Automatic Generation of Company Descriptions, 2018. Dataset and Article.
Datasets
- T-Rex : A Large Scale Alignment of Natural Language with Knowledge Base Triples, 2018
- Wikidata: Automatic Alignment of Wikipedia articles with triples.
- WebNLG: a dataset of triples and Wikipedia texts. Dataset and Article.
Evaluation
- Best practices for the human evaluation of automatically generated text, INLG Conference 2019.
- Evaluating the State-of-the-Art of End-to-End Natural Language Generation: The E2E NLG Challenge, 2020. In this paper, there are a lot of references of e2e systems that used the dataset of the challenge.