This first quarter of the 2021-2022 academic year, I supervise 5 final year projects in the Data Science Master of the UOC University, three of which in “general” domains which are 1) argument mining, 2) mining of encyclopedic knowledge from wikipedia, 3) text anonymization, and two of which in applied domains that are 4) use of NLP in an online medical consultation application, and 5) detection of recurrent defects in aircraft safety reports. Below is a selection of the resources and references I give to the students to get them started (sorry, the bibliographical citations are a bit sloppy).

1. Argument Mining

Argument mining resources page

1.1. Tutorials and surveys

Argument Mining, A Survey, Lawrence and Reed, 2019. Computational Linguistics.
[An insider’s guide to AI: Argument Mininghttps://www.contactengine.com/insights/an-insiders-guide-to-ai-argument-mining/), interview with Chris Reed (video)
Advances in Debating Technologies: Building AI That Can Debate Humans (video based on IBM Debater Project, tutorial). ACL 2020
Five Years of Argument Mining: a Data-driven Analysis, Cabrio and Villata, IJCAI 2018.
Tutorial on Advances in Argument Mining, ACL 2019.

1.2. Papers

Stance Classification of Context-Dependent Claims, Bar-Haim et al, 2017.
Towards an Argument Mining Pipeline Transforming Texts to Argument Graphs, Lenz et al, 2020.

1.3. Datasets + tools/APIs

IBM Debater Datasets
BBC Moral Maze: here is the program's page and here are the program's transcripts. This is an interesting dataset, although to the best of my knowledge it has not been annotated.
TARGER: open source argument miner, the tool's page, the github page and :the paper: TARGET: Neural Argument Mining at Your Fingertips, 2019.
ArguminSci: A Tool for Analyzing Argumentation and Rhetorical Aspects in Scientific Writing. Lauscher et al, and the github page.

2. Mining of Encyclopedic Knowledge from Wikipedia

The idea of this project is to perform extraction of encyclopedic knowledge from Wikipedia, in order to expand Wikipedia-based knowledge graph. Example encyclopedic knowledge include categorical information (e.g., “crew neckline” is a type of neckline and is without specific article).

An open-source toolkit for mining Wikipedia. David Milne, Ian H. Witten, 2013.
Extracting RDF Relations from Wikipedia’s Tables.Emir Muñoz. 2014.
Wikipedia Mining: Wikipedia as a Corpus for Knowledge Extraction. Kotaro Nakayama et al, 2008
Language Models are Open Knowledge Graphs. 2020, Medium article.
T-REx: A Large Scale Alignment of Natural Language with Knowledge Base Triples.
Extracting RDF Relations from Wikipedia’s Tables. Elsahar et al, 2018. In this work, the authors produce a large dataset of wikipedia abstracts aligned with triples. The goal to do so is to produce an align dataset for doing Natural Language Generation. The dataset is available here.
Extracting Named Entities and Synonyms from Wikipedia for use in News Search. Christian Bøhn. Master project. 2010. A bit old but very relevant.
Entity Extraction from Wikipedia List Pages.Nicolas Heist and Heiko Paulheim. ESWC 2020.
Building up Ontologies with Property Axioms from Wikipedia, Kawakami et al. 2018
Expanding basic ontology from Wikipedia, Castro et al.
Mining Domain-Specific Thesauri from Wikipedia: A Case Study. Milne et al. 2006 (rather old, but seems very relevant)
Automatic Creation of a Domain Specific Thesaurus Using Siamese Networks, Dhaliwal et al. 2021. Related to Milne and Witten's article above, but more recent.
Entity Extraction from Wikipedia List Pages, Heist and Paulheim. 2020
Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases, Weikum et al. 2021. (book)
This project also belongs to the field of “Ontology Learning” y also “Automatic Taxonomy Construction”: see for example “Codifying Collaborative Knowledge: Using Wikipedia as a Basis for Automated Ontology Learning”, Guo et al. (2009)
Keyword Extraction (RAKE and KEA).

3. Text Anonymization

Anonymisation Models for Text Data: State of the art, Challenges and Future Directions, Lison et al. ACL 2021.
Microsoft's Presidio:
John Snow Labs approach to deidenficiation for healthcare using Spark NLP.
How to Build and Deploy a Document Anonymizer with Streamlit and SpaCy, 2021.
A Python library to de-identify medical records with state-of-the-art NLP methods, called deidentify: github page and blog post.
Blog post on data (not text) anonymization: part 1 and part 2.
Datasets: texts from the european parliament, which contains names of places and people, although these are public texts.
Medical texts that need de-identification, in this portal you can access some datasets (probably need to register though).

4. Gathering Insights from Medical Conversations

Probing Patient Messages Enhanced by Natural Language Processing: A Top-Down Message Corpus Analysis, Mastorakos et al. 2021
Classifying patient portal messages using Convolutional Neural Networks, Sulieman et al, 2017.
A comparison of rule-based and machine learning approaches for classifying patient portal messages, Cronin et al, 2017.** **Disponible Online desde la Biblioteca de la UOC.
Using natural language processing and machine learning to classify health literacy from secure messages: The ECLIPPSE study, 2018
Categorising patient concerns using natural language processing techniques, 2021
Use of suggested messages for (human) chat operators: A Framework to Assist Chat Operators of Mental Healthcare Service, 2020. Madeira et al.
Classifying Electronic Consults for Triage Status and Question Type. Ding et al, 2020.
Summarizing Medical Conversations via Identifying Important Utterances. Song et al, COLING 2020.

5. Identification of Recurrent Defects in Aircraft Maintenance Reports

Ontologies/Taxonomies

Datasets

MaintNet: the resource page and [the 2020 paper by Akhbardeh et al.](MaintNet: A Collaborative Open-Source Library for Predictive Maintenance Language Resources).

NLP for aircraft maintenance

Natural Language Processing for aviation safety reports: from classification to interactive analysis, Computers in Industry 78, 2015
Natural language processing of incident and accident reports : application to risk management in civil aviation, Nikola Tulechki, 2015
FEATURE-TAK - Framework for Extraction, Analysis, and Transformation of Unstructured Textual Aircraft Knowledge, Reuss et al. 2016
Applying NLP tools to occurrence reports (presentation), Groff (National Transportation Safety Board), 2018.
Natural Language Processing Based Method for Clustering and Analysis of Aviation Safety Narratives, Rose et al., 2020.
Detection of Recurring Defects in Airline Incident Reports, Hermelo et al. 2020.
Sequential Pattern Mining Algorithm Based on Text Data: Taking the Fault Text Records as an Example, Yuan et al 2018
Text Mining-based Research on Aircraft Faults Classification and Retrieval Model, Yu et al, 2020.

master argument mining text mining wikipedia anonymization aircraft maintenance telemedecine