Building a Natural Language Question & Answer Search Engine

ODSC - Open Data Science
4 min readOct 29, 2019

--

Didn’t have time to read the book for the big quiz? Why not build a system to answer the questions for you?

Using the architecture pictured below we can build out a framework that can accept natural language questions as a query and answer the question using a corpus of documents as a knowledge base. To accomplish this architecture, we can piece together the open-sourced work from the fields of information retrieval and deep learning.

[Related article: 20 Open Datasets for Natural Language Processing]

This post is written at a very high level, to learn more about the process you can come to my training session at ODSC West 2019. In the session, we’ll go into the details of how each component works, we’ll go through Python code samples of how to build out each component, and you’ll build your own version of the system using your choice of dataset.

Step 1

The first step in our architecture is to accept a query in the form of a question and return relevant documents that might contain an answer. This can be accomplished with pure Python by using the Whoosh library. Whoosh is an information retrieval library that behaves similar to Elasticsearch. By default, Whoosh uses the Okapi BM25F ranking function to find documents most similar to the input query. BM25 can be thought of as similar to TF-IDF, but it has some clever tricks built in to achieve better results than its predecessor. Using this open-source library we can retrieve relevant documents in an instant, and move on to asking them the big questions.

Step 2

Once we have some relevant documents we can use some more open-source magic to begin to ask them questions. To do this we’ll take advantage of a pre-trained deep learning model distributed by the DeepPavlov Python package. This model is was trained on a data set known as SQuAD (Stanford Question Answering Dataset), this dataset takes a Wikipedia article and a question as input and outputs the best answer to the question found in the article. The particular model trained and distributed by DeepPavlov uses an architecture that includes BERT (Bidirectional Encoder Representations for Transformers). Models trained with BERT, and its successor ALBERT (A Lite BERT), are no stranger to the top spots of leaderboards such as SQuAD.

Step 3

Our final step is to tie together everything we’ve done so far. We now have an information retrieval component that can accept a query and return documents, and we have a Q&A component that can accept a query and a document and return the best answer. To tie these 2 together we can iterate over the documents returned from Step 1 and ask each of these documents the question using Step 2. Once we have a list of answers we can then rank them and return the best one, or display a list of possible answers to our user and let their domain expertise decide the best one.

With this system in place, we can have some fun with it and view some example input & output. Below are some real examples from two instances of the system implemented on two different datasets.

[Related article: Intro to Language Processing with the NLTK]

Example from the system built on a corpus of Nasdaq articles:

Question: “Who is Elon Musk?”

Top Answer: “eccentric CEO”

Example from the system built on the text of the Davinci Code by Dan Brown:

Question: “What subject is Robert Langdon a professor of?”

Top Answer: “RELIGIOUS SYMBOLOGY”

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.

--

--

ODSC - Open Data Science

Our passion is bringing thousands of the best and brightest data scientists together under one roof for an incredible learning and networking experience.