Create a Natural Language Question Answering system with IBM Watson

Recently I started a new project to build QA system for Mindvalley. After a short discussion with my colleagues we decided to use the 2011 jeopardy winner (Watson).
As you may know Watson is a question answering computer system capable of answering questions posed in natural language, it developed in IBM’s DeepQA project by a research team led by principal investigator David Ferrucci.
IBM also released it’s cloud platform IBM Bluemix in 2014. It supports several programming languages and services as well as integrated DevOps to build, run, deploy and manage applications on the cloud. And of-course Watson API’s are accessible in Bluemix platform.

Before you begin

Before you begin, you need the following pieces

Overview of Question Answer System

The final goal is simple; design and develop an application be able to:

But unfortunately, it is not as easy as advertised by IBM. In past IBM offered IBM Watson Question and Answer (QA) but in summer of 2015 this service replaced by four new services to support customize and embed Q&A applications.
The four replacement services recommended for different types of question and answer capabilities are:

Natural Language Classifier – allows you to Interpret and classify natural language with confidence.

Dialog – allows you to script a conversation and help walk a user through a process

Retrieve and Rank – allows information retrieval with a machine learning model

Document Conversion – takes documents and ‘chunks them up” into smaller answer units to return as passages

based on your input data and requirements, you can choose one or combinations of these services to build a comprehensive and robust system.

My Approach

I decide to use combination of Document Conversion and Retrieve and Rank Services. Document conversion service used to transform documents into Answer Units, and use those as input to train the Retrieve and Rank Service.

Bluemix platform preparation

Before we start you need to prepare and setup your server and services. I decided to deploy my code in Bluemix as well, however you can just create your API in bluemix and deploy your code wherever you like (AWS :))

Login into you Bluemix account, in dashboard click on Service and APIs you need to create two services Document Conversion and Retrieve and Rank and copy and paste the credentials. (it is pretty easy I am not going to explain it)

Document Preparation

The first and most important step is document preparation. The more disciplined you are in your handling of data, the more consistent and better results you are like likely to achieve. Watson Document Conversion service helps you in this process so you don’t need to do it manually but we need to prepare our document for DC service. You may ask how you can prepare a document? the answer is by providing related question as header of each paragraph. DC service is sensitive to headers, it transform document to json format which contains list of answer_units dictionary. look at following example;

 
 {
   "source_document_id":"",
   "timestamp":"2016-04-15T08:42:37.055Z",
   "media_type_detected":"application/wordprocessingml.document",
   "metadata":[{"name":"Content-Type","content":"text/html; charset=UTF-8"},
               {"name":"author","content":"Eva Luo"},
               {"name":"publicationdate","content":"2015-08-20"}],
   "answer_units":[
         {
            "id":"ff4da46f-b07a-4c12-aa2d-d5b0d3414c17",
            "type":"h3",
            "parent_id":"",
            "title":"What is Watson?",
            "direction":"ltr",
            "content":[
               {
                "media_type":"text/plain",
                "text":"Watson is an artificially intelligent computer ..."
               }
               ]}],
   "warnings":[]
   } 

Now we can use list of answer units for training Retrieve and Rank system.

Setup Retrieve and Rank System

This service helps users find the most relevant information for their query by using a combination of search and machine learning algorithms to detect “signals” in the data. Built on top of Apache Solr, developers load their data into the service, train a machine learning model based on known relevant results, then leverage this model to provide improved results to their end users based on their question or query.

Lets makes our hands dirty. if you have worked with Solr before, you may know we need our cluster and collection to index the data, so the let’s setup them.I used Node.js for this project before you start you need to install Node client library to use the Watson Developer Cloud services, a collection of REST APIs and SDKs that use cognitive computing to solve complex problems. npm install watson-developer-cloud --save

the next and last pre step is create conf file;

    var watson = require('watson-developer-cloud');
    
    var retrieve_and_rank = watson.retrieve_and_rank({
        username: '{username}',
        password: '{password}',
        version: 'v1'
    });
    
    var cluster_name = 'your_cluster_name'
    var cluster_size = 'your cluster size'
    
    module.exports.retrieve_and_rank = retrieve_and_rank
    module.exports.cluster_name = cluster_name
    module.exports.cluster_size = cluster_size

1. Create Cluster
To use the Retrieve and Rank service, you must create a Solr cluster. A Solr cluster manages your search collections, which you will create later.

    conf.retrieve_and_rank.createCluster({
        cluster_size: conf.cluster_size,
        cluster_name: conf.cluster_name
    },
    function (err, response) {
        if (err){
            console.log('error:', err);
            res.send(err)
        }
        else{
            res.send(JSON.stringify(response, null, 2));
        }
    });

2. Upload Solr Configuration
The response includes cluster_id and the cluster availability status. The cluster must be ready before you can use it. it takes about few minutes to become ready but you can check the status through API.

After cluster become ready and before you can create collections you need to configure your cluster. The Solr configuration identifies how to index the documents so that you can search the important fields. you can download the sample solr config from here I may write separate post about solar config latter. you can upload one or many configuration to your retrieve and rank service.

    var params = {
        cluster_id: req.query.id,
        config_name: 'mindvalley_config',
        config_zip_path: './public/data/solr_config.zip'
    };

    conf.retrieve_and_rank.uploadConfig(params,
        function (err, response) {
            if (err){
                console.log('error:', err);
                res.send(err)
            }
            else{
                console.log(JSON.stringify(response, null, 2));
                res.send(JSON.stringify(response, null, 2));
            }
        });

3. Create Collection
A Solr collection is a logical index of the data in your documents. A collection is a way to keep data separate in the cloud. you need to assign one of the uploaded configuration to your collection
    var params = {
        cluster_id: '{cluster id you have from step 1}',
        config_name: '{config name which is uploaded in step 2}',
        collection_name: '{collection name}',
        wt: 'json'
    };

    conf.retrieve_and_rank.createCollection(params,
        function (err, response) {
            if (err){
                console.log('error:', err);
                res.send(err);
            }
            else{
                console.log(JSON.stringify(response, null, 2));
                res.send(JSON.stringify(response, null, 2));
            }
        });
4. Index (Add) Documents
Now everything is ready to index (add) the question and answers generated by Document Conversion service into the collection.

    jsonfile.readFile('./public/data/data.json', function (err, obj) {
    obj = obj.answer_units
    for (i in obj) {

        var doc =
        {
            "id": obj[i].id,
            "body": obj[i].content[0].text,
            "title": obj[i].title
        }

        solrClient.add(doc, function (err, response) {
            if (err) {
                console.log('Error indexing document: ', err);
            }
            else {
                console.log('Indexed a document.');
                solrClient.commit(function (err) {
                    if (err) {
                        console.log('Error committing change: ' + err);
                    }
                    else {
                        console.log('Successfully committed changes.');
                    }
                });
            }
        });
        }
    })
5. Create Ground Truth
Ground truth is the collection of questions that are matched to answers. For the Retrieve and Rank service, you label the answers with their relevance to the question. The relevance label helps the ranker determine which features are the most useful. The questions, answers, and relevance labels are combined to create training data. we need to upload the training data to create and train a ranker.
The quality of training set is very important. Your training data need to have following quality criteria to be effective in improving the relevance of answers that are returned from the Retrieve and Rank service:

The relevance file that we need to use in next step it suppose to be in the following CSV format:

  "{question}","{answer_id1}","{relevance_label1}","{answer_id2}","{relevance_label2}","{...}"

6 Build Training Data & Ranker
Good news is you are almost down :). IBM Watson team provide the python scripts which create a training set based on your ground truth csv. The train.py Python script file takes the following format as input:

train.py -u {username:password} -i {relevance_file} -c {cluster_id} -x {collection_name}\
[-r {solr_rows_per_query}] [-n {ranker_name}]

The script creates a file in your working directory called trainingdata.txt. That file is used to create the ranker. When the script is finished, the script window displays a new ranker ID and its status. We’ll need it when we rerank results at runtime.

7 Time for ask question

Let’s test the system. you can search your collection with or without ranker. searching without ranker is useful to create a ground truth.

8 Help me to learn more
you can collect the user satisfaction level, regarding to the system answers in order to train your system and improve the results. you can update your ground truth based on user input and repeat step 6.