DeepPavlov Customizing Output

Hello, I am currently trying to modify the output you get through the FAQ model presented by the DeepPavlov framework. I have built it based on the given MIPT FAQ pipeline configuration and have successful results. One problem is that I hope to alter how the final answer is given.

Currently if a question is asked, the best-chosen answer is given, along with a list of probabilities(similarities) with all the questions in the data as well. I am assuming this is the result of the “out” value in the pipeline_config: “out”: [“y_pred_answers”, “y_pred_proba”]. I have confirmed this is true by altering it to see if the outcome is affected. What I want to do is customize this output by adding customized components into the chainer where the final output is formatted to my preference. Currently I am using the riseapi to make http requests and it is giving return values as stated above. I am hoping to change this return value (ex. answer and only one probability only).

However, this suddenly made me wonder if this is the right way to do it. Because as far as I understand, the pipeline chainer is there for the deeppavlov framework to train the model, in other words, the framework follows the chainer with the data to create a model and train it on the real data (ex. change data to token/lemma etc.) So the chainer really is an outline of steps the data must follow to create a trained model, not a outline for how we reach an answer once a question is given.

To summarize, Is it correct to put a final-output-formatting component into the chainer which is specifically for training the model? I am afraid this will cause issues that I am yet unaware of because this can directly affect how the model is being trained, instead of only formatting the final answer. Or is there a simpler way to format the return value? Is it safe to assume that when we pose a question to a chatbot model to process, it would follow the same process as listed in the pipeline, even if it is not for training a model but using it to answer queries?

Thank you so much for your help!

Dear @brandonra97, Thank you very for your interest!

It’s OK to put a final-output-formatting component into the chainer. The chainer is not only about training it’s about formatting the output as well. For example, proba2labels and answers_vocab do nothing but formatting the output.

The specific output is closely related to the infer_method of the sklearn_component component. For example, the default predict_proba outputs the distribution over the possible answers. Then the ongoing components are based on the fact the input is a distribution.

The easiest way to get the desired output is to change the out section as follows:

 "out": [
      "y_pred_answers",
      "y_pred_proba",
      "y_pred_ids"
   ]

where y_pred_answers is the best answer, y_pred_proba is the distribution over the answers, and y_pred_ids is the index of the best answer.

Please let me know if you need further assistance.

Hello, I see! I am assuming that the queries to the model also follows the component. Thank you so much for your help. I have some few extra questions if it’s okay :slight_smile:

The exact functionality I am hoping to produce is a json formatted string. Currently it prints:
“[[“answer from the question.”], [[0.91241, 0.1581295, 0.2132, 0.25185, 0.89716, …]]]”

But I want to change it to:
{“answer” : ”answer to the request”, ”metadata” : [{“question” : ”the question in the database that the model matched with the request”, ”points” : ”the value of similarity of the question to the request”}]}

For instance, there may be an entry in the data where the key is "hello, and the value is “Nice to meet you”. If a request string of “hi” comes in and the FAQ model matches it to the “hello” key, the ideal return value from the api would be:
{“answer” : “Nice to meet you.”, “metadata” : [{“question” : “hello”, “points” : “0.9185”}]}.

To make this work, as I had mentioned, I am going to add a custom component class that I have created on my own into the pipeline chanier.

Would this work perfectly fine?

Thank you so much!

This would work perfectly fine. Don’t forget that you process the data in batches, that’s why everything is a list. You can try modifying any of the existing components to alter the output.

Of course, you can post a link to the snippet if you need help. Moreover when it’s done please consider creating a pull request, I think this is a nice feature.

Thank you so much for your reply :slight_smile: It would be an honor to make a contribution to this amazing framework!

However I am having slight problems while implementing this component. After reading the documentation, I am creating a custom component by creating a class that inherits the “Component” class and then implements the call() functionality. Under the call() function, I intend to apply the business logics for the final result formatting.

The problem I am getting is that I incorporated this into the configuration by adding this:
{
“in”: [“y_pred_answers”, “y_pred_proba”, “y_pred_ids”]
“out”: [“final_result”]
“class_name”: “result_filter:FormatResult”
}
But once run for simple test runs, I get an param error stating that there is no “in” parameter. Once I try to circumvent this by specifying the parameters in the init() constructor and then specifying them in the configuration file. However, even after this the “out” is still read as a parameter and causes problems. Is there a step that I have missed that leads to this problem?

File “/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/deeppavlov/core/common/params.py”, line 106, in from_params
component = obj(**dict(config_params, **kwargs))
TypeError: init() missing 1 required positional argument: ‘out’

This is the exact console error output that is generated.

Thank you so much for your help.

I have actually fixed this issue!

I have a new problem now which is that you can only put one input parameter to a component inside the pipeline. As soon as two parameters are given, there is an error that at least to my view is a system rule and cannot be overridden. I tried to supply the model by passing them as extra parameters in the pipeline but that caused the variables to be treated as string literals instead of the value they should be holding. (ex. y_pred_ids will be just inputted as "y_pred_ids_).

This is a problem because I need the y_pred_ids to know which of the y_pred_proba I need to look for to print.

How would I fix this?

Thank you so much!

Hello, I have managed to solve all issues and created the custom components. It is doing what it is supposed to except for a weird functionality.

Following the pipeline configuration, when the faq model receives a query, it gives a list of percentage of similarity to each question in the data file and then selects the highest percentage and returns its corresponding answer. However there are some odd behaviors of the model from this perspective.

First of all, the “y_pred_id” which holds the index of the highest percentage, does not align with the data file. To clarify, if y_pred_id is 6, it does not correspond to the 6th row of the data but the 2nd. There is this weird offest of 4 units for the y_pred_ids and y_pred_probas. It also does not always follow this offset rule sometimes giving a different offset making it not addressable systematically. I tried reading the documentations and source code but still have no idea why this is happening.

Moreover, since the offset is by 4, what does it mean when y_pred_id holds an index from 0 to 3? To me it seems that the faq model has a built-in default function where if the user’s utterance is close to gibberish, it returns ids from 0 to 3 (or something like this).

Another interesting thing to note is that the y_pred_id data type is numpy.int64 and when turned into python native type using .item() method, if the index was from 0 to 3, it return -1. Why?

These features or characteristics is not mentioned in the documentation at all so I am very confused…

Thank you so much for your help!

Also please tell me where and how I should do the pull request!

Hey @brandonra97, sorry for the late response.

It seems like you can find an answer to your question in the different branch DeepPavlov Faq Model Returning Wrong Index.

Regarding the pull request, please follow the guidelines http://docs.deeppavlov.ai/en/master/devguides/contribution_guide.html

Best Regards.