Links

Ask boolean question to T5

Tags: #huggingface #ml #sales #ai #text
Author: Jeremy Ravenel​

T5-base finetuned on BoolQ (superglue task)

This notebook is for demonstrating the training and use of the text-to-text-transfer-transformer (better known as T5) on boolean questions (BoolQ). The example use case is a validator indicating if an idea is environmentally friendly. Nearly any question can be passed into the query function (see below) as long as a context to a question is given.
Author: Maximilian Frank (script4all.com) - Copyleft license
Notes:

Loading the model

If here comes an error, install the packages via python3 -m pip install … --user.
You can also load a T5 plain model (not finetuned). Just replace the following code
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained('mrm8488/t5-base-finetuned-boolq')
model = AutoModelForSeq2SeqLM.from_pretrained('mrm8488/t5-base-finetuned-boolq')…
with
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained('t5-small')
model = T5ForConditionalGeneration.from_pretrained('t5-small')
where t5-small is one of the names in the table above.

Input

Install packages

!pip install transformers
!pip install sentencepiece

Import libraries

import json
import torch
from operator import itemgetter
from distutils.util import strtobool
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

Load model

tokenizer = AutoTokenizer.from_pretrained("mrm8488/t5-base-finetuned-boolq")
model = AutoModelForSeq2SeqLM.from_pretrained("mrm8488/t5-base-finetuned-boolq").to(
torch.device("cuda" if torch.cuda.is_available() else "cpu")
)
try:
model.parallelize()
except:
pass

Model

Training

Optional: You can leave the following out, if you don't have custom datasets. By default the number of training epochs equals 0, so nothing is trained.
Warning: This option consumes a lot of runtime and thus naas.ai credits. Make sure to have enough credits on your account.
For each dataset a stream-opener has to be provided which is readable line by line (e.g. file, database). In the array with key keys are all dictionary keys which exist in the jsonl-line. So in this example the first training dataset has the keys question for the questions (string),passage for the contexts (string) and answer for the answers (boolean). Adjust these keys to your dataset.
At last you have to adjust the number of epochs to be trained (see comment # epochs).
srcs = [
{
"stream": lambda: open("boolq/train.jsonl", "r"),
"keys": ["question", "passage", "answer"],
},
{
"stream": lambda: open("boolq/dev.jsonl", "r"),
"keys": ["question", "passage", "answer"],
},
{
"stream": lambda: open("boolq-nat-perturb/train.jsonl", "r"),
"keys": ["question", "passage", "roberta_hard"],
},
]
model.train()
for _ in range(0): # epochs
for src in srcs:
with src["stream"]() as s:
for d in s:
q, p, a = itemgetter(src["keys"][0], src["keys"][1], src["keys"][2])(
json.loads(d)
)
tokens = tokenizer(
"question:" + q + "\ncontext:" + p, return_tensors="pt"
)
if len(tokens.input_ids[0]) > model.config.n_positions:
continue
model(
input_ids=tokens.input_ids,
labels=tokenizer(str(a), return_tensors="pt").input_ids,
attention_mask=tokens.attention_mask,
use_cache=True,
).loss.backward()
model.eval()
# ; suppresses long output on jupyter

Define query function

As the model is ready, define the querying function.
def query(q="question", c="context"):
return strtobool(
tokenizer.decode(
token_ids=model.generate(
input_ids=tokenizer.encode(
"question:" + q + "\ncontext:" + c, return_tensors="pt"
)
)[0],
skip_special_tokens=True,
max_length=3,
)
)

Output

Querying on the task

Now the actual task begins: Query the model with your ideas (see list ideas).
if __name__ == "__main__":
ideas = [
"The idea is to pollute the air instead of riding the bike.", # should be false
"The idea is to go cycling instead of driving the car.", # should be true
"The idea is to put your trash everywhere.", # should be false
"The idea is to reduce transport distances.", # should be true
"The idea is to put plants on all the roofs.", # should be true
"The idea is to forbid opensource vaccines.", # should be true
"The idea is to go buy an Iphone every five years.", # should be false
"The idea is to walk once every week in the nature.", # should be true
"The idea is to go buy Green bonds.", # should be true
"The idea is to go buy fast fashion.", # should be false
"The idea is to buy single-use items.", # should be false
"The idea is to drink plastic bottled water.", # should be false
"The idea is to use import goods.", # should be false
"The idea is to use buy more food than you need.", # should be false
"The idea is to eat a lot of meat.", # should be false
"The idea is to eat less meat.", # should be false
"The idea is to always travel by plane.", # should be false
"The idea is to opensource vaccines.", # should be false
]
for idea in ideas:
print("🌏 Idea:", idea)
print(
"\t✅ Good idea"
if query("Is the idea environmentally friendly?", idea)
else "\t❌ Bad idea"
)