Ask boolean question to T5
Tags: #huggingface #ml
Author: Jeremy Ravenel

T5-base finetuned on BoolQ (superglue task)

This notebook is for demonstrating the training and use of the text-to-text-transfer-transformer (better known as T5) on boolean questions (BoolQ). The example use case is a validator indicating if an idea is environmentally friendly. Nearly any question can be passed into the query function (see below) as long as a context to a question is given.
Author: Maximilian Frank (script4all.com) - Copyleft license
Notes:

Loading the model

If here comes an error, install the packages via python3 -m pip install … --user.
You can also load a T5 plain model (not finetuned). Just replace the following code
1
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
2
tokenizer = AutoTokenizer.from_pretrained('mrm8488/t5-base-finetuned-boolq')
3
model = AutoModelForSeq2SeqLM.from_pretrained('mrm8488/t5-base-finetuned-boolq')
Copied!
with
1
from transformers import T5Tokenizer, T5ForConditionalGeneration
2
tokenizer = T5Tokenizer.from_pretrained('t5-small')
3
model = T5ForConditionalGeneration.from_pretrained('t5-small')
Copied!
where t5-small is one of the names in the table above.

Input

Install packages

1
!pip install transformers
2
!pip install sentencepiece
Copied!

Import libraries

1
import json
2
import torch
3
from operator import itemgetter
4
from distutils.util import strtobool
5
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
Copied!

Load model

1
tokenizer = AutoTokenizer.from_pretrained('mrm8488/t5-base-finetuned-boolq')
2
model = AutoModelForSeq2SeqLM.from_pretrained('mrm8488/t5-base-finetuned-boolq').to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
3
try:model.parallelize()
4
except:pass
Copied!

Model

Training

Optional: You can leave the following out, if you don't have custom datasets. By default the number of training epochs equals 0, so nothing is trained.
Warning: This option consumes a lot of runtime and thus naas.ai credits. Make sure to have enough credits on your account.
For each dataset a stream-opener has to be provided which is readable line by line (e.g. file, database). In the array with key keys are all dictionary keys which exist in the jsonl-line. So in this example the first training dataset has the keys question for the questions (string),passage for the contexts (string) and answer for the answers (boolean). Adjust these keys to your dataset.
At last you have to adjust the number of epochs to be trained (see comment # epochs).
1
srcs = [
2
{ 'stream': lambda:open('boolq/train.jsonl', 'r'),
3
'keys': ['question', 'passage', 'answer'] },
4
{ 'stream': lambda:open('boolq/dev.jsonl', 'r'),
5
'keys': ['question', 'passage', 'answer'] },
6
{ 'stream': lambda:open('boolq-nat-perturb/train.jsonl', 'r'),
7
'keys': ['question', 'passage', 'roberta_hard'] }
8
]
9
model.train()
10
for _ in range(0): # epochs
11
for src in srcs:
12
with src['stream']() as s:
13
for d in s:
14
q, p, a = itemgetter(src['keys'][0], src['keys'][1], src['keys'][2])(json.loads(d))
15
tokens = tokenizer('question:'+q+'\ncontext:'+p, return_tensors='pt')
16
if len(tokens.input_ids[0]) > model.config.n_positions:
17
continue
18
model(input_ids=tokens.input_ids,
19
labels=tokenizer(str(a), return_tensors='pt').input_ids,
20
attention_mask=tokens.attention_mask,
21
use_cache=True
22
).loss.backward()
23
model.eval(); # ; suppresses long output on jupyter
Copied!

Define query function

As the model is ready, define the querying function.
1
def query(q='question', c='context'):
2
return strtobool(
3
tokenizer.decode(
4
token_ids=model.generate(
5
input_ids=tokenizer.encode('question:'+q+'\ncontext:'+c, return_tensors='pt')
6
)[0],
7
skip_special_tokens=True,
8
max_length=3)
9
)
Copied!

Output

Querying on the task

Now the actual task begins: Query the model with your ideas (see list ideas).
1
if __name__ == '__main__':
2
ideas = [ 'The idea is to pollute the air instead of riding the bike.', # should be false
3
'The idea is to go cycling instead of driving the car.', # should be true
4
'The idea is to put your trash everywhere.', # should be false
5
'The idea is to reduce transport distances.', # should be true
6
'The idea is to put plants on all the roofs.', # should be true
7
'The idea is to forbid opensource vaccines.', # should be true
8
'The idea is to go buy an Iphone every five years.', # should be false
9
'The idea is to walk once every week in the nature.', # should be true
10
'The idea is to go buy Green bonds.', # should be true
11
'The idea is to go buy fast fashion.', # should be false
12
'The idea is to buy single-use items.', # should be false
13
'The idea is to drink plastic bottled water.', # should be false
14
'The idea is to use import goods.', # should be false
15
'The idea is to use buy more food than you need.', # should be false
16
'The idea is to eat a lot of meat.', # should be false
17
'The idea is to eat less meat.', # should be false
18
'The idea is to always travel by plane.', # should be false
19
'The idea is to opensource vaccines.' # should be false
20
21
]
22
for idea in ideas:
23
print('🌏 Idea:', idea)
24
print('\t✅ Good idea' if query('Is the idea environmentally friendly?', idea) else '\t❌ Bad idea' )
Copied!