MindNLP
Introduction |
Quick Links |
Installation |
Get Started |
Tutorials |
Notes
Introduction
MindNLP is an open source NLP library based on MindSpore. It supports a platform for solving natural language processing tasks, containing many common approaches in NLP. It can help researchers and developers to construct and train models more conveniently and rapidly.
The master branch works with MindSpore master.
Major Features
- Comprehensive data processing: Several classical NLP datasets are packaged into friendly module for easy use, such as Multi30k, SQuAD, CoNLL, etc.
- Friendly NLP model toolset: MindNLP provides various configurable components. It is friendly to customize models using MindNLP.
- Easy-to-use engine: MindNLP simplified complicated training process in MindSpore. It supports Trainer and Evaluator interfaces to train and evaluate models easily.
Quick Links
Installation
Dependency
- mindspore >= 1.8.1
Install from source
To install MindNLP from source, please run:
Get Started
We will next quickly implement a sentiment classification task by using mindnlp.
Define Model
from mindspore import nn
from mindspore import ops
from mindspore.common.initializer import Uniform, HeUniform
from mindnlp.abc import Seq2vecModel
class SentimentClassification(Seq2vecModel):
def construct(self, text):
_, (hidden, _), _ = self.encoder(text)
context = ops.concat((hidden[–2, :, :], hidden[–1, :, :]), axis=1)
output = self.head(context)
return output
Define Hyperparameters
The following are some of the required hyperparameters in the model training process.
hidden_size = 256
output_size = 1
num_layers = 2
bidirectional = True
drop = 0.5
lr = 0.001
Data Preprocessing
The dataset was downloaded and preprocessed by calling the interface of dataset in mindnlp.
Load dataset:
imdb_train, imdb_test = load(‘imdb’, shuffle=True)
Initializes the vocab and tokenizer for preprocessing:
from mindnlp.transforms import BasicTokenizer
embedding, vocab = Glove.from_pretrained(‘6B’, 100, special_tokens=[“<unk>”, “<pad>”], dropout=drop)
tokenizer = BasicTokenizer(True)
The loaded dataset is preprocessed and divided into training and validation:
imdb_train = process(‘imdb’, imdb_train, tokenizer=tokenizer, vocab=vocab,
bucket_boundaries=[400, 500], max_len=600, drop_remainder=True)
imdb_test = process(‘imdb’, imdb_test, tokenizer=tokenizer, vocab=vocab,
bucket_boundaries=[400, 500], max_len=600, drop_remainder=False)
Instantiate Model
# build encoder
lstm_layer = nn.LSTM(100, hidden_size, num_layers=num_layers, batch_first=True,
dropout=dropout, bidirectional=bidirectional)
encoder = RNNEncoder(embedding, lstm_layer)
# build head
head = nn.SequentialCell([
nn.Dropout(1 – dropout),
nn.Sigmoid(),
nn.Dense(hidden_size * 2, output_size,
weight_init=HeUniform(math.sqrt(5)),
bias_init=Uniform(1 / math.sqrt(hidden_size * 2)))
])
# build network
network = SentimentClassification(encoder, head)
loss = nn.BCELoss(reduction=‘mean’)
optimizer = nn.Adam(network.trainable_params(), learning_rate=lr)
Training Process
Now that we have completed all the preparations, we can begin to train the model.
from mindnlp.engine.trainer import Trainer
# define metrics
metric = Accuracy()
# define trainer
trainer = Trainer(network=net, train_dataset=imdb_train, eval_dataset=imdb_valid, metrics=metric,
epochs=5, loss_fn=loss, optimizer=optimizer)
trainer.run(tgt_columns=“label”)
print(“end train”)
License
This project is released under the Apache 2.0 license.
Feedbacks and Contact
The dynamic version is still under development, if you find any issue or have an idea on new features, please don’t hesitate to contact us via Github Issues.
Acknowledgement
MindSpore is an open source project that welcome any contribution and feedback.
We wish that the toolbox and benchmark could serve the growing research
community by providing a flexible as well as standardized toolkit to reimplement existing methods
and develop their own new semantic segmentation methods.
Citation
If you find this project useful in your research, please consider citing:
title={{MindNLP}: a MindSpore NLP library},
author={MindNLP Contributors},
howpublished = {url{https://github.com/mindlab-ai/mindnlp}},
year={2022}
}