Some practical methods available when the amount of text data is not large enough

Deep learning often requires a lot of data, otherwise overfitting will occur, the author of this paper proposes some practical methods that can be used when the amount of text data is not large enough to give value to small data sets.

As a data scientist, choosing the right modeling approach and algorithm for your problem should be one of your most important skills.

A few months ago, I was working on a text classification problem where the key was to determine which news articles were relevant to my client.

I only have a few thousand annotated news datasets, so I started with simple classical machine learning modeling methods to solve this problem, such as TF-IDF for logistic regression classification.

In general, these models are suitable for text classification of long documents (such as news, blog posts, etc.), however on my task they performed poorly, only slightly better than random classification.

After researching where the model went wrong, I found that the bag of words representation was not enough for this task, I needed a model that could deeply understand the semantics of the document.

Deep learning models have shown very good results on complex tasks that require deep understanding of text semantics, such as machine translation, automatic question answering, text summarization, natural language inference, etc.

This seems like a perfect approach for my task, but training deep learning models often requires hundreds of thousands or even millions of labeled data, and I only have a small dataset. How to do it?

Usually, we need a lot of data to train deep learning models in order to avoid overfitting. Deep neural networks have a very, very large number of parameters, so if they don't train them with enough data, they tend to memorize the entire training set, which results in good training results but poor test results .

To avoid this situation due to lack of a lot of data, we need to use some special tricks! One hit kill skills!

In this post, I will show some methods developed by myself or found in articles, blogs, forums, Kaggle and some other places, and see how they can make deep learning better without big data complete my task. Many of these methods are based on best practices widely used in computer vision.

A small disclaimer: I'm not a deep learning expert, and this project is just one of the first big projects I've done with deep learning. Everything in this post is a summary of my personal experience, and it's possible that my approach doesn't apply to your problem.

Regularization

Regularization methods are presented in machine learning models in different forms and can be used to avoid overfitting. These methods are very theoretical and are a general approach to most problems.

L1 and L2 regularization

These methods are probably the oldest and have been used for many years in many machine learning models.

When using this method, we add the size of the weights to the model loss function we are trying to minimize. In this way, the model will try to make the weights as small as possible, while those weights that do not affect the model significantly will be reduced to zero.

In this way, we can use a smaller number of weights to remember the training set.

more details:

https://towardsdatascience.com/only-numpy-implementing-different-combination-of-l1-norm-l2-norm-l1-regularization-and-14b01a9773b

dropout

Dropout is another newer regularization method. Its specific approach is that during training, each node (neuron) in the neural network is discarded according to the probability of P (that is, the weight is set to zero). In this way, the network does not depend on specific neurons and their interactions, but must learn each pattern in a different part. This allows the model to focus on learning important patterns that are easier to apply to new data.

Early stopping

Early stopping is a simple regularization method, just monitor the validation set performance and stop training if you see that the validation set performance is no longer improving. This approach is very important in the absence of big data, because the model usually starts to overfit after 5-10 iterations or even less.

reduce the number of parameters

If you don't have a large dataset, then you should carefully design the number of layers in the network and the number of neurons in each layer. Also, special layers like convolutional layers have fewer parameters than fully connected layers, so it can be very useful to use them if possible.

data augmentation

Data augmentation is a method of creating more training data by changing the training data without changing the labels of the data. In computer vision, many image transformation methods are used to augment the dataset size, such as flipping, cropping, scaling, rotating, etc.

These transformations are useful for image-type data, but not for text, such as flipping out a meaningless sentence like "dog loves me", and training a model with it would have no effect. Here are some enhancements to text-based data:

synonym substitution

In this method, we randomly pick some words and replace them with their synonyms, for example, we change the sentence "I like this movie a lot" to "I like this movie very much", so that the sentence still has the same meaning, most likely have the same label. But this approach is not useful for my task, because synonyms have very similar word vectors, so the model treats the two sentences as the same sentence without actually augmenting the dataset.

back translation

In this method, we use machine translation to translate a piece of English into another language, and then translate back to English. This method has been successfully used in the Kaggle malicious comment classification competition.

For example, if we translate "I like this movie very much" into Russian, we get "ÐœÐ½Ðµ Ð¾Ñ‡ÐµÐ½ÑŒ Ð½Ñ€Ð°Ð²Ð¸Ñ‚ÑÑ ÑÑ‚Ð¾Ñ‚ Ñ„Ð¸Ð»ÑŒÐ¼", and when we translate back to English we get "I really like this movie". The method of back translation is not only similar to the ability of synonym replacement, it also has the ability to add or remove words and reorganize sentences while maintaining the original meaning.

document cropping

News articles are usually quite long, and when looking at the data, I found out that the entire article is not needed for classification. Also, I find that the main idea of â€‹â€‹the article is often repeated.

This got me thinking about cropping the article into several sub-articles for data augmentation so that I'll get more data. At the beginning I tried to extract a few sentences from the document and create 10 new documents. These newly created document sentences have no logical relationship, so the classifiers trained with them perform poorly. The second time, I tried breaking each article into paragraphs, each consisting of five consecutive sentences from the article. This method works very well and improves the performance of the classifier by a large amount.

Generative Adversarial Networks

GANs are one of the most exciting recent advances in deep learning, and they are often used to generate new images. The following blog post explains how to use GAN for data augmentation of image data, but some of its methods may also be applicable to text data.

Blog link:

https://towardsdatascience.com/generative-adversarial-networks-for-data-augmentation-experiment-design-2873d586eb59

transfer learning

Transfer learning refers to solving your own problems using network parameters trained for other tasks, usually trained on large datasets. Transfer learning is sometimes used as initialization for certain layers, and sometimes directly for feature extraction to save us from training new models. In computer vision, starting with a pre-trained ImageNet model is a common practice for problem solving, but NLP does not have the large datasets available for transfer learning like ImageNet.

Pretrained word vectors

Deep learning network architectures commonly used in natural language processing usually start with an Embedding Layer, which converts a word from One-Hot Encoding to a numeric vector representation. We can train the embedding layer from scratch or use pretrained word vectors such as Word2Vec, FastText or GloVe.

These word vectors are obtained by training large amounts of data through unsupervised learning methods or directly training on domain-specific datasets.

Pretrained word vectors are very effective because based on big data they provide the model with the context of the words and reduce the parameters of the model, thus significantly reducing the possibility of overfitting.

More about word embeddings:

https://

Pretrained Sentence Vectors

We can convert the input of the model from words to sentences, and in this way we get a simple model with few parameters and good performance. To do this, we can use a pretrained sentence encoder such as Facebook's InferSent or Google's Universal Sentence Encoder.

We can also train a sentence encoder model on the unlabeled data in the dataset using methods such as skip-thought vectors or language models.

More about unsupervised sentence vectors:

https://blog.myyellowroad.com/unsupervised-sentence-representation-with-deep-learning-104b90079a93

Pretrained language models

Many recent papers have achieved amazing results using pre-trained language models on large corpora to handle natural language tasks, such as ULMFIT, Open-AI transformer and BERT. A language model predicts the next word that will appear in a sentence from the previous words.

This pre-training didn't really help me get better results, but the article gave me some ways I haven't tried to help me fine-tune better.

A great blog about pretrained language models:

http://ruder.io/nlp-imagenet/

Pre-trained unsupervised or self-supervised learning

Given a large amount of unlabeled data, we can use unsupervised methods such as autoencoders or masked language models to train the model, which can be done with just the text itself.

Another better option for me is to use a self-supervised model. Self-supervised models can automatically extract labels without human labeling. The Deepmoji project is a good example.

In the Deepmoji project, the authors trained a model to predict emoji in tweets, and when the model performed well, they used the network to pre-train a tweeter's sentiment analysis model to obtain the status of the emoji prediction model.

Emoji prediction and sentiment analysis are clearly very related, so it performs very well as a pre-training task. The use of self-supervision in news data includes predicting headlines, newspapers, the number of comments, the number of retweets, etc. Self-supervision is a very good pre-training method, but it is often difficult to tell the association of surrogate labels from ground truth labels.

Pre-training with off-the-shelf networks

In many companies, most machine learning models for different tasks are built on the same dataset or similar datasets. For example a tweet, we can predict its subject, opinion, number of retweets, etc. It's best to pre-train your network with a network that is already well-established. For my task, applying this method did improve performance.

feature engineering

I know that deep learning "killed" feature engineering, and it's a bit outdated to talk about feature engineering. But when you don't have a lot of data, helping the network learn complex patterns through feature engineering can greatly improve performance. For example, in my classification of news articles, author, newspaper, number of comments, tags, and many more features can help predict tags.

Multimodal Architecture

We can combine document-level features into our model with a multimodal architecture. In the multimodal architecture, we build two different networks, one for text and one for features, merge their output layers (without softmax) and add more layers. These models are difficult to train because these features usually have stronger signals than text, so the network is mainly influenced by the features.

Great Keras tutorial on multimodal networks:

https://medium.com/m/global-identity?redirectUrl=https://becominghuman.ai/neural-networks-for-algorithmic-trading-multimodal-and-multitask-deep-learning-5498e0098caf

This approach improved my model by less than 1%.

word-level features

Word-level features are another type of feature engineering, such as part-of-speech tagging, semantic role tagging, entity extraction, etc. We can combine a one-hot encoded representation or an embedding of a word feature with the word embedding and use it as the input to the model.

We can also use other word features in this method, for example in sentiment analysis task we can take a sentiment dictionary and add another dimension to embed it, with 1 for words in the dictionary and 0 for other words, so the model can easily to learn some of the words it needs to focus on. In my task, I added the dimension of some important entities, which gave a nice performance boost to the model.

Feature Engineering Preprocessing

A final feature engineering approach is to preprocess the input text in a way that makes it easier for the model to learn. An example is "stemming", if sport is not an important label, we can replace the words football, baseball and tennis with sport, this will help the neural network model to understand that the difference between different sports is not important, The parameters in the network can be reduced.

Another example is the use of automatic summarization. As I said before, neural networks do not perform well on long texts, so we can run an automatic summarization algorithm like TextRank on the text and feed the neural network only the important sentences.

my model

After trying different combinations of the methods discussed in this article, the best performing model in my project is the Hierarchical Attention Network (HAN) mentioned in this article, the model uses dropout and early stopping as regularization, and adopts document clipping method for dataset augmentation. I use pre-trained word vectors and a pre-trained network from my company (this network uses the same data, just for a different task).

When doing feature engineering, I added entity word-level features to word embedding vectors. These changes improved my model's accuracy by almost 10%, and the model went from slightly better than random to a level of significant business value.

The application of deep learning on small datasets is still in the early stages of this research area, but it looks like it is gaining popularity, especially for pre-trained language models, I hope researchers and practitioners will find more ways Use deep learning to bring value to every dataset.

Smart Watch

16+ Years Experience Smart Watch manufacturer, ITOPNOO Provide One-stop Smart Wearable devices Solutions For You.

Our Smart Wearable products include android smart watches, Watch For iPhone, Bracelet and Wristband etc.

Leading healthcare navigation services for individuals and families who are generally healthy or face serious medical issues, and health services for employers.

The Trends New Watches Designs. Custom smart watch products designed with the vision of our clients' brands in mind.

Wholesale smart watches,Best Smart Watches,Gifts Wholesalers, smart watch manufacturer

TOPNOTCH INTERNATIONAL GROUP LIMITED , https://www.mic11.com