FastText embeddings done right
An important feature of FastText embeddings is the usage of subword information.
In addition to the vocabulary FastText also contains word's ngrams.
This additional information is useful for the following: handling Out-Of-Vocabulary words, extracting sense from word's etymology and dealing with misspellings.
But unfortunately all this advantages are not used in most open source projects.
We can easily discover it via GitHub (pic.). The point is that regular
Embedding layer maps the whole word into a single stored in memory fixed vector. In this case all the word vectors should be generated in advance, so none of the cool features work.
The good thing is that using FastText correctly is not so difficult! FacebookResearch provides an example of the proper way to use FastText in PyTorch framework.
Embedding you should choose
EmbeddingBag layer. It will combine ngrams into single word vector which can be used as usual.
Now we will obtain all advantages in our neural network.
Library for fast text representation and classification. - facebookresearch/fastText
... or you can just extend
collate_fn that is passed to DataLoader in pytorch =)
Parallel preprocessing with
Using multiple processes to construct train batches may significantly reduce total training time of your network.
Basically, if you are using GPU for training, you can reduce additional batch construction time almost to zero. This is achieved through pipelining of computations: while GPU crunches numbers, CPU makes preprocessing. Python
multiprocessing module allows us to implement such pipelining as elegant as it is possible in the language with GIL.
DataLoader class, for example, also uses
multiprocessing in it's internals.
DataLoader suffers lack of flexibility. It's impossible to create batch with any complex structure within standard
DataLoader class. So it should be useful to be able to apply raw
multiprocessing gives us a set of useful APIs to distribute computations among several processes. Processes does not share memory with each other, so data is transmitted via inter-process communication protocols. For example in linux-like operation systems
multiprocessing uses pipes. Such organization leads to some pitfalls that I am going to tell you.
imap may be used to apply preprocessing to batches. Both of them take processing function and iterable as argument. The difference is that
imap is lazy. It will return processed elements as soon as they are ready. In this case all processed batched should not be stored in RAM simultaneously. For training NN you should always prefer
with Pool(threads) as pool:
for batch in pool.imap(foo, batch_reader):
Other pitfall is associated with the need to transfer objects via pipes. In addition to the processing results,
multiprocessing will also serialize transformation object if it is used like this:
transformer will be serialized and send to subprocess. It may lead to some problems if
transformer object has large properties. In this case it may be better to store large properties as singleton class variables:
large_dictinary = None
def __init__(self, large_dictinary, **kwargs):
self.__class__.large_dictinary = large_dictinary
def foo(self, x):
y = self.large_dictinary[x]
Another difficulty that you may encounter is if the preprocessor is faster than GPU learning. In this case unprocessed batches accumulate in memory. If your memory is not to large enough you will get Out-of-Memory error. One way to solve this problem is to limit batch preprocessing until GPU learning is done.
Semaphore is perfect solution for this task:
for batch in source:
return x + 1
with Pool(threads) as pool:
semaphore = Semaphore(limit)
for x in pool.imap(plus, batch_reader(semaphore)):
for x in pooling():
Semaphore has internal counter syncronized across all working processes. It's logic will block execution if some process tries to increase counet value above limit with