In the first part of this series we saw the differences between multiprocessing and multithreading in Python. Although we saw some examples there, I think it deserves some practical view on how it can be used on a real data science project.
In a very broad aspect, a data science project follows four steps:
Naturally your project might follow somewhat different steps, but I’m sure you can follow this blog post to get some ideas.
Remember the golden rule: multithreading for I/O bound tasks and multiprocessing for CPU bound task.
Now, it should not be hard to follow that the “get the data” and “store other data” steps are probably the ones where we should use multithreading. All the others (“process the data” and “do something with the data”) are most probably better off using multiprocessing. You may argue that processing a lot of data might present an I/O bottleneck, and in some cases you might be right. If that is your case, you should try to break up this processing into “getting data” and “processing data” parts, so that the best strategy can be applied to each of them.
Let’s see how we can apply what we’ve learned in a classic Natural Language Processing (NLP) problem: text classification. The idea is to find out the category of a news article just by its text (e.g.: if it should be categorised as “sports”, “finances”, “economy”, etc).
If you are here only to see how multiprocessing and multithreading can be used in a real project, just follow along, you don’t really need to know anything about NLP to understand this example. However, if you are interested by the subject, you should start by studying what word vectors are. I like this article. You can also follow this GitHub repository where I and a colleague have been working on some text classification models (there are some cool notebooks there).
Our project is quite simple, we will:
By now it shall be straightforward to see that step 1 can possibly be accelerated in Python using multithreading, while step 3 should use multiprocessing.
Let’s start by the pre-trained GloVe word vectors loading. You can check the full code and execute it yourself in this notebook. Loading this pre-trained word vector file can take a lot of time. The file is considerably long, and we have to process it line per line. Each line contains a word and then a list of values for each dimension of its word vector.
train = pd.read_csv('data/r8-train-all-terms.txt', header=None, sep='\t') test = pd.read_csv('data/r8-test-all-terms.txt', header=None, sep='\t') train.columns = ['label', 'content'] test.columns = ['label', 'content'] vectorizer = GloveVectorizer() Xtrain = vectorizer.transform(train.content) Ytrain = train.label Xtest = vectorizer.transform(test.content) Ytest = test.label
The GloveVectorizer()
loads the pre-trained vectors in its __init__
function, and can do it both asynchronously or serially. This is how it performs the basic line-per-line file reading:
with zipfile.ZipFile('data/glove.6B.50d.txt.zip') as zf: with io.TextIOWrapper(zf.open('glove.6B.50d.txt'), encoding='utf-8', errors='ignore') as f: for line in f: values = line.split() word = values[0] vec = np.asarray(values[1:], dtype='float32') wordVec_dict[word] = vec embedding.append(vec) idx2word.append(word)
And here is the multithread
implementation:
def process_vector(line, wordVec_dict, embedding, idx2word): values = line.split() word = values[0] vec = np.asarray(values[1:], dtype='float32') wordVec_dict[word] = vec embedding.append(vec) idx2word.append(word) executor = ThreadPoolExecutor(max_workers=workers) futures = [] with zipfile.ZipFile('data/glove.6B.50d.txt.zip') as zf: with io.TextIOWrapper(zf.open('glove.6B.50d.txt'), encoding='utf-8', errors='ignore') as f: for line in f: futures.append(executor.submit(process_vector, line, wordVec_dict, embedding, idx2word)) for future in as_completed(futures): pass
Again, I strongly recommend you to check how it was implemented in the full code.
The ThreadPoolExecutor runs its threads asynchronously. The last for loop is used to guarantee the execution will only keep going after all the threads submitted to the executor are finished. Check the Python Documentation for more details on how ThreadPoolExecutorworks.
But how faster does the vectors loading gets with multithreading? In my MacBook Air the first serial version loads the 400000 word vectors in around 269.19898986816406s. The asynchronous approach loads the 400000 word vectors in 27.559515953063965s, using 10 workers (it could probably reach the same execution time with even less workers, as the bottleneck is reading the lines, not processing).
Now the cool part: training and testing.
Luckly for us, scikit-learn offers multiprocessing nativelly, just by setting it up on the model’s parameters. The two following code examples train a same model with the same data serially or using a set number of jobs (which are mapped to processes in scikit-learn).
model = RandomForestClassifier(n_estimators=200) model.fit(Xtrain, Ytrain)
Using multiprocessing in scikit-learn
is as easy as setting the n_jobs
model parameter. Here, we will set it to two:
model = RandomForestClassifier(n_estimators=200, n_jobs=2) model.fit(Xtrain, Ytrain)
It’s so easy you may doubt it works. But it does. The following bar graph shows the training time of this same model for different numbers of jobs:
Oh, and if you are curious, the model accuracy (RandomForestClassifier.scoreTrain
) turns out not bad at all:
Train score: 0.9992707383773929 Test score: 0.9346733668341709
I think it covers the most important parts. This last example shows how Python multiprocessing and multithreading features can be used to accelerate real projects, and sometimes with little-to-none code modifications. All that glitters is not gold, though. You will soon find out, when looking forward to more complex parallel and asynchronous executions in python that things can get quite messy.