In Tensorflow the majority of users are most familiar with what is known as Feeding data. The Tensorflow feed mechanism allows users to inject data into any Tensor in their computational graph. Here is a simple example (original source).

Build And Profile A Model

with tf.Session():
  input = tf.placeholder(tf.float32)
  classifier = ...
  print(classifier.eval(feed_dict={input: my_python_preprocessing_fn()}))

We have built a simple model based on this paper that uses a feed dict to train. You can find the source code for our model here. There are two important aspects to this model.

First, you will see that we are using a feed_dict to run an interative optimization operation.

_, l = sess.run([train_op, loss], feed_dict={
    u_inputs: user_batch,
    i_inputs: item_batch,
    ratings_input: rating_batch

Second, note that the Batch_Dataset class is iterable, and that it has the __next__ function defined.

class Batch_Dataset(object):
    ...
    def __iter__(self):
        return self

    def __next__(self):
        self.iter += 1
        if self.iter > self.stop_iter:
            self.iter = 0
            raise StopIteration
        else:
            rand_indices = np.random.choice(range(len(self.ratings)), size=self.batch_size, replace=False)
            return (np.array(self.user_review_list[rand_indices]),
                    np.array(self.item_review_list[rand_indices]),
                    np.array(self.ratings[rand_indices]).reshape(self.batch_size, 1))

As defined our __next__ method will select a random subset from three numpy arrays, one containing user reviews of items, one containing reviews for an item by other users, and one containing the rating this user gave this item. It is important to note that this is text data that has already been cleaned and only needs to be converted to indices, then embeddings, in order to pass them through our network.

We use tensorflow to do the conversion by defining a HashTable and embeddings Variable.

table = tf.contrib.lookup.HashTable(
  tf.contrib.lookup.KeyValueTensorInitializer(keys, values), -1
)

word_embeddings = tf.get_variable(
    "word_embeddings",
    shape=[len(dictionary), emb_size]
)

We found that this model took 59.807 seconds to train to completion. Further, we found that the GPU reached at most 30% utilization.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2027      C   python                                     10941MiB |
+-----------------------------------------------------------------------------+
Sat Apr 21 22:34:11 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   63C    P0    65W / 149W |  10954MiB / 11441MiB |     30%      Default |
+-------------------------------+----------------------+----------------------+

Getting The Most Out Of Tensorflow

For our case this may be sufficient. Waiting only one minute for a model to train is amazing compared to other models like AlphaGo which take 4 - 6 weeks even with all the resources available to DeepMind. According to the Tensorflow documentation here

If GPU utilization is not approaching 80-100%, then the input pipeline may be the bottleneck.

To see what may be holding us back from full gpu utilization we decided to run a check with the python profiler.

python -m cProfile -s tottime DeepCoNN\ -\ feed\ dict.py > profile.txt

Looking inside profile.txt we find the following.

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      825   17.388    0.021   17.388    0.021 {built-in method _pywrap_tensorflow_internal.TF_Run}
    58294   15.318    0.000   15.318    0.000 {built-in method numpy.core.multiarray.array}
    ...

As expected, our model spends the most time on some Tensorflow TF_Run which takes approximately 17.388 seconds. Interestingly, the next line shows us that the second most expensive method is a cast to Numpy arrays which almost takes just as long as the TF_Run calls. This begs the question, how do we remove this cast to numpy arrays?

Of course, Tensorflow has a built in api for Importing Data. They even have a performance guide, which we will use as a roadmap to change our model’s training process.

We can envision the problem with our model with the following image, taken from the Tensorflow performance guide link.

No Pipelining

Our problem is actually twofold. First, our model is spending too much time waiting for python to break the dataset into random batches of Numpy arrays. Additionally, though less obvious, we then have to transfer that data from Python’s environment to the Tensorflow session. Both of these problems can be solved simultaneously using Tensorflow’s tf.data api. This api can ingest data from multiple file types like csv, text files, string inputs and even multiple files of those types. We will use tfrecords because they are described as the standard tensorflow format. For now we will assume you have your data in a binary *.tfrecords file of the proper format, but due to a lack of documentation, we will be making another, shorter, post on how to create these data files and work with the api.

Additionally, we will change our model to use the Estimator api to ingest data in order to handle variable, queue, and table initialization as well as removing the call to tf_run and feed_dict usage. You can find the final model here. Importantly, you will note that we have defined our model construction in a single function with model_fn(features, labels, mode). This model is fed data through the features parameter via an iterator built by the following function.

def get_dataset_iterator(loc, batch_size, max_len, pad_value):
    dataset = tf.data.TFRecordDataset(loc)
    dataset = dataset.prefetch(batch_size)
    dataset = dataset.map(parse_fn, num_parallel_calls=batch_size)
    dataset = dataset.map(split_fn, num_parallel_calls=batch_size)
    dataset = dataset.map(get_truncate_fn(max_len), num_parallel_calls=batch_size)
    dataset = dataset.padded_batch(batch_size, padded_shapes=([max_len], [max_len], [None]), padding_values=(pad_value, pad_value, 0.0))
    dataset = dataset.shuffle(26352)
    iterator = dataset.make_one_shot_iterator()
    return iterator

As we can see, this reads a tfrecords file from the location, prefetches a batch, parses the data, splits the data (we are working with strings), truncates strings that are too long, pads and batches strings that are too short, shuffles the full dataset (in our case we know there are 26352 data points) then returns an iterator that traverses each data point. For now we will assume some familiarity with reading tfrecord files, but in another post we will have more documentation on doing so.

By using tfrecords and the tf.data api our model is able to train in 30.879 seconds. Again, think in percentages of improvement rather than absolute terms, our 30 second improvement is actually 50% less time to train the model. This could be a 4 - 6 week training time turning into a 2 - 3 week training time. Looking at our GPU utilization, we actually got as high as 61% at some points! Approaching the 80% utilization the Tensorflow documentation specifies as an approximate heuristic.

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2101      C   python                                     10941MiB |
+-----------------------------------------------------------------------------+
Sat Apr 21 22:35:18 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   66C    P0   106W / 149W |  10954MiB / 11441MiB |     61%      Default |
+-------------------------------+----------------------+----------------------+

Finally looking at the python profile we can obviously see that our cast to numpy is gone, and the most time consuming process was TF_Run which took about 24 seconds because it now includes all of our data ingestion as well as model training.

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      851   24.341    0.029   24.341    0.029 {built-in method _pywrap_tensorflow_internal.TF_Run}

What we can imagine having happening now is that the Tensorflow api is able to read and preprocess data prior to our model requesting more data to train on. Rather than all the time spent idle as above, our model training now looks like the bottom diagram in the following image during each iteration.

Parallel Data Ingestion

In conclusion, inference and model training times have opportunity for vast improvement using simple, well known tools and concepts. Simply using the recommended input format, parallelizing data collection and preprocessing improved our training time by 50% which in long-running models is significant.