How do I deal with additional input information other than images in a convolutional neural network?

Question

I try to convert a game state of a board game into the input for a convolutional neural network. A convolutional neural network is useful because the players have to place items on the board, and the convolutional neural network can take advantage of that spatial structure. Therefore I can describe the board well with a binary feature plane for each player (1 if there is an item on the board and 0 if not).
The players can not only place objects but also collect cards. A player can own a maximum of 19 of five different card types. How many cards a player has and what type they are, is important information for the neural network, but I cannot describe this with another feature plane. This has nothing to do with spatial structure. So how do I give the convolutional neural network such additional information, for example that the player has 6 cards of type A, as input?
There are also "places" where a player can place his items. These places have a number (so to speak how good the place is) from 2 - 12. I can describe these places again with a feature plane. But I wonder if the network distinguishes well enough between two numbers like 6 and 7? I could imagine that it can distinguish much better between ones and zeros.

ncasas · Accepted Answer

I think there are three questions here:
How to incorporate non-spatial information into the network?
When combining different information modalities, a typical approach is to do it at the internal representation level, that is: the point where you lose the spatial information (normally with a flatten operation) after the convolutions. You can have your extra information be processed by an MLP and the result be combined with  the representations obtained by the convolutional layers by concatenating both.
How to represent the cards as input to the network?
In order to represent the card a user has, you can represent them as discrete elements (i.e. tokens), just like text is usually handled in neural networks. This way, you can use an embedding layer, which would receive as input the index of the card. As the user can have any number of cards, you could use an LSTM. In order to represent the "end of the card collection", you can have a special token, and yet another one to represent "padding", which would be useful to create minibatches with different number of cards.
How to represent the places?
You should decide if these are better represented as discrete or continuous values. Or maybe just try both options and choose the best performing one. For continuous features, you could add, as you suggested, another feature place. For discrete features, you would just have an embedding layer, and then concatenate the output to the other channels.
Update: Some clarifications:

A "token" is a term used in NLP to refer to a value which is discrete, that is, the number of values it can take is finite, normally small. In your case, the different values that a card token can take is 19. Usually, we refer to tokens by the index they occupy in the list of all possible values.

In order to represent discrete values in neural networks, we normally represent each different value as a fixed-size vector.

An embedding table is just a table with the fixed-size vectors used to represent your discrete elements. The embedding layer is normally the first in the network architecture. It receives as inputs token indexes and outputs their associated vectors. The entries of the embedding table are updated during the backpropagation process.

You don't "concatenate an MLP with a convolutional network", you concatenate their outputs. Specifically, once the output of the last convolutional layer is computed, you normally "flatten" it, meaning you remove the spatial information and just place the output tensor elements in a single-dimension vector. That vector is what you concatenate with the output of the MLP, which is also a single-dimension vector (apart from the minibatch dimension).

How do I deal with additional input information other than images in a convolutional neural network?

One Answer

How to incorporate non-spatial information into the network?

How to represent the cards as input to the network?

How to represent the places?

Add your own answers!

Ask a Question