Clustering of unlabeled ship images

Question

I want to create a ship detection classifier from a dataset that is formed by 4000 photos(3072*2048).

But the dataset  that i currently have is not labeled  so i can feed it to a cnn.So i want to cluster this dataset to 2 labels(or 2 directories) ship and no_ship.I tried running k-means but the results were dissapointing.Is some other more functional  way to do this?

Vincent Yong · Answer

Here's what you could try.

Find a pre-trained network which is capable of detecting ships (An example could be a network trained on ImageNet). You will only need the layers before the Softmax() layer or after the Flatten() layer
If there are multiple types of ships you want to detect, I would pass multiple images of ships and non-ships into the network. For each ship/non-ship image, you will obtain a 1-D feature embedding. You could then average out the embeddings of all the ship & non-ship images you choose. What this tells you is that pictures with/without ships should have an embedding that looks like this.
Lastly, pass each image in your unlabelled dataset through the network and use a distance metric to see whether it is closer to the embedding that represents ships or the one that represents non-ships. You could use different metrics as shown here: https://dataaspirant.com/2015/04/11/five-most-popular-similarity-measures-implementation-in-python/

Valentin Calomme · Answer

First of all, keep this in mind:

After all, if it was easy to do this without any labels, then, what would be the point of needing the labels in the first place?

I can see two options:

Use a pre-trained image classifier to represent your images

As Vincent Young suggests, you can find pre-trained networks which have been trained on similar detection tasks. ModelZoo is a good place to find pre-trained networks for the framework you are using.

Try mean-shift instead of K-Means

K-Means is straight forward but has some flow. It doesn't deal well with clusters of uneven size and will learn towards creating circular clusters due to Euclidean distance.

Mean-shift can deal with arbitrary feature spaces and can use arbitrary kernel functions. You may not end up with 2 clusters, but you may be able to find useful clusters regardless. On this note, if you try using more than 2 clusters with K-Means, you may find some clusters being "pure" (containing a single class) while some may be mixed. These pure clusters can be a good start.

I wrote a chapter on Mean Shift on my website, including other resources, if you want to read it.

Graph4Me Consultant · Answer

In addition to the answers provided, you can:
1.) Train jointly a CNN (or Autoencoder) with clustering on your data. (DCN, kmeansNet,..)
2.) Pretrain a CNN using self-supervision on your data. (Have a look into the vast self-supervision literature, e.g. this work).
3.) Use an alternating scheme to train a CNN classifier on soft-labels provided by a clustering algorithm, e.g k-means (e.g. this work).

Clustering of unlabeled ship images

3 Answers

After all, if it was easy to do this without any labels, then, what would be the point of needing the labels in the first place?

Add your own answers!

Ask a Question