Seeking for advice on near real time object detection for mobile (detect garbage within images)

Question

I am trying to build a near-real time object detection model which should run on a mobile device. As I am new to this specific area of computer vision I would appreciate every advice on my current progress and feedback on what I could do differently to achieve the goal.
The goal
The goal is to detect garbage in images and classify them into one of the following disposal methods (3 target classes):

yellow sack/ can (German)
paper
glass

In addition to that the model should be lightweight so that it is possible to efficiently run it on a mobile device.
The dataset
I am using the trashnet dataset which includes exactly 2527 images that are distributed among the classes: glass, paper, plastic, trash, cardboard, metal. Notable here is that there is only one item per image. Also the background of every image is the same (plain white).

https://www.kaggle.com/asdasdasasdas/garbage-classification

The methodology
Quiet frankly I am following the YouTube Tutorial from Sentdex on Mac'n'cheese detection and this medium post on gun detection.
Therefore I am using Google Colab as my environment. Also I am trying to retrain a pretrained model (ssd_mobilenet_v2_coco_2018_03_29). Training the model and exporting the inference graph is done by using the provided methods from the tensorflow API (model_main.py and export_inference_graph.py). I am using the samples config from tensorflow for this model.
My steps so far

I've set up my Google Colab environment similar to the Colab Notebook from the Medium post I mentioned before.
I split the data into training and test data by 3/4 and 1/4 respectively.
I labeled my data by using the popular labelImg tool so that every object has a bounding box.
I deleted every image where the object fills the whole space or ranges out of the image since the bounding box wouldn't make that much sense.
I created the label_map, csv and tfrecord files.
I played around with the initial_learning_rate, the l2_regularizer > weight rate of the box predictor and feature extrator, set use_dropout=true and increased the batch_size=32.

My current results
Most of the models I built had a very bad AP/AR, kinda high loss and tended to overfit. Also the model is only able to detect one object at a time within new images (maybe because of the dataset?).
Here are some screenshots from my tensorboard. These were made after around 12k steps. I think this is also the point were the overfitting begins to show since the AP is suddenly rising and predicted images have an accuarcy around 90-100%.
Scalars:

DetectionBoxes_Precision
DetectionBoxes_Recall
Loss

Predicted images:

Predicted cardboard (Altpapier)
Predicted glass (Glascontainer)

Questions from my side

Is it problematic that every image has only one object in it? Might this cause problems when running the model on a video stream?
Are these enough images to build an accurate model?
Does anyone of you guys have experience in this area and could give me advice on how to fine tune the pretrained model?
I also ran the model on a video stream from my webcam but all models tend to classify the whole screen. So it seems that the model is detecting an object but draws the bounding box all over the screen. Might this be related to the nature of the dataset/ the poor model quality?

This has been a long post so thank you in advance for taking time to read this. I hope I was able to make my goal clear and provided enough details for you guys to follow my current progress.
Current adjusted configuration for the pretrained ssd_mobilenet_v2_coco_2018_03_29 model:

model {
  ssd {
    num_classes: 3
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        #use_dropout: false
        use_dropout: true
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              #weight: 0.00004
              weight: 0.001
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.9997,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            #weight: 0.00004
            weight: 0.001
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        weighted_sigmoid {
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 3
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 1
        max_total_detections: 1
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 32
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.01
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "PATH"
  fine_tune_checkpoint_type:  "detection"
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path:"PATH"
  }
  label_map_path: "PATH"
}

eval_config: {
  num_examples: 197
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  #max_evals: 10
  num_visualizations: 20
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "PATH"
  }
  label_map_path: "PATH"
  shuffle: false
  num_readers: 1
}

Anoop A Nair · Answer

You can follow the give steps as a rough outline to approach your end result.
Step 1:  Since your main aim is to identify objects from the background and then classify them into three categories. You can initially implement a Haar Cascade classifier to identify the object from different backgrounds. This might take some work in regard to creation of training set. But you can always crop out a few samples from the trash data set.
Step2:   After applying the trained Haar Cascade Classifier on you real-world images it will return the images containing trash, and sometimes it may return the background too. You can classify the images using a normal CNN network.
This is light enough to be implement on low end hardware.

Seeking for advice on near real time object detection for mobile (detect garbage within images)

The goal

The dataset

The methodology

My steps so far

My current results

Questions from my side

One Answer

Add your own answers!

Ask a Question