TransWikia.com

Seeking for advice on near real time object detection for mobile (detect garbage within images)

Data Science Asked by aei jaei on January 16, 2021

I am trying to build a near-real time object detection model which should run on a mobile device. As I am new to this specific area of computer vision I would appreciate every advice on my current progress and feedback on what I could do differently to achieve the goal.

The goal

The goal is to detect garbage in images and classify them into one of the following disposal methods (3 target classes):

  • yellow sack/ can (German)
  • paper
  • glass

In addition to that the model should be lightweight so that it is possible to efficiently run it on a mobile device.

The dataset

I am using the trashnet dataset which includes exactly 2527 images that are distributed among the classes: glass, paper, plastic, trash, cardboard, metal. Notable here is that there is only one item per image. Also the background of every image is the same (plain white).

The methodology

Quiet frankly I am following the YouTube Tutorial from Sentdex on Mac’n’cheese detection and this medium post on gun detection.
Therefore I am using Google Colab as my environment. Also I am trying to retrain a pretrained model (ssd_mobilenet_v2_coco_2018_03_29). Training the model and exporting the inference graph is done by using the provided methods from the tensorflow API (model_main.py and export_inference_graph.py). I am using the samples config from tensorflow for this model.

My steps so far

  1. I’ve set up my Google Colab environment similar to the Colab Notebook from the Medium post I mentioned before.
  2. I split the data into training and test data by 3/4 and 1/4 respectively.
  3. I labeled my data by using the popular labelImg tool so that every object has a bounding box.
  4. I deleted every image where the object fills the whole space or ranges out of the image since the bounding box wouldn’t make that much sense.
  5. I created the label_map, csv and tfrecord files.
  6. I played around with the initial_learning_rate, the l2_regularizer > weight rate of the box predictor and feature extrator, set use_dropout=true and increased the batch_size=32.

My current results

Most of the models I built had a very bad AP/AR, kinda high loss and tended to overfit. Also the model is only able to detect one object at a time within new images (maybe because of the dataset?).

Here are some screenshots from my tensorboard. These were made after around 12k steps. I think this is also the point were the overfitting begins to show since the AP is suddenly rising and predicted images have an accuarcy around 90-100%.

Scalars:

Predicted images:

Questions from my side

  1. Is it problematic that every image has only one object in it? Might this cause problems when running the model on a video stream?
  2. Are these enough images to build an accurate model?
  3. Does anyone of you guys have experience in this area and could give me advice on how to fine tune the pretrained model?
  4. I also ran the model on a video stream from my webcam but all models tend to classify the whole screen. So it seems that the model is detecting an object but draws the bounding box all over the screen. Might this be related to the nature of the dataset/ the poor model quality?

This has been a long post so thank you in advance for taking time to read this. I hope I was able to make my goal clear and provided enough details for you guys to follow my current progress.

Current adjusted configuration for the pretrained ssd_mobilenet_v2_coco_2018_03_29 model:

model {
  ssd {
    num_classes: 3
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        #use_dropout: false
        use_dropout: true
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              #weight: 0.00004
              weight: 0.001
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.9997,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v2'
      min_depth: 16
      depth_multiplier: 1.0
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            #weight: 0.00004
            weight: 0.001
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
    }
    loss {
      classification_loss {
        weighted_sigmoid {
        }
      }
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 3
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 1
        max_total_detections: 1
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 32
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.01
          decay_steps: 800720
          decay_factor: 0.95
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  fine_tune_checkpoint: "PATH"
  fine_tune_checkpoint_type:  "detection"
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
}

train_input_reader: {
  tf_record_input_reader {
    input_path:"PATH"
  }
  label_map_path: "PATH"
}

eval_config: {
  num_examples: 197
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  #max_evals: 10
  num_visualizations: 20
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "PATH"
  }
  label_map_path: "PATH"
  shuffle: false
  num_readers: 1
}

One Answer

You can follow the give steps as a rough outline to approach your end result.

Step 1: Since your main aim is to identify objects from the background and then classify them into three categories. You can initially implement a Haar Cascade classifier to identify the object from different backgrounds. This might take some work in regard to creation of training set. But you can always crop out a few samples from the trash data set.

Step2: After applying the trained Haar Cascade Classifier on you real-world images it will return the images containing trash, and sometimes it may return the background too. You can classify the images using a normal CNN network.

This is light enough to be implement on low end hardware.

Answered by Anoop A Nair on January 16, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP