METHOD FOR OBJECT DETECTION USING SHALLOW NEURAL NETWORKS

Info

Publication number: 20200311517
Type: Application
Filed: Nov 13, 2019
Publication Date: Oct 1, 2020
Inventors: Igal Raichelgauz (Tel Aviv), Roi Saida (Acco)
Application Number: 16/681,885

Abstract

A method that may include feeding an input image and downscaled versions of the input image to multiple branches of an object detector calculating, by the multiple branches, candidate bounding boxes; and selecting bounding boxes. The multiple branches comprise multiple shallow neural networks that are followed by multiple region units. Each branch includes a shallow neural network and a region unit. The multiple shallow neural networks are multiple instances of a single trained shallow neural network. The single trained shallow neural network is trained to detect objects having a size that is within a predefined size range and to ignore objects having a size that is outside the predefined size range.

Description

Description

CROSS REFERENCE

This application claims priority from U.S. provisional patent 62/827,121 filing date Mar. 31 2019.

BACKGROUND

Object detection is required in various systems and applications.

There is a growing need to provide a method and a system that may be able to provide highly accurate object detection at a low cost.

SUMMARY

There may be provided a method for object detection, the method may include receiving an input image by an input of an object detector; wherein the object detector may include multiple branches; generating at least one downscaled version of the input image; feeding the input image to a first branch of the multiple branches; feeding each one of the at least one downscale version of the input image to a unique branch of the multiple branches, one downscale version of the image per branch; calculating, by the multiple branches, candidate bounding boxes that may be indicative of candidate objects that appear in the input image and each one of the at least one downscaled version of the input image; selecting bounding boxes out of the candidate bounding boxes, by a selection unit that followed the multiple branches; wherein the multiple branches may include multiple shallow neural networks that may be followed by multiple region units; wherein each branch may include a shallow neural network and a region unit; wherein the multiple shallow neural networks may be multiple instances of a single trained shallow neural network; and wherein the single trained shallow neural network may be trained to detect objects having a size that may be within a predefined size range and to ignore objects having a size that may be outside the predefined size range.

The method may include generating the multiple downscaled applying a same downscaling ratio between (a) the input image and a first downscaled version of the image and between (b) the first downscale version of the input image to a second downscale version of the input image.

There may be provided a non-transitory computer readable medium for detecting an object by an object detector, wherein the non-transitory computer readable medium may store instructions for: receiving an input image by an input of the object detector; wherein the object detector may include multiple branches; generating at least one downscaled version of the input image; feeding the input image to a first branch of the multiple branches; feeding each one of the at least one downscale version of the input image to a unique branch of the multiple branches, one downscale version of the image per branch; calculating, by the multiple branches, candidate bounding boxes that may be indicative of candidate objects that appear in the input image and each one of the at least one downscaled version of the input image; selecting bounding boxes out of the candidate bounding boxes, by a selection unit that follows the multiple branches; wherein the multiple branches may include multiple shallow neural networks that may be followed by multiple region units; wherein each branch may include a shallow neural network and a region unit; wherein the multiple shallow neural networks may be multiple instances of a single trained shallow neural network; and wherein the single trained shallow neural network may be trained to detect objects having a size that may be within a predefined size range and to ignore objects having a size that may be outside the predefined size range.

The non-transitory computer readable medium that may store instructions for generating the multiple downscaled applying a same downscaling ratio between (a) the input image and a first downscaled version of the image and between (b) the first downscale version of the input image to a second downscale version of the input image.

There may be provided an object detection system that may include an input, a downscaling unit, multiple branches, and a selection unit; wherein the input may be configured to receive an input image; wherein the downscaling unit may be configured to generate at least one downscaled version of the input image; wherein the multiple branches may be configured to receive the input image and the at least one downscaled version of the input image, one image per branch; wherein the multiple branches may be configured to calculate candidate bounding boxes that may be indicative of candidate objects that appear in the input image and each one of the at least one downscaled version of the input image; wherein the selection unit may be configured to select bounding boxes out of the candidate bounding boxes; wherein the multiple branches may include multiple shallow neural networks that may be followed by multiple region units; wherein each branch may include a shallow neural network and a region unit; wherein the multiple shallow neural networks may be multiple instances of a single trained shallow neural network; and wherein the single trained shallow neural network may be trained to detect objects having a size that may be within a predefined size range and to ignore objects having a size that may be outside the predefined size range.

The downscaling unit may be configured to generate the multiple downscaled applying a same downscaling ratio between (a) the input image and a first downscaled version of the image and between (b) the first downscale version of the input image to a second downscale version of the input image.

The predefined size range may range between (a) about ten by ten pixels, till (b) about one hundred by one hundred pixels.

The predefined size range may range between (a) about sixteen by sixteen pixels, till (b) about one hundred and twenty pixels by one hundred and twenty pixels.

The predefined size range may range between (a) about eighty by eighty pixels, till (b) about one hundred by one hundred pixels.

The multiple branches may be three branches and wherein there may be two downscaled versions of the input image.

The at least one downscaled version of the image may be multiple downscaled versions of the input image.

The first downscale version of the input image may have a width that may be one half of a width of the input image and a length that may be one half of a length of a length of an input image.

The each shallow neural network may have up to four layers.

The each shallow neural network may have up to five layers.

BRIEF DESCRIPTION OF THE DRAWINGS

The embodiments of the disclosure will be understood and appreciated more fully from the following detailed description, taken in conjunction with the drawings in which:

FIG. 1 illustrates an example of an object detection system;

FIG. 2 illustrates an example of an image, two objects, two bounding boxes and a bounding box output;

FIG. 3 illustrates an image and various objects;

FIG. 4 illustrates an example of a training process; and

FIG. 5 illustrates an example of a method for object detection.

DESCRIPTION OF EXAMPLE EMBODIMENTS

In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the invention. However, it will be understood by those skilled in the art that the present invention may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present invention.

The subject matter regarded as the invention is particularly pointed out and distinctly claimed in the concluding portion of the specification. The invention, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings.

It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.

Because the illustrated embodiments of the present invention may for the most part, be implemented using electronic components and circuits known to those skilled in the art, details will not be explained in any greater extent than that considered necessary as illustrated above, for the understanding and appreciation of the underlying concepts of the present invention and in order not to obfuscate or distract from the teachings of the present invention.

Any reference in the specification to a method should be applied mutatis mutandis to a device or system capable of executing the method and/or to a non-transitory computer readable medium that stores instructions for executing the method.

Any reference in the specification to a system or device should be applied mutatis mutandis to a method that may be executed by the system, and/or may be applied mutatis mutandis to non-transitory computer readable medium that stores instructions executable by the system.

Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a device or system capable of executing instructions stored in the non-transitory computer readable medium and/or may be applied mutatis mutandis to a method for executing the instructions.

Any combination of any module or unit listed in any of the figures, any part of the specification and/or any claims may be provided.

There may be provided a low power object detection system (detector), non-transitory computer readable medium and method. The object detection system, non-transitory computer readable medium and method also provide a high level semantic multi scale feature maps, without impairing the speed of the detector.

Each additional convolution layer increases the detector physical receptive field, therefore, enlargement of the maximum object size that is managed by the detector result in increasing the required number of convolution layers.

Since each layer of the convolutional network has a fixed receptive field, it is not optimal to detect objects of different scales utilizing only features generated by the last convolutional layer.

Shallow feature maps have small receptive fields that are used to detect small objects, and deep feature maps have large receptive fields that are used to detect large objects.

Nevertheless, shallow features might have less semantic information, which may impair the detection of small objects.

The above theorem was very popular at the first object detectors that have been released until 2016. In contrast, at the last few years, we are witness to a new trend of very deep networks integrated into state of the art object detectors. hence state of the art object detectors detect small objects using feature maps extracted from enormous receptive fields.

That implementation forces ineffective forward propagation of small object features from earlier network's stages to deeper network's stages.

Thus while managing larger objects required deeper network, the ineffective detection of small objects increase the number of channels along the network or complicating the memory data transition between layers.

Interesting theorem explaining the motivation of using feature maps that have large receptive fields for small objects suggests that in order to detect a small object we take advantage of the context information surrounding it. For example, we can easily distinguish between small car driving on the roadway and boat sailing on the sea employing the surrounding background information which is notably more differently than the internal context information of that two small objects.

However, real-time automotive application can't take advantage of deeper/wider/Complex networks because those networks are not applicable due to power consuming limitation requirements.

FIG. 1 illustrates an object detection system 9000 that includes an input 9010 (illustrated as receiving input image 9001), a downscaling unit 9011, multiple branches (such as three branches 9013(1), 9013(2) and 9013(3)), and a selection unit 9016 such as a non-maximal suppression unit.

Input 910 may be configured to receive an input image by an input of an object detector.

Downscaling unit 9011 may be configured to generate at least one downscaled version of the input image.

The multiple branches 9013(1), 9013(2) and 9013(3) may be configured to receive the input image and the at least one downscaled version of the input image, one image per branch.

Input image 9001 is fed to first branch 9013(1) that is configured to calculate first candidate bounding boxes that may be indicative of candidate objects that appear in the input image.

First downscaled version of the input image (DVII) 9002 is fed to second branch 9013(2) that is configured to calculate second candidate bounding boxes that may be indicative of candidate objects that appear in first DVII 9002.

Second DVII 9003 is fed to third branch 9013(3) that is configured to calculate third candidate bounding boxes that may be indicative of candidate objects that appear in second DVII 9003.

The multiple branches may include multiple shallow neural networks that may be followed by multiple region units.

In first branch 9013(1), a first shallow neural network 9012(1) is followed by first region unit 9014(1).

The first shallow neural network 9012(1) outputs a first shallow neural network output (SNNO-1) 9003(1) that may be a tensor with multiple features per segment of the input image. The first region unit 9014(1) is configured to receive SNNO-1 9003(1) and calculate and output first candidate bounding boxes 9005(1).

The second shallow neural network 9012(2) outputs a second SNNO (SNNO-2) 9003(2) that may be a tensor with multiple features per segment of the first DVII 9002. The second region unit 9014(2) is configured to receive SNNO-2 9003(2) and calculate and output second candidate bounding boxes 9005(2).

The third shallow neural network 9012(3) outputs a third SNNO (SNNO-3) 9003(3) that may be a tensor with multiple features per segment of the second DVII 9003. The third region unit 9014(3) is configured to receive SNNO-3 9003(3) and calculate and output third candidate bounding boxes 9005(3).

The multiple shallow neural networks 9012(1), 9012(2) and 9012(3) may be multiple instances of a single trained shallow neural network.

The single trained shallow neural network may be trained to detect objects having a size that may be within a predefined size range and to ignore objects having a size that may be outside the predefined size range.

The selection unit 9016 may be configured to select bounding boxes (denoted BB output 9007) out of the first, second and third candidate bounding boxes.

The selected bounding boxes may be further processed to detect the objects. Additionally or alternatively—the bounding boxes may provide the output of the object detection system.

The branch that receives the input image is configured to detect objects that have a size that is within the predefined size range.

The predefined size range may span along certain fractions of the input image (for example—between less than a percent to less than ten percent of the input image—although other fractions may be selected).

The predefined size range may be tailored to the expected size of images within a certain distance range from the sensor.

The predefined size range may span along certain numbers of pixels—for example between (a) about 10, 20, 30, 40, 50, 60, 70, 80, and 90 pixels by about 10, 20, 30, 40, 50, 60, 70, 80, and 90, and (b) about 100, 110, 120, 130, 140, 150, 160 pixels by about 100, 110, 120, 130, 140, 150, 160 pixels.

Each branch that receives a downscaled version of the input image (assuming of a certain downscaling factor) may detect objects have a size (within the downscaled version of the input image) that is within the predefined size range—and thus may detect images that appear in the input image having a size that is within a size range that equals the predefined range multiplied by the downscaling factor.

Assuming, for example that the input image is of 576×768 pixels (each pixel is represented by three colors), the first DVII is 288×384 pixels (each pixel is represented by three colors), and the second DVII is 144×192 pixels (each pixel is represented by three colors), that SNNO-1 has 85 features per each segment out 36×48 segments, that SNNO-2 has 85 features per each segment out 18×24 segments, that SNNO-3 has 85 features per each segment out 9×12 segments.

The assumption above as well as the example below are merely non-limiting examples of various values. Other values may be provided.

Under these assumptions, each shallow neural network may detect an object having a size between 20×20 to 100×100 pixels and physical receptive field around 200×200 pixels. This assumes automotive objects can be effectively represented using bounding box dimension below 100×100.

In contrast to a single model trained end to end, the following architecture contains several identical shallow neural networks.

The first branch detects small object (as appearing in the input image), the second branch detects medium objects (as appearing in the input image), and the third branch detects large objects (as appearing in the input image)—all may be within a limited predefined size range.

The number of branches, scales, and the downscale factor may differ from those illustrated in FIG. 1. For example—there may be two or more than three branches, the downscaling factor may differ from 2×2, downscaling factors between different images may differ from each other, and the like.

FIG. 2 illustrates an example of an image 9020, two objects—pedestrian 9021 and car 9022, two bounding boxes 9023 (bounding pedestrian 9021) and 9024 (bounding car 9022) and a bounding box output 9025.

The bounding box output 9025 may include coordinates (x,y,h,w) of the bounding boxes, objectiveness and class. The coordinate indicate the location (x,y) as well as the height and width of the bounding boxes. Objectiveness provides a confidence level that an object exists. Class—class of object—for example cat, dog, vehicle, person . . . ). The (x,y) coordinates may represent the center of the bounding box.

The object detection may be compliant to any flavor of YOLO—but other object detection schemes may be applied.

FIG. 3 illustrates an image 9030 and various objects 9031, 9032, 9033 and 9034.

Objects 9033 and 9034 are outside the predefined size range and should be ignored of. The single trained neural network is trained to detect objects 9031 and 9032 (within the predefined size range) and ignore objects 9033 and 9034.

FIG. 4 illustrates an example of a training process.

Test images 9040 are fed to single shallow neural network 9017 that outputs, for each test image, a single shallow neural network output that may be a tensor with multiple features per segment of the test image. The region unit 9018 is configured to receive the output from single shallow neural network 9017 and calculate and output candidate bounding boxes per test image. Actual results such as the output candidate bounding boxes per test image or an output of a selecting unit 9019 (that follows region unit 9018) may be fed to error calculation unit 9050.

Error calculation unit 9050 also receives desired results 9045—objects of a size of the predefined range that should be detected by the single shallow neural network 9017.

Error calculation unit 9050 calculates an error 9055 between the the actual results and the desired results- and the error is fed to the single shallow neural network 9017 during the training process.

FIG. 5 illustrates an example of a method 9100 for object detection.

Method 9100 may include the following steps:

- Step 9101 of receiving an input image by an input of an object detector. The object detector may include multiple branches. The multiple branches may include multiple shallow neural networks that may be followed by multiple region units. Each branch may include a shallow neural network and a region unit. The multiple shallow neural networks may be multiple instances of a single trained shallow neural network. The single trained shallow neural network may be trained to detect objects having a size that may be within a predefined size range and to ignore objects having a size that may be outside the predefined size range.
- Step 9102 of generating at least one downscaled version of the input image.
- Step 9103 of feeding the input image to a first branch of the multiple branches.
- Step 9104 of feeding each one of the at least one downscale version of the input image to a unique branch of the multiple branches, one downscale version of the image per branch.
- Step 9105 of calculating, by the multiple branches, candidate bounding boxes that may be indicative of candidate objects that appear in the input image and each one of the at least one downscaled version of the input image.
- Step 9106 of selecting bounding boxes out of the candidate bounding boxes, by a selection unit that followed the multiple branches.
- Step 9107 of outputting the bonding boxes and/or further processing the bounding boxes.

Method 9100 may include training the single trained shallow neural network.

While the foregoing written description of the invention enables one of ordinary skill to make and use what is considered presently to be the best mode thereof, those of ordinary skill will understand and appreciate the existence of variations, combinations, and equivalents of the specific embodiment, method, and examples herein. The invention should therefore not be limited by the above described embodiment, method, and examples, but by all embodiments and methods within the scope and spirit of the invention as claimed.

In the foregoing specification, the invention has been described with reference to specific examples of embodiments of the invention. It will, however, be evident that various modifications and changes may be made therein without departing from the broader spirit and scope of the invention as set forth in the appended claims.

Moreover, the terms “front,” “back,” “top,” “bottom,” “over,” “under” and the like in the description and in the claims, if any, are used for descriptive purposes and not necessarily for describing permanent relative positions. It is understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the invention described herein are, for example, capable of operation in other orientations than those illustrated or otherwise described herein.

Furthermore, the terms “assert” or “set” and “negate” (or “deassert” or “clear”) are used herein when referring to the rendering of a signal, status bit, or similar apparatus into its logically true or logically false state, respectively. If the logically true state is a logic level one, the logically false state is a logic level zero. And if the logically true state is a logic level zero, the logically false state is a logic level one.

Those skilled in the art will recognize that the boundaries between logic blocks are merely illustrative and that alternative embodiments may merge logic blocks or circuit elements or impose an alternate decomposition of functionality upon various logic blocks or circuit elements. Thus, it is to be understood that the architectures depicted herein are merely exemplary, and that in fact many other architectures may be implemented which achieve the same functionality.

Any arrangement of components to achieve the same functionality is effectively “associated” such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality may be seen as “associated with” each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being “operably connected,” or “operably coupled,” to each other to achieve the desired functionality.

Furthermore, those skilled in the art will recognize that boundaries between the above described operations merely illustrative. The multiple operations may be combined into a single operation, a single operation may be distributed in additional operations and operations may be executed at least partially overlapping in time. Moreover, alternative embodiments may include multiple instances of a particular operation, and the order of operations may be altered in various other embodiments.

Also for example, in one embodiment, the illustrated examples may be implemented as circuitry located on a single integrated circuit or within a same device. Alternatively, the examples may be implemented as any number of separate integrated circuits or separate devices interconnected with each other in a suitable manner.

However, other modifications, variations and alternatives are also possible. The specifications and drawings are, accordingly, to be regarded in an illustrative rather than in a restrictive sense.

In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word ‘comprising’ does not exclude the presence of other elements or steps then those listed in a claim. Furthermore, the terms “a” or “an,” as used herein, are defined as one or more than one. Also, the use of introductory phrases such as “at least one” and “one or more” in the claims should not be construed to imply that the introduction of another claim element by the indefinite articles “a” or “an” limits any particular claim containing such introduced claim element to inventions containing only one such element, even when the same claim includes the introductory phrases “one or more” or “at least one” and indefinite articles such as “a” or “an.” The same holds true for the use of definite articles. Unless stated otherwise, terms such as “first” and “second” are used to arbitrarily distinguish between the elements such terms describe. Thus, these terms are not necessarily intended to indicate temporal or other prioritization of such elements. The mere fact that certain measures are recited in mutually different claims does not indicate that a combination of these measures cannot be used to advantage.

While certain features of the invention have been illustrated and described herein, many modifications, substitutions, changes, and equivalents will now occur to those of ordinary skill in the art. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the true spirit of the invention.

It is appreciated that various features of the embodiments of the disclosure which are, for clarity, described in the contexts of separate embodiments may also be provided in combination in a single embodiment. Conversely, various features of the embodiments of the disclosure which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable sub-combination.

It will be appreciated by persons skilled in the art that the embodiments of the disclosure are not limited by what has been particularly shown and described hereinabove. Rather the scope of the embodiments of the disclosure is defined by the appended claims and equivalents thereof

Claims

1. A method for object detection, the method comprises:

receiving an input image by an input of an object detector; wherein the object detector comprises multiple branches generating at least one downscaled version of the input image;

feeding the input image to a first branch of the multiple branches;

feeding each one of the at least one downscale version of the input image to a unique branch of the multiple branches, one downscale version of the image per branch;

calculating, by the multiple branches, candidate bounding boxes that are indicative of candidate objects that appear in the input image and each one of the at least one downscaled version of the input image;

selecting bounding boxes out of the candidate bounding boxes, by a selection unit that followed the multiple branches;

wherein the multiple branches comprise multiple shallow neural networks that are followed by multiple region units; wherein each branch comprises a shallow neural network and a region unit;

wherein the multiple shallow neural networks are multiple instances of a single trained shallow neural network; and

wherein the single trained shallow neural network is trained to detect objects having a size that is within a predefined size range and to ignore objects having a size that is outside the predefined size range.

2. The method according to claim 1 wherein the predefined size range ranges between (a) ten by ten pixels, till (b) one hundred by one hundred pixels.

3. The method according to claim 1 wherein the predefined size range ranges between (a) sixteen by sixteen pixels, till (b) one hundred and twenty pixels by one hundred and twenty pixels.

4. The method according to claim 1 wherein the predefined size range ranges between (a) eighty by eighty pixels, till (b) one hundred by one hundred pixels.

5. The method according to claim 1 wherein the multiple branches are three branches and wherein there are two downscaled versions of the input image.

6. The method according to claim 1 wherein the generating of the at least one downscaled version of the input image comprises generating multiple downscaled versions of the input image.

7. The method according to claim 6 comprising generating the multiple downscaled applying a same downscaling ratio between (a) the input image and a first downscaled version of the image and between (b) the first downscale version of the input image to a second downscale version of the input image.

8. The method according to claim 6 wherein a first downscale version of the input image has a width that is one half of a width of the input image and a length that is one half of a length of a length of an input image.

9. The method according to claim 1 wherein each shallow neural network has up to four layers.

10. The method according to claim 1 wherein each shallow neural network has up to five layers.

11. A non-transitory computer readable medium for detecting an object by an object detector, wherein the non-transitory computer readable medium stores instructions for:

receiving an input image by an input of the object detector; wherein the object detector comprises multiple branches;

generating at least one downscaled version of the input image;

feeding the input image to a first branch of the multiple branches;

feeding each one of the at least one downscale version of the input image to a unique branch of the multiple branches, one downscale version of the image per branch;

calculating, by the multiple branches, candidate bounding boxes that are indicative of candidate objects that appear in the input image and each one of the at least one downscaled version of the input image;

selecting bounding boxes out of the candidate bounding boxes, by a selection unit that follows the multiple branches;

wherein the multiple branches comprise multiple shallow neural networks that are followed by multiple region units; wherein each branch comprises a shallow neural network and a region unit;

wherein the multiple shallow neural networks are multiple instances of a single trained shallow neural network; and

wherein the single trained shallow neural network is trained to detect objects having a size that is within a predefined size range and to ignore objects having a size that is outside the predefined size range.

12. The non-transitory computer readable medium according to claim 11 wherein the predefined size range ranges between (a) ten by ten pixels, till (b) one hundred by one hundred pixels.

13. The non-transitory computer readable medium according to claim 11 wherein the predefined size range ranges between (a) sixteen by sixteen pixels, till (b) one hundred and twenty pixels by one hundred and twenty pixels.

14. The non-transitory computer readable medium according to claim 11 wherein the predefined size range ranges between (a) eighty by eighty pixels, till (b) one hundred by one hundred pixels.

15. The non-transitory computer readable medium according to claim 11 wherein the multiple branches are three branches and wherein there are two downscaled versions of the input image.

16. The non-transitory computer readable medium according to claim 11 wherein the generating of the at least one downscaled version of the input image comprises generating multiple downscaled versions of the input image.

17. The non-transitory computer readable medium according to claim 16 that stores instructions for generating the multiple downscaled applying a same downscaling ratio between (a) the input image and a first downscaled version of the image and between (b) the first downscale version of the input image to a second downscale version of the input image.

18. The non-transitory computer readable medium according to claim 16 wherein a first downscale version of the input image has a width that is one half of a width of the input image and a length that is one half of a length of a length of an input image.

19. The non-transitory computer readable medium according to claim 11 wherein each shallow neural network has up to four layers.

20. The non-transitory computer readable medium according to claim 11 wherein each shallow neural network has up to five layers.

21. An object detection system that comprises an input, a downscaling unit, multiple branches, and a selection unit;

wherein the input is configured to receive an input image;

wherein the downscaling unit is configured to generate at least one downscaled version of the input image;

wherein the multiple branches are configured to receive the input image and the at least one downscaled version of the input image, one image per branch;

wherein the multiple branches are configured to calculate candidate bounding boxes that are indicative of candidate objects that appear in the input image and each one of the at least one downscaled version of the input image;

wherein the selection unit is configured to select bounding boxes out of the candidate bounding boxes;

wherein the multiple branches comprise multiple shallow neural networks that are followed by multiple region units; wherein each branch comprises a shallow neural network and a region unit;

wherein the multiple shallow neural networks are multiple instances of a single trained shallow neural network; and

wherein the single trained shallow neural network is trained to detect objects having a size that is within a predefined size range and to ignore objects having a size that is outside the predefined size range.

22. The object detection system according to claim 21 wherein the predefined size range ranges between (a) ten by ten pixels, till (b) one hundred by one hundred pixels.

23. The object detection system according to claim 21 wherein the predefined size range ranges between (a) sixteen by sixteen pixels, till (b) one hundred and twenty pixels by one hundred and twenty pixels.

24. The object detection system according to claim 21 wherein the predefined size range ranges between (a) eighty by eighty pixels, till (b) one hundred by one hundred pixels.

25. The object detection system according to claim 21 wherein the multiple branches are three branches and wherein there are two downscaled versions of the input image.

26. The object detection system according to claim 21 wherein the generating of the at least one downscaled version of the input image comprises generating multiple downscaled versions of the input image.

27. The object detection system according to claim 26 wherein the downscaling unit is configured to generate the multiple downscaled applying a same downscaling ratio between (a) the input image and a first downscaled version of the image and between (b) the first downscale version of the input image to a second downscale version of the input image.

28. The object detection system according to claim 26 wherein a first downscale version of the input image has a width that is one half of a width of the input image and a length that is one half of a length of a length of an input image.

29. The object detection system according to claim 21 wherein each shallow neural network has up to four layers.

30. The object detection system according to claim 21 wherein each shallow neural network has up to five layers.