Understanding neutron-caused errors in electronics
17 Sep 2021
No
- Sujit Malde

 

 

Researchers have used ChipIr’s high-intensity neutron beam to showcase effective reliability improvements for commercially available Artificial Intelligence (AI) accelerators.

No
NM500 neuromorphic neural network chip

​​NM500 neuromorphic neural network chip

http://www.theneuromorphic.com/neuroshield/

Terrestrial neutrons, generated from reactions between atmospheric atoms and cosmic rays, are a potential source of errors in electronic devices that are often under-considered.

Concerns have risen over the reliability of electronics against neutron errors, since neutron errors become more common as electronics get smaller and more ubiquitous.

One example is AI accelerators, which make machine learning calculations run faster. They're extremely attractive for high-performance computing and increasingly used in safety-critical applications, such as for computer vision in driverless cars. The outputs of machine learning processes are probabilistic, meaning that an undetected hardware error that propagates to the output of the process is very difficult to diagnose, compared to other processes with predictable outputs.

Research into single event effects, a type of neutron error, at the ChipIr instrument has focused on improving the reliability of a range of AI accelerators. Former industrial placement student Sebastian Blower investigated the NM500 neuromorphic neural network chip from General Vision, an example of a commercially available off-the-shelf AI accelerator.

The NM500 learns by saving information about an object in its category register. It can then predict whether any features of any objects resemble what it has learnt by comparing information that it is given, with the information in its register. Sebastian found that the errors that caused the device to fail were generated by neutron events in the NM500's category register.

He then investigated ways to avoid these errors causing device failure, using a technique known as triple mode redundancy (TMR), where the same critical calculation is performed several times and the most common output is chosen, as shown below.

Schematic explaining triple mode redundancy

Sebastian successfully adapted this technique for the NM500 by exploiting unused bits in the category register. At no extra cost or physical complexity, 96% of single event effects were corrected by the technique, a dramatic improvement of the device's reliability.

The work done to the NM500 device serves as a promising proof-of-concept for further investigations into the reliability of AI accelerators. It highlights the importance and significance of neutron testing, and is a testament to the pioneering electronics research performed at ChipIr.

Further Information

The full paper can be found online at DOI: 10.1109/TNS.2021.3086686

For more information on the errors caused by terrestrial neutrons, see the diagram below, taken from Terrestrial Radiation Effects in ULSI Devices and Electronic Systems by Eishi H. Ibe, or watch this video explaining the dangers of single event effects​

Diagram explaining single event effects

Contact: de Laune, Rosie (STFC,RAL,ISIS)