Which scaling rule applies to large artificial neural networks

Végh, János

doi:10.1007/s00521-021-06456-y

Cited by 9 publications

(19 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Their operating principle undergoes the general distributed processing principles. As discussed in [35], they can do valuable work at a small number of cores ('toy level') and can be useful embedded components in a general-purpose processor, but have severe performance limitations at large scale systems. They are sensitive to the synchronization issues discussed here, primarily if they use feedback and recurrency [16].…”

Section: Artificial Neural Networkmentioning

confidence: 99%

“…In the case of AI-type workload, the performance with half-precision and double precision operands differ only marginally for vast systems. For details, see [35,42].…”

Section: Half-length Operands Vs Double-length Onesmentioning

confidence: 99%

“…Despite all of this, the idea of non-temporal behavior was confirmed by accepting the concept of "weak scaling" [48], suggesting that all housekeeping times, such as organizing the joint work of parallelized serial processors, sharing resources, using exceptions, and OS services, delivering data between processing units and data storage units, are negligible. See [35] why weak scaling is wrong. Essentially, this is why the algorithmic scalability assumes a dependence on the number of operations (i.e., it assumes that the transfer time can be neglected aside from processing time), rather than taking into account how the effective computing time changes with the transfer time between the computing units as the physical size of the system increases.…”

Section: The Role Of Transfer Timementioning

confidence: 99%

“…[64]. For a more detailed analysis see [58], and specifically for the case of artificial neural networks [35]. The figure suggests using another design principle instead of using the bus exclusively, directly from the computing component's position.…”

Section: The Serial Busmentioning

confidence: 99%

“…This efficiency decrease is why these entries reduced their number of cores in the second benchmark. Their payload performance reached their "roofline" [35,72] levels at that number of cores; using all cores would decrease the system's performance by order of magnitude only because of the higher number of cores. (Started with June 2021, this "measured cores" information is missing from the published HPCG data, and even the formerly published data are removed.…”

Section: Distributed Processingmentioning

confidence: 99%

See 4 more Smart Citations

Revising the Classic Computing Paradigm and Its Technological Implementations

Végh¹

2021

Informatics

Self Cite

View full text Add to dashboard Cite

Today’s computing is based on the classic paradigm proposed by John von Neumann, three-quarters of a century ago. That paradigm, however, was justified for (the timing relations of) vacuum tubes only. The technological development invalidated the classic paradigm (but not the model!). It led to catastrophic performance losses in computing systems, from the operating gate level to large networks, including the neuromorphic ones. The model is perfect, but the paradigm is applied outside of its range of validity. The classic paradigm is completed here by providing the “procedure” missing from the “First Draft” that enables computing science to work with cases where the transfer time is not negligible apart from the processing time. The paper reviews whether we can describe the implemented computing processes by using the accurate interpretation of the computing model, and whether we can explain the issues experienced in different fields of today’s computing by omitting the wrong omissions. Furthermore, it discusses some of the consequences of improper technological implementations, from shared media to parallelized operation, suggesting ideas on how computing performance could be improved to meet the growing societal demands.

show abstract

Section: Artificial Neural Networkmentioning

confidence: 99%

“…In the case of AI-type workload, the performance with half-precision and double precision operands differ only marginally for vast systems. For details, see [35,42].…”

Section: Half-length Operands Vs Double-length Onesmentioning

confidence: 99%

Section: The Role Of Transfer Timementioning

confidence: 99%

Section: The Serial Busmentioning

confidence: 99%

Section: Distributed Processingmentioning

confidence: 99%

See 3 more Smart Citations

Revising the Classic Computing Paradigm and Its Technological Implementations

Végh¹

2021

Informatics

Self Cite

View full text Add to dashboard Cite

show abstract

On the Role of Speed in Technological and Biological Information Transfer for Computations

Végh¹,

Berki

2022

Acta Biotheor

View full text Add to dashboard Cite

In all kinds of implementations of computing, whether technological or biological, some material carrier for the information exists, so in real-world implementations, the propagation speed of information cannot exceed the speed of its carrier. Because of this limitation, one must also consider the transfer time between computing units for any implementation. We need a different mathematical method to consider this limitation: classic mathematics can only describe infinitely fast and small computing system implementations. The difference between mathematical handling methods leads to different descriptions of the computing features of the systems. The proposed handling also explains why biological implementations can have lifelong learning and technological ones cannot. Our conclusion about learning matches published experimental evidence, both in biological and technological computing.

show abstract