Petals: Collaborative Inference and Fine-tuning of Large Models

Alexander, Borzunov,; Baranchuk, Dmitry; Dettmers, Tim; Riabinin, Maksim; Belkada, Younes; Chumachenko, A. V.; Samygin, Pavel; Raffel, Colin

doi:10.18653/v1/2023.acl-demo.54

Cited by 5 publications

(3 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Borzunov et al 12 proposed an innovative approach to distribute the workload of LLMs across multiple servers. This method involves splitting the LLM model on consumer hardware and executing it in a distributed manner, incorporating dynamic quantization and server load balancing along with fault tolerance.…”

Section: Related Workmentioning

confidence: 99%

“…The selection of edge devices for services is based on the pheromone trail strength, which reflects the accumulated experience of previous iterations, and a heuristic factor to include domain-specific knowledge or performance metrics. This decision-making process is followed with an immediate update to the local pheromone value, following each service placement (lines [11][12][13][14][15]. This local update serves as a rapid feedback technique, guiding subsequent selections within the same cycle.…”

Section: Algorithm 1: Edgegenacomentioning

confidence: 99%

See 1 more Smart Citation

Latency-aware service placement for GenAI at the edge

Thapa,

Mashayekhy

2024

Disruptive Technologies in Information Sciences VIII

View full text Add to dashboard Cite

In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) and Generative AI (GenAI) have emerged as front-runners in shaping the next generation of intelligent applications, where humanlike data generation is necessary. While their capabilities have shown transformative potential in centralized computing environments, there is a growing shift towards decentralized edge AI models, where computations are orchestrated closer to data sources to provide immediate insights, faster response times, and localized intelligence without the overhead of cloud communication. For latency-critical applications like autonomous vehicle driving, GenAI at the edge is vital, allowing vehicles to instantly generate and adapt driving strategies based on everchanging road conditions and traffic patterns. In this paper, we propose a latency-aware service placement approach, designed for the seamless deployment of GenAI services on these cloudlets. We represent GenAI as a Direct Acyclic Graph, where GenAI operations represent the nodes and the dependencies between these operations represent the edges. We propose an Ant Colony Optimization approach that guides the placement of GenAI services at the edge based on capabilities of cloudlets and network conditions. Through experimental validation, we achieve notable GenAI performance at the edge with lower latency and efficient resource utilization. This advancement is expected to revolutionize and innovate in the field of GenAI, paving the way for more efficient and transformative applications at the edge.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Algorithm 1: Edgegenacomentioning

confidence: 99%

Latency-aware service placement for GenAI at the edge

Thapa,

Mashayekhy

2024

Disruptive Technologies in Information Sciences VIII

View full text Add to dashboard Cite

show abstract

“…The concept of local-SGD (or FedAvg) has previously been applied in the realm of language modeling. Crossdevice federated learning, for instance, has been utilized to pretrain and fine-tune language models (Borzunov et al, 2022;Diskin et al, 2021a;Hilmkil et al, 2021;Presser, 2020;Ro et al, 2022;Ryabinin et al, 2021). More recently, DiLoCo has extended the local-SGD methodology to encompass larger language models, specifically proposing the use of AdamW + Nesterov momentum as the InnerOpt + OuterOpt pairing.…”

Section: Local-sgd For Language Modelingmentioning

confidence: 99%

Google and DeepMind: Deep Learning Systems in Ophthalmology

Liu

Mitani

Spitz

et al. 2021

Artificial Intelligence in Ophthalmology

View full text Add to dashboard Cite

Despite their success in many natural language tasks, solving math problems remains a significant challenge for large language models (LLMs). A large gap exists between LLMs' pass-at-one and pass-at-N performance in solving math problems, suggesting LLMs might be close to finding correct solutions, motivating our exploration of fine-tuning methods to unlock LLMs' performance. Using the challenging MATH dataset, we investigate three fine-tuning strategies: (1) solution fine-tuning, where we fine-tune to generate a detailed solution for a given math problem; (2) solution-cluster re-ranking, where the LLM is fine-tuned as a solution verifier/evaluator to choose among generated candidate solution clusters; (3) multi-task sequential fine-tuning, which integrates both solution generation and evaluation tasks together efficiently to enhance the LLM performance. With these methods, we present a thorough empirical study on a series of PaLM 2 models and find: (1) The quality and style of the step-by-step solutions used for fine-tuning can make a significant impact on the model performance; (2) While solution re-ranking and majority voting are both effective for improving the model performance when used separately, they can also be used together for an even greater performance boost; (3) Multi-task fine-tuning that sequentially separates the solution generation and evaluation tasks can offer improved performance compared with the solution fine-tuning baseline. Guided by these insights, we design a fine-tuning recipe that yields approximately 58.8% accuracy on the MATH dataset with fine-tuned PaLM 2-L models, an 11.2% accuracy improvement over the fewshot performance of pre-trained PaLM 2-L model with majority voting.

show abstract