AutoTrain: No-code training for state-of-the-art models

Abhishek Thakur
2024-10-21
Abstract:With the advancements in open-source models, training (or finetuning) models on custom datasets has become a crucial part of developing solutions which are tailored to specific industrial or open-source applications. Yet, there is no single tool which simplifies the process of training across different types of modalities or tasks. We introduce AutoTrain (aka AutoTrain Advanced) -- an open-source, no code tool/library which can be used to train (or finetune) models for different kinds of tasks such as: large language model (LLM) finetuning, text classification/regression, token classification, sequence-to-sequence task, finetuning of sentence transformers, visual language model (VLM) finetuning, image classification/regression and even classification and regression tasks on tabular data. AutoTrain Advanced is an open-source library providing best practices for training models on custom datasets. The library is available at <a class="link-external link-https" href="https://github.com/huggingface/autotrain-advanced" rel="external noopener nofollow">this https URL</a>. AutoTrain can be used in fully local mode or on cloud machines and works with tens of thousands of models shared on Hugging Face Hub and their variations.
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to simplify the process of model training (or fine - tuning) on different tasks and modalities. Although the development of open - source models has made training or fine - tuning models on custom datasets increasingly important, there is currently a lack of a general - purpose open - source solution for various tasks. The specific challenges mentioned in the paper include: 1. **Complexity of hyperparameter adjustment**: Finding suitable hyperparameters requires a large number of experiments and expertise. Inappropriate hyperparameter adjustment may lead to over - fitting or under - fitting. 2. **Model validation**: To ensure that the trained model has good generalization ability, a suitable validation set and evaluation metrics are required. Over - fitting the training data may lead to poor performance of the model in actual scenarios. 3. **Distributed training**: Training with multiple GPUs on large datasets can be very cumbersome and requires a large number of modifications to the codebase. Distributed training also involves additional complexity in synchronization and data processing. 4. **Monitoring**: When training a model, it is very important to monitor the loss, metrics, and other artifacts to ensure the smooth progress of the training process. 5. **Maintenance**: As the data changes, it may be necessary to re - train or fine - tune the model while maintaining the consistency of the training settings. To address these challenges, the paper introduces **AutoTrain** (also known as **AutoTrain Advanced**), which is an open - source no - code tool/library that can be used for different types of tasks, such as large language model (LLM) fine - tuning, text classification/regression, label classification, sequence - to - sequence tasks, fine - tuning of sentence transformers, visual language model (VLM) fine - tuning, image classification/regression, and classification and regression tasks for tabular data. AutoTrain provides a simple interface that enables users to train models on custom datasets without extensive programming knowledge.