Few-shot learning is the capacity to complete a task given a small number of demonstrations. If large pre-trained language models would exhibit such capabilities, a single model could be used across multiple real-world tasks.
Therefore, a recent paper on arXiv.org proposes a real-world few-shot text classification benchmark designed to measure how much recent and upcoming NLP advances benefit applications.
The benchmark focuses on naturally occurring tasks. For each task, a public training set with 50 examples and a larger unlabeled test set is released. The unsupervised pre-training on the unlabeled examples and open-domain information retrieval is encouraged. Then, automated evaluation is provided.
The model complements existing synthetic benchmarks designed to highlight where models fall. It helps measure the gap between research and practice and provides a template for future benchmarks that mirror deployment.
Large pre-trained language models have shown promise for few-shot learning, completing text-based tasks given only a few task-specific examples. Will models soon solve classification tasks that have so far been reserved for human research assistants? Existing benchmarks are not designed to measure progress in applied settings, and so don’t directly answer this question. The RAFT benchmark (Real-world Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment. Baseline evaluations on RAFT reveal areas current techniques struggle with: reasoning over long texts and tasks with many classes. Human baselines show that some classification tasks are difficult for non-expert humans, reflecting that real-world value sometimes depends on domain expertise. Yet even non-expert human baseline F1 scores exceed GPT-3 by an average of 0.11. The RAFT datasets and leaderboard will track which model improvements translate into real-world benefits at this https URL .
Research paper: Alex, N., “RAFT: A Real-World Few-Shot Text Classification Benchmark”, 2021. Link: https://arxiv.org/abs/2109.14076