<abbr id="y2asm"></abbr><abbr id="y2asm"></abbr>
  • <code id="y2asm"></code>
    <code id="y2asm"></code>
  • <button id="y2asm"></button>
    <rt id="y2asm"></rt>

    The Pile

    An 800GB Dataset of Diverse Text for Language Modeling

    What is the Pile?

    The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

    Download

    The Pile is hosted by the Eye.

    The format of the Pile is jsonlines data compressed using zstandard.

    Have a model that uses or evaluates on the Pile? Let us know!

    Why is the Pile a good training set?

    Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB.

    Why is the Pile a good benchmark?

    To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. Pile BPB is a measure of world knowledge and reasoning ability in these domains, making it a robust benchmark of general, cross-domain text modeling ability for large language models.

    Citing

    If you use the Pile or any of the components, please cite us!

    @article{pile,
      title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
      author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
      journal={arXiv preprint arXiv:2101.00027},
      year={2020}
    }
                    

    Leaderboard

    * indicates potential test-set overlap. Zero-shot indicates that not all of the components of the Pile were present in the training data.

    Rank Model Test BPB

    1.

    Jan 1.2021

    GPT-3 (Zero-Shot)*

    OpenAI

    0.7177

    2.

    Jan 1.2021

    GPT-2 (Zero-Shot)*

    OpenAI

    1.2253

    主站蜘蛛池模板: videofree极品另类| yellow字幕网在线| 欧美性猛交xxxx乱大交丰满 | 久久久精品国产| 波多野结衣电影免费在线观看 | 国内精自视频品线六区免费| 好男人在线社区www| 九九久久精品无码专区| 焰灵姬你下面好紧| 国产chinasex对白videos麻豆| 3d动漫精品啪啪一区二区免费| 成人免费一区二区三区| 久热国产在线视频| 欧美精品专区第1页| 午夜网站在线观看| 中文字幕无码无码专区| 欧美一级爽快片淫片高清在线观看 | 69堂在线观看| 好硬好爽老师再深点| 久久国产精品免费一区二区三区| 欧美精品v欧洲精品| 全彩口工番日本漫画| 青青草原免费在线| 国产精品久久久久影院| bl文库双性灌尿| 成年女人a毛片免费视频| 久草视频免费在线| 欧美日韩精品一区二区三区不卡| 内射白浆一区二区在线观看| 阿v视频在线观看| 国产男女插插一级| 91精品国产免费久久国语麻豆| 少妇精品久久久一区二区三区 | 国产又猛又黄又爽| 亚洲入口无毒网址你懂的| 国语对白嫖老妇胖老太| 一级做a爰片久久毛片人呢| 日本三级欧美三级人妇视频黑白配| 亚洲一区二区三区电影| 欧美精品一区二区三区久久 | 精品卡一卡2卡三卡免费观看|