But Meta’s model is available only upon request, and it has a license that limits its use for research purposes. Hugging Face goes a step further. The meetings detailing its work over the past year are recorded and uploaded online, and anyone can download the model free of charge and use it for research or to build commercial applications.
A big focus for BigScience was to embed ethical considerations into the model from its inception, instead of treating them as an afterthought. LLMs are trained on tons of data collected by scraping the internet. This can be problematic, because these data sets include lots of personal information and often reflect dangerous biases. The group developed data governance structures specifically for LLMs that should make it clearer what data is being used and who it belongs to, and it sourced different data sets from around the world that weren’t readily available online.
The group is also launching a new Responsible AI License, which is something like a terms-of-service agreement. It is designed to act as a deterrent from using BLOOM in high-risk sectors such as law enforcement or health care, or to harm, deceive, exploit, or impersonate people. The license is an experiment in self-regulating LLMs before laws catch up, says Danish Contractor, an AI researcher who volunteered on the project and co-created the license. But ultimately, there’s nothing stopping anyone from abusing BLOOM.
The project had its own ethical guidelines in place from the very beginning, which worked as guiding principles for the model’s development, says Giada Pistilli, Hugging Face’s ethicalist, who drafted BLOOM’s ethical charter. For example, it made a point of recruiting volunteers from diverse backgrounds and locations, ensuring that outsiders can easily reproduce the project’s findings, and releasing its results in the open.
This philosophy translates into one major difference between BLOOM and other LLMs available today: the vast number of human languages the model can understand. It can handle 46 of them, including French, Vietnamese, Mandarin, Indonesian, Catalan, 13 Indic languages (such as Hindi), and 20 African languages. Just over 30% of its training data was in English. The model also understands 13 programming languages.
This is highly unusual in the world of large language models, where English dominates. That’s another consequence of the fact that LLMs are built by scraping data off the internet: English is the most commonly used language online.
The reason BLOOM was able to improve on this situation is that the team rallied volunteers from around the world to build suitable data sets in other languages even if those languages weren’t as well represented online. For example, Hugging Face organized workshops with African AI researchers to try to find data sets such as records from local authorities or universities that could be used to train the model on African languages, says Chris Emezue, a Hugging Face internal and a researcher at Masakhane , an organization working on natural-language processing for African languages.