Chinese technology company ByteDance has recently launched its new multimodal artificial intelligence (AI) model, named Bagel. It is a visual language model (VLM) that can not only understand pictures but can also generate and edit them. The biggest thing is that the company has made it open-source and now it can be downloaded from popular AI platforms like GitHub and Hugging Face.
Features of Bagel
Multimodal input: Capable of understanding and processing both text and images simultaneously.
14 billion parameters: 7 billion of which are active at a time.
Interleaved training data: Text and images were trained together, allowing Bagel to make better connections between the two.
Advanced image editing capabilities
ByteDance claims that Bagel does better image editing than other existing open-source VLMs. It can easily do things like adding emotions to the image, removing, changing, or adding an element, style transfer, and free-form editing, i.e. making changes without any limited framework.
Also capable of world modeling
Bagel has been trained in such a way that it can understand the world in visual form - such as the relationship between objects, the effect of natural factors like light or gravity, etc. ByteDance says that in their internal tests, Bagel has surpassed Qwen2.5-VL-7B (better in understanding images), Janus-Pro-7B and Flux-1-dev (better in image generation), Gemini-2-exp (better performance in image editing in GEdit-Bench test) AI models.
PC Social media