Aria is here and may be here to stay. Artificial intelligence has a new competitor with Rhymes AI, a Tokyo-based company that has developed an open source multimodal LLM capable of processing text, code, images and video, all within a single architecture.
The really incredible thing about this model is both its versatility and its efficiency, as it is not a large model so the resources it uses are not overly demanding and this makes it more environmentally friendly by not consuming as much power as others.
How has this been achieved? By employing a Mixture-of-Experts (MoE) framework. This architecture is similar to having a team of mini-experts, each trained to excel in specific areas or tasks.
When given a new input to the model, only the relevant experts (subset) are activated instead of using the entire model. In this way, running only a specific section of the model will be lighter than running an all-knowing entity that tries to process everything.
This fact makes Aria more efficient because, unlike traditional models that trigger all parameters for each task, Aria selectively triggers only 3.5 billion of its 24.9 billion parameters per token, reducing the computational load and improving performance.
This way Aria works allows for greater scalability, as new subsets could be added to handle specialized tasks without overloading the system.
First multimodal MoE
The most important thing about the model created by Rhymes AI is that it is the first open source multimodal MoE because while it is true that Mixtral-8x7B is a MoE and Pixtral is a multimodal LLM, Aria has been able to combine both architectures to create what may be the new AI revolution.
In the latest benchmarks this Tokyo-based model has outperformed some great models from much larger companies such as Pixtral 12B and Llama 3.2-11B and can almost “talk toe-to-toe” with AI giants such as GPT-4o, Gemini-1 Pro or Claude 3.5 Sonnet, showing multimodal performance on par with OpenAI's model.
Aria is licensed under the Apache 2.0 license, allowing developers to adapt and build on the model, which, along with models such as those released by Meta and Mistral, means it can continue to grow in open source for improvements.
Versatile
Aria's versatility also shines in a variety of tasks and makes it a virtually unstoppable model. Researchers uploaded a financial report and Rhymes AI was able to return an accurate analysis with data, calculate profit margins and provide detailed breakdowns.
Aria goes further, and also generated Python code to create graphs with weather details after loading different data into the system, was also able to watch video tutorials, extract code snippets and even debug it, and was able to identify 19 different scenes within an hour-long video on Michelangelo's David.