Personalisation is everywhere these days, think of social media feeds presenting your data based on your profile, or browsing the web and uncannily spotting ads for products you’ve Googled moments ago. We’ve all become too accustomed to these practices, where a lot of money is generated from tools that carefully transform who we are and what we like into fat dollar signs.
But the voices that are demanding better privacy protection and ownership of our own data are increasingly louder, there’s a backlash towards these practices.
At Xayn this is exactly our mission, we started out building a mobile news app, which would feature personalisation, without user data ever actually leaving the device. Depending on user interactions, such as explicitly liking articles, or implicitly reading articles long enough, we would keep altering the news feed so that each next batch of presented articles would align more and more with what the user would be interested in.
This was not a simple feat, to maintain absolute privacy, we couldn’t rely on a server solution to process the user’s interactions (or the data would leave the device).
Personalisation means tracking what users like, in our case, we transform a desired article into a centre of interest, or if another centre exists and is found to be close enough, we then merge and evolve the existing one instead. You can think of a centre of interest as a collection of keywords, derived from longer texts, but then represented in a purely mathematical way. Eventually, these centres allow us to better understand what a user likes, then allow us to present better matching articles going forward.
So to optimise our feed, we run semantic similarity on the articles, using the centres of interest as a basis, and then present the top matching articles to the user.
Since we don’t use a server-side solution, we, therefore, bundled our A.I. model and related logic together with the app and began an uphill battle to reach an acceptable performance level - running a model is generally an expensive task. Solutions like ChatGTP for example, require a ton of energy to run, some estimates place it at 24 kgCO2e per day.
Granted, we were not embedding our own chatbot on a device, nevertheless, the size of our model had to be as small as possible, and the token size for our embeddings had to be low enough so that the device could calculate similarities within an acceptable time.
We chose a tiny and lightweight student model and further reduced the number of layers to 1, the intermediate size to 512 and the embedding size to 128. Next, we modified the sequence length to 128 instead of 512 to make the model inference more performant in terms of latency.
Via distillation, we then inferred the behaviour of a large teacher model on our student model.
Finally, we converted the model to ONNX, so that we could run it in Tract.
Tract is a neural network inference engine, think of it as a lightweight version of TensorFlow. It’s maintained as an open-source project by Sonos and is available for the Rust programming language.
Rust itself is well known for its excellent performance and memory safety. This alone already made it an ideal candidate for our project; however, speed was not the only reason to choose Rust. The mobile market is dominated by Android and iOS and each platform runs on different CPU architectures still.
With Rust, we could compile against the most common architectures, even targeting web assembly if we wanted to. We could really optimise our binary even further, using LTO for example.
Today, we are no longer building this app, but we’ve taken what we’ve learned and instead offer our personalisation and semantic search as a service for B2B clients. Our goal has shifted to providing a service that is reliable, affordable, and of course, performing well on benchmarks.
With Rust, we were able to create a robust web application on top of the existing framework that we had developed for mobile devices and we expose our functionality via APIs written in the open API standard.
Given all the optimisations mentioned in this article, moving to the cloud gave us a clear competitive edge – with our model Xaynia, we can offer semantic similarity and semantic search, processing millions of documents with minimal energy consumption and therefore at a very low price. We are also continuing our efforts to provide more functionality related to Natural Language Processing down the line in the same energy- and cost-efficient manner.
We have tested and optimised Xaynia extensively on energy efficiency comparing it to other popular transformer models like paraphrase-xlm-r-multilingual-v1 or Google’s BERT base. In these settings, Xaynia was up to 20 times more energy efficient. Compared to the even larger language models like Open AI’s text-embedding-ada-002 we estimate Xaynia to be at least 40x more energy efficient, due to their 2x larger embedding space.
For more details, we invite you to take a look at our white paper explaining the approach in depth. If you'd like to discuss our approach with us and share ideas on how we can improve it further, please don't hesitate to reach out to us at email@example.com.