
While Microsoft says that Z-code MoE has led to great strides in improving language translation, the problem isn’t solved. A single MoE model can replace 20 of the current translation models, increasing efficiency of training MoE models while also improving translation accuracy.” Future work “For our production model deployment, the training dataset was 5 billion parameter models, which are 80 times larger than Microsoft’s currently deployed models. This architecture allows massive scale in the number of model parameters while keeping the amount of compute constant,” Huang continued. “Using an MoE approach allows us to achieve performance and quality benefits more efficiently, as it only engages a portion of the model to complete a task, as opposed to other architectures that have to activate an entire AI model to run every request. But Microsoft claims that Z-code MoE is the first MoE language model to reach production. MoEs were first proposed in the ’90s, and research papers in recent years from companies including Google describe experiments with trillion-parameter-plus MoE language models. Microsoft’s and Nvidia’s recently released Megatron 530B language model, which has 530 billion parameters, was originally developed across 560 Nvidia DGX A100 servers. But it’s more efficient than other methods.

The cost isn’t insubstantial, to be fair - a 2020 study from startup AI21 Labs pegged the expenses for developing a text-generating model with only 1.5 billion parameters at between $80,000 and $1.6 million. To illustrate, an MoE model containing 1.6 trillion parameters requires compute resources approximately equal to that of a 10 billion-parameter conventional model, by Microsoft’s estimation. (Parameters are the part of the model that’s learned from example text data, and generally speaking - especially in language - the correlation between the number of parameters and sophistication has held up remarkably well.) In fact, MoE is one of the few architectures demonstrated to scale to more than a trillion parameters.

The experts can receive a mix of data, but only a few experts remain active at any one time, meaning that even a huge model needs only a small amount of processing power in order to develop or run. “While the Z-code MoE models learn universal representation, specific parts of the model can specialize in particular languages and linguistics characteristics to enable better translation.”Ĭompared with other model architectures, MoEs have some advantages. The same underlying model can be fine-tuned to perform different language understanding tasks such as translating between languages, summarizing a speech, offering ways to complete a sentence or generating suggested tweets, instead of having to develop separate models for each of those narrow purposes,” Xuedong Huang, chief technology officer at Microsoft’s Azure AI division, told VentureBeat via email.

“Z-code MoE models are a promising way forward in the language domain since they are more efficient and need fewer systems to run. For example, each expert cluster can learn to handle a separate part of speech or semantic or grammatical rule. Lower layers extract certain “features” from the text to be translated - i.e., characteristics - and “experts” - i.e., clusters - are called upon to evaluate those features. MoEs are made up of small clusters of neurons that are only active under special, specific circumstances. Each neuron is a mathematical operation that plays a key role in how the model “learns” to interpret and translate languages. The AI models used in modern text translation, MoE or no, contain components called “neurons” that are organized into distinctive layers. Now, the company says that an improved version of Z-code - Z-code Mixture of Experts (MoE), which launched this week - can better understand “low-resourced” language nuances.

Microsoft rolled out Z-code-powered enhancements to Translator last October, adding support for 12 new languages including Georgian, Tibetan, and Uyghur. For example, a model’s translation skills might be used to improve its ability to understand natural (i.e., everyday) language. Low-resource languages are generally defined as having under 1 million example sentences, which adds to the challenge of developing models AI models usually perform better when given more examples.īecause many languages share linguistic elements, Microsoft develops Z-code models multilingually across different languages and that knowledge is transferred between languages. Like all models, Microsoft’s learn from examples in large datasets sourced from a mixture of public and private archives (e.g., ebooks, websites such as Wikipedia, and hand-translated documents).
