Unbabel’s Language Operations Platform gives businesses the ability to thrive across cultures and geographies by eliminating language barriers.
Users generally praise Unbabel for its seamless integration and effectiveness at combining AI with human review to enhance translation quality. However, some complaints focus on occasional inaccuracies and a lack of support for certain niche languages. Pricing opinions are mixed; some see it as fair given the service quality, while others find it somewhat high. Overall, Unbabel holds a solid reputation for its innovative approach to translation tasks, although it may not be the perfect fit for all language needs.
Mentions (30d)
1
Reviews
0
Platforms
2
Sentiment
0%
0 positive
Users generally praise Unbabel for its seamless integration and effectiveness at combining AI with human review to enhance translation quality. However, some complaints focus on occasional inaccuracies and a lack of support for certain niche languages. Pricing opinions are mixed; some see it as fair given the service quality, while others find it somewhat high. Overall, Unbabel holds a solid reputation for its innovative approach to translation tasks, although it may not be the perfect fit for all language needs.
Features
Use Cases
Industry
information technology & services
Employees
690
Funding Stage
Merger / Acquisition
Total Funding
$111.7M
We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]
We evaluated six models on English subtitle translation into Spanish, Japanese, Korean, Thai, Chinese Simplified, and Chinese Traditional - 167 segments per language pair, scored with two reference-free QE metrics. Models tested: TranslateGemma-12b claude-sonnet-4-6 deepseek-v3.2 gemini-3.1-flash-lite-preview gpt-5.4-mini gpt-5.4-nano Scoring We used MetricX-24 (lower = better) and COMETKiwi (higher = better) - both reference-free QE metrics. We also developed a combined score: TQI = COMETKiwi × exp(−MetricX / 10) The exponential decay term converts MetricX into a multiplicative fidelity penalty. When MetricX is near 0, TQI ≈ COMETKiwi. As MetricX grows, the penalty increases exponentially. TQI is our own metric, not an industry standard. Top-level results (avg TQI across all 6 languages) Rank Model Avg TQI #1 TranslateGemma-12b 0.6335 #2 gemini-3.1-flash-lite-preview 0.5981 #3 deepseek-v3.2 0.5946 #4 claude-sonnet-4-6 0.5811 #5 gpt-5.4-mini 0.5785 #6 gpt-5.4-nano 0.5562 All models sit between 0.75-0.79 on COMETKiwi (fluency). Models diverge significantly on MetricX-24 fidelity scores - that's where the TQI separation comes from. A few things worth discussing: 1. Metric-model affinity concern One caveat worth noting: MetricX-24 is a Google metric and TranslateGemma is a Google model. COMETKiwi - from Unbabel - shows a noticeably smaller gap between TranslateGemma and the field. The direction of the result holds either way, but the size of the lead may be partially inflated by metric-model affinity. 2. Claude collapses in Japanese claude-sonnet-4-6 ranked last (#6) in Japanese - MetricX 3.90, its worst result across all languages. Its COMETKiwi (0.79) was decent. Classic fluency-fidelity mismatch: output that sounds natural but drifts from source meaning. 3. Gemini Flash Lite outperforms full-sized frontier models A "lite" model consistently ranked #2-3, beating Claude Sonnet and both GPT-5.4 variants across most languages. 4. TranslateGemma ranked #1 - then human QA found something the metrics had missed entirely TranslateGemma topped every language. When our linguists reviewed the Traditional Chinese (zh-TW) output, the model was outputting Simplified Chinese for both zh-CN and zh-TW language codes. We then investigated community reports suggesting zh-Hant as the correct explicit tag for Traditional Chinese and retested with it. Result: 76% of segments still came back Simplified, 14% Traditional, 10% ambiguous (segments too short or script-neutral to classify). https://preview.redd.it/h6gfrd0ew4vg1.jpg?width=773&format=pjpg&auto=webp&s=fbe0afae3831528440b956167456e94004bcbe09 MetricX-24 and COMETKiwi scored both outputs identically and highly - no indication of a problem from either metric. As it turns out, this is a confirmed, publicly documented issue caused by training data bias: TranslateGemma's fine-tuning corpus is heavily skewed toward Simplified Chinese. The locale tags are accepted without error but not honored by the model's weights. This affects all model sizes (4B, 12B, 27B) - upgrading to a larger model size won't fix it, since the root cause is training data composition, not capacity. A workaround exists (OpenCC s2twp post-processing), but standard QE metrics will look fine the whole time - that's exactly the problem for any pipeline relying on automated validation. submitted by /u/ritis88 [link] [comments]
View originalUnbabel uses a tiered pricing model. Visit their website for current pricing details.
Key features include: Customer portal, Editor interface, We take security seriously.
Unbabel is commonly used for: Customer support in multiple languages, Localizing marketing content for global audiences, Translating product documentation and manuals, Facilitating multilingual communication in remote teams, Enhancing user experience on international websites, Improving accessibility for non-English speaking customers.
Unbabel integrates with: Zendesk, Salesforce, Slack, Shopify, WordPress, HubSpot, Microsoft Teams, Jira, Intercom, Google Workspace.

Agent Workspace for Zendesk
Mar 14, 2022