About
arXiv Code Data Trained Models🤔 Problem: Dialect differences cause performance issues for many types of users of language technologies. If we want fair, inclusive, and equitable NLP, our systems need to be dialect invariant: performance should be constant over dialect shifts.
💡 Solution: Multi-VALUE is a suite of resources for evaluating and achieving English dialect invariance. It contains tools for systematically modifying written text in accordance with 189 attested linguistic patterns from 50 varieties of English. Researchers can use this to (1) build dialect stress tests and (2) train more robust models using Multi-VALUE as data augmentation.
🧪 Experiments: You can reproduce experiments showing significant performance disparities in dialect QA, MT, and Semantic Parsing tasks. To fill these gaps, you can start by training on synthetic dialect data.
Attribution: Multi-VALUE was designed and built at the SALT Lab 🧂 at Stanford University 🌲 and the Georgia Institute of Technology 🐝 and was supported by Amazon Fairness in AI. This resource draws heavily from the Electronic World Atlas of Varieties of English (eWAVE), which aggregates the work of over 80 field linguists.
What Multi-VALUE can do for you
In short, Multi-VALUE can produce synthetic forms of dialectal text with modular control over which linguistic features are expressed. This will allow you to systematically isolate and the effects of certain syntactic and morphological structures on the performance of English NLP systems. Research applications include:
📐 Benchmarking: ML researchers can more comprehensively evaluate task performance across domain shifts.
⚖️ Bias and Fairness: Fairness, accountability and transparency researchers can more directly examine the ways NLP systems systematically harm disadvantaged or protected groups.
🌏 Linguistic Typology: Computational linguists can systematically understand the internal representations of large language models according to the literature on theoretical and field linguistics.
🌱 Low-resource NLP: Practitioners can adapt models to low-resource dialects.
Getting Started
Use the following demo to start using Multi-VALUE.
Demo
[VECTOR DIALECT] Hong Kong English (abbr: HKE)
Region: South and Southeast Asia
Latitude: 22.26
Longitude: 114.25
'I talk with them yesterday.'
{(2,7): {'value': 'talk', 'type': 'bare_past_tense'}}
Important Considerations
Limitations
🤔 Multi-VALUE does not cover orthographic (writing) or lexical variation; researchers can draw such variation from corpus data.
🤔 Multi-VALUE covers only what linguists have observed frequently enough to document, and the catalogue is incomplete. Speech and communication can vary in myriad forms not captured by this resource.
Additional Considerations
Please keep the following points in mind when using Multi-VALUE
📐 Benchmarking: Dialects are not fixed or deterministic – they are living elements of the communities which speak them.
⚖️ Bias and Fairness: Synthetic transformations are designed for stress-testing NLP systems and not for impersonating spoken or written dialect.
Data Agreement
I will not use Multi-VALUE for malicious purposes including not limited to: deception, impersonation, mockery, discrimination, hate speech, targeted harassment and cultural appropriation. In my use of this resource, I will respect the dignity and privacy of all people.
BibTeX
@inproceedings{ziems-etal-2022-multi-value,
title = "Multi-VALUE: Evaluating Cross-dialectal NLP",
author = "Ziems, Caleb and
Held, Will and
Yang, Jingfeng and
Yang, Diyi"
}
COQA Experiments
Model | Test Dialect | |||||||
---|---|---|---|---|---|---|---|---|
Base | Train Set | SAE | AppE | ChcE | CollSgE | IndE | UAAVE | Average |
BERT Base | SAE | 77.2 | 77.4 (-3.8%) | 76.6 (-0.7%) | 61.5 (-25.4%) | 70.8 (-9%) | 71.2 (-8.4%) | 71.9 (-7.3%) |
AppE | 76.3 (-1.1%) | 76.4 (-1%) | 76.1 (-1.4%) | 64.7 (-19.3%) | 72.8 (-6%) | 73.2 (-5.4%) | 73.3 (-5.3%) | |
ChcE | 76.8 | 74.7 (-3.3%) | 76.5 (-0.8%) | 63.6 (-21.3%) | 71.6 (-7.8%) | 71.4 (-8.1%) | 72.4 (-6.5%) | |
CollSgE | 75.7 (-1.9%) | 74.1 (-4.2%) | 75.5 (-2.2%) | 74.7 (-3.3%) | 73.6 (-4.8%) | 73.4 (-5.1%) | 74.5 (-3.6%) | |
IndE | 76.0 (-1.5%) | 75.4 (-2.4%) | 75.7 (-2%) | 63.2 (-22%) | 75.1 (-2.7%) | 74.1 (-4.1%) | 73.3 (-5.3%) | |
UAAVE | 76.1 (-1.4%) | 75.6 (-2%) | 76.0 (-1.5%) | 64.6 (-19.5%) | 74.5 (-3.6%) | 75.3 (-2.5%) | 73.7 (-4.7%) | |
Multi | 76.2 (-1.2%) | 75.6 (-2%) | 76.1 (-1.3%) | 73.7 (-4.7%) | 74.9 (-3.1%) | 75.1 (-2.7%) | 75.3 (-2.5%) | |
In-Dialect | 77.2 | 76.4 (-1%) | 76.5 (-0.8%) | 74.7 (-3.3%) | 75.1 (-2.7%) | 75.3 (-2.5%) | 75.9 (-1.7%) | |
RoBERTa Base | SAE | 81.8 | 79.1 (-3.4%) | 81.5 (-0.3%) | 68.8 (-18.9%) | 76.1 (-7.5%) | 76.6 (-6.7%) | 77.3 (-5.8%) |
AppE | 82.0 (0.3%) | 81.8 | 81.8 | 71.2 (-14.9%) | 79.0 (-3.5%) | 79.6 (-2.8%) | 79.2 (-3.2%) | |
ChcE | 81.7 (-0.1%) | 79.3 (-3.1%) | 81.5 (-0.4%) | 68.8 (-18.9%) | 76.5 (-7%) | 77.3 (-5.9%) | 77.5 (-5.5%) | |
CollSgE | 81.5 (-0.4%) | 80.1 (-2.2%) | 81.2 (-0.7%) | 80.2 (-2%) | 79.4 (-3%) | 78.7 (-3.9%) | 80.2 (-2%) | |
IndE | 81.1 (-0.8%) | 80.5 (-1.5%) | 80.9 (-1.1%) | 67.2 (-21.7%) | 80.3 (-1.9%) | 79.2 (-3.3%) | 78.2 (-4.6%) | |
UAAVE | 81.6 (-0.2%) | 81.1 (-0.9%) | 81.5 (-0.3%) | 69.2 (-18.2%) | 79.6 (-2.7%) | 81.1 (-0.9%) | 79.0 (-3.5%) | |
Multi | 80.6 (-1.5%) | 80.4 (-1.7%) | 80.5 (-1.6%) | 78.5 (-4.2%) | 79.7 (-2.7%) | 80.0 (-2.2%) | 80.0 (-2.3%) | |
In-Dialect | 81.8 | 81.8 | 81.5 (-0.4%) | 80.2 (-2%) | 80.3 (-1.9%) | 81.1 (-0.9%) | 81.1 (-0.9%) |