Multi-VALUE:
  A toolkit for Cross-Dialectal NLP

About

arXiv Code Data Trained Models


🤔   Problem: Dialect differences cause performance issues for many types of users of language technologies. If we want fair, inclusive, and equitable NLP, our systems need to be dialect invariant: performance should be constant over dialect shifts.

💡   Solution: Multi-VALUE is a suite of resources for evaluating and achieving English dialect invariance. It contains tools for systematically modifying written text in accordance with 189 attested linguistic patterns from 50 varieties of English. Researchers can use this to (1) build dialect stress tests and (2) train more robust models using Multi-VALUE as data augmentation.

🧪   Experiments: You can reproduce experiments showing significant performance disparities in dialect QA, MT, and Semantic Parsing tasks. To fill these gaps, you can start by training on synthetic dialect data.


Attribution: Multi-VALUE was designed and built at the SALT Lab 🧂 at Stanford University 🌲 and the Georgia Institute of Technology 🐝 and was supported by Amazon Fairness in AI. This resource draws heavily from the Electronic World Atlas of Varieties of English (eWAVE), which aggregates the work of over 80 field linguists.

What Multi-VALUE can do for you

In short, Multi-VALUE can produce synthetic forms of dialectal text with modular control over which linguistic features are expressed. This will allow you to systematically isolate and the effects of certain syntactic and morphological structures on the performance of English NLP systems. Research applications include:


📐   Benchmarking: ML researchers can more comprehensively evaluate task performance across domain shifts.


⚖️   Bias and Fairness: Fairness, accountability and transparency researchers can more directly examine the ways NLP systems systematically harm disadvantaged or protected groups.


🌏   Linguistic Typology: Computational linguists can systematically understand the internal representations of large language models according to the literature on theoretical and field linguistics.


🌱   Low-resource NLP: Practitioners can adapt models to low-resource dialects.

Getting Started

Use the following demo to start using Multi-VALUE.

Demo

import value
hke = value.HongKongDialect()
print(hke)

[VECTOR DIALECT] Hong Kong English (abbr: HKE)
      Region: South and Southeast Asia
      Latitude: 22.26
      Longitude: 114.25

hke.transform("I talked with them yesterday")

'I talk with them yesterday.'

hke.executed_rules()

{(2,7): {'value': 'talk', 'type': 'bare_past_tense'}}

Important Considerations

Limitations

🤔   Multi-VALUE does not cover orthographic (writing) or lexical variation; researchers can draw such variation from corpus data.


🤔   Multi-VALUE covers only what linguists have observed frequently enough to document, and the catalogue is incomplete. Speech and communication can vary in myriad forms not captured by this resource.


Additional Considerations

Please keep the following points in mind when using Multi-VALUE


📐   Benchmarking: Dialects are not fixed or deterministic – they are living elements of the communities which speak them.


⚖️   Bias and Fairness: Synthetic transformations are designed for stress-testing NLP systems and not for impersonating spoken or written dialect.


Data Agreement

I will not use Multi-VALUE for malicious purposes including not limited to: deception, impersonation, mockery, discrimination, hate speech, targeted harassment and cultural appropriation. In my use of this resource, I will respect the dignity and privacy of all people.

BibTeX



								  @inproceedings{ziems-etal-2022-multi-value,
      title = "Multi-VALUE: Evaluating Cross-dialectal NLP",
      author = "Ziems, Caleb  and
      Held, Will  and
      Yang, Jingfeng  and
      Yang, Diyi"
}
								


Experiments

We find that current NLP systems show discrepancies on dialect versions of popular benchmarks like CoQA Conversational QA, Spider Semantic Parsing, and WMT19 Machine translation. We report some of these discrepancies here.

COQA Experiments

Model Test Dialect
Base Train Set SAE AppE ChcE CollSgE IndE UAAVE Average
BERT Base SAE 77.2 77.4 (-3.8%) 76.6 (-0.7%) 61.5 (-25.4%) 70.8 (-9%) 71.2 (-8.4%) 71.9 (-7.3%)
AppE 76.3 (-1.1%) 76.4 (-1%) 76.1 (-1.4%) 64.7 (-19.3%) 72.8 (-6%) 73.2 (-5.4%) 73.3 (-5.3%)
ChcE 76.8 74.7 (-3.3%) 76.5 (-0.8%) 63.6 (-21.3%) 71.6 (-7.8%) 71.4 (-8.1%) 72.4 (-6.5%)
CollSgE 75.7 (-1.9%) 74.1 (-4.2%) 75.5 (-2.2%) 74.7 (-3.3%) 73.6 (-4.8%) 73.4 (-5.1%) 74.5 (-3.6%)
IndE 76.0 (-1.5%) 75.4 (-2.4%) 75.7 (-2%) 63.2 (-22%) 75.1 (-2.7%) 74.1 (-4.1%) 73.3 (-5.3%)
UAAVE 76.1 (-1.4%) 75.6 (-2%) 76.0 (-1.5%) 64.6 (-19.5%) 74.5 (-3.6%) 75.3 (-2.5%) 73.7 (-4.7%)
Multi 76.2 (-1.2%) 75.6 (-2%) 76.1 (-1.3%) 73.7 (-4.7%) 74.9 (-3.1%) 75.1 (-2.7%) 75.3 (-2.5%)
In-Dialect 77.2 76.4 (-1%) 76.5 (-0.8%) 74.7 (-3.3%) 75.1 (-2.7%) 75.3 (-2.5%) 75.9 (-1.7%)
RoBERTa Base SAE 81.8 79.1 (-3.4%) 81.5 (-0.3%) 68.8 (-18.9%) 76.1 (-7.5%) 76.6 (-6.7%) 77.3 (-5.8%)
AppE 82.0 (0.3%) 81.8 81.8 71.2 (-14.9%) 79.0 (-3.5%) 79.6 (-2.8%) 79.2 (-3.2%)
ChcE 81.7 (-0.1%) 79.3 (-3.1%) 81.5 (-0.4%) 68.8 (-18.9%) 76.5 (-7%) 77.3 (-5.9%) 77.5 (-5.5%)
CollSgE 81.5 (-0.4%) 80.1 (-2.2%) 81.2 (-0.7%) 80.2 (-2%) 79.4 (-3%) 78.7 (-3.9%) 80.2 (-2%)
IndE 81.1 (-0.8%) 80.5 (-1.5%) 80.9 (-1.1%) 67.2 (-21.7%) 80.3 (-1.9%) 79.2 (-3.3%) 78.2 (-4.6%)
UAAVE 81.6 (-0.2%) 81.1 (-0.9%) 81.5 (-0.3%) 69.2 (-18.2%) 79.6 (-2.7%) 81.1 (-0.9%) 79.0 (-3.5%)
Multi 80.6 (-1.5%) 80.4 (-1.7%) 80.5 (-1.6%) 78.5 (-4.2%) 79.7 (-2.7%) 80.0 (-2.2%) 80.0 (-2.3%)
In-Dialect 81.8 81.8 81.5 (-0.4%) 80.2 (-2%) 80.3 (-1.9%) 81.1 (-0.9%) 81.1 (-0.9%)