Project 002

Korean ABSA Benchmark

End-to-end benchmark of three approaches — Classical ML, Transformer, and LLM — for aspect-based sentiment analysis on Korean restaurant reviews covering FOOD, PRICE, SERVICE, and AMBIENCE.

0.94 Best Mention F1
3 Models Tested
4 Aspects Detected
127M Best Model Params

Model Benchmark Results

TF-IDF + LR

Classical ML lightweight baseline (~192K params)

Mention F10.91
Sentiment F10.51
Inference Time0.05s
Disk Size2.36 MB
StrengthFastest

A two-stage pipeline using character n-gram TF-IDF vectorization and multi-output Logistic Regression. Over 20× faster than KcELECTRA and uses less than 0.5% of its disk space, making it ideal for low-resource or real-time deployments.

Qwen 2.5 LLM

7B-parameter LLM via Ollama — few-shot prompting

Mention F10.81
Sentiment F10.44
Inference Time1801s
Disk Size4700 MB
StrengthHard Cases

Underperforms on standard metrics in its few-shot setting, but shows clear advantages on complex multi-aspect and sarcastic reviews. Fine-tuning could close the gap significantly. Requires no training — zero training time.

Benchmark Scope

Each model is evaluated on per-aspect F1 scores, mention F1, sentiment F1, inference time, model size, and parameter count. The dataset was split 70% train / 15% validation / 15% test using MultilabelStratifiedShuffleSplit to preserve aspect label distribution.

Per-Aspect F1 Scores

Aspect TF-IDF + LR KcELECTRA Qwen 2.5
FOOD 0.787 0.876 0.540
PRICE 0.788 0.818 0.690
SERVICE 0.783 0.891 0.770
AMBIENCE 0.719 0.780 0.522

Key Insight

KcELECTRA leads across all four aspects. Qwen 2.5 struggles most with FOOD and AMBIENCE, suggesting that few-shot prompting without fine-tuning is insufficient for fine-grained aspect detection in Korean.

Efficiency & Model Complexity

Model Parameters Size on Disk Training Time Inference Time
TF-IDF + LR 192K 2.36 MB 1.01s 0.05s
KcELECTRA 127M 485.28 MB 146s 1.04s
Qwen 2.5 LLM 7B 4700 MB 0s (no training) 1801s

Hard Multi-Aspect Cases (29 Reviews)

Aspect TF-IDF + LR Errors KcELECTRA Errors Qwen 2.5 Errors
FOOD1268
PRICE532
SERVICE443
AMBIENCE1374

Hard Cases Takeaway

Qwen 2.5 performs best on hard multi-aspect and sarcastic reviews, suggesting its generative reasoning is valuable for complex implicit sentiment — even when overall metrics trail KcELECTRA.

Aspect Coverage

FOOD
PRICE
SERVICE
AMBIENCE

Why This Matters

Korean restaurant reviews often express multiple opinions in a single sentence, requiring models to cleanly separate aspect mentions before assigning the correct sentiment label. A sentence like "The food was delicious, but the service was terrible" must yield FOOD → Positive and SERVICE → Negative — not a single blended label.

Technical Stack

Transformers KcELECTRA Qwen 2.5 TF-IDF + Logistic Regression Python scikit-learn Ollama Streamlit Prompt Engineering Korean NLP M-ABSA Dataset AdamW Optimizer

Dataset

No purpose-built Korean restaurant ABSA dataset existed, so the multilingual M-ABSA corpus was filtered, restructured, and relabeled for the four target aspects. Sentiment labels are encoded as 0 = Not Mentioned, 1 = Negative, 2 = Positive.

Key Takeaway

KcELECTRA offers the best overall performance and production viability. TF-IDF + LR is a strong lightweight fallback. Fine-tuning Qwen 2.5 on the ABSA task remains the most promising avenue for future improvement.

View Code on GitHub →