Korean ABSA Benchmark | Project Showcase

Model Benchmark Results

TF-IDF + LR

Classical ML lightweight baseline (~192K params)

Mention F10.91

Sentiment F10.51

Inference Time0.05s

Disk Size2.36 MB

StrengthFastest

A two-stage pipeline using character n-gram TF-IDF vectorization and multi-output Logistic Regression. Over 20× faster than KcELECTRA and uses less than 0.5% of its disk space, making it ideal for low-resource or real-time deployments.

Best

KcELECTRA

Korean transformer fine-tuned for ABSA (~127M params)

Mention F10.94

Sentiment F10.57

Inference Time1.04s

Disk Size485 MB

StrengthMost Accurate

The strongest overall performer. Fine-tuned with a joint mention + sentiment head sharing a single [CLS] embedding. Leads across all four aspects and offers the best performance-to-cost ratio for production use.

Qwen 2.5 LLM

7B-parameter LLM via Ollama — few-shot prompting

Mention F10.81

Sentiment F10.44

Inference Time1801s

Disk Size4700 MB

StrengthHard Cases

Underperforms on standard metrics in its few-shot setting, but shows clear advantages on complex multi-aspect and sarcastic reviews. Fine-tuning could close the gap significantly. Requires no training — zero training time.

Benchmark Scope

Each model is evaluated on per-aspect F1 scores, mention F1, sentiment F1, inference time, model size, and parameter count. The dataset was split 70% train / 15% validation / 15% test using MultilabelStratifiedShuffleSplit to preserve aspect label distribution.

Per-Aspect F1 Scores

Aspect	TF-IDF + LR	KcELECTRA	Qwen 2.5
FOOD	0.787	0.876	0.540
PRICE	0.788	0.818	0.690
SERVICE	0.783	0.891	0.770
AMBIENCE	0.719	0.780	0.522

Key Insight

KcELECTRA leads across all four aspects. Qwen 2.5 struggles most with FOOD and AMBIENCE, suggesting that few-shot prompting without fine-tuning is insufficient for fine-grained aspect detection in Korean.

Efficiency & Model Complexity

Model	Parameters	Size on Disk	Training Time	Inference Time
TF-IDF + LR	192K	2.36 MB	1.01s	0.05s
KcELECTRA	127M	485.28 MB	146s	1.04s
Qwen 2.5 LLM	7B	4700 MB	0s (no training)	1801s

Hard Multi-Aspect Cases (29 Reviews)

Aspect	TF-IDF + LR Errors	KcELECTRA Errors	Qwen 2.5 Errors
FOOD	12	6	8
PRICE	5	3	2
SERVICE	4	4	3
AMBIENCE	13	7	4

Hard Cases Takeaway

Qwen 2.5 performs best on hard multi-aspect and sarcastic reviews, suggesting its generative reasoning is valuable for complex implicit sentiment — even when overall metrics trail KcELECTRA.

Aspect Coverage

FOOD

PRICE

SERVICE

AMBIENCE

Why This Matters

Korean restaurant reviews often express multiple opinions in a single sentence, requiring models to cleanly separate aspect mentions before assigning the correct sentiment label. A sentence like "The food was delicious, but the service was terrible" must yield FOOD → Positive and SERVICE → Negative — not a single blended label.

Technical Stack

Transformers KcELECTRA Qwen 2.5 TF-IDF + Logistic Regression Python scikit-learn Ollama Streamlit Prompt Engineering Korean NLP M-ABSA Dataset AdamW Optimizer

Dataset

No purpose-built Korean restaurant ABSA dataset existed, so the multilingual M-ABSA corpus was filtered, restructured, and relabeled for the four target aspects. Sentiment labels are encoded as 0 = Not Mentioned, 1 = Negative, 2 = Positive.

Key Takeaway

KcELECTRA offers the best overall performance and production viability. TF-IDF + LR is a strong lightweight fallback. Fine-tuning Qwen 2.5 on the ABSA task remains the most promising avenue for future improvement.