TF-IDF + LR
A two-stage pipeline using character n-gram TF-IDF vectorization and multi-output Logistic Regression. Over 20× faster than KcELECTRA and uses less than 0.5% of its disk space, making it ideal for low-resource or real-time deployments.
End-to-end benchmark of three approaches — Classical ML, Transformer, and LLM — for aspect-based sentiment analysis on Korean restaurant reviews covering FOOD, PRICE, SERVICE, and AMBIENCE.
A two-stage pipeline using character n-gram TF-IDF vectorization and multi-output Logistic Regression. Over 20× faster than KcELECTRA and uses less than 0.5% of its disk space, making it ideal for low-resource or real-time deployments.
The strongest overall performer. Fine-tuned with a joint mention + sentiment head sharing a single [CLS] embedding. Leads across all four aspects and offers the best performance-to-cost ratio for production use.
Underperforms on standard metrics in its few-shot setting, but shows clear advantages on complex multi-aspect and sarcastic reviews. Fine-tuning could close the gap significantly. Requires no training — zero training time.
Each model is evaluated on per-aspect F1 scores, mention F1, sentiment F1, inference time, model size, and parameter count. The dataset was split 70% train / 15% validation / 15% test using MultilabelStratifiedShuffleSplit to preserve aspect label distribution.
| Aspect | TF-IDF + LR | KcELECTRA | Qwen 2.5 |
|---|---|---|---|
| FOOD | 0.787 | 0.876 | 0.540 |
| PRICE | 0.788 | 0.818 | 0.690 |
| SERVICE | 0.783 | 0.891 | 0.770 |
| AMBIENCE | 0.719 | 0.780 | 0.522 |
KcELECTRA leads across all four aspects. Qwen 2.5 struggles most with FOOD and AMBIENCE, suggesting that few-shot prompting without fine-tuning is insufficient for fine-grained aspect detection in Korean.
| Model | Parameters | Size on Disk | Training Time | Inference Time |
|---|---|---|---|---|
| TF-IDF + LR | 192K | 2.36 MB | 1.01s | 0.05s |
| KcELECTRA | 127M | 485.28 MB | 146s | 1.04s |
| Qwen 2.5 LLM | 7B | 4700 MB | 0s (no training) | 1801s |
| Aspect | TF-IDF + LR Errors | KcELECTRA Errors | Qwen 2.5 Errors |
|---|---|---|---|
| FOOD | 12 | 6 | 8 |
| PRICE | 5 | 3 | 2 |
| SERVICE | 4 | 4 | 3 |
| AMBIENCE | 13 | 7 | 4 |
Qwen 2.5 performs best on hard multi-aspect and sarcastic reviews, suggesting its generative reasoning is valuable for complex implicit sentiment — even when overall metrics trail KcELECTRA.
Korean restaurant reviews often express multiple opinions in a single sentence, requiring models to cleanly separate aspect mentions before assigning the correct sentiment label. A sentence like "The food was delicious, but the service was terrible" must yield FOOD → Positive and SERVICE → Negative — not a single blended label.
No purpose-built Korean restaurant ABSA dataset existed, so the multilingual M-ABSA corpus was filtered, restructured, and relabeled for the four target aspects. Sentiment labels are encoded as 0 = Not Mentioned, 1 = Negative, 2 = Positive.
KcELECTRA offers the best overall performance and production viability. TF-IDF + LR is a strong lightweight fallback. Fine-tuning Qwen 2.5 on the ABSA task remains the most promising avenue for future improvement.