Uy PilipINST! Creating an instruction-tuning dataset for Philippine languages

Category: Dataset Curation

Lead: Lj V. Miranda (contact)

Timeline: (prelim) Feb 2026, (official) mid 2026

Skills needed:

Instruction Tuning Data Collection and Curation Benchmarking

Instruction-tuning (or supervised fine-tuning, SFT) is a method to further refine a model’s capabilities for specific tasks or languages. It is relatively cheap and straightforward, which is why most foundation model providers include instruction-tuned models in their releases.

Since most foundation model providers rely on open-source datasets to train their models, we can indirectly influence their development pipelines by contributing a high-quality instruction-tuning dataset to the open ecosystem. This effort also paves the way for training our own Filipino-centric language models.

What exactly are we trying to do?

Specifically, we will curate a high-quality instruction-tuning dataset for the top four to six (4-6) spoken Philippine languages: Tagalog, Bisaya, Hiligaynon, Ilokano, Cebuano, and Bikolano.

Ultimately, we want to answer the following question:

How does post-training data composition—synthetic, human-annotated, or web-crawled— affect LLM performance on FilBench, under a low annotation budget?

By doing so, we aim to explore the following aspects:

Data sourcing and composition: Where can we find high-quality instruction data for Philippine languages? Should we prioritize synthetic generation, community platforms like Reddit, existing datasets like Aya, or a combination of these sources? What is the optimal mix of data sources to maximize quality and diversity?
Data efficiency: How much instruction-tuning data is needed to achieve strong performance on Filipino NLP benchmarks such as FilBench? Can we identify diminishing returns to guide efficient data collection efforts?
Task relevance: Which tasks and capabilities are most valuable for Filipino-centric use cases? How can we ensure our instruction dataset covers the linguistic and cultural nuances that matter most to Filipino language users?

Document	Description
Research Brief	Research proposal.
Project Journal	Running log of project progress.
Google Drive	Shared folder containing other materials.

I want to help out!

Reach out to me (Lj) first! Although the official project will start mid-2026, I plan to start some experiments as early as January. During that time, this will be a smaller effort compared to the benchmark project, and I prefer a more focused team with 3 people including me. I’m also happy to receive support in the form of compute credits and grants (if you know of any, please point them our way)!

What exactly are we trying to do?

Other links and timelines

I want to help out!