Falcon Twig: Fine-tuning Falcon H1 on Tool Calling
Abu Dhabi is making strides toward becoming the world's first fully native AI government by 2027. A significant player in this initiative is the Technical Innovation Institute (TII), known for developing the Falcon models. These models are crucial in advancing AI's role in government services.
In my recent project, I explored the performance of the Falcon models on the Berkeley Function Calling Leaderboard (BFCL). Unfortunately, the older models performed poorly. Despite the introduction of new H1 models, improvements are necessary.
I fine-tuned Falcon-H1-7B-Instruct with QLoRA to make it better at calling tools (functions). On the BFCL benchmark, my model ("Falcon Twig") underperformed the base model overall, but showed a small improvement on parallel calls. The main lessons: data quality matters a lot (especially synthetic parallel samples), prompts and stop tokens can silently block multi-call behavior, and strict schema formatting is easy to break in practice.
What this project is about
Many real apps need an LLM that can call tools: read weather, search, use a calendar, etc. The goal here was simple: teach a compact, efficient model to call tools more reliably without huge compute.
Plain-English idea: Take a proven model (Falcon-H1-7B-Instruct), add small trainable adapters (QLoRA), feed it a mix of tool-calling and normal tasks, and train it to output valid JSON tool calls with correct arguments.
What I built (simple view)
- Base model: Falcon-H1-7B-Instruct
- Method: QLoRA (low-rank adapters on a 4-bit base)
- Data: a custom mix of function-calling datasets (single, multi, parallel) plus some non-tool tasks to reduce forgetting
- Eval: BFCL (Berkeley Function-Calling) with live tools and AST checks
Key configuration highlights
Parameter-efficient tuning: LoRA
Quantization: 4-bit (QLoRA)
Optimizer: AdamW, cosine LR with warmup
Batching: length-grouped, gradient accumulation
Regularization: weight decay + gradient clipping
Early stopping: on validation loss
Optional: WiSE-FT style weight interpolation for stability
Data at a glance
I trained on a 78k example mixture with both verified and synthetic tool-calling, plus no-tool instruction data to keep general skills. High-level split (train/test/val roughly 60.6k/7.6k/7.6k) and sources are described in the report, including a 7k synthetic parallel subset. In hindsight, some synthetic samples had issues (format and coverage), which likely hurt results.
Results (what happened)
On BFCL, Falcon Twig scored lower than the base 7B model on most tracks, but it did better on parallel tool-calling (still not great, just relatively better). Overall accuracy dropped, while simple and multi-call AST tracks suffered, and many failures were due to schema or argument mistakes (for example, writing city instead of the required location, or missing unit).
Why things failed (most likely)
- Prompt and stops discouraged multiple calls. The system prompt used a single-call example, temperature was clamped very low, and stop tokens could cut off after the first </tool_call>. Together, this nudged the model toward only one call even when two were required.
- Schema strictness. Small format errors break the AST checker: wrong field name, missing required arg, or wrong value normalization.
- Synthetic data issues. Some parallel samples were low quality or English-only, so the model learned fragile patterns that failed on live checks.
- Handler differences. Because Falcon-H1 APIs differ from OpenAI-style tool calls, the custom handler and parsing logic added more places for subtle bugs.
What I would do differently next time
Better data
- Curate synthetic parallel samples with stricter validators and multi-language coverage.
- Add near-miss hard negatives (almost-right arguments) to train calibration.
- Mix in APIGen-style verified sets to raise floor quality.
Better supervision
- Add multiple-call exemplars directly in the prompt and templates.
- Use format rewards (e.g., JSON diff penalties) or a small format critic to punish schema errors.
- Try execution-in-the-loop fine-tuning or lightweight RL for argument correctness.
Better decoding & infra
- Relax temperature slightly and remove single-call bias in examples.
- Avoid stop tokens that can clip the second tool call.
- Unit-test the handler with gold JSON snippets and edge-cases.