Bridging the Gap in Financial AI How PIXIU is Pushing LLMs Forward
The rapid advancements in Large Language Models (LLMs) have revolutionized various domains, yet finance remains underserved with specialized open-source models. The PIXIU project seeks to address this gap by introducing a comprehensive financial LLM, instruction-tuning datasets, and evaluation benchmarks. Unlike general-purpose models such as GPT-4 or BloombergGPT, PIXIU is designed to excel in financial tasks by leveraging domain-specific datasets and benchmarks.
Research Motivation
The research behind PIXIU is driven by three key questions:
- How can we develop efficient and openly available LLMs tailored for finance?
- How can we construct large-scale, high-quality financial instruction datasets?
- How can we build a holistic financial evaluation benchmark for assessing financial LLMs?
By addressing these questions, PIXIU aims to push forward the development of financial AI and promote open research in this critical domain.
Core Components of PIXIU
Financial Instruction Tuning Dataset (FIT)
The heart of PIXIU is the Financial Instruction Tuning Dataset (FIT), which consists of 136,000 samples. This dataset covers diverse financial tasks, including sentiment analysis, news headline classification, named entity recognition, question answering, and stock movement prediction.
Financial Language Understanding and Prediction Evaluation Benchmark (FLARE)
To evaluate the performance of financial LLMs, PIXIU introduces FLARE, which consists of:
- Four financial NLP tasks with six datasets.
- One financial prediction task with three datasets.
FLARE ensures that financial LLMs are rigorously tested across various financial contexts, from text-based sentiment analysis to numerical stock movement prediction.
Open-Sourced Datasets for Instruction Tuning
Unlike self-instruct methods commonly used in LLMs like Alpaca, PIXIU relies on open-source datasets for instruction tuning. The reasons for this approach include:
- Open-source datasets are annotated by domain experts, ensuring high quality.
- They are cost-effective and free from commercial use restrictions.
- They cover a variety of financial text types, including news, reports, tweets, and multi-modal data like time series and tables.
Key Financial NLP Tasks in PIXIU
1. Financial Sentiment Analysis
PIXIU leverages two major datasets:
- Financial Phrase Bank (FPB): A collection of financial news sentences labeled with sentiment (positive, negative, neutral).
- FiQA-SA: Sentiment analysis on financial news and microblog posts, scored on a scale from -1 (most negative) to 1 (most positive).
2. News Headline Classification
PIXIU uses the Gold News Headline Dataset, which contains news headlines about gold from 2000 to 2019. However, a potential limitation is that financial journalism varies significantly across asset classes (forex, equities, ETFs, commodities). Training solely on gold-related news might not generalize well to other assets.
3. Named Entity Recognition (NER)
The FIN dataset is used for NER tasks, consisting of sentences from financial agreements in SEC filings. Entities are classified into LOCATION (LOC), ORGANIZATION (ORG), and PERSON (PER).
4. Question Answering (QA)
Two datasets support financial QA tasks:
- FinQA: Expert-annotated question-answering pairs based on earnings reports.
- ConvFinQA: An extension of FinQA that introduces multi-turn conversation-based QA.
Notably, FinMA, the financial LLM derived from PIXIU, underperforms GPT-4 in QA tasks due to its limited context window and reasoning capabilities. Addressing these limitations is crucial for improving financial QA performance.
5. Stock Movement Prediction
PIXIU includes three stock movement prediction datasets:
- BigData22
- ACL18
- CIKM18
These datasets frame stock movement prediction as a binary classification problem, where a price increase above 0.55% is labeled as positive (1), and a decrease below -0.5% is labeled as negative (-1). However, this approach may not fully capture the complexities of financial markets due to the non-stationary nature of returns. Applying triple barrier labeling techniques could improve model robustness.
Model Fine-Tuning and Performance
PIXIU fine-tunes two versions of FinMA on LLaMA-7B and LLaMA-30B:
- FinMA-7B: Fine-tuned for 15 epochs on 8 A100 40GB GPUs.
- FinMA-30B: Fine-tuned for 20 epochs on 128 A100 40GB GPUs.
Key training configurations:
- Optimizer: AdamW
- Batch size: 32 (7B) / 24 (30B)
- Learning rate: 8e-6
- Weight decay: 1e-5
- Warmup steps: 5% of training steps
- Input length: 2048 tokens
Performance Comparison with GPT-4
FinMA achieves strong performance in sentiment analysis and headline classification but underperforms GPT-4 in:
- Named Entity Recognition (NER)
- FinQA and ConvFinQA
- Stock movement prediction (BigData22)
This suggests that while domain-specific tuning improves task-specific performance, general-purpose models still outperform in tasks requiring broader reasoning and extensive context.
Open Questions for Future Research
- How can we extend FinMA’s context window to improve its performance in QA tasks?
- Can triple barrier methods improve stock movement prediction accuracy?
- How does training on gold-related news impact the model’s ability to generalize to other financial assets?
Conclusion
PIXIU represents a major step forward in developing open-source financial LLMs. By providing a tailored instruction dataset (FIT) and a rigorous evaluation benchmark (FLARE), PIXIU ensures that financial AI research remains transparent and accessible. While FinMA shows promise in specific tasks, further improvements in context window length and predictive methodologies are necessary to match or exceed the performance of general-purpose LLMs like GPT-4.
The financial AI community now has a strong foundation to build upon, and future iterations of PIXIU could further enhance its capabilities across a broader range of financial tasks.
For more details, visit the PIXIU GitHub repository.