Public development banks (PDBs) play a critical role in the global transition toward low-emissions climate-resilient development. Particularly in emerging markets and developing economies (EMDEs), PDBs are instrumental in accelerating climate investment by financing transition projects and developing de-risking mechanisms to “crowd in” private finance, shaping policy frameworks at a national and international level, and providing advisory services and technical assistance to accelerate the growth of climate project pipelines (CPI 2024).
Accordingly, comprehensive tracking of PDB climate ambition can provide important insights into where PDB support for low-emissions climate-resilient transition is strongest, as well as where greater efforts are needed to raise their ambition. For example, initial tracking reports in 2023 and 2022 showed that PDB climate ambition had thus far been largely concentrated among multilateral development banks (MDBs) and bilateral development finance institutions (DFIs) located in advanced economies.
In 2024, to test the robustness of these and other previous findings, we have increased tracking from 70 to 170 institutions, aiming specifically to increase coverage of smaller PDBs in EMDEs and add multilingual functionality to include non-English sources. However, implementing these additions meant that a drastically larger volume of unstructured information would need to be ingested in order to begin the analysis, a task that the data backend developed in 2022 and used again in 2023 was not fit to handle.
While the taxonomy of climate commitments was retained from the original methodology, in 2024, we completed an extensive overhaul of the commitments tracking process, utilizing artificial intelligence (AI) and machine learning tools (ML) to process substantially larger primary datasets and capture more robust information, leading to deeper analytical insights. See the 2024 report on PDBs’ climate commitments for key findings and recommendations that were informed by the AI/ML-enabled data-gathering process. In this technical blog post, we dive more deeply into the details of our new data methodology in the spirit of joint learning and knowledge exchange among climate finance analysts.
Figure 1: AI/ML-enabled Data Collection Pipeline for Tracking PDBs’ Climate Commitments
<!– image html =
–>
In this post, we present the following:
The revised methodology leverages AI/ML to solve the following data collection challenges:
- How can we identify the climate commitments adopted by PDBs?
- After PDBs’ climate commitments are identified, how do we extract relevant metadata that allows for detailed analysis of the commitments—e.g., when was the commitment made, what is the scale of the commitment’s ambition, etc.?
We have approached these challenges with the development of two specific AI/ML-enabled solutions:
- A suite of large language model (LLM) text classifiers that leverages natural language processing (NLP) and ML to label text snippets collected from PDB websites when they contain references to specific climate commitments.
- A complementary set of ChatGPT prompts that extract key metadata fields from labeled text snippets to form structured time series data that contains information on the scale of PDB ambition contained in each commitment, which can be expanded as new data is collected.
In working on these solutions, we have not only developed new research tools to deepen understanding of global financial institutions’ climate ambition but also uncovered key learnings that will inform the future deployment of AI/ML tools to support our analytical projects.
As the demand for comprehensive climate finance data continues to grow, CPI and other research and advisory organizations in the space will face a growing number of opportunities—and challenges—to utilize AI/ML approaches to enhance data gathering. The experience of developing AI/ML-enabled solutions for the purpose of tracking PDB climate commitments has yielded a few key learnings that can inform future efforts by CPI and/or partners:
- The development of AI/ML-enabled data collection tools is particularly advantageous when structured datasets are not readily available and/or when processing of large volumes of data. At the moment, no structured dataset of PDB climate commitments exists outside of the tracking done by CPI, which can now be updated by ingesting thousands of text sources on an annual or semi-annual basis with minimal manual processing. Specifically, web scraping through Google Programmable Search provides a way to locate PDB climate commitments at scale, which are then automatically converted into a structured data table using LLM text classification models and metadata extraction prompts.
- Standardized data formatting imposed using AI/ML tools can facilitate linkages to complementary datasets, allowing for deeper analysis based on more comprehensive information. For example, the time series data structure returned by the AI/ML data pipeline is compatible with matching PDBs’ climate ambition against investment flows measured by CPI’s climate finance tracking, as well as other data sources that characterize the operating contexts faced by PDBs (e.g., level of host government policy support, maturity of financial system, climate investment pipeline, etc.). As a result, our analysis of PDB climate ambition is much more robust in 2024 than in previous years.
- Particularly in the case of NLP models, AI/ML tools can be continuously retrained and adapted to different information collection tasks, efficiently integrating new information into data pipelines without incurring large computing costs or expending significant analyst resources. The initial text classification model is re-trained from the ClimateBERT model for climate commitments and actions, which is subsequently adapted to create a series of secondary models that label commitment types according to CPI’s climate commitment taxonomy, which then feed into metadata extraction using a series of ChatGPT prompts. As such, we are able to produce a novel structured dataset from unstructured text inputs, supported by a processing pipeline where performance gains are passed through to downstream tasks, leading to continuous quality improvements without reliance on costly high-performance computing or manual collection.
As tracking of PDB climate commitments continues, further improvements can be made to the AI/ML-enabled data collection process to produce better quality data and maximize efficiencies in subsequent years. Opportunities include:
- Refining the list of key words used to web scrape search results from PDB websites by evaluating the effectiveness of each individual key word set, assessing which most effectively retrieve validated commitments. This would center on analysis of true positives and false negatives of each set to determine which key word combinations most consistently yield climate commitments.
- Exploring additional text classification models to better support ChatGPT’s parsing of metadata, with options including decision trees, random forest, logistic regression, or Support Vector Machines (SVM) / k-Nearest Neighbors (KNN). For example, a KNN could be used with Term Frequency-Inverse Document Frequency (TF-IDF) grouped tokenized terms due to its inefficiency with high dimensional data, while an SVM can be used with the ungrouped tokenized terms.
In addition, structured data produced with the new AI/ML-enabled data collection process presents a number of potential opportunities for future research, such as:
- Inferential assessment of the effect that climate commitments have on the volume and sectoral composition of PDBs’ direct financing flows.
- Comparison of PDBs’ climate ambition to that of private financial institutions tracked by CPI’s Net Zero Finance Tracker.
- Integration of tracked PDB climate ambition into CPI’s Climate Finance Reform Compass as a progress indicator.
The following sections describe the technical data science methods used to develop the aforementioned AI/ML-enabled tracking tools. This includes a detailed discussion of model training and fine-tuning, as well as an evaluation of how the models performed after being applied to unfamiliar data outside of a train/test environment.
Solution part 1: Training an LLM text classifier to identify climate commitments
We started the process of identifying climate commitments made by PDBs by scraping relevant text snippets from PDB websites using Google Programmable Search queries. Queries are constructed around a set of English key words, which are translated into Spanish, French, and Portuguese (languages commonly found among the sample of tracked PDBs) to enable more exhaustive data collection. The key words, shown below, were selected to mimic the basic vocabulary of climate commitments made by PDBs.
Commitment area | Key word |
---|---|
Paris alignment | (announce | commit | pledge | target | aim) AND (align | aligning | alignment) AND Paris AND (agreement | “climate agreement” | accords | goals) |
Mitigation targets | (announce | commit | pledge | target | aim | achieve | align) AND (“net zero” | net-zero | ((climate OR carbon) AND (neutral | neutrality))) |
Mitigation targets | (announce | commit | pledge | target | aim | achieve) AND (reduce | reduction | cut | slash | decrease | peak) AND (emissions | carbon | GHG) |
Climate investment goals | (announce | commit | pledge | dedicate | establish | aim) AND (green | climate | renewable | “low carbon” | “clean energy” | waste | sustainable | SDG | ESG | adaptation) AND (finance | invest | fund | financing) AND (goal | target | objective) |
Climate investment goals | (announce | commit | pledge | dedicate | establish | aim) AND (finance | invest | fund) AND (protection | preservation | restoration | conservation) AND (biodiversity | forest | pollution | water) |
Divestment and exclusion policies | (divest | stop | end | exclude | reduce | “phase out” | “phase down” | quit | divest | “cut off”) AND (fossil fuels | coal | oil | gas | methane | unabated | deforestation) |
Integration actions | climate AND (action | transition) AND (management | strategy | plan | framework | “capacity building” | engagement | disclosure | department | product | offering) |
Integration actions | (announce | adopt | set | establish | apply | implement) AND carbon AND (price | tariff | credit) |
Integration actions | (assess | report | evaluate | monitor | disclose | integrate | manage | screen) AND climate AND (risk | vulnerability) |
However, search results often include “non-commitment” text snippets that contain some assortment of relevant key words but do not refer to an actual climate commitment. In order to separate text snippets that do reference climate commitments announced by PDBs from these non-commitment text snippets, we trained a text classification large language model (LLM). The model is re-trained from the ClimateBERT model for climate commitments and actions. ClimateBERT is an LLM that is developed using the DistilRoBERTA NLP model and trained on over 2 million climate-related paragraphs.
To re-train the ClimateBERT model, we used the Python packages transformers
and torch
(PyTorch), which tokenize the search result text content and run it through a deep learning (transformer neural network) model training process.
The model is re-trained on a rebalanced dataset that contains a 50-50 split between labeled commitments and non-commitments, across a total of 3066 observations. This training set is sourced from the results of previous climate ambition tracking among public (CPI 2023 and 2022) and private (CPI 2022 and 2021) financial institutions. However, since the full results of these previous tracking efforts show a roughly 75-25 split between commitments and non-commitments, rebalancing allows the model to better “learn” the defining features of the minority class (i.e., commitments) than it would if trained on an unbalanced dataset.
These data are randomly separated into training and validation sets (on an 80-20) basis. The model is fine-tuned using a cross-entropy loss function and Adam optimization. Model performance is evaluated on the basis of precision (% true positives out of all classified positives), recall (% of total positives captured), and F1 score (harmonic mean of both). Accuracy on the validation set is also considered but is less insightful than the aforementioned performance measures, as it does not provide indication of which class (i.e., positive or negative) the model performs better on, which is a key nuance needed to guide model fine-tuning. The latest version of the commitment classification model performs with a validation accuracy of 90.23%, 90.08 F1 Score, 91.14 Precision, and 89.81 Recall. See a demonstration of the model in the code snippet below:
import transformers
from transformers import pipeline, AutoTokenizer, AutoConfig, AutoModelForSequenceClassification
model_name = "nkc98/commitments-classification-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device=-1)
result = pipe('British International Investment accelerates climate finance ... Alongside increasing its delivery of climate finance, BII is committed to Paris alignment and is developing a strategy for reaching net zero at a portfolio ...', padding=True, truncation=True)
result
In the example above, running the text snippet “British International Investment accelerates climate finance … Alongside increasing its delivery of climate finance, BII is committed to Paris alignment and is developing a strategy for reaching net zero at a portfolio …” through the classification model returns a label of “yes” (indicating that the text snippet does indeed correspond to an announced climate commitment) with an associated probability of 99.5%.
Once a text snippet has been classified as a climate commitment, it is then assigned an additional label corresponding to sub-types of commitments within CPI’s climate commitment taxonomy:
- Targets. Signaling intent to achieve specific climate-relevant objectives, potentially resulting in engagement and climate finance flows. This dimension tracks both qualitative commitments and quantitative targets adopted to address climate change, such as:
- Paris alignment
- Mitigation targets
- Net zero targets
- Carbon neutrality targets
- Interim emissions targets
- Climate investment goals
- Integration actions. Measures to mainstream climate into PDB decision-making, potentially increasing climate finance flows (or decreasing flows to projects without climate benefits or even negative climate effects). These are qualitative changes to institutional policies, governance, and investment approaches including:
- Institutional climate strategies
- Exclusion and divestment policies
- Counterparty engagement guidelines
These sub-type labels are provided by a secondary set of transformer neural network classification models that are derived from the primary commitments model described above. Specifically, manually validated climate commitments text snippets that correspond to each sub-type (i.e., Paris alignment, net zero, climate investment goals) are used to iteratively re-train the primary model so that it “learns” a new task of accurately identifying text snippets that correspond to a particular sub-type of commitment. These secondary models can essentially be understood as estimating the conditional probability that, given a text snippet has already been labeled as a commitment, it can further be categorized within a commitment sub-type e.g., P(net zero | commitment).
# Load commitments model and tokenizer
model_name = "nkc98/commitments-classification-model"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
...
inputs = tokenizer(list(training_df['Text']), return_tensors="pt", padding="max_length", truncation=True, max_length=128)
...
start_time = datetime.now()
# Training
for epoch in range(num_epochs):
sequence_classification_model.train()
total_loss = 0.0
for batch in train_loader:
input_ids = batch[0]
attention_mask = batch[1]
labels = batch[2]
optimizer.zero_grad()
outputs = sequence_classification_model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
...
# Validation
sequence_classification_model.eval()
val_loss = 0.0
with torch.no_grad():
for batch in val_loader:
input_ids = batch[0]
attention_mask = batch[1]
labels = batch[2]
...
Note that secondary labels are not mutually exclusive—for example, the text snippet “British International Investment accelerates climate finance … Alongside increasing its delivery of climate finance, BII is committed to Paris alignment and is developing a strategy for reaching net zero at a portfolio …” actually corresponds to four overlapping secondary labels: Paris alignment, net zero, carbon neutrality, and institutional climate strategies.
After text snippets are processed through ML text classification, they are returned as a structured data observation in the format below:
Text Snippet | Commitment | Paris Alignment | Net Zero | Carbon Neutral | Interim Target | Investment | Divestment | Institutional Strategy | Counterparty Engagement |
---|---|---|---|---|---|---|---|---|---|
British International Investment accelerates climate finance … Alongside increasing its delivery of climate finance, BII is committed to Paris alignment and is developing a strategy for reaching net zero at a portfolio … | True | True | True | False | False | True | False | True | False |
Solution Part 2: Extracting commitment metadata with ChatGPT
In order to maximize the efficiency of ChatGPT-supported metadata extraction, we incorporated text classification machine learning models, such as neural networks, to subset the dataset to observations that contain the relevant metadata fields. Subsequently, we employed various task optimization techniques to guide ChatGPT’s extraction process. Finally, we used a schema validation script to verify that ChatGPT’s responses align with the expected data type and value ranges.
Step 1: Applying labels to subset extraction tasks
To streamline and optimize the process of extracting metadata from commitments, we incorporated ChatGPT into the data gathering workflow to complete tasks that otherwise would have been done manually. However, in order to optimize the cost of utilizing ChatGPT [1], we first distinguish between text snippets based on their relevance to each extraction prompt. To complete this pre-processing task, we first tried using a TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer to categorize unlabeled data for relevancy to specific commitment sub-types; however, after manual validation of the metadata fields, we instead used a set of transformer neural network models (described in Solution Part 1) to label commitment types and then pass text snippets to appropriate extraction prompts.
Specifically, after commitments were classified into specific sub-categories such as mitigation targets, climate investment goals, or exclusion and divestment policies, they were then fed to extraction prompts based on the types of metadata values they would likely contain. For example, an interim mitigation target is very probable to contain a percentage carbon or greenhouse gas emissions value, while a commitment to an institutional climate strategy would not be.
In addition to those secondary commitment type labels, we also implemented helper functions to identify if a commitment contains a date, monetary values, or percentages to further subset the data for metadata extraction. Overall, this strategy saved costs associated with API calls by ~78%. Additionally, similar to Retrieval Augment generation (Gao et al. 2024), this methodology should theoretically reduce the probability of ChatGPT generating a “hallucination” i.e., fabricating a value that is not actually present in the text snippet.
df['contains_percent'] = 0
df.loc[df['Text'].str.contains(r'%|percent|per cent', case=False, na=False), 'contains_percent'] = 1
df = df[(df['Mitigation'] ==1) & (df['contains_percent'] ==1)]
_df = prompt.open_ai_prompt(df, pv.commitment_announcement_v4)
In this example, without targeted sub-setting the query would have contained 1,378,582 tokens, but after implementing this helper function strategy, the total token count was reduced to 291,279 tokens, saving about 78% of query costs.
Step 2: ChatGPT supported field extraction
Within subsets of the total text snippet dataset, ChatGPT is then used to parse relevant metadata values. Unlike the tools used in the previous stage, which were trained on external data to classify commitments, ChatGPT is able to pinpoint relevant information within a text snippet without any prior training due to the advent of named entity recognition and event extraction (Huang, Y. and Huang, J. 2020). As such, use of ChatGPT removed the time and labor-intensive process of manual data labeling and model fine-tuning.
Prior to the release of gpt-4o-mini
, we implemented gpt-3.5-turbo-0125
to support metadata extraction due to the significantly lower cost for development and testing. However, with the release of gpt-4o-mini
, we may switch to the newer and cheaper model in future iterations. However, for the purposes of this report, we will refer to gpt-3.5-turbo-0125
when discussing methodology. Users can follow the official documentation to make a connection to the OpenAI API [2]. Additionally, we recommend following the official documentation for safe API practices [3].
In order to properly utilize ChatGPT responses [4], we utilized the parameter response_format={"type": "json_object"}
to ensure responses were JSON objects.
def response_parse(df):
...
string = df.loc[index, 'response']
json_file=json.loads(string)
for i in json_file:
df.loc[index,i] = json_file[i]
...
return df
During the development of the initial pipeline, we opted for iterating prompts as opposed to building a fine-tuned model. Some evidence suggests that fine-tuned models have a propensity to overfit on training data, based on a theory that these models may exaggerate actual performance on an underlaying task (Brown et al. 2020). Moreover, given that the extraction task is fairly straightforward and already supported by a number of pre-processing steps (e.g., indexing relevance from commitment type labels and helper functions), we believed that prompt engineering would be more efficient than fine-tuning models for each specific task. However, the balance of tradeoffs in prompt engineering vs fine-tuning can vary by task, other users can find in-depth strategies for their own use case on OpenAI’s prompt engineering guide.
Specifically, we implemented prompt engineering suggestions to “split complex tasks into simpler subtasks”, “provide reference text”, “test changes systematically”, “adopt a persona”, “use delimiters”, and “provide examples”. We instructed ChatGPT to adopt an API persona that parses data and returns the responds as JSON object. For prompts themselves, we implemented a “few-shot” examples approach.
...
messages=[{"role": "system", "content":(f"""
You are an API that "responds only in JSON" for parsing of meta data from text snippets.
{prompt}
""")},
{"role": "user", "content": row['Text']}],
stream=True,)
...
percentage_prompt = """
You are tasked with identifying percentages within text snippets.
These snippets include percent reduction in carbon emissions, ghg emissions, or unspecified emissions;
Or include percent reduction or increase in finances.
Provide the values as floats. If there is a range go with the lower number.
If there is no data use "NULL".
Please provide the information as a JSON.
Your response should look like:
{"carbon_pct": "xx.xx",
"ghg_pct": "xx.xx",
"unknown_emissions_pct" : "xx.xx",
"finance_reduction_pct" : "xx.xx",
"finance_increase_pct" : "xx.xx",
"unknown_finance_pct" : "xx.xx"}
or
{"carbon_pct": "NULL",
"ghg_pct": "NULL",
"unknown_emissions_pct" : "NULL",
"finance_reduction_pct" : "NULL",
"finance_increase_pct" : "NULL",
"unknown_finance_pct" : "NULL"}
...
"""
For the prompts themselves, research suggests that few shot examples improve the accuracy of “fill in the blank” tasks by 18% when compared to “zero shot” examples (Tom B. et al. 2020).
"""
...
Example 1:
"North American Development Bank SUMMARY OF PROJECT ...
50% and will achieve nearly 24% lower carbon dioxide (CO2) emissions.
The reduction in criteria pollutant emissions is even higher for compressed natural ..."
should return:
{"carbon_pct": "24.00",
"ghg_pct": "NULL",
"unknown_emission_pct" : "50.00",
"finance_reduction_pct" : "NULL",
"finance_increase_pct" : "NULL",
"unknown_finance_pct" : "NULL"}
Example 2:
"PRIVATE SECTOR DIAGNOSIS (CPSD): CREATING... to 10.5% in 2015 before gradually decreasing to reach 5.5% in 2019,
... country consists of a reduction in greenhouse gas emissions of 11.14%.. ."
should return:
{"carbon_pct": "NULL",
"ghg_pct": "11.14",
"unknown_emission_pct" : "10.5",
"finance_reduction_pct" : "NULL",
"finance_increase_pct" : "NULL",
"unknown_finance_pct" : "NULL"}
"""
Below is an example of a commitment and the associated response from ChatGPT (in JSON format) to the prompt provided:
Commitment | Response |
---|---|
KBN Green Bond Framework 2024 2030 target to reduce our own emissions by. 55% vs 2019 levels. The previous target set in 2020, was to achieve a 50% reduction by. 2030. KBN’s greenhouse gas … | {“carbon_pct”: “NULL”, “ghg_pct”: “[55.0, 50.0]”, “unknown_emissions_pct”: “NULL”, “finance_reduction_pct”: “NULL”, “finance_increase_pct”: “NULL”, “unknown_finance_pct”: “NULL”} |
The JSON response object is then converted into a data frame as seen below.
text | carbon_pct | ghg_pct | unknown_emissions _pct | finance_reduction _pct | finance_increase _pct | unknown_finance _pct |
---|---|---|---|---|---|---|
KBN Green Bond Framework 2024 2030 target to reduce our own emissions by. 55% vs 2019 levels. The previous target set in 2020, was to achieve a 50% reduction by. 2030. KBN’s greenhouse gas … | NULL | [55.0, 50.0] | NULL | NULL | NULL | NULL |
Step 3: Finalization of commitments table schema
To validate the responses of ChatGPT, we used python packages Pandera
and YAML
to verify datatypes and enforce expected range of values. This iterative process can also be used as a tool to identify edge cases or diagnose issues during the pipeline.
As a first validation step, we incorporated a YAML file to coerce data types and acceptable ranges. We opted to use YAML
due to its ability to be user readability, nest structures, and create schemas (YAML 2024). To implement schema development, we utilized the python package Pandera
to scaffold the framework of the schema. After generating the “inferred schema” we revised all values to reflect the specification of the project. With infer_schmea
, the package will coerce the data types of the data frame. To learn more about YAML
files in Pandera
developers can read the documentation.
from pandera.io import from_yaml
import pandera as pa
schema_inf = pa.infer_schema(df)
yaml_inf = schema_inf.to_yaml()
yaml_schema = from_yaml(yaml_inf)
with open('./schema_test.yaml', 'w') as file:
file.write(yaml_inf)
schema_type: dataframe
version: 0.20.3
columns:
...
text:
title: null
description: null
dtype: str
nullable: false
checks: null
unique: false
coerce: true
required: true
regex: false
net_zero:
title: null
description: null
dtype: bool
nullable: false
checks: null
unique: false
coerce: true
required: true
regex: false
...
ghg_pct:
title: null
description: null
dtype: float64
nullable: true
checks:
greater_than_or_equal_to: 0.0
less_than_or_equal_to: 100.0
unique: false
coerce: false
required: true
regex: false
...
dtype: null
coerce: true
strict: false
name: null
ordered: false
unique: null
report_duplicates: all
unique_column_names: false
add_missing_columns: false
title: null
description: null
After creating and modifying the schema, we used a script to validate the data against the schema structure. This function returned all rows (i.e., text snippets) that fail to meet the validation criteria. A few example validation cases are shown in the table below.
schema_context | column | check | check_number | failure_case | index |
---|---|---|---|---|---|
Column | publication_date | coerce_dtype(‘datetime64[ns]’) | None | 6 days ago . | 429 |
Column | money_amount | coerce_dtype(‘Int64’) | None | nan | 449 |
Column | money_amount | coerce_dtype(‘Int64’) | None | 300,000,000 | 451 |
Where ChatGPT metadata extraction returns values flagged as invalid according to the schema, they were corrected programmatically across each failure case. For example, in failure cases where date fields are returned as relative time (e.g., “6 days ago”) instead of absolute time (e.g., “December 10, 2024”), they are corrected using the date when the text snippet was web scraped. That is, a failure case date field value of “6 days ago” was corrected using a reference web scraping date of August 22, 2024, to replace the value with August 16, 2024.
After extracted metadata is validated automatically using the YAML
schema, we conducted a final manual validation to resolve any lingering data quality issues, directly validating responses in Excel and appending corrected values as additional columns. Manual validation also included a review of secondary commitment classification labels (i.e., Paris alignment, net zero, climate investment goals, etc.).
Step 4: Evaluating performance of AI/ML solutions
To understand how well the AI/ML tools performed, we utilized manually validated metadata values alongside extracted values to produce a confusion matrix, using an additional new subset of text snippets that were collected after initial training/testing. Accordingly, we evaluated the performance of each of the transformer neural network models and ChatGPT extraction prompts ability to classify and/or extract text data into correct categories and values by assessing the relative prevalence of true positive and true negative results against false positive and false negative results.
In addition to traditional performance metrics (accuracy, specificity, precision, recall), we also included both F1 (values range 0-1) and Matthew’s correlation coefficient (MCC; values range -1 to 1) to measure model’s ability to classify information across all classes (positive and negative). For both metrics, a score of “1” indicates perfect classification. MCC can be especially useful for measuring performance on imbalanced datasets (Chicco and Jurman 2020), which is particularly relevant to this exercise given that the PDB climate commitments dataset tends to have a low ratio of true positives to true negatives.
Furthermore, unlike F1, MCC incorporates predictions that were accurately predicted as negative, correcting a evaluative bias that occurs when F1 over-indexes model performance on the positive class, but not the other way around. Accordingly, unlike F1, MCC only assesses strong performance when a model accurately predicts both positive and negative classes, regardless of class balance. Finally, we also used scikit-learn’s “classification report” to evaluate performance, which provides macro and weighted averaging, as well as individual evaluations.
Evaluating climate commitment labeling models (i.e., Solution part 1)
As described in Solution Part 1, a set of transformer neural network models were trained to label text snippets as various sub-types of climate commitments. To evaluate how well these more sophisticated models perform against alternatives, we created a baseline model for comparison, which classifies text snippet observations using simple string-matching—i.e., labels are determined only on the basis of certain terms and sequences of words appearing within a given text snippet. Rather than using a comparatively complex baseline model, such as a stratified classifier based on class distribution, string-matching was selected as a baseline model due to its relatively intuitive nature, which allows for interpretability. String-matching terms were selected by using N-gram (i.e., text features of lengths 1 to n) plots specific to each commitment type.
A sample of the N-grams most frequently appearing within net zero and carbon neutrality commitments can be seen below:
<!– image html =
–>
We use these text features to subset the commitments to observations that contain a sequence of “net” and “zero” while avoiding features that would indicate a commitment belonging to the separate carbon neutral category. This method should theoretically limit the number of misclassified commitments where terms are used interchangeably (e.g. net zero carbon).
def baseline_bools(df):
...
# Initialize all boolean columns to 0
bool_cols = [..., 'Net Zero', ...]
df[bool_cols] = 0
...
net_zero_pattern = r'bnet[s-]+zerob' # Identify net-zero commitmenmts
zero_carbon_pattern = r'bzero[s-]+carbonb' # Ignore mentions of zero-carbon; these are assumed to be carbon neutral commitmenmts
df.loc[(df['Text'].str.contains(net_zero_pattern, case=False, regex=True)) & (~df['Text'].str.contains(zero_carbon_pattern, case=False, regex=True)), 'Net Zero'] = 1
...
return df
When commitments correspond to a narrow and specific set of text features, such as the above “net zero” example, classifying with string-matching is expected to be more effective. Unlike a neural network, this method avoids training on confounding features that a model can pick up on, thus reducing the probability of overfitting when generalizing the model. However, more verbose (i.e., greater variety of corresponding text features) or vaguely defined commitments will be harder to classify using this simplistic baseline model.
For example, we found that net zero and carbon neutrality commitments tend to be articulated using similar language with overlapping contextual text features, possibly leading to false positive classification by the transformer neural network (i.e., carbon neutrality commitments incorrectly labeled as net zero and vice versa). When evaluating a baseline string-matching model side by side with the transformer neural network, we can identify the prevalence of this and other confounding issues by comparing the respective false positive and false negative rate of each method.
An example of the confusion matrix and classification report results for net zero and climate investment goal classification models are found below. In both cases, transformer neural network performance was compared to the string-matching baseline model mentioned previously.
mcc = matthews_corrcoef(y_true, y_pred)
metrics[model_name] = {‘Accuracy’: accuracy,
‘Precision’: precision,
‘Recall’: recall,
‘F1 Score’: f1,
‘Specificity’: specificity,
‘MCC’: mcc}
…
metrics_df = pd.DataFrame(metrics).T
metrics_df.index.name=”Model”
plt.tight_layout()
plt.show()” style=”color:#22272e;display:none;background-color:#adbac7″ aria-label=”Copy” data-copied-text=”Copied!” data-has-text-button=”textSimple” data-inside-header-type=”simpleString” aria-live=”polite” class=”code-block-pro-copy-button”>Copy
def cm(_df, i):
...
df_sm = _df.drop(['Paris Aligned', 'Net Zero', 'Carbon Neutral', 'Interim Target',
'Mitigation', 'Investment', 'Divestment', 'Institutional Strategy'], axis=1)
df_sm = baseline_bools(df_sm)
y_true_sm = df_sm[f'{i} True']
y_pred_sm = df_sm[i]
cm1 = confusion_matrix(y_true_sm, y_pred_sm)
y_true_nn = _df[f'{i} True']
y_pred_nn = _df[i]
cm2 = confusion_matrix(y_true_nn, y_pred_nn)
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
...
metrics = {}
for model_name, y_true, y_pred, cm in zip(['String Match', 'Neural Network'],
[y_true_sm, y_true_nn],
[y_pred_sm, y_pred_nn],
[cm1, cm2]):
tn, fp, fn, tp = cm.ravel()
accuracy = (cm.diagonal().sum()) / cm.sum()
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')
specificity = tn / (tn + fp) if (tn + fp) > 0 else 0
mcc = matthews_corrcoef(y_true, y_pred)
metrics[model_name] = {'Accuracy': accuracy,
'Precision': precision,
'Recall': recall,
'F1 Score': f1,
'Specificity': specificity,
'MCC': mcc}
...
metrics_df = pd.DataFrame(metrics).T
metrics_df.index.name = 'Model'
plt.tight_layout()
plt.show()
<!– image html =
–>
Model | Accuracy | Precision | Recall | F1 Score | Specificity | MCC |
---|---|---|---|---|---|---|
String Match | 0.718526 | 0.911170 | 0.718526 | 0.788316 | 0.724384 | 0.195209 |
Neural Network | 0.797339 | 0.941686 | 0.797339 | 0.845673 | 0.789589 | 0.396129 |
<!– image html =
–>
Model | Accuracy | Precision | Recall | F1 Score | Specificity | MCC |
---|---|---|---|---|---|---|
String Match | 0.943705 | 0.945839 | 0.943705 | 0.944643 | 0.964021 | 0.708473 |
Neural Network | 0.849028 | 0.934560 | 0.849028 | 0.873396 | 0.834951 | 0.568485 |
Interestingly, performance results were fairly close between the transformer neural network and string-matching classification models, and for some labeling tasks the string-matching model performed better than the transformer neural network. Looking at the performance of the transformer neural network, despite class re-balancing to an even 50-50 distribution, we still observed a tendency to over-predict positive classes while also making fewer false negative classifications when compared to the string-matching baseline model. Notably, for classifying both net zero and climate investment goal commitments, the transformer neural network performs with F1 scores and MCC scores that would be considered “good” or even “strong” by rule of thumb standards for model performance, while the string-matching model performs excellently on net zero commitments but fairly poorly on climate investment goals (MCC score, in particular, is quite low).
In theory, the simplicity of net zero commitments text features allowed the string match to perform better, in that any text snippet containing a combination of “net” and “zero” was very likely to be a net zero commitment. Conversely, the transformer neural network struggled to some degree with identifying a clear text feature signal to train on, and likely overfit on some spurious features of the training text snippets.
On the other hand, for more complex labeling tasks such as identifying climate investment goals (”Investment” above), the transformer neural network performed better than string matching. A likely explanation is that, due to the complexity and variations of text features across climate investment goals, string matching does not effectively discern between positive and negative classes due to the multiple ways in which climate investment goals can be stated, leading to a high volume of potential edge cases where text snippets containing key features such as “investment” or “finance” do not consistently correspond to positive or negative classes.
Based on these results, we recommend implementing a mixed model testing approach after evaluating the relative complexity of a commitment’s text features, to determine if a simple method such as string matching should be used as opposed to a more complex method such as a neural network. For this particular project, a transformer neural network model (ClimateBERT) was readily available to be re-trained on climate commitments data, providing an accessible solution to improve data gathering. In other scenarios where existing models are not available and a model must be developed without a baseline, starting with simpler models, such as a logistic regression, may be an effective strategy.
Evaluating metadata extraction with ChatGPT (i.e., Solution part 2)
In addition to evaluating the secondary labeling of climate commitment types, we created an evaluation function to assess the performance of ChatGPT when extracting key metadata fields (Solution part 2). Performance was assessed on the following metrics: false negatives, false positives, hallucinations, and correct extractions:
- In this context, false negatives are observations that ChatGPT has incorrectly identified as
Null
when there is actually a true value included in the text snippet. - False extractions are observations where ChatGPT has mistakenly extracted a different value contained in the text snippet that does not correspond to true value for the targeted metadata field.
- Hallucinations are observations where an invented value was inferred by the LLM when no metadata value was actually present (i.e.,
Null
). - Finally, correct extractions are observations where ChatGPT has extracted the true values from text snippets.
Note that all error cases were corrected during manual validation before data was finalized for analysis. This includes out of sample false negatives, where text snippets were not passed to ChatGPT for metadata extraction due to mislabeling upstream, but were later discovered during manual validation.
def evaluate_gpt_performance(df):
category_lookup = {
'Carbon Neutral Announced': ['Carbon Neutral'],
'Carbon Neutral Target': ['Carbon Neutral'],
...
metrics = {
'Column': [],
'Accuracy': [],
'Total_Observations': [],
'Total_Samples': [],
...
predictions = df[pred_col].fillna('nan').astype(str)
ground_truth = df[true_col].fillna('nan').astype(str)
hallucination_sample_rate = (hallucinations / total_samples
if total_samples > 0 else 0)
false_negative_sample_rate = (false_negatives / total_samples
if total_samples > 0 else 0)
...
The table below highlights a sample of metadata fields and their evaluations. Samples, in this context, refer to the subset of data fed into ChatGPT. For example, the text snippet sample parsed for “carbon neutral announced” (i.e., the year in which the commitment was made) values only refers to text snippets where ‘mitigation’ and ‘contains year’ Booleans are equal to one.
Metadata Field | Total Sample Observations | Correct Extractions | Sample Accuracy Rate | Sample False Negatives | Sample False Extraction | Sample Hallucinations | Out Of Sample False Negatives |
---|---|---|---|---|---|---|---|
Carbon Neutral Announced Date | 549 | 481 | 87.61% | 0 | 0 | 68 | 0 |
Carbon Neutral Target Date | 549 | 484 | 88.16% | 23 | 1 | 41 | 2 |
Net Zero Announced Date | 549 | 469 | 85.43% | 0 | 0 | 80 | 0 |
Net Zero Target Date | 549 | 479 | 87.25% | 36 | 5 | 29 | 4 |
Carbon Tonnage Reduction | 779 | 710 | 91.14% | 7 | 5 | 57 | 0 |
Carbon Percent Reduction | 222 | 196 | 88.29% | 14 | 2 | 10 | 1 |
GHG Tonnage Reduction | 549 | 498 | 90.71% | 3 | 3 | 45 | 2 |
GHG Percent Reduction | 235 | 175 | 74.47% | 48 | 3 | 9 | 2 |
Overall, these evaluation results suggest that AI/ML tools developed for this project perform with relatively high application accuracy (i.e., when applied to data outside of the test/train set) on the labeling and extraction tasks they were used for. However, these results also reveal a few key areas in which the tools can be refined for future iterations of the PDB climate ambition tracking project or other use cases.
Footnotes
[1] API cost can change overtime as well as become depreciated, so we recommend reviewing available models for the scale of your tasks https://openai.com/api/pricing/
[2] CPI used the python to interact with the API, but practitioners can refer to the official documentation for their preferred programming language https://platform.openai.com/docs/api-reference/authentication
[3] CPI suggest following API safety protocols to prevent unauthorized use of API keys https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
[4] Once a connection is made to ChatGPT, practitioners can refer to the official documentation on how to prompt information using their preferred programming language OpenAI streaming documentation.
[5] Practitioners can utilize various strategies to create reproducible prompts by following OpenAI’s best practices guide https://cookbook.openai.com/examples/structured_outputs_intro
Acknowledgements
This methodology blog has been reviewed by CPI colleagues Eddie Dilworth, Jake Connolly, and Christ Grant.
This project is supported by the Sequoia Climate Foundation.