1. Introduction
Throughout the information age, scientists have embraced computers as allies—from automation of simple tasks to postprocessing big datasets. Human weather forecasters provide added value over raw numerical weather prediction (NWP) guidance, both quantitatively (Novak et al. 2011) and in public communication (Stuart et al. 2006). Today, generative AI is rapidly entering the mainstream in the wake of intuitive web-based “chatbot” interface, such as OpenAI’s ChatGPT large-language model (https://chatgpt.com, accessed on 1 July 2024). Within this flurry of AI-model development, meteorologists now hold unprecedented but little-explored tools for improving the weather-forecasting enterprise. The addition of “sight” to language models—so-called multimodal models that can ingest more than text—enables weather images and charts to be interpreted not just by the human eye but by machine intelligence. Given this combination of natural-language processing, intelligence, and ingestion of imagery, we ask the following question: Can this rapid technological advance serve meteorologists in forecasting, research, and public communication? We do not seek to replace the human forecaster; rather, we evaluate competence of the first public release of GPT-4V to augment the meteorologist’s toolbox.
In the era of big data, scientists are increasingly turning to multimodal AI models like GPT-4V to increase efficiency and achieve more in less time. Recent AI models can perform specific tasks as well as humans (Bubeck et al. 2023; Yang et al. 2023) and may disrupt society more than the last generation of large language models (LLMs) such as GPT-3 (Floridi and Chiriatti 2020; Tamkin et al. 2021; Bender et al. 2021). Yet, despite typically coherent and confident responses to prompts, there is the risk of hallucinations: instances where output confidently and convincingly contains erroneous information. Frankfurt (2005) and Hicks et al. (2024) eccentrically detail the distinction between hallucinations and indifference of some LLMs to truthfulness of their output. The language model generates plausible output based on patterns in the training corpus but does not fact-check its inferences out of the box. As the human developer or user, it is difficult to diagnose sources of error in proprietary LLMs when little information about training data is given. Even with the corpus dataset in hand, we estimate from Brown et al. (2020) and Kaplan et al. (2020) that GPT-4V learned from 1 to 2 TB of the filtered text and an order of magnitude more in image processing. The forecast of severe weather is a risk-averse endeavor; small errors may have disproportionally negative consequences. Given this risk sensitivity and the importance of specific wording when communicating weather hazards to the public (Rothfusz et al. 2018; Trujillo-Falcón et al. 2021), AI language models present the potential to improve public communication of hazard risks (Olteanu et al. 2014) tailored to the audience.
These LLMs are trained on massive datasets of text and code, enabling them to perform a variety of tasks, including translation, summary of large chunks of the text, and knowledge of a wide range of topics allowing tailored answering of questions (prompts). During the writing of the present manuscript, OpenAI has demonstrated further abilities, such as more natural interaction via speech (GPT-4o), deeper reasoning and chain-of-thought (GPT-o1), and text-to-video (Sora). The GPT-4V system card (OpenAI et al. 2023) details the testing and evaluation performed by OpenAI, but the lack of transparency makes it difficult to assess performance through a report card. The reader is encouraged to consult https://openai.com/index/gpt-4v-system-card/ (accessed on 1 August 2024) for more thorough testing. Our goal is to discern whether this somewhat nebulous skill set within GPT-4V can constructively contribute to meteorological applications; we probe its capabilities and limits in meteorological application through three questions:
-
Can GPT-4V correctly interpret weather charts and imagery?
-
Can GPT-4V effectively communicate weather hazards in language tailored for the audience?
-
What constitutes a useful answer—what are our expectations and are they reasonable?
2. Method
We choose GPT-4 as our large-language model due to OpenAI’s claimed GPT-4V performance metrics and release of the “vision” ability enabling image inputs. Further, at the time of project conception, keyword searches on Google for ChatGPT and OpenAI were an order of magnitude larger (trends.google.com, accessed on 15 March 2024; not shown) than competitors and their AI models such as Anthropic (Claude) and Google (Gemini or Bard). Herein, we used the ChatGPT web portal: the online graphical front end to OpenAI GPT models.
During initial exploration, many types of challenges were provided to GPT-4V; here, we focus on two tasks that are most representative of GPT-4V’s range and aptitude of abilities. First, we ask GPT-4V to interpret multiple weather charts and deduce the risk of meteorological hazards. Second, we give GPT-4V a synoptic-scale forecast chart marked with regions of general weather type such as thunderstorms and request plain-language summaries in both Spanish and English. The conversations were processed by the lead author within ChatGPT, performed within 2 weeks of 1 October 2023 with the same version released in stages to the public starting on 25 September (https://help.openai.com/en/articles/6825453-chatgpt-release-notes#h_e3a2ee7903, accessed on 1 November 2023). We used no custom instructions, which can be provided to personalize responses within ChatGPT. We include sections of our GPT-4V conversations in the manuscript, but full conversations (unabridged other than trimming of the unimportant procedural text) are found in the online supplemental material. Text quoted verbatim from GPT-4V conversation is italicized. Our experiments were conducted within a closed environment (i.e., ChatGPT did not have access to the internet). The training corpus did not include information beyond September 2021.
3. GPT-4V and meteorological prediction
Yang et al. (2023) and Bubeck et al. (2023) showed that GPT-4 is able to interpret and conceptualize data spatially, such as text-based navigation after a description of a location. Indeed, Xu and Tao (2024) found LLMs displayed ability to hold large batches of maps in memory to form basic spatial awareness, but this was tempered by low reliability of image identification and black-box behavior that precluded reproducible evaluation analysis. GPT-4V displays nascent ability to anticipate, which is a key human characteristic (Dennett 2015). Combining both factors, can GPT-4V conceptualize the atmospheric state from a sequence of charts?
a. Initial set of weather charts
We want to determine if GPT-4V can grasp the three-dimensional atmospheric flow and hence we show GPT-4V multiple pairs of charts depicting North American Model (NAM) and Global Forecast System (GFS) guidance data, visualized by Pivotal Weather (https://home.pivotalweather.com/, accessed on 15 October 2023). We provide two models to assist GPT-4V capture basic characteristics of uncertainty. While this is a smaller corpus of guidance than would be available to human forecasters, we give sufficient information to GPT-4V that a human would capture the general flow pattern from the same resources:
-
Geopotential height at 300 and 500 hPa
-
Dry-bulb temperature at 500 hPa, 850 hPa, and 2 m
-
Wind at 300 hPa, 500 hPa, 850 hPa, and 10 m
-
Mean sea level pressure
-
Equivalent potential temperature at 2 m
-
Simulated composite reflectivity
Figure 1 shows a subset of these charts given to GPT-4V, including geopotential height and wind vectors on multiple levels critical for understanding the flow pattern. The forecast synoptic-scale flow pattern displays a vertically stacked extratropical cyclone over the U.S. Great Plains on the polar side of a jet stream (Figs. 1a,b) and evident in simulated composite reflectivity (Figs. 1c,d). The warm and cold sectors of the cyclone are seen in 850-hPa temperature (Figs. 1e,f). Our first instructional prompt after providing charts (Fig. 2) asks GPT-4V to give a summary of uncertainty after the first batch of maps and how it might improve its conceptualization of the atmospheric state with further charts. Much of the response is not specific enough to provide utility to end users (Fig. 2), referring to some differences, slight variations, and differences in forecasted weather that might affect predictions.
Three pairs of weather charts taken from a full set given to GPT-4V. Images reproduced with kind permission of Pivotal Weather, LLC. (a),(b) Geopotential height and wind speed at 300 hPa; (c),(d) simulated composite reflectivity; (e),(f) dry-bulb temperature at 850 hPa. (left) GFS; (right) NAM.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0029.1
GPT-4V response to a collection of weather charts. We subjectively highlight vague/incorrect responses in red and useful/correct sections in blue.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0029.1
Responses display hallucinations falsely detected in the provided charts: Fig. 2 refers to precipitation in the eastern United States not shown in the charts (Figs. 1c,d). The choice of terminology is often nonstandard: GPT-4V identifies a more pronounced jet streak over the northeastern United States. If we interpret “pronounced” as meaning stronger in magnitude, this is evidently a mix-up between the two models, despite GPT-4V’s ability to read small texts in images (not shown). It is unlikely a forecaster would find utility in GPT-4V’s responses in Fig. 2: vague replies require the human to continue their task by fetching further guidance on top of fact-checking the responses. However, a human forecaster would also have access to many more meteorological charts; accordingly, we now heed the request to provide addition chart(s).
b. Second set of weather charts
At the end of the response in Fig. 2, GPT-4V requests five further pairs of charts to better grasp atmospheric uncertainty and structure:
-
Upper-air soundings, to assist understanding of stability and convective potential
-
Surface dewpoint temperature (implying GPT-4V may not have processed the 2-m equivalent-potential-temperature charts)
-
Sea level pressure and analyzed surface frontal boundaries (something GPT-4V can attempt to infer from the provided charts)
-
Vorticity at 500 hPa (again, GPT-4V might infer this from existing upper-level charts)
-
Humidity charts (at unspecified pressure levels)
The request for humidity data is sensible from GPT-4V’s lack of access to upper-level humidity information; we give GFS and NAM charts of 700-hPa relative humidity and wind vectors and then request GPT-4V to infer frontal positions from these and previous charts. When asked about the equivalent-potential-temperature charts, given initially to improve understanding of near-surface cyclone structure (not shown), GPT-4V is not able to recall these. It is unclear whether this is due to these charts residing so deep in the conversation memory that this chart is deprioritized: GPT-4V will progressively deprioritize information if the model architecture deems it less relevant to imminent or recent prompts (Vaswani et al. 2017; OpenAI et al. 2023). Continuing its response, GPT-4V makes numerous mistakes, such as another conflation of NAM and GFS 300-hPa wind maxima and identifying discrepancies indiscernible to human eyes over the Great Lakes region of the 700-hPa humidity charts (not shown; supplemental material S1). GPT-4V’s response also suggests a warm front north of the high-humidity region in the Great Lakes but mistakenly locates a cold front in the northeastern United States rather than stretching from Ohio to the Carolinas.
The errors and lack of useful information in these responses early in the conversation may cause a chain reaction of useless or harmful responses if not caught by a human quickly enough in responses. This failure to check self-consistency stems from lack of hybrid fast-and-slow thought process (Kahneman 2011) to oversee fidelity: a known limitation of AI in emulating human thought and decision-making (Booch et al. 2021; Weston and Sukhbaatar 2023). Bubeck et al. (2023) found when deriving or processing the mathematical logic, where a statement is often either correct or incorrect, GPT-4V can reach a correct answer despite generating contradictory rationale. Similarly, as we progress through the conversation, we receive a useful final answer despite previous logical and factual errors—self-inconsistency does not preclude a course correction. As such, we turn to GPT-4V’s ability to recognize meteorological hazards in the charts provided and find GPT-4V indeed has a reasonable grasp of the atmospheric state despite the prior error. This further accents the need for humans to persist with their GPT-4V conversation rather than halting after one-shot tasks, partly due to low interpretability of response variation from a black-box model (e.g., Flora et al. 2024).
c. Issuing a mock forecast
The Storm Prediction Center (SPC) issues convective outlooks for mesoscale hazards such as tornadoes, damaging hail, and strong winds (spc.noaa.gov), providing invaluable information to the public about potential threats to life and property (Herman et al. 2018). Emulating the quality of SPC human forecasters is a highly complex task, and thus we do not expect GPT-4V to perform at a near-human level. In pivoting to convective severe weather, we provide a final chart of most unstable convective available potential energy (MUCAPE) and ask for a reevaluation of uncertainty (Fig. 3). This order to reevaluate follows OpenAI-recommended practices of splitting a long complex task into smaller, manageable responses whose quality can be assessed by the supervising human.
Conversation snippet with MUCAPE maps and request and generated response. We have removed discussion of maps not shown in Fig. 1.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0029.1
The GPT-4V response synthesizes its knowledge base to correctly identify a swath of instability from Missouri to Illinois. The response names states and geographical regions, highlights regions of high uncertainty, and locates a jet streak and relevance of the wind maximum to a convective forecast. The improvement in response quality lends support to our course corrections and provision of further maps; concerningly, some generated responses appear to equate uncertainty with elevated severe potential.
Having primed GPT-4V with charts and subtasks, we request an ambitious SPC-style convective outlook (Fig. 4), valid 0000 UTC 14 October 2023 (commensurate with previous charts’ valid time). The response contains some vague and generic rationale (significant differences, discrepancies), but GPT-4V nonetheless provides five specific regions with “elevated risk” that ultimately resemble the SPC forecast (Fig. 4b).
-
GPT-4V identifies Texas as a region of elevated risk from its proximity to instability and moisture; however, variations in MUCAPE represent uncertainty, not magnitude.
-
Missouri and Arkansas are noted for moisture convergence, a diagnostic found suboptimal for predicting convective initiation (Banacos and Schultz 2005); again, discrepancies between the GFS and NAM charts are given as evidence of a risk of hazards.
-
The rationale for choosing Illinois and Indiana is faulty, as uncertainty does not indicate the potential for severe weather, though notably GPT-4V appears to consider multiple vertical levels with specific reference to 700-hPa moisture.
-
GPT-4V finally identifies Ohio as an area with a lower risk of severe weather than states farther west; there is little rationale given, but the description tallies with the broad SPC thunderstorm risk (Fig. 4b).
Conversation snippet showing (a) GPT-4V outlook and (b) corresponding SPC-issued outlook for the same period.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0029.1
The main difference between the SPC forecast chart (Fig. 4b) and the GPT-4V response is that GPT-4V does not identify the U.S. Southeast as another area of thunderstorm risk.
We want GPT-4V to evaluate its own forecast through comparison with the actual human-issued SPC charts (Fig. 5), gauging whether GPT-4V has a sensible grasp of the atmospheric state. We ask GPT-4V to create “emulations” to evaluate itself, where each emulation represents a human evaluator with various biases. This represents a wisdom of crowds approach—the rationale also behind ensemble weather-prediction systems. We ask GPT-4V to summarize each emulation’s response. Some responses would not be uttered by human SPC evaluators (“narrowing down the focus areas based on more real-time data” is not pertinent for a 48-h forecast); other responses correspond to the SPC outlook (“refining the exact boundaries would have been the key to align better”). When evaluating itself, GPT-4V is too generous at times and even critiques the SPC outlook based on previous guidance. GPT-4V’s final evaluation appears to treat the SPC outlook as truth (“align better”), despite stating in the prompt there is yet no correct answer. In sum, GPT-4V continues to construct arguments containing poor logic to support its responses but ultimately yields a respectable outlook when compared to an analogous chart produced by humans at the SPC.
“Wisdom of crowds” method of self-evaluation of outlook in Fig. 4a having been provided the human equivalent in Fig. 4b.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0029.1
4. GPT-4V and bilingual weather communication
GPT-4V has shown the potential to grasp the general atmospheric state from a sequence of maps. Until this point, we have used and received the generated text that has not been tailored to the layperson. Different communication styles and languages are required for different communities: for instance, surveys (https://nces.ed.gov/programs/digest/d21/tables/dt21_225.70.asp, accessed on 1 November 2023) show over 20% of the U.S. population do not speak English as their first language at home (Dietrich and Hernandez 2022). Given the importance of risk communication appropriate for each community’s culture, we now test GPT-4V’s ability to not only interpret meteorological charts but also communicate an accessible summary to two audiences: plain-language Spanish appropriate for Spanish speakers in the United States and the same for American English. Recent models approach machine–human parity in the natural-language translation skill (Läubli et al. 2020), but this is sensitive to the model architecture and following of best practices (Hassan et al. 2018), and performance varies between source and target languages. Indeed, certain idiomatic expressions may be untranslatable between natural languages, requiring awareness of idiomatic translation where the true meaning is preserved. Translation is inherently creative, and different cultures describe geographical and weather phenomena uniquely.
We analyze GPT-4V’s generated 200-word plain-language summary of a synoptic analysis issued by the U.S. Weather Prediction Center, valid 2 October 2023 over the contiguous United States. We request the generated text first in a Spanish localization appropriate for the U.S. population (Fig. 6). Consequently, we request the same response but in American English. Responses from GPT-4V are not identical even if prompts are, due to the GPT temperature parameter. To better account for variation in GPT-4V’s responses—where a single test may give an unfair, unrepresentative assessment—we ask identical questions in three independent conversations (shown across Figs. 6 and 7). Temperature controls creativity or randomness in the generated response. Manually setting temperature to zero yields deterministic responses identical for a given prompt; larger values increase the model’s variability of responses.
Map, prompt, and response for bilingual hazard communication.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0029.1
The second and third responses to identical prompting to that in Fig. 6, performed in distinct conversation threads, regarding bilingual communication of hazards.
Citation: Artificial Intelligence for the Earth Systems 4, 1; 10.1175/AIES-D-24-0029.1
We combine the two actions of interpretation and translation in a single prompt, assuming naïvely that GPT-4V will split these as independent tasks. Despite prompting for Spanish first, we find the Spanish version is a direct translation of the English rather than an idiomatic one, both containing hallucinations. This presents as GPT-4V “reasoning” in English first, perhaps due to the large percentage of English language in the training corpera. A previous, more studied predecessor of GPT-4V (GPT-3) was so limited in linguistic diversity that evaluation considered English-language results alone (Brown et al. 2020). There is a complex causal relationship between language and logic (e.g., Gleick 2011, 40–44), underscoring the need to continue improving a known “area of further improvement” (Brown et al. 2020, p. 14). Direct translations can cause confusion as context, and key messages are more likely to become lost in translation (Trujillo-Falcón et al. 2022). This yields some linguistic ambiguity:
-
Examples where GPT-4V translated the information inconsistently with the field’s standards, such as clima for weather, which can also be a translation of climate for most Spanish speakers. The inconsistency causes confusion among bilingual groups (Trujillo-Falcón et al. 2021).
-
Vague use of geographical terminology (noroeste for northeast; la región central del país for “center of the country”). Especially for multilingual groups that were not born in the United States, the lack of suitable geographical context can present challenges in comprehension and eventual decision-making (Trujillo-Falcón et al. 2024).
-
Acronyms did not translate completely into Spanish. For example, the Eastern daylight time (EDT) acronym remained the same in English and Spanish. For the acronym to be understandable for bilingual speakers, the acronym should be spelled out in Spanish.
-
As seen in English responses, there are vague calls for action (tome precauciones, “take precautions”) and descriptions of hazards (mezcla de precipitaciones, “mixed precipitation”), rather than pointed advice required from an expert guidance system.
Given established machine–human near parity in natural-language translation (Läubli et al. 2020), this poor technique of translation is disappointing and surprising. Preliminary work with text-only English–Spanish translation was similarly disappointing (not shown) and supports the idea that translation is inherently poor in this version. Indeed, a more natural, conversational version of GPT-4o was presented during the writing of this manuscript (“Advanced Voice Mode,” https://help.openai.com/en/articles/8400625-voice-mode-faq, accessed on 1 September 2024), and its translation performance awaits future evaluation. Due to the black-box nature of GPT-4V, it is difficult to determine reasons behind poor performance. Although the English response does not suffer from inappropriately direct translation, it shares similar shortcomings in vagueness and nonstandard terminology:
-
There are hallucinations of scattered rain in the northeast and thunderstorms in the northwest.
-
The spelling of traveling in the response is not American English, highlighting errors in localization for both languages.
-
The response should address hazards specifically rather than discussing adverse weather in general—each sentence should carry weight given the need for a concise summary.
To its credit, GPT-4V’s responses in both languages identify severe weather potential in a broad swath from Texas to Illinois; further, use of state names appears correctly in all three responses. This tentatively increases our optimism that further tuning, prompt optimization, and a method of course correction during a conversation can all contribute to substantially more useful responses.
In summary, issuing a bilingual hazard outlook is a complex task that requires competence in image recognition, understanding in time and space, communication to a lay audience, and translation abilities. Further investigation (not shown) revealed many errors were shared with a similar Spanish–English translation task restricted to textual input. GPT-4V produces responses that appear as if an English response was directly translated into Spanish, perhaps stemming from the disproportionally high appearance (93% in GPT-3) of English in the training corpora (Bender et al. 2021; Byrd 2023), in contrast to the open-access BLOOM model where English comprises about 30% (BigScience Workshop 2022). For GPT-4V, English may act as a bridge language, especially if the corpus does not contain culturally nuanced weather terminology. However, previous investigation found performance improved when employing English as the bridging languages between less common translation (Bakhshaei et al. 2010; Kunchukuttan 2021). In pursuit of improved meteorological public communication, systems constrained to domain-specific terminology tailored for the community in question (Trujillo-Falcón et al. 2021; Bitterman et al. 2023; Trujillo-Falcón et al. 2024) are likely to remedy some lexical issues shown above.
5. Synthesis and recommendations
a. Seeking fidelity in coherence
GPT-4V frequently gives coherent answers but in the manner of a student attempting to veil their lack of knowledge with a wealth of regurgitation. The answer may be misleading or incorrect (hallucinations), but the delivery is convincing. The variety of responses is too wide for a given prompt, and the language is too vague, for applications such as scientific communication where there is finite correct, useful information but many incorrect answers (a low signal-to-noise ratio). This variety is likely a result of an excessively large default temperature value that yields inappropriate creativity for scientific tasks. Answers also contain useless filler text, awkward direct translation of natural language, and vague geography. Useful information may well be obtained, but identical to that found on good-quality internet sites. For balance, GPT-4V has the advantage of speed, handling of large datasets, and customization of responses. Alas, our results have shown mixed results in meteorological applications similarly to (Kadiyala et al. 2024) with poor fidelity unacceptable for real-world deployment.
With this said, it is remarkable we have technology that recognizes so much content in images. There are glimpses of real utility, such as issuing and self-evaluating a mock SPC-style outlook and interpreting and explaining weather charts. This promise must be balanced with the adage of the blind squirrel: it will find an acorn eventually. In Birhane and McGann (2024), language is argued as a mechanism for LLMs to communicate rather than conceptualize; responses that suggest GPT-4V can think spatially (Bubeck et al. 2023) may be an illusion of “seeing the map not the territory” (Birhane and Prabhu 2021).
Specific instructions (or resources such as uploaded PDF guidelines) may assist with improving elements such as preferred terminology. Indeed, testbeds for National Weather Service (NWS) forecasters found AI products required adaptation to the individual themselves (Cains et al. 2024): something constrained by custom instructions given to GPT-4V before each user prompt (OpenAI et al. 2023). Further, to better discriminate between meaningless coherence and useful truth, the supervising human is able to—and should—cross-reference statements with established meteorological fact and expert input. For complex tasks, a continuous conversation between human and AI allows course corrections and is encouraged over one-shot attempts at eliciting useful responses.
There is a trade-off between confidence/determinism and creativity/uncertainty, and asking specifically for uncertainty and honesty (to avoid hallucinations) was not consistently effective during testing. This is a parallel of over/underconfidence in a probabilistic weather forecast or over/underfitting an AI model. We show that self-evaluation is possible with conceptual copies of GPT-4V’s own output, but whether GPT-4V actually does more than emulate an emulation is unclear (Schaeffer et al. 2023). Further, GPT-4V gives full confidence to responses that could be misinterpreted. Such a misunderstanding of a prognosis fully accepted as true is an example of “catastrophic error” in information theory (Pierce 1980). Indeed, when a system is perceived as infallible, an incorrect prediction becomes exponentially more damaging as the probabilities linearly approach the limits of zero or unity (Cover and Thomas 2012). This can be remedied in GPT-4V communication, as with humans, by instructions never to issue binary forecasts and use of appropriate error estimates for the time and spatial scale.
b. Recommendations
ChatGPT and its products, as with all AI assistants, are best considered a copilot in academic realms—especially so during idea generation, simplifying complex ideas, and narrowing the scope of large paragraphs. However, in operations tasked with protecting lives and property, there is little room for error in issuing timely, correct warnings for hazards and little time for thorough vetting of language-model output. Rigorous testing must be completed before humans can be removed further from the operational loop. We do not have detailed knowledge of how the text is generated by a system that is neither transparent nor explainable (Flora et al. 2024), a continuing concern with AI products that may assist NWS forecasters anticipate high-impact hazards (McGovern et al. 2017, 2023). Thus, it can be difficult to anticipate and identify errors or refine model inputs (text or image prompts in the case of GPT-4V) to improve model accuracy.
An effective session should resemble a conversation: be prepared to correct and nudge the conversation to define an answer and even reprimand GPT-4V for laziness (Zhao et al. 2024). Remarkably, the latter’s cause is still unknown at the time of writing. When GPT-4V and the user disagree, course correction can be difficult: developers place guardrails to limit generation of incorrect, unethical, or illegal responses (Byrd 2023; Zhang et al. 2024; Geiping et al. 2024); after correction, the AI model may interpret (justified) user obstinance as redefining truthfulness or reality during the conversation as a way to circumvent these guardrails. The drawback of a longer discussion is running out of “token memory” (i.e., how much of the conversation GPT-4V remembers), demanding user recapitulation. A dialog framework designed with “scratchpads” and examples (Liu et al. 2024) may overcome forgetfulness and improve performance in tandem with better and larger memory management (Kwon et al. 2023). Some answers show a lack of logical consistency; the user must play the role of an overarching “monitor” that is able to tell when it is wrong, why it is wrong, and how to correct this missing knowledge (Booch et al. 2021). This highlights the importance of trustworthy, interpretable output from AI systems. Without human trust in AI output, there is little reason to use the generated text over, say, seeking peer-reviewed or human-expert guidance.
c. Future work
Ultimately, a tornado was observed in western Illinois during the time period covered in Figs. 4 and 5. Further work could test whether AI can generate an SPC forecast comparable in skill to a human. Future testing might ask for more detailed responses from each emulation, such as asking GPT-4V to evaluate without the SPC convection outlook first. Prompts and responses should be concise to limit the risk of losing crucial information earlier in the conversation.
We suggest potential research avenues that explore the following:
-
Ability to request nested emulations to create a synthesis “wisdom of crowds” or ensemble approach to prompting (or by simply using multiple independent chats with identical prompts);
-
Uses to improve accessibility to those visually impaired;
-
Prompt engineering, or finding an optimal question, especially in the presence of high stochasticity (this includes testing the order of translation for bilingual communication);
-
Manual control and optimization of the temperature or creativity parameter;
-
Open-source large multimodal models and local optimization of the model to meteorological applications, including a lower stochasticity value and prescription of specific terms for use in multilingual warning communications;
-
Having GPT-4V iteratively critique and optimize its own prompts and responses;
Despite its failure to give more detail within the expert section, could GPT-4V give discussion tailored for aviation, emergency managers, etc., with appropriate conveyance of uncertainty in both complexity and geography? The structure of experimentation could follow those within a human setting as conducted by Shivers-Williams and Klockow-McClain (2021).
Further work includes NWS-led research into “operational integration of smart translation” (Bozeman et al. 2024) whose success would expand multilingual risk communication to less common languages not within the language expertise of the service. Moreover, modification recommendations regarding Wireless Emergency Alerts in Spanish (Trujillo-Falcón 2024) could be deployed similarly in language models to constrain (fine-tune) responses.
Acknowledgments.
The authors thank two anonymous reviewers and the editor for thoughtful critique during the review process. J. R. L. thanks faculty and students at the Department of Geography and Meteorology at Valparaiso University and family and colleagues who gave frequent feedback on real-life Generative AI output. The authors thank Pamela Gardner, Kimberly Hoogewind, and Sean Ernst for useful input in the review stage of this paper. J. R. L. and S. N. L. are funded by Uintah County Special Service District 1 and the Utah Legislature. C. K. P.’s contribution to this work comprised regular duties at federally funded NOAA/NSSL. Funding for M. L. F. and J. E. T.-F. was provided by NOAA/Office of Oceanic and Atmospheric Research under NOAA–University of Oklahoma Cooperative Agreement NA21OAR4320204, U.S. Department of Commerce. Partial funding for D. M. S. was provided to the University of Manchester by the Natural Environment Research Council through Grants NE/V012681/1, NE/W000997/1, and NE/X018539/1. Co-authors following D. M. S. in the author list are alphabetic in order and contributed equally to this manuscript. Weather charts herein and in supplemental material are reproduced with kind permission of Pivotal Weather, LLC, where labeled with a watermark. Outside of experiments, GPT-4 was used to generate preliminary ideas for project development. No AI-generated text was used verbatim herein.
Data availability statement.
Solely textual data were generated for the present study and are contained entirely within the supplemental material.
REFERENCES
Bakhshaei, S., S. Khadivi, and N. Riahi, 2010: Farsi—German statistical machine translation through bridge language. 2010 Fifth Int. Symp. on Telecommunications, Tehran, Iran, Institute of Electrical and Electronics Engineers, 557–561, https://doi.org/10.1109/ISTEL.2010.5734087.
Banacos, P. C., and D. M. Schultz, 2005: The use of moisture flux convergence in forecasting convective initiation: Historical and operational perspectives. Wea. Forecasting, 20, 351–366, https://doi.org/10.1175/WAF858.1.
Bender, E. M., T. Gebru, A. McMillan-Major, and S. Shmitchell, 2021: On the dangers of stochastic parrots: Can language models be too big? Proc. 2021 ACM Conf. on Fairness, Accountability, and Transparency, Online, Association for Computing Machinery, 610–623, https://doi.org/10.1145/3442188.3445922.
BigScience Workshop, 2022: BLOOM: A 176B-parameter open-access multilingual language model. arXiv, 2211.05100v4, https://doi.org/10.48550/arXiv.2211.05100.
Birhane, A., and V. U. Prabhu, 2021: Large image datasets: A pyrrhic win for computer vision? 2021 IEEE Winter Conf. on Applications of Computer Vision (WACV), Waikoloa, HI, Institute of Electrical and Electronics Engineers, 1536–1546, https://doi.org/10.1109/WACV48630.2021.00158.
Birhane, A., and M. McGann, 2024: Large models of what? Mistaking engineering achievements for human linguistic agency. Lang. Sci., 106, 101672, https://doi.org/10.1016/j.langsci.2024.101672.
Bitterman, A., M. J. Krocak, J. T. Ripberger, S. Ernst, J. E. Trujillo-Falcón, A. G. Pabón, C. Silva, and H. Jenkins-Smith, 2023: Assessing public interpretation of original and linguist-suggested SPC risk categories in Spanish. Wea. Forecasting, 38, 1095–1106, https://doi.org/10.1175/WAF-D-22-0110.1.
Booch, G., and Coauthors, 2021: Thinking fast and slow in AI. Proc. AAAI Conf. Artif. Intell., 35, 15 042–15 046, https://doi.org/10.1609/aaai.v35i17.17765.
Bozeman, M. L., A. Montanez, J. E. Calkins, K. Farina, and R. Henry-Reeves, 2024: AWIPS software integrates AI translation technology to benefit NWS operations. 104th AMS Annual Meeting, Online, Amer. Meteor. Soc., 10B.3, https://ams.confex.com/ams/104ANNUAL/meetingapp.cgi/Paper/430225.
Brown, T. B., and Coauthors, 2020: Language models are few-shot learners. arXiv, 2005.14165v4, https://doi.org/10.48550/arXiv.2005.14165.
Bubeck, S., and Coauthors, 2023: Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv, 2303.12712v5, https://doi.org/10.48550/arXiv.2303.12712.
Byrd, A., 2023: Truth-telling: Critical inquiries on LLMs and the corpus texts that train them. Compos. Stud., 51, 135–142.
Cains, M. G., C. D. Wirz, J. L. Demuth, A. Bostrom, D. J. Gagne II, A. McGovern, R. A. Sobash, and D. Madlambayan, 2024: Exploring NWS forecasters’ assessment of AI guidance trustworthiness. Wea. Forecasting, 39, 1219–1241, https://doi.org/10.1175/WAF-D-23-0180.1.
Cover, T. M., and J. A. Thomas, 2012: Elements of Information Theory. John Wiley and Sons, 784 pp., https://doi.org/10.1002/047174882X.
Dennett, D. C., 2015: Why and how does consciousness seem the way it seems? Open MIND, T. K. Metzinger and J. M. Windt, Eds., MIND Group, 1–11, https://doi.org/10.15502/9783958570245.
Dietrich, S., and E. Hernandez, 2022: Language use in the United States: 2019. American Community Survey Rep. ACS-50, 37 pp., https://www.census.gov/content/dam/Census/library/publications/2022/acs/acs-50.pdf.
Flora, M. L., C. K. Potvin, A. McGovern, and S. Handler, 2024: A machine learning explainability tutorial for atmospheric sciences. Artif. Intell. Earth Syst., 3, e230018, https://doi.org/10.1175/AIES-D-23-0018.1.
Floridi, L., and M. Chiriatti, 2020: GPT-3: Its nature, scope, limits, and consequences. Minds Mach., 30, 681–694. https://doi.org/10.1007/s11023-020-09548-1.
Frankfurt, H. G., 2005: On Bullshit. Princeton University Press, 67 pp., https://doi.org/10.1515/9781400826537.
Geiping, J., A. Stein, M. Shu, K. Saifullah, Y. Wen, and T. Goldstein, 2024: Coercing LLMs to do and reveal (almost) anything. arXiv, 2402.14020v1, https://doi.org/10.48550/arXiv.2402.14020.
Gleick, J., 2011: The Information: A History, a Theory, a Flood. HarperCollins Publishers, 544 pp., https://play.google.com/store/books/details?id=617JSFW0D2kC.
Hassan, H., and Coauthors, 2018: Achieving human parity on automatic Chinese to English news translation. arXiv, 1803.05567v2, https://doi.org/10.48550/arXiv.1803.05567.
Herman, G. R., E. R. Nielsen, and R. S. Schumacher, 2018: Probabilistic verification of storm prediction center convective outlooks. Wea. Forecasting, 33, 161–184, https://doi.org/10.1175/WAF-D-17-0104.1.
Hicks, M. T., J. Humphries, and J. Slater, 2024: ChatGPT is bullshit. Ethics Inf. Technol., 26, 38, https://doi.org/10.1007/s10676-024-09775-5.
Kadiyala, L. A., O. Mermer, D. J. Samuel, Y. Sermet, and I. Demir, 2024: A comprehensive evaluation of multimodal large language models in hydrological applications. EarthArXiv, https://eartharxiv.org/repository/object/7176/download/13733/.
Kahneman, D., 2011: Thinking, Fast and Slow. Farrar, Straus and Giroux, 512 pp.
Kaplan, J., and Coauthors, 2020: Scaling laws for neural language models. arXiv, 2001.08361v1, https://doi.org/10.48550/arXiv.2001.08361.
Kunchukuttan, A., 2021: An empirical investigation of multi-bridge multilingual NMT models. arXiv, 2110.07304v1, https://doi.org/http://arxiv.org/abs/2110.07304.
Kwon, W., and Coauthors, 2023: Efficient memory management for large language model serving with paged attention. Proc. 29th Symp. on Operating Systems Principles, Koblenz, Germany, Association for Computing Machinery, 611–626, https://doi.org/10.1145/3600006.3613165.
Läubli, S., S. Castilho, G. Neubig, R. Sennrich, Q. Shen, and A. Toral, 2020: A set of recommendations for assessing human–machine parity in language translation. J. Artif. Intell. Res., 67, 653–672, https://doi.org/10.1613/jair.1.11371.
Liu, N., L. Chen, X. Tian, W. Zou, K. Chen, and M. Cui, 2024: From LLM to conversational agent: A memory enhanced architecture with fine-tuning of large language models. arXiv, 2401.02777v2, https://doi.org/10.48550/arXiv.2401.02777.
McGovern, A., K. L. Elmore, D. J. Gagne II, S. E. Haupt, C. D. Karstens, R. Lagerquist, T. Smith, and J. K. Williams, 2017: Using artificial intelligence to improve real-time decision making for high-impact weather. Bull. Amer. Meteor. Soc., 98, 2073–2090, https://doi.org/10.1175/BAMS-D-16-0123.1.
McGovern, A., R. J. Chase, M. Flora, D. J. Gagne II, R. Lagerquist, C. K. Potvin, N. Snook, and E. Loken, 2023: A review of machine learning for convective weather. Artif. Intell. Earth Syst., 2, e220077, https://doi.org/10.1175/AIES-D-22-0077.1.
Novak, D. R., C. Bailey, K. Brill, M. Eckert, D. Petersen, R. Rausch, and M. Schichtel, 2011: Human improvement to numerical weather prediction at the hydrometeorological prediction center. 24th Conf. on Weather and Forecasting and 20th Conf. on Numerical Weather Prediction, Online, Amer. Meteor. Soc., 440, https://ams.confex.com/ams/91Annual/webprogram/Manuscript/Paper181989/HumanNWP_final.pdf.
Olteanu, A., C. Castillo, and F. Diaz, 2014: CrisisLex: A lexicon for collecting and filtering microblogged communications in crises. Proc. 8th Int. AAAI Conf. Weblogs and Social Media, Ann Arbor, MI, University of Michigan, 376–385, https://doi.org/10.1609/icwsm.v8i1.14538.
OpenAI and Coauthors, 2023: GPT-4 technical report. arXiv, 2303.08774v6, https://doi.org/10.48550/arXiv.2303.08774.
Pierce, J. R., 1980: An Introduction to Information Theory: Symbols, Signals and Noise. Dover Publications, 305 pp.
Rothfusz, L. P., R. Schneider, D. Novak, K. Klockow-McClain, A. E. Gerard, C. Karstens, G. J. Stumpf, and T. M. Smith, 2018: FACETs: A proposed next-generation paradigm for high-impact weather forecasting. Bull. Amer. Meteor. Soc., 99, 2025–2043, https://doi.org/10.1175/BAMS-D-16-0100.1.
Schaeffer, R., B. Miranda, and S. Koyejo, 2023: Are emergent abilities of large language models a mirage? arXiv, 2304.15004v2, https://doi.org/10.48550/arXiv.2304.15004.
Shivers-Williams, C. A., and K. E. Klockow-McClain, 2021: Geographic scale and probabilistic forecasts: A trade-off for protective decisions? Nat. Hazards, 105, 2283–2306, https://doi.org/10.1007/s11069-020-04400-2.
Stuart, N. A., and Coauthors, 2006: The future of humans in an increasingly automated forecast process. Bull. Amer. Meteor. Soc., 87, 1497–1502, https://doi.org/10.1175/BAMS-87-11-1497.
Tamkin, A., M. Brundage, J. Clark, and D. Ganguli, 2021: Understanding the capabilities, limitations, and societal impact of large language models. arXiv, 2102.02503v1, https://doi.org/10.48550/arXiv.2102.02503.
Trujillo-Falcón, J. E., 2024: Examining warning response among Spanish speakers in the United States to enhance multilingual wireless emergency alerts. Ph.D. dissertation, University of Oklahoma, 175 pp., https://shareok.org/server/api/core/bitstreams/8192bf3d-b760-436a-a129-c8257c2a4343/content.
Trujillo-Falcón, J. E., O. Bermúdez, K. Negrón-Hernández, J. Lipski, E. Leitman, and K. Berry, 2021: Hazardous weather communication En Español: Challenges, current resources, and future practices. Bull. Amer. Meteor. Soc., 102, E765–E773, https://doi.org/10.1175/BAMS-D-20-0249.1.
Trujillo-Falcón, J. E., and Coauthors, 2022: Aviso o Alerta? Developing effective, inclusive, and consistent watch and warning translations for U.S. Spanish speakers. Bull. Amer. Meteor. Soc., 103, E2791–E2803, https://doi.org/10.1175/BAMS-D-22-0050.1.
Trujillo-Falcón, J. E., A. R. Gaviria Pabón, J. Reedy, and K. E. Klockow-McClain, 2024: Systemic vulnerabilities in Hispanic and Latinx immigrant communities led to the reliance on an informal warning system in the December 10–11, 2021, tornado outbreak. Nat. Hazards Rev., 25, 04023059, https://doi.org/10.1061/NHREFO.NHENG-1755.
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, 2017: Attention is all you need. arXiv, 1706.03762v7, https://doi.org/10.48550/arXiv.1706.03762.
Weston, J., and S. Sukhbaatar, 2023: System 2 attention (is something you might need too). arXiv, 2311.11829v1, https://doi.org/10.48550/arXiv.2311.11829.
Xu, J., and R. Tao, 2024: Map reading and analysis with GPT-4V(ision). ISPRS Int. J. Geoinf., 13, 127, https://doi.org/10.3390/ijgi13040127.
Yang, Z., L. Li, K. Lin, J. Wang, C.-C. Lin, Z. Liu, and L. Wang, 2023: The dawn of LMMs: Preliminary explorations with GPT-4V(ision). arXiv, 2309.17421v2, https://doi.org/10.48550/arXiv.2309.17421.
Zhang, Z., G. Shen, G. Tao, S. Cheng, and X. Zhang, 2024: On large language models’ resilience to coercive interrogation. 2024 IEEE Symp. on Security and Privacy (SP), San Francisco, CA, Institute of Electrical and Electronics Engineers, 826–844, https://doi.org/10.1109/SP54263.2024.00208.
Zhao, S., Y. Yuan, X. Tang, and P. He, 2024: Difficult task Yes but simple task No: Unveiling the laziness in multimodal LLMs. arXiv, 2410.11437v1, https://doi.org/10.48550/arXiv.2410.11437.