Global Water Security Center

Providing decision makers with the most reliable, ground-breaking research, applied scientific techniques, and best practices so that the hydrologic cycle and its potential impacts can be put in a context for appropriate action and response by the United States

Innovative tools in development to aid in geospatial data wrangling

This opinion article was written by GWSC Environmental Data Scientist Sambadi Majumder.

As an environmental data scientist, I often encounter complex and interesting problems as they pertain to properly processing large spatiotemporal datasets. One such challenge presented itself during the interview process for my current role at GWSC. 

I was presented with a series of tasks where I needed to efficiently handle publicly available large meteorological and environmental datasets. My usual tools of choice in such applications are R and Python as scripting has always been my favorite way to clean and process such datasets because scripting provides the user an opportunity to create reproducible methodologies. However, the learning curve for using programming languages effectively in elaborate geospatial applications can be a bit daunting for new users. 

This motivated me to start developing the R package geoRflow to help users easily manage large multidimensional geospatial datasets. This is achieved by wrapping a series of intricate tasks, performed in order, within versatile R functions. 

The “flow” part of the name comes from analogizing this process to a pipeline through which information flows. I am using highly regarded and widely used R packages such as terrastars, and sf to create the backbone of geoRflow. From easily downloading large climate datasets to performing complex data processing, this package simplifies and automates geospatial workflows for environmental research.

As a parallel to this endeavor, I am also developing a Python counterpart of this R package, called PyEarthly, with a few additional functionalities. Updates on the development of these two packages along with their source code can be viewed in their respective GitHub repositories (geoRflow and PyEarthly).

These packages, through their compendium of functions tailored to streamline the process of geospatial data preparation, aim to be a versatile toolkit for environmental researchers and practitioners. The development of these packages is an ongoing journey, and the goal is to make them available to users by early 2025. 

In conclusion, I hope practitioners will find the tools useful in their research and as these packages evolve. I am looking forward to uncovering new use cases within the realm of accessible environmental data analysis.