An AI pipeline for ingesting data from a borehole

*Corresponding Author: John M. Aiken

Abstract

Researchers analyzing data collected from borehole drilling projects can be faced with dozens of terabytes of seismic, hydrologic, geologic, and rock mechanics data, including complex imagery, physical measurements and expert written reports. These diverse data sets play a pivotal role in our understanding of solid earth processes. Ingesting and analysing such data presents a colossal challenge that typically demands a team of experts. The utilization of Artificial Intelligence (AI) and machine learning emerges as a compelling approach to help tackle the volume and complexity of drilling data. This paper presents an AI based pipeline for ingesting data from the Oman Drilling Project's Multi-borehole Observatory. The study focuses on the alteration of peridotite core segments taken from Borehole BA1B, utilizing a catboost classification model trained on an integrated data set of machine learning segmented core images, physical measurements, geological, lithographic data, and AI summarized expert texts and feature selection. This paper's central objective is to establish a repeatable, efficient pattern for processing such multifaceted borehole data, while also addressing the critical research question of the impact of non-tectonic fracturing on peridotite alteration.

Approach

Drilling expeditions generally produce at least three kinds of raw data: Images taken from borehole cores, drilling reports written by expert geologists identifying various features of each meter of core that comes from the borehole, and physical, chemical, and biological measurements taken using various kinds of equipment inside the borehole. Raw images have their fractures segmented using a random forest classifier trained on hand labeled fracture data (Segment Anything was unable to pick out fractures).

Raw wrap-around core images are first pre-processed using a gaussian, hessian, roberts, and sobel edge enhancing filters. This flattens differences in color content of the image, highlights abrupt changes in edges, making it ultimately easier to pick out fracture veins. Twenty images taken from 20m segments distributed depthwise along the borehole were then labeled using the Ilastik software. We then used the built-in random forest algorithm within Ilastik to label the remaining 485 images. We drop all labeled pixel groupings with less than 50 pixels. We then apply a post-processing eccentricity filter to remove small round erroneously labeled pixel groupings as they are not physically representative of a fracture or vein network. This is then considered the final labeled fracture/vein network data set. Using this data set, we calculate the percent fractured per depth, and other fracture network connectivity measurements. These are coupled with other physical, chemical, and biological measurements into the data set.

Drilling reports are summarized into keywords using ChatGPT. Geologist remarks from the drilling report were given to ChatGPT to summarize. These remarks typically take the form of written notes and were taken per approximate meter of extracted core. An example:

microcrystalline carbonate visible on vein surfaces. clasts are 90\% angular and 10\% rounded. This indicates that the thickness of alluvium is less than few 10s cm and the bedrock is surfacing. angular fragments at 0 to 60 cm with mixed lithologies varying from serpentinised harzburgite to dunite. Serpentinization, oxidation, carbonation in veins.

Each set of remarks per depth unit (505 in total) were given to ChatGPT (gpt-turbo-3.5) with the prompt:

Please summarize the following text into ten keywords and explain why you picked each key word. The text to summarize: {text}

where {text} being replaced by the geologist's remarks. This produced hundreds of different keywords emerged from the process many of which were close duplicates or similar keywords. After filtering keywords for duplicates, similarity, and any keywords that were reported by ChatGPT less than 50 times (representing less than 10% of the BA1B total cored depth), 52 keywords remained. Those keywords were integrated in the dataset as binary variables. Using these keywords, we asked ChatGPT to group the different keywords into topics based on the type of information they convey, and we plotted the graph of keywords depending on depth to have preliminary information about the features of the core.

We asked ChatGPT to classify all model features in the BA1B data set into groups that we could use to separate for model comparison analysis. We excluded the text summarization features already classified since ChatGPT had already seen these. We gave ChatGPT the prompt:

ChatGPT was given the following prompt:

you are an expert physicist, chemist, biologist, and computational scientist ai helper bot. i will give you a list of columns for a catboost classifier. These columns are the features in teh model. The catboost classifier is designed to determine whether a section of a borehole core has greater than 90% peridotite alteration or not. We are attempting to measure the impact of fractures in the sample against other features that impact the total alteration assuming this is related to reaction driven cracking.
You are to first:

  • define each column
  • provide an overarching category for each column
  • describe why you picked this category for the column

  • Features provided to you should be grouped into categories. Please reply saying you understand the task and then i will give you the column names.

    ChatGPT replied categorizing all of the features into groups. These categorizations are then used for model feature selection. We compare these AI selected feature groupings to the expert selected feature groupings.

    Results

    The pipeline is able to ingest all data and then produces two collections of catboost models: 1) expert selected models and 2) GPT recommended models. These both produced groups of ROC curves that demonstrate that the fractures were the least important for predicting where high alteration occurs.

    The ChatGPT summarization analysis had two steps, first was the keyword analysis, second was the topic analysis of the selected keywords. Some keywords appear prolifically across the entire depth cross-section (e.g., Serpentine veins, Black Serpentinization, Gabbro). Others have a clear depth dependence either occuring in the upper Dunite sequence (e.g., Irregular, Linneation, Open cracks, Alteration halo) or in the lower Harzburgite sequence (e.g., Hydrothermal, Shearing, Magmatic veins). These keywords generally appear where we would expect them to when referencing the full text reports. ChatGPT was also able to group keywords into meaningful topics of "Veins and alteration", "oxidation and alteration", "structural features", "rock type", "mineralogy", and "physical characteristics".

    This paper has presented an AI-based pipeline to ingest and analyse data from the Oman Drilling Project's Multi-borehole observatory. A random forest image segmentation were used for the treatment of core images, ChatGPT was used to summarize the expert knowledge from the drilling reports, these were coupled with physical, chemical, and biological measurements and used to predict the presence of highly altered peridotites via a catboost model. The catboost model provided valuable outlooks of the main factors influencing peridotite alteration. It indicates textual and physical data such as depth and mineral composition are of primarily importance in the classification, but the network analysis data taken from segmentation represent a suitable alternative and provide acceptable results. Moreover, it shows an AI-based treatment of geological data can equals a physical measurements-oriented method, and is a viable substitute for this classification problem. While this pipeline is particular to the research questions related to the Oman Drilling Project's Borehole BA1B (namely, what is the impact of non-tectonic fracturing of rock on peridotite alteration?), much of the AI based framework presented in this paper is applicable to a great many borehole related data sets.

    Citation



    The website template was borrowed from Jon Barron.