<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.2 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.2/JATS-journalpublishing1.dtd">
<!--<?xml-stylesheet type="text/xsl" href="article.xsl"?>-->
<article article-type="research-article" dtd-version="1.2" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<front>
<journal-meta>
<journal-id journal-id-type="issn">2941-1300</journal-id>
<journal-title-group>
<journal-title>ing.grid</journal-title>
</journal-title-group>
<issn pub-type="epub">2941-1300</issn>
<publisher>
<publisher-name>Universit&#228;ts- und Landesbibliothek Darmstadt</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.48694/inggrid.3983</article-id>
<article-categories>
<subj-group>
<subject>Research article</subject>
</subj-group>
</article-categories>
<title-group>
<article-title>From Ontology to Metadata: A Crawler for Script-based Workflows</article-title>
<subtitle>HOMER: a tool for extraction and re-use of ontology-based metadata in high-performance measurement and computing workflows</subtitle>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0001-8623-1464</contrib-id>
<name>
<surname>Chiapparino</surname>
<given-names>Giuseppe</given-names>
</name>
<email>giuseppe.chiapparino@tum.de</email>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-1489-6501</contrib-id>
<name>
<surname>Farnbacher</surname>
<given-names>Benjamin</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0003-0580-9717</contrib-id>
<name>
<surname>Hoppe</surname>
<given-names>Nils</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-4583-7969</contrib-id>
<name>
<surname>Ralev</surname>
<given-names>Radoslav</given-names>
</name>
<xref ref-type="aff" rid="aff-2">2</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-7213-5110</contrib-id>
<name>
<surname>Sdralia</surname>
<given-names>Vasiliki</given-names>
</name>
<xref ref-type="aff" rid="aff-3">3</xref>
<xref ref-type="aff" rid="aff-4">4</xref>
</contrib>
<contrib contrib-type="author">
<contrib-id contrib-id-type="orcid">https://orcid.org/0000-0002-6904-8315</contrib-id>
<name>
<surname>Stemmer</surname>
<given-names>Christian</given-names>
</name>
<xref ref-type="aff" rid="aff-1">1</xref>
</contrib>
</contrib-group>
<aff id="aff-1"><label>1</label>TUM School of Engineering and Design, Department of Engineering Physics and Computation, Chair of Aerodynamics and Fluid Mechanics, Technical University of Munich, Garching; Germany</aff>
<aff id="aff-2"><label>2</label>TUM School of Computation, Information and Technology, Department of Informatics, Technical University of Munich, Garching, Germany</aff>
<aff id="aff-3"><label>3</label>TUM School of Engineering and Design, Department of Engineering Physics and Computation, Chair of Aerodynamics and Fluid Mechanics, Technical University of Munich, Garching, Germany</aff>
<aff id="aff-4"><label>4</label>Munich Data Science Institute (MDSI), Technical University of Munich, Garching, Germany</aff>
<pub-date publication-format="electronic" date-type="pub" iso-8601-date="2024-07-12">
<day>12</day>
<month>07</month>
<year>2024</year>
</pub-date>
<pub-date pub-type="collection">
<year>2024</year>
</pub-date>
<volume>1</volume>
<issue>2</issue>
<fpage>1</fpage>
<lpage>18</lpage>
<history>
<date date-type="received" iso-8601-date="2023-02-07">
<day>07</day>
<month>02</month>
<year>2023</year>
</date>
<date date-type="accepted" iso-8601-date="2024-06-10">
<day>10</day>
<month>06</month>
<year>2024</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright: &#x00A9; 2024 The Author(s)</copyright-statement>
<copyright-year>2024</copyright-year>
<license license-type="open-access" xlink:href="https://creativecommons.org/licenses/by/4.0/">
<license-p>The text of this work is released under the Creative Commons license CC BY 4.0 International. You can find the contract text of the license at <uri xlink:href="https://creativecommons.org/licenses/by/4.0/">https://creativecommons.org/licenses/by/4.0/</uri>. The illustrations are excluded from this license, here the copyright lies with the respective rights holder.</license-p>
</license>
</permissions>
<self-uri xlink:href="https://www.inggrid.org/articles/doi.org/10.48694/inggrid.3983/"/>
<abstract>
<p>The present work introduces HOMER (<bold>H</bold>igh Performance Measurement and Computing tool for <bold>O</bold>ntology-based <bold>M</bold>etadata <bold>E</bold>xtraction and <bold>R</bold>e-use), a python-written metadata crawler that allows to automatically retrieve relevant research metadata from script-based workflows on HPC systems. The tool offers a flexible approach to metadata collection, as the metadata scheme can be read out from an ontology file. Through minimal user input, the crawler can be adapted to the user&#8217;s needs and easily implemented within the workflow, enabling to retrieve relevant metadata. The obtained information can be further automatically post-processed. For example, strings may be trimmed by regular expressions or numerical values may be averaged. Currently, data can be collected from text-files and HDF5 files, as well as directly hardcoded by the user. However, the tool has been designed in a modular way, so that it allows straightforward extension of the supported file-types, the instruction processing routines and the post-processing operations.</p>
</abstract>
<kwd-group>
<kwd>Metadata extraction</kwd>
<kwd>HPMC</kwd>
<kwd>Ontology</kwd>
<kwd>Research Data Management</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec>
<title>1 Introduction</title>
<p>Nowadays, scientists are called to handle large amount of generated data, store them in repositories and distribute them among other scientists or the scientific community, something that makes it hard for them to keep track of all the relevant information over time and space. This can lead to the generation of a large quantity of forgotten and unused data, the so-called Dark Data [<xref ref-type="bibr" rid="B1">1</xref>], [<xref ref-type="bibr" rid="B2">2</xref>]. Although every researcher implements some sort of Research Data Management (RDM), either consciously or unconsciously, to avoid the loss of precious information, standardized RDM approaches, such as the FAIR data principles (Findable, Accessible, Interoperable, and Re-usable), have been proposed in order to provide a more structured and potentially efficient solution to these problems. From the beginning of a project, the scientist should have a data-management plan on how the data will be organized, where they will be stored safely and who should be able to access the data and re-use them. In fact, accompanying the complete data-generation process with a proper data-management plan will have two benefits. On one side, it will be easier to reproduce old research works for future scientists. On the other side, well-documented data will enable effective secondary research.</p>
<p>For that reason, the German Federal Government funded NFDI (Nationale Forschungsdateninfrastruktur [National Research Data Infrastructure]), to establish an infrastructure on RDM, providing an environment where scientists can develop solutions to research questions and make their findings and innovations sustainable by implementing the FAIR data principles. NFDI4Ing (NFDI f&#252;r die Ingenieurwissenschaften [NFDI for Engineering] [<xref ref-type="bibr" rid="B3">3</xref>]), one of the consortia funded by the NFDI initiative, brings together the engineering communities to develop, standardise and provide methods and services to make engineering research data FAIR.</p>
<p>One major factor of making data FAIR is the implementation of a controlled vocabulary with common terminology. The use of a controlled vocabulary is essential for findability, interoperability, and consequently, the re-use and the establishment of new user models. Most of the research data in the HPMC domain is neither documented nor are metadata sets available, as common terminologies for HPMC in the engineering sector still need to be developed and established within the community. HOMER allows to automatically retrieve relevant research metadata from script-based workflows on HPC systems and therefore supports researchers to collect and publish their research data within a controlled vocabulary using a standardized workflow. Controlled vocabularies, and the relations and restrictions between their terms, are practically implemented through the use of ontologies. An ontology defines a shared conceptualization of a common vocabulary, semantic relations of data and the syntactic as well as the semantic interoperability, including machine-interpretable definitions of basic concepts in the domain and the relations among them. The NFDI4Ing consortium has developed an ontology as a common classification of engineering data in a taxonomic hierarchy with standardized vocabulary and procedures. Metadata4Ing (Metadata for Engineering [<xref ref-type="bibr" rid="B4">4</xref>]) aims at providing a thorough framework for the semantic description of research data, with a particular focus on engineering sciences and neighboring disciplines. Metadata4Ing re-uses elements from the existing terminologies and ontologies, such as DCMI Metadata Terms [<xref ref-type="bibr" rid="B5">5</xref>]) or the PROV (Provenance Namespace) ontology [<xref ref-type="bibr" rid="B6">6</xref>]), whose terms were imported into Metadata4ing. This ontology allows a thorough description of the whole data-generation process (experiment, observation, simulation), covering aspects such as: the object of investigation, all sample and data manipulation procedures, a summary of the data files and the information contained, and all personal and institutional roles. The NFDI4Ing framework entails many working groups called &#8220;archetypes&#8221;. Among them, the role of archetype DORIS is twofold: on one side, to create a HPMC-(sub)ontology based on Metadata4Ing in order to establish a consistent terminology for computational fluid-dynamics (CFD) workflows in high performance computing (HPC) systems; on the other side, to develop a metadata crawler, presented in this work, for metadata extraction. The expansion to an HPC-sub-ontology is based on modularity and fits in the primary Metadata4Ing classes of method, tool, object of research. The expansion includes suggestions of unambiguous terms for domain-related metadata expressed in classes, object properties (relations) and data properties. These classes have been developed in a community-based approach and represent common methods and tools for workflows in engineering research on HPMC systems. The crawler named HOMER (<bold>H</bold>PMC tool for <bold>O</bold>ntology-based <bold>M</bold>etadata <bold>E</bold>xtraction and <bold>R</bold>e-use) is intended as a RDM tool to automate the retrieval of metadata and is designed to be used in script-based HPMC applications.</p>
<p>In the field of RDM, many solutions and tools for metadata extraction have been proposed in the recent years. While all of them share with HOMER the same core concept of automating metadata extraction on HPC systems, they implement different approaches and solutions to the problem, introducing a great variety of capabilities.</p>
<p>For example, the RDM system at the University of Huddersfield, iCurate [<xref ref-type="bibr" rid="B7">7</xref>], provides a tailored solution to HPMC data with the functionalities of metadata retrieval, departmental archiving, workflow management system and metadata validation and self inferencing. This last functionality requires the metadata to be mapped onto a suitable ontology. iCurate offers support for all aspects of data management, but the actual extraction of metadata is limited to the annotations made by the user in a HPC job file. While this guarantees a non-intrusive integration within the workflow, it doesn&#8217;t allow the user to retrieve metadata from output files.</p>
<p>Another tool for research-data management is represented by signac [<xref ref-type="bibr" rid="B8">8</xref>]. This is a lightweight framework providing all the components to create a searchable and shareable dataspace (a decentralized infrastructure for data sharing and exchange based on commonly agreed principles [<xref ref-type="bibr" rid="B9">9</xref>]). The core application is a semi-structured database that allows storing the original files on the file system along with the associated metadata, which is created on-the-fly and saved in human-readable format. The tool also allows for workflow management thanks to the signac-flow application. The framework is implemented in python and is designed to be used in HPC systems. However, in order to perform the metadata annotation, signac requires the user to wrap the original simulation code into a script.</p>
<p>The extraction of metadata from output files is possible with Xtract [<xref ref-type="bibr" rid="B10">10</xref>], [<xref ref-type="bibr" rid="B11">11</xref>]. In general, this powerful serverless middleware provides an effective, flexible and scalable way to retrieve metadata from very large data lakes (centralized systems storing data in raw format [<xref ref-type="bibr" rid="B12">12</xref>]). With Xtract, metadata can be extracted both centrally (&#8220;central mode&#8221;, i.e. fetching the metadata all at once from the different repositories where the data are generated) and at the edge (&#8220;edge mode&#8221;, i.e. extracting and storing the metadata as soon as the data are created and at the location where they are generated). The service implements several different extractors (written in python or bash), which are run dynamically according to the type of file that needs to be crawled. This allows to handle a vast quantity of different data formats typically employed in scientific applications. Machine learning is used to infer the type of file to be crawled, so as to choose the best extractor(s) for that file in the shortest time possible. Finally, the tool is highly portable as it is wrapped in a docker container [<xref ref-type="bibr" rid="B13">13</xref>] and can be linked to Globus [<xref ref-type="bibr" rid="B14">14</xref>], [<xref ref-type="bibr" rid="B15">15</xref>]. Xtract is a powerful service, which is however more suited to data lakes, rather than to be applied within a workflow. Moreover, the information that can be retrieved is somewhat limited and not entirely customizable by the user.</p>
<p>From this point of view, more freedom is given by ExtractIng [<xref ref-type="bibr" rid="B16">16</xref>], a generic automated metadata-extraction toolkit. Again suited for HPC systems, ExtractIng is a Java-written standalone tool that needs to be run once simulation outputs have been produced. It is easy to integrate within a workflow and offers both native and parallel implementation of the parsing algorithm, which makes the code scalable for HPC applications. The metadata extraction is based on the metadata scheme provided by EngMeta [<xref ref-type="bibr" rid="B17">17</xref>]. The tool is code-independent, in the sense that an external configuration file allows to adapt the metadata extraction to the specific simulation code and computational environment. While this provides a generic extraction tool, the configuration file needs to be manually written and adapted by the user (even though it only needs to be done once per code).</p>
<p>Another solution is represented by Brown Dog [<xref ref-type="bibr" rid="B18">18</xref>], which consists of two services called DAP and DTS. The first provides file conversion, while the second performs extraction and analysis of metadata. Brown Dog aims at leveraging already existing software, libraries and services in order to provide an automated aid in RDM. The implementation of an elasticity module provides for an optimized auto-scale of the two services based on the system demand. Moreover, a tool catalog points the user to the most suitable option for file conversion and metadata extraction. The extracted metadata is returned as a JSON file. However, this operation is performed by passing the data to the web-based service Clowder. This particular aspect might pose some limitations in the use of Brown Dog.</p>
<p>Some of the services that provide metadata extraction might be domain specific. For example, ScienceSearch [<xref ref-type="bibr" rid="B19">19</xref>] is a generalized and scalable search infrastructure, which employs machine learning to capture metadata. Information is retrieved not only from regular data, but also from the context and the surroundings artifacts (proposals, publications, file system structures and images) of the data, allowing for an enrichment of the extracted metadata. The service provides a web interface, where users can submit their text queries, and also provides the possibility for the users to give feedback on the collected metadata, so as to improve the search quality. However, the data model is unique to the NCEM dataset, which includes data relevant to the field of electron microscopy.</p>
<p>Within the NFDI environment, Swate [<xref ref-type="bibr" rid="B20">20</xref>] is an Excel add-in for the annotation of experimental data and computational workflows developed by the consortium NFDI4Plants. The tool is intended for metadata annotation based on the ontology provided by the user. The use of a spreadsheet environment aims at providing an intuitive and low-friction workflow, principally focusing on wet-lab applications, where the user can annotate work-relevant metadata as the experiment is performed. Hence, in this tool, the work is done manually by the user.</p>
<p>A second tool within the NFDI infrastructure is presented in this paper. Developed within the NFDI4Ing consortium at the Technical University of Munich (TUM), HOMER is a metadata crawler to be integrated in script-based (HPMC) workflows aiming at retrieving metadata that can be attached to the raw data published by researchers. The tool is designed to be flexible and adjustable to the user&#8217;s needs in its application and easy to implement in potentially any HPMC workflow. This development approach tries to overcome all the shortcomings highlighted for the RDM solutions and tool reviewed in this section, and to allow HOMER to be suitable for a wide range of applications. In fact, the crawler can retrieve metadata from text and binary (HDF5) files, as well as from user&#8217;s annotations and terminal commands, at any stage of the workflow without interfering with the other processes composing the workflow. The automated extraction of metadata can be performed both in edge as well as in central mode, making the tool suitable for extracting information also from central repositories (such as data lakes). The metadata extraction is based on the ontology schemes chosen by the users. However, the users do not need to strictly adhere to a fixed scheme, but can adjust and customize it according to their needs. Moreover, although developed primarily keeping engineering sciences as the main use application, HOMER can be employed to retrieve metadata from HPMC workflows applied to a wide variety of research fields. Finally, the tool has been written with a modular structure, so it can easily be developed further to include new features. Hence, HOMER is proposed as a flexible and consistent RDM tool that can be used in a wide variety of applications and fields with limited user inputs in order to easily promote the FAIR principles and enrich the data created by the user.</p>
<p>The code structure is described in section 3, while a simple application is described in section 4. Finally, in section 5, an overview on the future steps in the code development is given.</p>
</sec>
<sec>
<title>2 Characterization of the problem</title>
<p>Many numerical applications, such as optimization problems or parametric studies, require the user to solve essentially the same problem with slightly different inputs each time. To make an example in the field of CFD, assume that it is necessary to assess the aerodynamic characteristics of an aircraft wing during different phases of the flight (take-off, climb, cruise and so on). Hence, the user will perform a certain number of simulations employing the same geometry of the wing while varying the freestream conditions (pressure, density, temperature, Reynolds number and so on) provided as input to the simulation. In such a case, especially when the number of simulations to be performed is large, automating the workflow (or parts of it) by means of script-based processes enables an efficient use of the available computing resources. For example, the user could create a script to change the freestream-input parameters as soon as a simulation ends so that the following one with new conditions would start immediately. Together with the data generation, the researcher should also aim at retrieving and storing the relevant metadata for all the computations performed, in order to comply with the FAIR principles and add value to the data gathered. In the wing-study example, the most obvious relevant metadata would be the different input freestream conditions associated to each simulation result, but the user could be also interested in storing information on the specific hardware or software (version of the code, version of the compiler, ID of the computational node, and so on) used for the simulations, for example. The information that needs to be extracted might be scattered across the different files that are usually generated during numerical calculations, such as the input and output files associated with the simulation, as well as files generated by the HPC system. Therefore, the user will have to go through all these files for each simulation and recursively extract and store the metadata. Doing such a job by hand would be certainly time consuming and would look as a viable option only if the number of simulations is very limited. In HPMC applications where hundreds of simulations are performed, this approach would be prohibitive to say the least and, therefore, the use of a dedicated extraction routine would be the preferential and most efficient choice. At this point, for the user it would be a matter of either writing an extraction routine from scratch, which would guarantee a perfect compatibility even for very specific codes but requires time and resources directly spent by the user, or employ an already available tool, at the price of possible overheads to properly implement the tool within the workflow. In the latter case, HOMER would come in handy as a valid support for the researchers. In fact, HOMER is intended to be used, potentially, for any script-based application and is designed to be easily adjusted to different research fields.</p>
</sec>
<sec>
<title>3 Code Description</title>
<sec>
<title>3.1 Characteristics</title>
<p>HOMER is a code written in python and, at the moment, its complete extraction workflow consists of 5 steps, as shown in figure <xref ref-type="fig" rid="F1">1</xref>. The code has been developed as a collection of modular routines, each performing a different action, rather than as a single script. This guarantees a flexible application of the crawler, as the user doesn&#8217;t have always to perform all the steps described in the next subsection, especially once the initial setup of the workflow has been done for the first time. The modularity of the code also allows the developers and the users to easily modify and expand the capabilities of the crawler, so that the code can be tailored for specific applications, if needed. HOMER can be employed both locally and on HPC systems. However, it should be noted that, as of now, it does not support a parallel implementation to parse the target files. Moreover, the tool only covers the stages of planning, creating/collecting and processing/analyzing within the data life cycle (figure <xref ref-type="fig" rid="F2">2</xref>).</p>
<fig id="F1">
<caption>
<p><bold>Figure 1:</bold> The five steps the crawler is currently composed of and related input/output files.</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="inggrid-3983_chiapparino-g1.png"/>
</fig>
<fig id="F2">
<caption>
<p><bold>Figure 2:</bold> Data Life Cycle [<xref ref-type="bibr" rid="B21">21</xref>].</p>
</caption>
<graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="inggrid-3983_chiapparino-g2.png"/>
</fig>
<p>Conceived as a tool to be integrated in script-based workflows, the crawler should be run after the simulation (or, potentially, any processing step), similarly to ExtractIng. Hence, the metadata are naturally extracted in edge mode (where the data are generated). However, the tool can also be used to retrieve metadata from centrally-stored, previously collected data, similarly to Xtract. In this case, though, the user has to perform some extra steps according to the specific case at hand (an example is given at the end of section 4). Together with metadata extraction, the tool gives the user the opportunity to perform some simple post-processing operations as well, such as trimming strings or calculating the minimum, maximum or mean of a series of values.</p>
</sec>
<sec>
<title>3.2 Implementation</title>
<p>When running the code for the first time, five steps (figure <xref ref-type="fig" rid="F1">1</xref>) are needed, with two of them requiring direct user input. The overview of this five-steps workflow is given in the next paragraphs, while an application on a CFD-based example is shown in section 4.</p>
<p>The first step consists in reading the ontology file and creating an empty dictionary containing the flat classes as specified in the ontology. A &#8220;flat class&#8221; in this context is the initial state of the class in which the properties have not been specified, yet. The step is performed by the routine <monospace>ClassUtils.py</monospace> and takes advantage of the python package <monospace>Owlready2</monospace> to work on the ontology. The empty dictionary is a list of the classes and their attributes as they appear in the ontology file. The dictionary acts as a template, where all the flat classes are listed but not filled-in, yet. Hence, no specific instance of a class is created at this point. The file, however, contains the fields that allow the user to specify how many instances and related properties of each class are to be created.</p>
<p>The second step has to be performed manually and serves the purpose of preparing the dictionary file to be used by the &#8220;Multiplexer&#8221;, as explained in the third step. The user needs to specify how many instances of each class need to be retrieved. This is done by giving a numerical value to the keyword <monospace>__count__</monospace> in the corresponding class. Similarly, the user can specify how many properties each instance of a class needs to have by indicating the numerical value in the corresponding property list.</p>
<p>The third step consists in the Multiplexer and is performed by the routine <monospace>Multiplexer.py</monospace>. The output is the original flat-class dictionary which has been now expanded according to the needs of the user, so that the new file contains a list of all the needed instances for each class and all the properties for each instance of a class.</p>
<p>The fourth step again has to be performed manually. The user has to fill in the multiplexed dictionary by specifying, for each instance and property, where the crawler should look for the data and how it should retrieve them. This information is to be provided by specifying the three keywords <monospace>&#34;path&#34;, &#34;type&#34;</monospace> and <monospace>&#34;pattern&#34;</monospace>. The filled expanded dictionary works as a configuration file for the final step.</p>
<p>The fifth and last step consists in the actual extraction of the metadata and is performed through the routine <monospace>EntityUtils.py</monospace>. Once the metadata have been extracted, it is printed out to a file in a structured human-readable format (JSON or YAML). Currently, metadata can be extracted from files, such as text files using regular expressions or HDF5 files using <monospace>h5py</monospace>, from the output of a operating-system command, or can be directly hardcoded during the fourth step, if needed.</p>
<p>As mentioned, the user must first configure the crawler to the specific simulation code in use (type and amount of metadata available to be extracted, location and format of the target files, extraction methods&#8230;). This involves the two (lengthy) manual steps just described. However, once the first setup has been completed, the configuration file (created at the end of step four) can be re-used with little to no modification every time a new simulation is performed by the user and new metadata need to be extracted. Specifically, step five simply needs to be performed in the subsequent runs. This allows for a seamless integration of the tool within the workflow.</p>
</sec>
</sec>
<sec>
<title>4 Example Application</title>
<p>In this section, a simple usage of the tool is shown for an application within the CFD field. Namely, the same test problem of a wing aerodynamic optimization mentioned in 2 is considered, in order to show the capabilities of the tool directly applied to an engineering application.</p>
<p>Nonetheless, the reader is also invited to try out the more generic step-by-step test case based on a simplified &#8220;Pizza ontology&#8221; available in the GitLab code repository (all the relevant files are in the directory <monospace>/SimpleApplication_PizzaOntology</monospace>). This purposely generic example is intended to provide a complete overview of the implementation of HOMER within a script-based workflow in a simple, clear and application-independent way. The ontology file used in the initial step is loosely based on the &#8220;Pizza ontology&#8221; provided by Stanford University in their Prot&#233;g&#233; tutorial [<xref ref-type="bibr" rid="B22">22</xref>].</p>
<sec>
<title>4.1 Application to a CFD case</title>
<p>Taking the case of simulations on an airplane wing at different freestream conditions as an example, the user could be interested in extracting values such as freestream pressure, temperature, Mach and Reynolds number, as well as the ID-number of the node(s) where the calculation was performed and the software version used at the time of the calculation. This metadata information would then be attached to the results of the corresponding simulations to help making the data compliant with the FAIR principles.</p>
<p>In this example, the code NSMB [<xref ref-type="bibr" rid="B23">23</xref>] is employed to perform the simulations, and all the relevant pieces of information are extracted from the input and output files of the code and are classified according to a simplified version of the Metadata4Ing ontology for sake of simplicity. Therefore, in this example, the reference classes are limited to <monospace>&#34;Processing_Step&#34;, &#34;Tool&#34;</monospace> and <monospace>&#34;Method&#34;</monospace>, with information such as parameters name and numerical value being considered as properties of those classes.</p>
<p>In this example, it is assumed that the crawler is employed in edge mode, meaning that the HOMER is invoked right after each simulation has finished to immediately extract the corresponding metadata. This means that all the file paths specified by the keyword <monospace>&#34;path&#34;</monospace> are relative paths, as the crawler runs from the same directory where the simulations are performed.</p>
<p>The example shows all the five steps described in section 3, which would be ideally required for the first usage of the tool. After that, HOMER can be seamlessly integrated in the workflow without further modifications.</p>
<p>After setting up the (optional) python virtual environment and having installed the crawler, the first command to run is <monospace>ClassUtils.py</monospace>, which retrieves the ontology and creates a dictionary with the flat classes (i.e. classes to be filled-out by the user and where, right after step 1, the properties have simply placeholder values) in the form of a <monospace>.json</monospace> file. The content of such a file would look like the one shown in the next lines.</p>
<code>
<styled-content style="color: #A9A9A9;">1</styled-content>&#160;&#160;&#160;&#160;{
<styled-content style="color: #A9A9A9;">2</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Processing_Step&#34;: {
<styled-content style="color: #A9A9A9;">3</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;__count__&#34;: 1,
<styled-content style="color: #A9A9A9;">4</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;__restrictions__&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">5</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Name&#34;: [1],
<styled-content style="color: #A9A9A9;">6</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Parameter&#34;: [1],
<styled-content style="color: #A9A9A9;">7</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Has_numerical_value&#34;: [1],
<styled-content style="color: #A9A9A9;">8</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">9</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Tool&#34;: {
<styled-content style="color: #A9A9A9;">10</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;__count__&#34;: 1,
<styled-content style="color: #A9A9A9;">11</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;__restrictions__&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">12</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;System_component&#34;: [1],
<styled-content style="color: #A9A9A9;">13</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Name&#34;: [1],
<styled-content style="color: #A9A9A9;">14</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;ID&#34;: [1],
<styled-content style="color: #A9A9A9;">15</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">16</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#8230;
<styled-content style="color: #A9A9A9;">17</styled-content>&#160;&#160;&#160;&#160;}
</code>
<p><bold>Listing 1:</bold> Example of the content inside the file produced after step 1.</p>
<p>This empty dictionary has to be manually adjusted to the specific case by the user. In this example, only one processing step is foreseen (running the simulation), for which five parameters are going to be extracted (pressure, temperature, Mach, Reynolds and starting time of the simulation). Therefore, the <monospace>&#34;Processing_Step&#34;</monospace> class will be repeated once (<monospace>&#34;__count__&#34;</monospace>: 1) and will contain one <monospace>&#34;Name&#34;</monospace> and five <monospace>&#34;Parameter&#34;</monospace> and <monospace>&#34;Has_numerical_value&#34;</monospace> properties. Regarding the <monospace>&#34;Tool&#34;</monospace> class, assume to separate between &#8220;Hardware&#8221; and &#8220;Software&#8221;. Hence, two instances of such a class (<monospace>&#34;__count__&#34;</monospace>: 2) need to be created, each of them with its own properties <monospace>&#34;System_component&#34;, &#34;Name&#34;</monospace> and <monospace>&#34;ID&#34;</monospace>. The result of this manual file manipulation is shown below.</p>
<code>
<styled-content style="color: #A9A9A9;">1</styled-content>&#160;&#160;&#160;&#160;{
<styled-content style="color: #A9A9A9;">2</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Processing_Step&#34;: {
<styled-content style="color: #A9A9A9;">3</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;__count__&#34;: 1,
<styled-content style="color: #A9A9A9;">4</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;__restrictions__&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">5</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Name&#34;: [1],
<styled-content style="color: #A9A9A9;">6</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Parameter&#34;: [5],
<styled-content style="color: #A9A9A9;">7</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Has_numerical_value&#34;: [5],
<styled-content style="color: #A9A9A9;">8</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">9</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Tool&#34;: {
<styled-content style="color: #A9A9A9;">10</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;__count__&#34;:2,
<styled-content style="color: #A9A9A9;">11</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;__restrictions__&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">12</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;System_component&#34;: [1,1],
<styled-content style="color: #A9A9A9;">13</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Name&#34;: [1,1],
<styled-content style="color: #A9A9A9;">14</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;ID&#34;: [1,1],
<styled-content style="color: #A9A9A9;">15</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">16</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#8230;
<styled-content style="color: #A9A9A9;">17</styled-content>&#160;&#160;&#160;&#160;}
</code>
<p><bold>Listing 2:</bold> Filled-in .json file after step 2.</p>
<p>Running <monospace>Multiplexer.py</monospace> expands the classes according to the parameters indicated in the previous step. The output is shown below. The new empty dictionary contains all the instances and corresponding properties the crawler will use in the creation of the metadata file.</p>
<code>
<styled-content style="color: #A9A9A9;">1</styled-content>&#160;&#160;&#160;&#160;{
<styled-content style="color: #A9A9A9;">2</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Processing_Step&#34;: {
<styled-content style="color: #A9A9A9;">3</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;__restrictions__&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">4</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Name&#34;: {
<styled-content style="color: #A9A9A9;">5</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;path&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">6</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">7</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;pattern&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">8</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;postprocessor&#34;: {
<styled-content style="color: #A9A9A9;">9</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">10</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;args&#34;: &#34;&#34;
<styled-content style="color: #A9A9A9;">11</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
<styled-content style="color: #A9A9A9;">12</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">13</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Parameter_1&#34;: {
<styled-content style="color: #A9A9A9;">14</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;path&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">15</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">16</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;pattern&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">17</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;postprocessor&#34;: {
<styled-content style="color: #A9A9A9;">18</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">19</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;args&#34;: &#34;&#34;
<styled-content style="color: #A9A9A9;">20</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
<styled-content style="color: #A9A9A9;">21</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
<styled-content style="color: #A9A9A9;">22</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">23</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#8230;
<styled-content style="color: #A9A9A9;">24</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Has_numerical_value_1&#34;: {
<styled-content style="color: #A9A9A9;">25</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;path&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">26</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">27</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;pattern&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">28</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;postprocessor&#34;: {
<styled-content style="color: #A9A9A9;">29</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">30</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;args&#34;: &#34;&#34;
<styled-content style="color: #A9A9A9;">31</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
<styled-content style="color: #A9A9A9;">32</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">33</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#8230;
<styled-content style="color: #A9A9A9;">34</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">35</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Tool_1&#34;: {
<styled-content style="color: #A9A9A9;">36</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;__restrictions__&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">37</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Software_component&#34;: {
<styled-content style="color: #A9A9A9;">38</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;path&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">39</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">40</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;pattern&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">41</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;postprocessor&#34;: {
<styled-content style="color: #A9A9A9;">42</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">43</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;args&#34;: &#34;&#34;
<styled-content style="color: #A9A9A9;">44</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
<styled-content style="color: #A9A9A9;">45</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">46</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#8230;
<styled-content style="color: #A9A9A9;">47</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">48</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#8230;
<styled-content style="color: #A9A9A9;">49</styled-content>&#160;&#160;&#160;&#160;}
</code>
<p><bold>Listing 3:</bold> Dictionary with all the classes and their properties expanded by the multiplexer in step 3.</p>
<p>The multiplexed dictionary has to be filled in manually again by the user. How to fill in the dictionary depends on how the user wants to retrieve the data and where the information is stored. In this example, data are all extracted from plain text files and the crawler uses regular expressions to locate and read the data. This is done by specifying the keywords: <monospace>&#34;path&#34;, &#34;type&#34;</monospace> and <monospace>&#34;pattern&#34;</monospace>. The entries in <monospace>&#34;postprocessor&#34;</monospace> can be left empty for the sake of this example. The lines below show how to hardcode metadata by providing a string in <monospace>&#34;type&#34;</monospace> (for the property <monospace>&#34;Parameter_1&#34;</monospace>, where the user directly provides the name of the parameter), retrieve information from a file using regular expressions (<monospace>&#34;Has_numerical_value_1&#34;</monospace>, retrieved from the file specified in <monospace>&#34;path&#34;</monospace>) and from the output of a terminal command (<monospace>&#34;ID&#34;</monospace>, where the terminal command is given in <monospace>&#34;pattern&#34;</monospace>).</p>
<code>
<styled-content style="color: #A9A9A9;">1</styled-content>&#160;&#160;&#160;&#160;{
<styled-content style="color: #A9A9A9;">2</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Processing_Step&#34;: {
<styled-content style="color: #A9A9A9;">3</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#8230;
<styled-content style="color: #A9A9A9;">4</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Parameter_1&#34;: {
<styled-content style="color: #A9A9A9;">5</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;path&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">6</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;string&#34;,
<styled-content style="color: #A9A9A9;">7</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;pattern&#34;: &#34;Freestream Mach number&#34;,
<styled-content style="color: #A9A9A9;">8</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;postprocessor&#34;: {
<styled-content style="color: #A9A9A9;">9</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">10</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;args&#34;: &#34;&#34;
<styled-content style="color: #A9A9A9;">11</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
<styled-content style="color: #A9A9A9;">12</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
<styled-content style="color: #A9A9A9;">13</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">14</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#8230;
<styled-content style="color: #A9A9A9;">15</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Has_numerical_value_1&#34;: {
<styled-content style="color: #A9A9A9;">16</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;path&#34;: &#34;input.dat&#34;,
<styled-content style="color: #A9A9A9;">17</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;regex&#34;,
<styled-content style="color: #A9A9A9;">18</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;pattern&#34;: &#34;Mach :\\s(.*)\\n&#34;,
<styled-content style="color: #A9A9A9;">19</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;postprocessor&#34;: {
<styled-content style="color: #A9A9A9;">20</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">21</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;args&#34;: &#34;&#34;
<styled-content style="color: #A9A9A9;">22</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
<styled-content style="color: #A9A9A9;">23</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">24</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#8230;
<styled-content style="color: #A9A9A9;">25</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">26</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Tool_1&#34;: {
<styled-content style="color: #A9A9A9;">27</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#8230;
<styled-content style="color: #A9A9A9;">28</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;ID&#34;: {
<styled-content style="color: #A9A9A9;">29</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;path&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">30</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;os&#34;,
<styled-content style="color: #A9A9A9;">31</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;pattern&#34;: &#34;hostname&#34;,
<styled-content style="color: #A9A9A9;">32</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;postprocessor&#34;: {
<styled-content style="color: #A9A9A9;">33</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;type&#34;: &#34;&#34;,
<styled-content style="color: #A9A9A9;">34</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;args&#34;: &#34;&#34;
<styled-content style="color: #A9A9A9;">35</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
<styled-content style="color: #A9A9A9;">36</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">37</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#8230;
<styled-content style="color: #A9A9A9;">38</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">39</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#8230;
<styled-content style="color: #A9A9A9;">40</styled-content>&#160;&#160;&#160;&#160;}
</code>
<p><bold>Listing 4:</bold> Filled-in dictionary in step 4.</p>
<p>Finally, <monospace>EntityUtils.py</monospace> is used to run the actual extraction routine, which retrieves the metadata according to the parameters specified in the previous step. The output file is shown below and could be either a <monospace>.json</monospace> or a <monospace>.yaml</monospace> file, according to the user needs.</p>
<code>
<styled-content style="color: #A9A9A9;">1</styled-content>&#160;&#160;&#160;&#160;{
<styled-content style="color: #A9A9A9;">2</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Processing_Step&#34;: {
<styled-content style="color: #A9A9A9;">3</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Name&#34;: &#34;Wing simulation&#34;,
<styled-content style="color: #A9A9A9;">4</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Parameter_1&#34;: &#34;Freestream Mach number&#34;,
<styled-content style="color: #A9A9A9;">5</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Parameter_2&#34;: &#34;Freestream pressure&#34;,
<styled-content style="color: #A9A9A9;">6</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Parameter_3&#34;: &#34;Freestream temperature&#34;,
<styled-content style="color: #A9A9A9;">7</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Parameter_4&#34;: &#34;Freestream unit Reynolds number&#34;,
<styled-content style="color: #A9A9A9;">8</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Parameter_5&#34;: &#34;Start time&#34;,
<styled-content style="color: #A9A9A9;">9</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Has_numerical_value_1&#34;: &#34;0.35&#34;,
<styled-content style="color: #A9A9A9;">10</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Has_numerical_value_2&#34;: &#34;61640&#34;,
<styled-content style="color: #A9A9A9;">11</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Has_numerical_value_3&#34;: &#34;262&#34;,
<styled-content style="color: #A9A9A9;">12</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Has_numerical_value_4&#34;: &#34;5.607E6&#34;,
<styled-content style="color: #A9A9A9;">13</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Has_numerical_value_5&#34;: &#34;10:13:31&#34;
<styled-content style="color: #A9A9A9;">14</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">15</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Tool_1&#34;: {
<styled-content style="color: #A9A9A9;">16</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;System_component&#34;: &#34;Hardware&#34;,
<styled-content style="color: #A9A9A9;">17</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Name&#34;: &#34;lrz-coolmuc2-linux-cluster-2022&#34;
<styled-content style="color: #A9A9A9;">18</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;ID&#34;: &#34;i22r07c05s05&#34;
<styled-content style="color: #A9A9A9;">19</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">20</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Tool_2&#34;: {
<styled-content style="color: #A9A9A9;">21</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;System_component&#34;: &#34;Software&#34;,
<styled-content style="color: #A9A9A9;">22</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Name&#34;: &#34;NSMB&#34;,
<styled-content style="color: #A9A9A9;">23</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;ID&#34;: &#34; 6.09.21    Date:  28 - January - 2021     &#34;
<styled-content style="color: #A9A9A9;">24</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;},
<styled-content style="color: #A9A9A9;">25</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Method&#34;: {
<styled-content style="color: #A9A9A9;">26</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#34;Name&#34;: &#34;LU-SGS&#34;
<styled-content style="color: #A9A9A9;">27</styled-content>&#160;&#160;&#160;&#160;&#160;&#160;&#160;&#160;}
<styled-content style="color: #A9A9A9;">28</styled-content>&#160;&#160;&#160;&#160;}
</code>
<p><bold>Listing 5:</bold> Final .json file containing the extracted metadata after step 5.</p>
<p>At this point, the generated metadata file can be stored together with the output data from the simulation. Whenever the user performs a new simulation and wants to extract the same type of metadata, there is no need to repeat all five steps of the process. In fact, the filled multiplexed dictionary created in step 4 will not change and acts as a configuration file that can be directly re-used in step 5. This means that the user needs only to add the command that runs step 5 in the script-based workflow. This corresponds to using HOMER in edge mode, which means invoking the crawler each time new data is generated at the end of a CFD simulation.</p>
<p>The other option would be to use the crawler after all the simulations have been run in order to retrieve all the metadata at once, which corresponds to using the crawler in central mode. In this case, the user will need to specify absolute file paths (via the keyword <monospace>&#34;path&#34;</monospace> in the multiplexed <monospace>.json</monospace> file) to point to the files containing the information. This means that the user needs to create an extra script that allows the crawler to search all the relevant folders and files. Such a script would be specialized according to the user&#8217;s simulation environment and workflow. Hence, no example of such a usage can be given in the context of the generic CFD-showcase described in this work. However, an example script is provided in the GitLab folder for the Pizza-ontology tutorial. It must be noted that central and edge modes are not features of the tool itself, but are different ways of using HOMER. It&#8217;s up to the users to decide which is the best usage based on their needs and preferences. This, however, shows again the flexibility of the tool. Another remark is that the user doesn&#8217;t need to adhere strictly to the chosen ontology file, nor does the user have to use an ontology based on Metada4Ing. At any of the steps where manual input is needed, the user can adjust the classes and properties according to the case-specific needs. For example, the user could rename the property <monospace>&#34;Parameter&#34;</monospace> to <monospace>&#34;Variable&#34;</monospace> in the class <monospace>&#34;Processing_Step&#34;</monospace> by manually amending the <monospace>&#34;.json&#34;</monospace> file while filling out the flat-class template during step 2. In fact, an ontology file is not even necessary for the actual extraction of the metadata, in principle. The user could even create it&#8217;s own <monospace>.json</monospace> file with its own classes and properties skipping the first two steps altogether. On one hand, of course, this approach requires a certain amount of overhead from the researcher side in terms of planning and preparing the <monospace>.json</monospace> files. On the other hand, it gives much more freedom to the user when it comes to adapt the crawler to the specific case at hand.</p>
<p>Regarding the limitations of HOMER, the tool works best within standardized workflows, where the structure of the files containing the metadata to be extracted changes very little or not at all over time. Although, as shown, it would be possible to adapt the crawler, and in particular the multiplexed dictionary, to new file structures thanks to the flexibility of the tool, such an operation could take a considerable amount of time and effort from the user side if performed for every new application of a (changing) work flow. Hence, it appears sensible to limit the use of the crawler to cases where well-known and relatively fixed data structures are employed as it is common in most numerical and experimental research projects. The second limitation is the range of data formats the crawler can currently extract metadata from, which is limited to text and HDF5 files, together with outputs of terminal commands and hardcoded lines. Although the regular-expression parser allows to retrieve information from virtually any text file regardless of its extension, commonly used formats such as <monospace>.xml</monospace> have not been implemented, yet. As the crawler is designed flexible, this would be a straight forward process.</p>
</sec>
</sec>
<sec>
<title>5 Conclusion and Future Developments</title>
<p>In this work, HOMER (<bold>H</bold>PMC tool for <bold>O</bold>ntology-based <bold>M</bold>etadata <bold>E</bold>xtraction and <bold>R</bold>e-use), a tool to automate metadata extraction in script-based workflows, has been presented. The crawler, a python-written code, allows for a flexible approach to metadata retrieval. As starting point, the user can provide an ontology file, whose metadata scheme represents the backbone of the extracted information. The classes and attributes from the ontology can be tailored to the specific case at hand and expanded by means of the multiplexer. Once the user has filled in the final dictionary, the actual metadata extraction is executed. This can happen both in edge mode (natural application for script-based workflows) or, with some further user input, in central mode. Then, the extracted metadata can be further post-processed by some routines included in the code. The use of the tool requires some user input and tuning for the first application, but after that, it can be seamlessly integrated in potentially any workflow.</p>
<p>Currently, metadata can be retrieved from text and HDF5 files, from outputs of console commands or can be directly hardcoded in the configuration file. This limitation can be easily overcome in the future, as the code is designed in a modular way, thus allowing for a simple integration of new building blocks. According to the user&#8217;s needs, new readers/writers of other file formats can be added. The same applies for the post-processing capabilities on the extracted metadata. Moreover, work to increase the amount of readable file formats is planned, at first focusing on the most common formats in CFD applications.</p>
<p>As of now, HOMER can be already implemented in HPMC workflows, so as to enrich each processing step (e.g. mesh generation, simulation, post-processing, report) by adding the corresponding metadata. This capability allows for the collection of valuable data (such as the energy consumption for a set of simulations) to enable secondary research and the development of new methodologies in HPC systems. In the current state, the tool would provide the best performance when used to extract metadata right after the creation of the data, as no parallel implementation for file parsing is present, yet. In its five-steps implementation described in this work, HOMER was mainly employed in the data life cycle for the processing stages of planning, creating/collecting and processing/analyzing. A future development of the tool would be to cover the complete data life cycle in a holistic approach, by providing the possibility to automatically preserve and publish/share the extracted metadata along with the research dataset. Through publishing not just the data but also administrative (preservation) metadata, third party users will be able to retrieve crucial information about accessibility, access rights and licenses among others. Bibliographic (author, identifiers) and descriptive (research domain, tools, methods, processing steps) metadata can be published in repositories together with the referenced research data, or be linked to the research data by persistent links and identifiers, if technical or organizational reasons impede a joint provisioning (for example, if the research data are too large to be stored in a common repository).</p>
<p>One main factor of making data FAIR is the use of a controlled vocabulary with common terminology. This is guaranteed by the fact that HOMER supports the usage of semantic ontologies as metadata schemes. These schemes have to be matched somehow with searchable metadata fields in the corresponding repositories. Only few repositories offer such publishing options, like DaRUS (University of Stuttgart) [<xref ref-type="bibr" rid="B24">24</xref>], which uses predefined metadata blocks, or Coscine (RWTH) [<xref ref-type="bibr" rid="B25">25</xref>], which provides the possibility to use standardized or self-created metadata application profiles [<xref ref-type="bibr" rid="B26">26</xref>]. These schemes still have to be parsed with the corresponding metadata fields in the extracted metadata file, to provide the metadata in a standardized, searchable and indexable front end. The NFDI4Ing consortium is simultaneously working on a generic interface which combines different kinds of metadata and data repositories with one standard-based interface. This enables the linking between all data and metadata of the research data life cycle, including experiments, raw data, software, subject-specific metadata sets, and the tracking of usage and citations. Standardized and automatically extracted metadata files can easily be made findable and accessible by this new generic interface [<xref ref-type="bibr" rid="B27">27</xref>]. Therefore, HOMER can be a crucial piece within the metadata toolchain from using common vocabularies and automatized extracting to FAIR publishing. The already mentioned Metadata4Ing ontology has been used as the reference during the early stages of the development of HOMER. In the meantime, a HPMC-sub-ontology has been developed within Metadat4Ing. Hence, one of the next steps will be to further adapt HOMER to this new sub-ontology, allowing the tool to be more effective in the complete data life cycle of a CFD workflow on HPC systems.</p>
</sec>
<sec>
<title>Data availability</title>
<p>Data can be found here: <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://gitlab.lrz.de/nfdi4ing/crawler/-/tree/master/SimpleApplication_PizzaOntology">https://gitlab.lrz.de/nfdi4ing/crawler/-/tree/master/SimpleApplication_PizzaOntology</ext-link></p>
</sec>
<sec>
<title>Software availability</title>
<p>Software can be found here: <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://doi.org/10.14459/2022mp1694401">https://doi.org/10.14459/2022mp1694401</ext-link></p>
</sec>
</body>
<back>
<sec>
<title>6 Acknowledgements</title>
<p>The authors would like to thank the Federal Government and the Heads of Government of the L&#228;nder, as well as the Joint Science Conference (GWK), for their funding and support within the framework of the NFDI4Ing consortium. Funded by the German Research Foundation (DFG) - project number 442146713. Moreover, the authors gratefully acknowledge the Gauss Centre for Supercomputing e.V. (<ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.gauss-centre.eu">www.gauss-centre.eu</ext-link>) for funding this project by providing computing time on the GCS Supercomputer SuperMUC-NG at Leibniz Supercomputing Centre (<ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.lrz.de">www.lrz.de</ext-link>).</p>
</sec>
<sec>
<title>7 Roles and contributions</title>
<p><bold>Giuseppe Chiapparino:</bold> Conceptualization; Investigation; Methodology; Software - testing; Validation; Writing &#8211; original draft</p>
<p><bold>Benjamin Farnbacher:</bold> Data curation; Investigation; Writing &#8211; original draft (Introduction and Conclusions)</p>
<p><bold>Nils Hoppe:</bold> Conceptualization; Investigation; Methodology; Software - development and design</p>
<p><bold>Radoslav Ralev:</bold> Software - design, development, implementation and testing</p>
<p><bold>Vasiliki Sdralia:</bold> Writing &#8211; original draft (Introduction)</p>
<p><bold>Christian Stemmer:</bold> Funding acquisition; Resources; Supervision; Writing &#8211; review and editing of original</p>
</sec>
<ref-list>
<ref id="B1"><label>[1]</label><mixed-citation publication-type="journal"><string-name><given-names>P. B.</given-names> <surname>Heidorn</surname></string-name>, <article-title>&#8220;Shedding Light on the Dark Data in the Long Tail of Science,&#8221;</article-title> <source>Library Trends</source>, vol. <volume>57</volume>, no. <issue>2</issue>, pp. <fpage>280</fpage>&#8211;<lpage>299</lpage>, <year>2008</year>. DOI: <pub-id pub-id-type="doi">10.1353/lib.0.0036</pub-id>.</mixed-citation></ref>
<ref id="B2"><label>[2]</label><mixed-citation publication-type="journal"><string-name><given-names>B.</given-names> <surname>Schembera</surname></string-name> and <string-name><given-names>J. M.</given-names> <surname>Dur&#224;n</surname></string-name>, <article-title>&#8220;Dark Data as the New Challenge for Big Data Science and the Introduction of the Scientific Data Officer,&#8221;</article-title> <source>Philosophy &amp; Technology</source>, vol. <volume>33</volume>, pp. <fpage>93</fpage>&#8211;<lpage>115</lpage>, <year>2020</year>. DOI: <pub-id pub-id-type="doi">10.1007/s13347-019-00346-x</pub-id>.</mixed-citation></ref>
<ref id="B3"><label>[3]</label><mixed-citation publication-type="webpage"><collab>NFDI4Ing Consortium</collab>. <article-title>&#8220;Website.&#8221;</article-title> (<year>2022</year>), [Online]. Available: <uri>https://nfdi4ing.de</uri>.</mixed-citation></ref>
<ref id="B4"><label>[4]</label><mixed-citation publication-type="webpage"><collab>Metadata4Ing Workgroup</collab>. <article-title>&#8220;Metadata4ing: An ontology for describing the generation of research data within a scientific activity.&#8221;</article-title> (<year>2022</year>), [Online]. Available: <uri>https://nfdi4ing.pages.rwth-aachen.de/metadata4ing/metadata4ing/index.html#ref</uri>.</mixed-citation></ref>
<ref id="B5"><label>[5]</label><mixed-citation publication-type="webpage"><collab>DCMI Usage Board</collab>. <article-title>&#8220;Dcmi metadata terms.&#8221;</article-title> (<year>2020</year>), [Online]. Available: <uri>https://www.dublincore.org/specifications/dublin-core/dcmi-terms/</uri>.</mixed-citation></ref>
<ref id="B6"><label>[6]</label><mixed-citation publication-type="webpage"><string-name><surname>Lebo</surname>, <given-names>Timothy</given-names></string-name> and <string-name><surname>Satya</surname>, <given-names>Sahoo</given-names></string-name> and <string-name><surname>Deborah</surname>, <given-names>McGuinness</given-names></string-name>. <article-title>&#8220;Prov-o: The prov ontology.&#8221;</article-title> (<year>2013</year>), [Online]. Available: <uri>https://www.w3.org/TR/prov-o/</uri>.</mixed-citation></ref>
<ref id="B7"><label>[7]</label><mixed-citation publication-type="book"><string-name><given-names>S.</given-names> <surname>Liang</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Holmes</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Antoniou</surname></string-name>, and <string-name><given-names>J.</given-names> <surname>Higgins</surname></string-name>, <chapter-title>&#8220;Icurate: A research data management system,&#8221;</chapter-title> in <source>Multi-disciplinary Trends in Artificial Intelligence</source>, <string-name><given-names>A.</given-names> <surname>Bikakis</surname></string-name> and <string-name><given-names>X.</given-names> <surname>Zheng</surname></string-name>, Eds., <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>, <year>2015</year>, pp. <fpage>39</fpage>&#8211;<lpage>47</lpage>, ISBN: 978-3-319-26181-2. DOI: <pub-id pub-id-type="doi">10.1007/978-3-319-26181-2_4</pub-id>.</mixed-citation></ref>
<ref id="B8"><label>[8]</label><mixed-citation publication-type="journal"><string-name><given-names>C. S.</given-names> <surname>Adorf</surname></string-name>, <string-name><given-names>P. M.</given-names> <surname>Dodd</surname></string-name>, <string-name><given-names>V.</given-names> <surname>Ramasubramani</surname></string-name>, and <string-name><given-names>S. C.</given-names> <surname>Glotzer</surname></string-name>, <article-title>&#8220;Simple data and workflow management with the signac framework,&#8221;</article-title> <source>Computational Materials Science</source>, vol. <volume>146</volume>, pp. <fpage>220</fpage>&#8211;<lpage>229</lpage>, <year>2018</year>, ISSN: 0927-0256. DOI: <pub-id pub-id-type="doi">10.1016/j.commatsci.2018.01.035</pub-id>.</mixed-citation></ref>
<ref id="B9"><label>[9]</label><mixed-citation publication-type="journal"><string-name><given-names>L.</given-names> <surname>Nagel</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Lycklama</surname></string-name>, <article-title>&#8220;Design principles for data spaces - position paper,&#8221;</article-title> version 1.0, <year>2021</year>. DOI: <pub-id pub-id-type="doi">10.5281/zenodo.5105744</pub-id>.</mixed-citation></ref>
<ref id="B10"><label>[10]</label><mixed-citation publication-type="book"><string-name><given-names>T. J.</given-names> <surname>Skluzacek</surname></string-name>, <chapter-title>&#8220;Dredging a data lake: Decentralized metadata extraction,&#8221;</chapter-title> in <source>Proceedings of the 20th International Middleware Conference Doctoral Symposium</source>, ser. Middleware &#8217;19, <publisher-loc>Davis, California</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>, <year>2019</year>, pp. <fpage>51</fpage>&#8211;<lpage>53</lpage>, ISBN: 9781450370394. DOI: <pub-id pub-id-type="doi">10.1145/3366624.3368170</pub-id>.</mixed-citation></ref>
<ref id="B11"><label>[11]</label><mixed-citation publication-type="book"><string-name><given-names>T. J.</given-names> <surname>Skluzacek</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Chard</surname></string-name>, <string-name><given-names>R.</given-names> <surname>Wong</surname></string-name>, <italic>et al.</italic>, <chapter-title>&#8220;Serverless workflows for indexing large scientific data,&#8221;</chapter-title> in <source>Proceedings of the 5th International Workshop on Serverless Computing</source>, ser. WOSC &#8217;19, <publisher-loc>Davis, CA, USA</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>, <year>2019</year>, pp. <fpage>43</fpage>&#8211;<lpage>48</lpage>, ISBN: 9781450370387. DOI: <pub-id pub-id-type="doi">10.1145/3366623.3368140</pub-id>.</mixed-citation></ref>
<ref id="B12"><label>[12]</label><mixed-citation publication-type="webpage"><string-name><given-names>J.</given-names> <surname>Dixon</surname></string-name>. <article-title>&#8220;Pentaho, hadoop, and data lakes.&#8221;</article-title> (<year>2010</year>), [Online]. Available: <uri>https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/</uri>.</mixed-citation></ref>
<ref id="B13"><label>[13]</label><mixed-citation publication-type="journal"><string-name><given-names>D.</given-names> <surname>Merkel</surname></string-name>, <article-title>&#8220;Docker: Lightweight linux containers for consistent development and deployment,&#8221;</article-title> <source>Linux Journal</source>, vol. <volume>2014</volume>, no. <issue>239</issue>, p. <fpage>2</fpage>, <year>2014</year>.</mixed-citation></ref>
<ref id="B14"><label>[14]</label><mixed-citation publication-type="journal"><string-name><given-names>I.</given-names> <surname>Foster</surname></string-name>, <article-title>&#8220;Globus online: Accelerating and democratizing science through cloud-based services,&#8221;</article-title> <source>IEEE Internet Computing</source>, vol. <volume>15</volume>, no. <issue>3</issue>, pp. <fpage>70</fpage>&#8211;<lpage>73</lpage>, <year>2011</year>. DOI: <pub-id pub-id-type="doi">10.1109/MIC.2011.64</pub-id>.</mixed-citation></ref>
<ref id="B15"><label>[15]</label><mixed-citation publication-type="journal"><string-name><given-names>B.</given-names> <surname>Allen</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Bresnahan</surname></string-name>, <string-name><given-names>L.</given-names> <surname>Childers</surname></string-name>, <italic>et al.</italic>, <article-title>&#8220;Software as a service for data scientists,&#8221;</article-title> <source>IEEE Internet Computing</source>, vol. <volume>55</volume>, no. <issue>2</issue>, pp. <fpage>81</fpage>&#8211;<lpage>88</lpage>, <year>2012</year>. DOI: <pub-id pub-id-type="doi">10.1145/2076450.2076468</pub-id>.</mixed-citation></ref>
<ref id="B16"><label>[16]</label><mixed-citation publication-type="journal"><string-name><given-names>B.</given-names> <surname>Schembera</surname></string-name>, <article-title>&#8220;Like a rainbow in the dark: Metadata annotation for HPC applications in the age of dark data,&#8221;</article-title> <source>Journal of Supercomputing</source>, vol. <volume>77</volume>, pp. <fpage>8946</fpage>&#8211;<lpage>8966</lpage>, <year>2021</year>. DOI: <pub-id pub-id-type="doi">10.1007/s11227-020-03602-6</pub-id>.</mixed-citation></ref>
<ref id="B17"><label>[17]</label><mixed-citation publication-type="book"><string-name><given-names>B.</given-names> <surname>Schembera</surname></string-name> and <string-name><given-names>D.</given-names> <surname>Iglezakis</surname></string-name>, <chapter-title>&#8220;The Genesis of EngMeta - A Metadata Model for Research Data in Computational Engineering,&#8221;</chapter-title> in <source>Metadata and Semantic Research</source>, <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer International Publishing</publisher-name>, <year>2019</year>, pp. <fpage>127</fpage>&#8211;<lpage>132</lpage>, ISBN: 978-3-030-14401-2. DOI: <pub-id pub-id-type="doi">10.1007/978-3-030-14401-2_12</pub-id>.</mixed-citation></ref>
<ref id="B18"><label>[18]</label><mixed-citation publication-type="book"><string-name><given-names>S.</given-names> <surname>Padhy</surname></string-name>, <string-name><given-names>G.</given-names> <surname>Jansen</surname></string-name>, <string-name><given-names>J.</given-names> <surname>Alameda</surname></string-name>, <italic>et al.</italic>, <chapter-title>&#8220;Brown dog: Leveraging everything towards autocuration,&#8221;</chapter-title> in <source>IEEE International Conference on Big Data (Big Data)</source>, <year>2015</year>, pp. <fpage>493</fpage>&#8211;<lpage>500</lpage>. DOI: <pub-id pub-id-type="doi">10.1109/BigData.2015.7363791</pub-id>.</mixed-citation></ref>
<ref id="B19"><label>[19]</label><mixed-citation publication-type="book"><string-name><given-names>G. P.</given-names> <surname>Rodrigo</surname></string-name>, <string-name><given-names>M.</given-names> <surname>Henderson</surname></string-name>, <string-name><given-names>G. H.</given-names> <surname>Weber</surname></string-name>, <string-name><given-names>C.</given-names> <surname>Ophus</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Antypas</surname></string-name>, and <string-name><given-names>L.</given-names> <surname>Ramakrishnan</surname></string-name>, <chapter-title>&#8220;ScienceSearch: Enabling search through automatic metadata generation,&#8221;</chapter-title> in <source>IEEE 14th International Conference on e-Science (e-Science)</source>, <year>2018</year>, pp. <fpage>93</fpage>&#8211;<lpage>104</lpage>. DOI: <pub-id pub-id-type="doi">10.1109/eScience.2018.00025</pub-id>.</mixed-citation></ref>
<ref id="B20"><label>[20]</label><mixed-citation publication-type="webpage"><string-name><given-names>K.</given-names> <surname>Frey</surname></string-name>, <string-name><given-names>K.</given-names> <surname>Schneider</surname></string-name>, <string-name><given-names>O.</given-names> <surname>Maus</surname></string-name>, and <string-name><given-names>T.</given-names> <surname>M&#252;hlhaus</surname></string-name>. <article-title>&#8220;Swate: A swate workflow annotation tool for excel.&#8221;</article-title> (<year>2022</year>), [Online]. Available: <uri>https://github.com/nfdi4plants/Swate</uri>.</mixed-citation></ref>
<ref id="B21"><label>[21]</label><mixed-citation publication-type="journal"><collab>UK Data Service, modified by TUM University Library (UB)</collab>. <article-title>&#8220;Data life cycle - icons.&#8221;</article-title> (<year>2022</year>).</mixed-citation></ref>
<ref id="B22"><label>[22]</label><mixed-citation publication-type="journal"><string-name><given-names>M. A.</given-names> <surname>Musen</surname></string-name>, <article-title>&#8220;The prot&#233;g&#233; project: A look back and a look forward,&#8221;</article-title> <source>AI Matters</source>, vol. <volume>1</volume>, no. <issue>4</issue>, pp. <fpage>4</fpage>&#8211;<lpage>12</lpage>, <year>2015</year>. DOI: <pub-id pub-id-type="doi">10.1145/2757001.2757003</pub-id>. [Online]. Available: <pub-id pub-id-type="doi">10.1145/2757001.2757003</pub-id>.</mixed-citation></ref>
<ref id="B23"><label>[23]</label><mixed-citation publication-type="book"><string-name><given-names>J.</given-names> <surname>Vos</surname></string-name>, <string-name><given-names>N.</given-names> <surname>Duquesne</surname></string-name>, and <string-name><given-names>H. J.</given-names> <surname>Lee</surname></string-name>, <chapter-title>&#8220;Shock wave boundary layer interaction studies using the NSMB flow solver,&#8221;</chapter-title> in <source>3rd European Symposium on Aerothermodynamics for Space Vehicles, ESA SP-426</source>, <year>1999</year>.</mixed-citation></ref>
<ref id="B24"><label>[24]</label><mixed-citation publication-type="webpage"><collab>University of Stuttgart</collab>. <article-title>&#8220;Darus.&#8221;</article-title> (<year>2022</year>), [Online]. Available: <uri>https://www.izus.uni-stuttgart.de/en/fokus/darus/</uri>.</mixed-citation></ref>
<ref id="B25"><label>[25]</label><mixed-citation publication-type="webpage"><collab>RWTH Aachen University</collab>. <article-title>&#8220;Coscine.&#8221;</article-title> (<year>2022</year>), [Online]. Available: <uri>https://coscine.rwth-aachen.de</uri>.</mixed-citation></ref>
<ref id="B26"><label>[26]</label><mixed-citation publication-type="webpage"><collab>RWTH Aachen University</collab>. <article-title>&#8220;Aims &#8211; applying interoperable metadata standards.&#8221;</article-title> (), [Online]. Available: <uri>https://www.wzl.rwth-aachen.de/cms/wzl/Forschung/Forschungsumfeld/Forschungsprojekte/Projekte/~ivong/ProMiDigit-Process-Mining-fuer-No-Code/</uri>.</mixed-citation></ref>
<ref id="B27"><label>[27]</label><mixed-citation publication-type="webpage"><collab>NFDI4Ing Consortium</collab>. <article-title>&#8220;Metadata hub.&#8221;</article-title> (<year>2022</year>), [Online]. Available: <uri>https://git.rwth-aachen.de/nfdi4ing/s-3/s-3-3/metadatahub</uri>.</mixed-citation></ref>
</ref-list>
</back>
</article>