Computed Geomaterials Structured Vocabulary
  (A Data Release Site)
Summary: Using lexical and statistical methods, the taxonomic (heirarchical) relations of multitudes of geomaterials concepts are obtained and rendered to adjacency matrices and ontologies. In the process, word terms are characterized  for use in Natural Language Processing and other applications. This site serves the data from this project. The data are made available for the purposes of review of the products, but also potentially so applications can be designed.
NSF is thanked for supporting the program:
Awards EAR 1242909 ('Dark Data') and 1047776 ('Seamless Strandline').
NSF Logo

Explanation

The state of vocabularies for earth landscape materials - geomaterials - is chaotic. Over 10,000 different names are thought to exist, and all the fields of geology, geomorphology, pedology and agriculture, foundations engineering, cryology and glaciology, marine geology, benthic habitats, coastal survey have their own vocabularies. If a structured vocabulary could be made it would open very large opportunities for data mining, data integration across the single issue of earth surface materials.

The vocabulary was compiled from multitudes of glossaries, dictionaries, thesauri, schema and data models. The sources defined or described rock lithologies, sediments, soils, fluids, landscapes and habitats, and ice formations. Over 3600 terms are represented from the 18 different but overlapping linguistic corpora.

The motivations were several, to: (i) have a resource which could be used to identify documents, datasets relevant to the geosciences, particularly in the detection of 'dark data', (ii) organize geomaterials terms as a semantic net, identifying the similarity and heirarchic relationships between their concepts, (iii) investigate whether a semantic approach to lithologies could improve the way dbSEABED handles word-based data.

On the latter, the dictionary for dbSEABED is now over 15,000 terms (including cliched phrases), which is becoming unwieldy. There is the potential for automation of the dictionary and processing with methods such as Natural Language Processing (NLP), and WordNet methods. However, the results need to be of very high reliability because dbSEABED is used for real-world decision-making and risk assessments.

Served Items

a. Documentation of latest developments using lexical, nomenclatural, statistical methods to mine for vocabulary, structure (taxonomy), and ontology. ["http://instaar.colorado.edu/~jenkinsc/dbseabed/resources/geomaterials/GeomaterialsVocab.pdf"]

b. A zipfile of some of the data products as explained in the documentation. ["http://instaar.colorado.edu/~jenkinsc/dbseabed/resources/geomaterials/GeomaterialsVocab.zip"]

Send queries or comments to "chris.jenkins 
colorado.edu" .

Continuation

This work is continuing on a collaborative basis - to extend the vocabulary, deepen the structuring, extend the methods.





(Click to Expand)          

Graphviz compilation of the geomaterials terms from the core 'Keystone' corpus plus the WMO Sea-Ice glossary and the International Permafrost Glossary. This run of the software was assisted by the NSIDC, Boulder, CO USA.

Above: the entire network, including orphan terms.
Below: A close-in of part of the network.


(Click to Expand) 


Author: Chris Jenkins
Date: 3 Apr 2014
Location: Boulder