Crunchbase Challenge. Alexiev, V. Github gist, March, 2022.
URL abstract bibtex Here's a challenge for the "Knowledge Graphs Construction" Community Group (KGC CG): Take Crunchbase: 14M rows, across 18 tables, served as CSV, updated daily. The data of some nodes comes from multiple tables (eg Organization from organizations, org_parents, org_descriptions). RDFize and store the total dataset, in under 1-2 hours time. Using the approach described here, GraphDB 9.11 with OntoRefine takes 76-119 minutes (1.3-2 hours) depending on hardware to produce and load 138M triples (19-30k triples per second). Update the data daily, replacing the data of recently updated rows. Using the approach described here, it takes about 15 minutes to update all of Crunchbase. Do it with your favorite RDFization toolkit, and preferably do it declaratively.
@Misc{Alexiev-Crunchbase-challenge-2022,
author = {Vladimir Alexiev},
title = {Crunchbase Challenge},
howpublished = {Github gist},
month = Mar,
year = 2022,
url = {https://gist.github.com/VladimirAlexiev/d5d67feb002dbcfa6b3d4c3dd59b52da},
date = {2022-03-24},
keywords = {Crunchbase, ontologies, knowledge graphs, KG construction, RDFization, declarative transformations, semantic model, semantic data integration, ETL, semantic conversion, declarative approaches},
abstract = {Here's a challenge for the "Knowledge Graphs Construction" Community Group (KGC CG):
Take Crunchbase: 14M rows, across 18 tables, served as CSV, updated daily.
The data of some nodes comes from multiple tables (eg Organization from organizations, org_parents, org_descriptions).
RDFize and store the total dataset, in under 1-2 hours time.
Using the approach described here, GraphDB 9.11 with OntoRefine takes 76-119 minutes (1.3-2 hours) depending on hardware to produce and load 138M triples (19-30k triples per second).
Update the data daily, replacing the data of recently updated rows.
Using the approach described here, it takes about 15 minutes to update all of Crunchbase.
Do it with your favorite RDFization toolkit, and preferably do it declaratively.},
}
Downloads: 0
{"_id":"tBMnDQwnQSHbb9HMT","bibbaseid":"alexiev-crunchbasechallenge-2022","author_short":["Alexiev, V."],"bibdata":{"bibtype":"misc","type":"misc","author":[{"firstnames":["Vladimir"],"propositions":[],"lastnames":["Alexiev"],"suffixes":[]}],"title":"Crunchbase Challenge","howpublished":"Github gist","month":"March","year":"2022","url":"https://gist.github.com/VladimirAlexiev/d5d67feb002dbcfa6b3d4c3dd59b52da","date":"2022-03-24","keywords":"Crunchbase, ontologies, knowledge graphs, KG construction, RDFization, declarative transformations, semantic model, semantic data integration, ETL, semantic conversion, declarative approaches","abstract":"Here's a challenge for the \"Knowledge Graphs Construction\" Community Group (KGC CG): Take Crunchbase: 14M rows, across 18 tables, served as CSV, updated daily. The data of some nodes comes from multiple tables (eg Organization from organizations, org_parents, org_descriptions). RDFize and store the total dataset, in under 1-2 hours time. Using the approach described here, GraphDB 9.11 with OntoRefine takes 76-119 minutes (1.3-2 hours) depending on hardware to produce and load 138M triples (19-30k triples per second). Update the data daily, replacing the data of recently updated rows. Using the approach described here, it takes about 15 minutes to update all of Crunchbase. Do it with your favorite RDFization toolkit, and preferably do it declaratively.","bibtex":"@Misc{Alexiev-Crunchbase-challenge-2022,\n author = {Vladimir Alexiev},\n title = {Crunchbase Challenge},\n howpublished = {Github gist},\n month = Mar,\n year = 2022,\n url = {https://gist.github.com/VladimirAlexiev/d5d67feb002dbcfa6b3d4c3dd59b52da},\n date = {2022-03-24},\n keywords = {Crunchbase, ontologies, knowledge graphs, KG construction, RDFization, declarative transformations, semantic model, semantic data integration, ETL, semantic conversion, declarative approaches},\n abstract = {Here's a challenge for the \"Knowledge Graphs Construction\" Community Group (KGC CG):\nTake Crunchbase: 14M rows, across 18 tables, served as CSV, updated daily.\nThe data of some nodes comes from multiple tables (eg Organization from organizations, org_parents, org_descriptions).\nRDFize and store the total dataset, in under 1-2 hours time.\nUsing the approach described here, GraphDB 9.11 with OntoRefine takes 76-119 minutes (1.3-2 hours) depending on hardware to produce and load 138M triples (19-30k triples per second).\nUpdate the data daily, replacing the data of recently updated rows.\nUsing the approach described here, it takes about 15 minutes to update all of Crunchbase.\nDo it with your favorite RDFization toolkit, and preferably do it declaratively.},\n}\n\n","author_short":["Alexiev, V."],"key":"Alexiev-Crunchbase-challenge-2022","id":"Alexiev-Crunchbase-challenge-2022","bibbaseid":"alexiev-crunchbasechallenge-2022","role":"author","urls":{"URL":"https://gist.github.com/VladimirAlexiev/d5d67feb002dbcfa6b3d4c3dd59b52da"},"keyword":["Crunchbase","ontologies","knowledge graphs","KG construction","RDFization","declarative transformations","semantic model","semantic data integration","ETL","semantic conversion","declarative approaches"],"metadata":{"authorlinks":{}}},"bibtype":"misc","biburl":"https://vladimiralexiev.github.io/my/Alexiev-bibliography.bib","dataSources":["qQ4QyF9WbfwAyRcSb"],"keywords":["crunchbase","ontologies","knowledge graphs","kg construction","rdfization","declarative transformations","semantic model","semantic data integration","etl","semantic conversion","declarative approaches"],"search_terms":["crunchbase","challenge","alexiev"],"title":"Crunchbase Challenge","year":2022}