Addressing Data Quality with Pre-Digital Native Content (VIDEO)

Jan 21, 2020

By Joyce Wells

At Data Summit 2019, Bob Kasenchak, director of business development at Access Innovations, discussed how to manage the unique challenges of pre-digital source content.

According to Kasenchak, in scholarly publishing, sometime around 1996, most STEM-type publisher's data was born digital. "The XML is a native format that this content was produced and stored in, he said. "Content that was created too early to be a native digital has been transformed into XML from some previous format—maybe SGML, maybe PDFs, or physical copies that were then ported over to PDFs and turned into XML, in order to bring entire collections into the same XML format. Usually, this was done with a combination of automated and brute force manual processes. So, it's very common for the structured field of data to be less than accurate, in content that was Pre-Native Digital. If you add to this, the problems that are propagated by OCR, which I'm sure we all love, which has improved recently, but many organizations have OCR that was done in the '80s or even before that, and it can be a very unreliable conversion to digital text."

DBTA’s next Data Summit conference will be held May 19-20, 2020, in Boston, with pre-conference workshops on Monday, May 18.

As a result, he noted, "You have the potential for quite a bit of mess in what is supposed to be clean, structured data. This article from Nature is from 1869. So this leads us to the topic of data quality. The assumption is that since we have all this nice fielded data, whether it's in XML or other data base or some other source, that the data contained in it is clean. But if there's anyone here at this data conference whose ever encountered a completely clean data set I have yet to meet them. But for the type of scholarly content that I'm discussing in this talk--things like dates and journal information--are pretty much unproblematic."

The problem really comes with the uncontrolled values, particularly things like names, he said. "I'll talk more about this in a second. Obviously, I think, extracting dirty data into another structure is simply asking for errors to be compounded. And using dirty data to inferences and drive analysis is useless because you're not going to get correct answers out of your data. But I suspect at this particular conference, I don't need to preach to the choir about the value of clean data."

Many presenters have made their slide decks available on the Data Summit 2019 website at www.dbta.com/DataSummit/2019/Presentations.aspx.

To access the video of Kasenchak's full presentation, "From Structured Text to Knowledge Graphs: Creating RDF Triples From Published Scholarly Data," go to https://datasummit.brightcovegallery.com/detail/video/6040884584001/a204.-from-structured-text-to-knowledge-graphs:-creating-rdf-triples-from-published-scholarly-data?autoStart=true&q=kasenchak#links