Plain Text & Character Encoding: A Primer for Data Curators. Erickson, S. Journal of eScience Librarianship, 10(3):1211, August, 2021.
Plain Text & Character Encoding: A Primer for Data Curators [link]Paper  doi  abstract   bibtex   
Plain text data consists of a sequence of encoded characters or “code points” from a given standard such as the Unicode Standard. Some of the most common file formats for digital data used in eScience (CSV, XML, and JSON, for example) are built atop plain text standards. Plain text representations of digital data are often preferred because plain text formats are relatively stable, and they facilitate reuse and interoperability. Despite its ubiquity, plain text is not as plain as it may seem. The set of standards used in modern text encoding (principally, the Unicode Character Set and the related encoding format, UTF-8) have complex architectures when compared to historical standards like ASCII. Further, while the Unicode standard has gained in prominence, text encoding problems are not uncommon in research data curation. This primer provides conceptual foundations for modern text encoding and guidance for common curation and preservation actions related to textual data.
@article{erickson_plain_2021,
	title = {Plain {Text} \& {Character} {Encoding}: {A} {Primer} for {Data} {Curators}},
	volume = {10},
	issn = {21613974},
	shorttitle = {Plain {Text} \& {Character} {Encoding}},
	url = {https://escholarship.umassmed.edu/jeslib/vol10/iss3/12/},
	doi = {10.7191/jeslib.2021.1211},
	abstract = {Plain text data consists of a sequence of encoded characters or “code points” from a given standard such as the Unicode Standard. Some of the most common file formats for digital data used in eScience (CSV, XML, and JSON, for example) are built atop plain text standards. Plain text representations of digital data are often preferred because plain text formats are relatively stable, and they facilitate reuse and interoperability. Despite its ubiquity, plain text is not as plain as it may seem. The set of standards used in modern text encoding (principally, the Unicode Character Set and the related encoding format, UTF-8) have complex architectures when compared to historical standards like ASCII. Further, while the Unicode standard has gained in prominence, text encoding problems are not uncommon in research data curation. This primer provides conceptual foundations for modern text encoding and guidance for common curation and preservation actions related to textual data.},
	number = {3},
	urldate = {2022-05-05},
	journal = {Journal of eScience Librarianship},
	author = {Erickson, Seth},
	month = aug,
	year = {2021},
	pages = {1211},
}

Downloads: 0