The OpenD Programming Language

arsd.characterencodings

This is meant to help get data from the wild into utf8 strings so you can work with them easily inside D.

The main function is convertToUtf8(), which takes a byte array of your raw data (a byte array because it isn't really a D string yet until it is utf8), and a runtime string telling it's current encoding.

The current encoding argument is meant to come from the data's metadata, and is flexible on exact format - it is case insensitive and takes several variations on the names.

This way, you should be able to send it the encoding string directly from an XML document, a HTTP header, or whatever you have, and it ought to just work.

Members

Functions

convertToUtf8
string convertToUtf8(immutable(ubyte)[] data, string dataCharacterEncoding)

Takes data from a given character encoding and returns it as UTF-8

convertToUtf8Lossy
string convertToUtf8Lossy(immutable(ubyte)[] data, string dataCharacterEncoding)

Like convertToUtf8, but if the encoding is unknown, it just strips all chars > 127 and calls it done instead of throwing

tryToDetermineEncoding
string tryToDetermineEncoding(ubyte[] rawdata)

Tries to determine the current encoding based on the content. Only really helps with the UTF variants. Returns null if it can't be reasonably sure.

Examples

auto data = cast(immutable(ubyte)[])
	std.file.read("my-windows-file.txt");
string utf8String = convertToUtf8(data, "windows-1252");
// utf8String can now be used

The encodings currently implemented for decoding are:

  • UTF-8 (a no-op; it simply casts the array to string)
  • UTF-16,
  • UTF-32,
  • Windows-1252,
  • ISO 8859 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, and 16.
  • KOI8-R

It treats ISO 8859-1, Latin-1, and Windows-1252 the same way, since those labels are pretty much de-facto the same thing in wild documents (people mislabel them a lot and I found it more useful to just deal with it than to be pedantic).

This module currently makes no attempt to look at control characters.

Meta