Library unicode

Library methods for handling unicode strings.

Author:

  • Daniel Miller

Copyright © Same as Nmap--See https://nmap.org/book/man-legal.html

Source: https://svn.nmap.org/nmap/nselib/unicode.lua

Functions

chardet (buf, len)

Determine (poorly) the character encoding of a string

cp437_dec (buf, pos)

Decodes a CP437 character

cp437_enc (cp)

Encode a Unicode code point to CP437

decode (buf, decoder, bigendian)

Decode a buffer containing Unicode data.

encode (list, encoder, bigendian)

Encode a list of Unicode code points

transcode (buf, decoder, encoder, bigendian_dec, bigendian_enc)

Transcode a string from one format to another

utf16_dec (buf, pos, bigendian)

Decodes a UTF-16 character.

utf16_enc (cp, bigendian)

Encode a Unicode code point to UTF-16. See RFC 2781.

utf16to8 (from)

Helper function for the common case of UTF-16 to UTF-8 transcoding, such as from a Windows/SMB unicode string to a printable ASCII (subset of UTF-8) string.

utf8_dec (buf, pos)

Decodes a UTF-8 character.

utf8_enc (cp)

Encode a Unicode code point to UTF-8. See RFC 3629.

utf8to16 (from)

Helper function for the common case of UTF-8 to UTF-16 transcoding, such as from a printable ASCII (subset of UTF-8) string to a Windows/SMB unicode string.

Functions

chardet (buf, len)

Determine (poorly) the character encoding of a string

First, the string is checked for a Byte-order Mark (BOM). This can be examined to determine UTF-16 with endianness or UTF-8. If no BOM is found, the string is examined.

If null bytes are encountered, UTF-16 is assumed. Endianness is determined by byte position, assuming the null is the high-order byte. Otherwise, if byte values over 127 are found, UTF-8 decoding is attempted. If this fails, the result is 'other', otherwise it is 'utf-8'. If no high bytes are found, the result is 'ascii'.

Parameters

buf
The string/buffer to be identified
len
The number of bytes to inspect in order to identify the string. Default: 100

Return value:

A string describing the encoding: 'ascii', 'utf-8', 'utf-16be', 'utf-16le', or 'other' meaning some unidentified 8-bit encoding
cp437_dec (buf, pos)

Decodes a CP437 character

Parameters

buf
A string containing the character
pos
The index in the string where the character begins

Return values:

  1. pos The index in the string where the character ended
  2. cp The code point of the character as a number
cp437_enc (cp)

Encode a Unicode code point to CP437

Returns nil if the code point cannot be found in CP437

Parameters

cp
The Unicode code point as a number

Return value:

A string containing the related CP437 character
decode (buf, decoder, bigendian)

Decode a buffer containing Unicode data.

Parameters

buf
The string/buffer to be decoded
decoder
A Unicode decoder function (such as utf8_dec)
bigendian
For encodings that care about byte-order (such as UTF-16), set this to true to force big-endian byte order. Default: false (little-endian)

Return value:

A list-table containing the code points as numbers
encode (list, encoder, bigendian)

Encode a list of Unicode code points

Parameters

list
A list-table of code points as numbers
encoder
A Unicode encoder function (such as utf8_enc)
bigendian
For encodings that care about byte-order (such as UTF-16), set this to true to force big-endian byte order. Default: false (little-endian)

Return value:

An encoded string
transcode (buf, decoder, encoder, bigendian_dec, bigendian_enc)

Transcode a string from one format to another

The string will be decoded and re-encoded in one pass. This saves some overhead vs simply passing the output of unicode.encode to unicode.decode.

Parameters

buf
The string/buffer to be transcoded
decoder
A Unicode decoder function (such as utf16_dec)
encoder
A Unicode encoder function (such as utf8_enc)
bigendian_dec
Set this to true to force big-endian decoding.
bigendian_enc
Set this to true to force big-endian encoding.

Return value:

An encoded string
utf16_dec (buf, pos, bigendian)

Decodes a UTF-16 character.

Does not check that the returned code point is a real character. Specifically, it can be fooled by out-of-order lead- and trail-surrogate characters.

Parameters

buf
A string containing the character
pos
The index in the string where the character begins
bigendian
Set this to true to encode big-endian UTF-16. Default is false (little-endian)

Return values:

  1. pos The index in the string where the character ended
  2. cp The code point of the character as a number
utf16_enc (cp, bigendian)

Encode a Unicode code point to UTF-16. See RFC 2781.

Windows OS prior to Windows 2000 only supports UCS-2, so beware using this function to encode code points above 0xFFFF.

Parameters

cp
The Unicode code point as a number
bigendian
Set this to true to encode big-endian UTF-16. Default is false (little-endian)

Return value:

A string containing the code point in UTF-16 encoding.
utf16to8 (from)

Helper function for the common case of UTF-16 to UTF-8 transcoding, such as from a Windows/SMB unicode string to a printable ASCII (subset of UTF-8) string.

Parameters

from
A string in UTF-16, little-endian

Return value:

The string in UTF-8
utf8_dec (buf, pos)

Decodes a UTF-8 character.

Does not check that the returned code point is a real character.

Parameters

buf
A string containing the character
pos
The index in the string where the character begins

Return values:

  1. pos The index in the string where the character ended or nil on error
  2. cp The code point of the character as a number, or an error string
utf8_enc (cp)

Encode a Unicode code point to UTF-8. See RFC 3629.

Does not check that cp is a real character; that is, doesn't exclude the surrogate range U+D800 - U+DFFF and a handful of others.

Parameters

cp
The Unicode code point as a number

Return value:

A string containing the code point in UTF-8 encoding.
utf8to16 (from)

Helper function for the common case of UTF-8 to UTF-16 transcoding, such as from a printable ASCII (subset of UTF-8) string to a Windows/SMB unicode string.

Parameters

from
A string in UTF-8

Return value:

The string in UTF-16, little-endian