Library `unicode`

Library methods for handling unicode strings.

Author:

Daniel Miller

Copyright © Same as Nmap--See https://nmap.org/book/man-legal.html

Source: https://svn.nmap.org/nmap/nselib/unicode.lua

Functions

chardet (buf, len): Determine (poorly) the character encoding of a string
cp437_dec (buf, pos): Decodes a CP437 character
cp437_enc (cp): Encode a Unicode code point to CP437
decode (buf, decoder, bigendian): Decode a buffer containing Unicode data.
encode (list, encoder, bigendian): Encode a list of Unicode code points
transcode (buf, decoder, encoder, bigendian_dec, bigendian_enc): Transcode a string from one format to another
utf16_dec (buf, pos, bigendian): Decodes a UTF-16 character.
utf16_enc (cp, bigendian): Encode a Unicode code point to UTF-16. See RFC 2781.
utf16to8 (from): Helper function for the common case of UTF-16 to UTF-8 transcoding, such as from a Windows/SMB unicode string to a printable ASCII (subset of UTF-8) string.
utf8_dec (buf, pos): Decodes a UTF-8 character.
utf8_enc (cp): Encode a Unicode code point to UTF-8. See RFC 3629.
utf8to16 (from): Helper function for the common case of UTF-8 to UTF-16 transcoding, such as from a printable ASCII (subset of UTF-8) string to a Windows/SMB unicode string.

Functions

chardet (buf, len)

Determine (poorly) the character encoding of a string

First, the string is checked for a Byte-order Mark (BOM). This can be examined to determine UTF-16 with endianness or UTF-8. If no BOM is found, the string is examined.

If null bytes are encountered, UTF-16 is assumed. Endianness is determined by byte position, assuming the null is the high-order byte. Otherwise, if byte values over 127 are found, UTF-8 decoding is attempted. If this fails, the result is 'other', otherwise it is 'utf-8'. If no high bytes are found, the result is 'ascii'.

Parameters

buf: The string/buffer to be identified
len: The number of bytes to inspect in order to identify the string. Default: 100

Return value:

A string describing the encoding: 'ascii', 'utf-8', 'utf-16be', 'utf-16le', or 'other' meaning some unidentified 8-bit encoding

cp437_dec (buf, pos)

Decodes a CP437 character

Parameters

buf: A string containing the character
pos: The index in the string where the character begins

Return values:

pos The index in the string where the character ended
cp The code point of the character as a number

cp437_enc (cp)

Encode a Unicode code point to CP437

Returns nil if the code point cannot be found in CP437

Parameters

cp: The Unicode code point as a number

Return value:

A string containing the related CP437 character

decode (buf, decoder, bigendian)

Decode a buffer containing Unicode data.

Parameters

buf: The string/buffer to be decoded
decoder: A Unicode decoder function (such as utf8_dec)
bigendian: For encodings that care about byte-order (such as UTF-16), set this to true to force big-endian byte order. Default: false (little-endian)

Return value:

A list-table containing the code points as numbers

encode (list, encoder, bigendian)

Encode a list of Unicode code points

Parameters

list: A list-table of code points as numbers
encoder: A Unicode encoder function (such as utf8_enc)
bigendian: For encodings that care about byte-order (such as UTF-16), set this to true to force big-endian byte order. Default: false (little-endian)

Return value:

An encoded string

transcode (buf, decoder, encoder, bigendian_dec, bigendian_enc)

Transcode a string from one format to another

The string will be decoded and re-encoded in one pass. This saves some overhead vs simply passing the output of unicode.encode to unicode.decode.

Parameters

buf: The string/buffer to be transcoded
decoder: A Unicode decoder function (such as utf16_dec)
encoder: A Unicode encoder function (such as utf8_enc)
bigendian_dec: Set this to true to force big-endian decoding.
bigendian_enc: Set this to true to force big-endian encoding.

Return value:

An encoded string

utf16_dec (buf, pos, bigendian)

Decodes a UTF-16 character.

Does not check that the returned code point is a real character. Specifically, it can be fooled by out-of-order lead- and trail-surrogate characters.

Parameters

buf: A string containing the character
pos: The index in the string where the character begins
bigendian: Set this to true to encode big-endian UTF-16. Default is false (little-endian)

Return values:

pos The index in the string where the character ended
cp The code point of the character as a number

utf16_enc (cp, bigendian)

Encode a Unicode code point to UTF-16. See RFC 2781.

Windows OS prior to Windows 2000 only supports UCS-2, so beware using this function to encode code points above 0xFFFF.

Parameters

cp: The Unicode code point as a number
bigendian: Set this to true to encode big-endian UTF-16. Default is false (little-endian)

Return value:

A string containing the code point in UTF-16 encoding.

utf16to8 (from)

Helper function for the common case of UTF-16 to UTF-8 transcoding, such as from a Windows/SMB unicode string to a printable ASCII (subset of UTF-8) string.

Parameters

from: A string in UTF-16, little-endian

Return value:

The string in UTF-8

utf8_dec (buf, pos)

Decodes a UTF-8 character.

Does not check that the returned code point is a real character.

Parameters

buf: A string containing the character
pos: The index in the string where the character begins

Return values:

pos The index in the string where the character ended or nil on error
cp The code point of the character as a number, or an error string

utf8_enc (cp)

Encode a Unicode code point to UTF-8. See RFC 3629.

Does not check that cp is a real character; that is, doesn't exclude the surrogate range U+D800 - U+DFFF and a handful of others.

Parameters

cp: The Unicode code point as a number

Return value:

A string containing the code point in UTF-8 encoding.

utf8to16 (from)

Helper function for the common case of UTF-8 to UTF-16 transcoding, such as from a printable ASCII (subset of UTF-8) string to a Windows/SMB unicode string.

Parameters

from: A string in UTF-8

Return value:

The string in UTF-16, little-endian

Library unicode

Functions

Functions

Parameters

Return value:

Parameters

Return values:

Parameters

Return value:

Parameters

Return value:

Parameters

Return value:

Parameters

Return value:

Parameters

Return values:

Parameters

Return value:

Parameters

Return value:

Parameters

Return values:

Parameters

Return value:

Parameters

Return value:

Library `unicode`