Library unicode
Library methods for handling unicode strings.
Author:
Copyright © Same as Nmap--See https://nmap.org/book/man-legal.html
Source: https://svn.nmap.org/nmap/nselib/unicode.lua
Functions
- chardet (buf, len)
Determine (poorly) the character encoding of a string
- cp437_dec (buf, pos)
Decodes a CP437 character
- cp437_enc (cp)
Encode a Unicode code point to CP437
- decode (buf, decoder, bigendian)
Decode a buffer containing Unicode data.
- encode (list, encoder, bigendian)
Encode a list of Unicode code points
- transcode (buf, decoder, encoder, bigendian_dec, bigendian_enc)
Transcode a string from one format to another
- utf16_dec (buf, pos, bigendian)
Decodes a UTF-16 character.
- utf16_enc (cp, bigendian)
Encode a Unicode code point to UTF-16. See RFC 2781.
- utf16to8 (from)
Helper function for the common case of UTF-16 to UTF-8 transcoding, such as from a Windows/SMB unicode string to a printable ASCII (subset of UTF-8) string.
- utf8_dec (buf, pos)
Decodes a UTF-8 character.
- utf8_enc (cp)
Encode a Unicode code point to UTF-8. See RFC 3629.
- utf8to16 (from)
Helper function for the common case of UTF-8 to UTF-16 transcoding, such as from a printable ASCII (subset of UTF-8) string to a Windows/SMB unicode string.
Functions
- chardet (buf, len)
-
Determine (poorly) the character encoding of a string
First, the string is checked for a Byte-order Mark (BOM). This can be examined to determine UTF-16 with endianness or UTF-8. If no BOM is found, the string is examined.
If null bytes are encountered, UTF-16 is assumed. Endianness is determined by byte position, assuming the null is the high-order byte. Otherwise, if byte values over 127 are found, UTF-8 decoding is attempted. If this fails, the result is 'other', otherwise it is 'utf-8'. If no high bytes are found, the result is 'ascii'.
Parameters
- buf
- The string/buffer to be identified
- len
- The number of bytes to inspect in order to identify the string. Default: 100
Return value:
A string describing the encoding: 'ascii', 'utf-8', 'utf-16be', 'utf-16le', or 'other' meaning some unidentified 8-bit encoding - cp437_dec (buf, pos)
-
Decodes a CP437 character
Parameters
- buf
- A string containing the character
- pos
- The index in the string where the character begins
Return values:
- pos The index in the string where the character ended
- cp The code point of the character as a number
- cp437_enc (cp)
-
Encode a Unicode code point to CP437
Returns nil if the code point cannot be found in CP437
Parameters
- cp
- The Unicode code point as a number
Return value:
A string containing the related CP437 character - decode (buf, decoder, bigendian)
-
Decode a buffer containing Unicode data.
Parameters
- buf
- The string/buffer to be decoded
- decoder
- A Unicode decoder function (such as utf8_dec)
- bigendian
- For encodings that care about byte-order (such as UTF-16), set this to true to force big-endian byte order. Default: false (little-endian)
Return value:
A list-table containing the code points as numbers - encode (list, encoder, bigendian)
-
Encode a list of Unicode code points
Parameters
- list
- A list-table of code points as numbers
- encoder
- A Unicode encoder function (such as utf8_enc)
- bigendian
- For encodings that care about byte-order (such as UTF-16), set this to true to force big-endian byte order. Default: false (little-endian)
Return value:
An encoded string - transcode (buf, decoder, encoder, bigendian_dec, bigendian_enc)
-
Transcode a string from one format to another
The string will be decoded and re-encoded in one pass. This saves some overhead vs simply passing the output of
unicode.encode
tounicode.decode
.Parameters
- buf
- The string/buffer to be transcoded
- decoder
- A Unicode decoder function (such as utf16_dec)
- encoder
- A Unicode encoder function (such as utf8_enc)
- bigendian_dec
- Set this to true to force big-endian decoding.
- bigendian_enc
- Set this to true to force big-endian encoding.
Return value:
An encoded string - utf16_dec (buf, pos, bigendian)
-
Decodes a UTF-16 character.
Does not check that the returned code point is a real character. Specifically, it can be fooled by out-of-order lead- and trail-surrogate characters.
Parameters
- buf
- A string containing the character
- pos
- The index in the string where the character begins
- bigendian
- Set this to true to encode big-endian UTF-16. Default is false (little-endian)
Return values:
- pos The index in the string where the character ended
- cp The code point of the character as a number
- utf16_enc (cp, bigendian)
-
Encode a Unicode code point to UTF-16. See RFC 2781.
Windows OS prior to Windows 2000 only supports UCS-2, so beware using this function to encode code points above 0xFFFF.
Parameters
- cp
- The Unicode code point as a number
- bigendian
- Set this to true to encode big-endian UTF-16. Default is false (little-endian)
Return value:
A string containing the code point in UTF-16 encoding. - utf16to8 (from)
-
Helper function for the common case of UTF-16 to UTF-8 transcoding, such as from a Windows/SMB unicode string to a printable ASCII (subset of UTF-8) string.
Parameters
- from
- A string in UTF-16, little-endian
Return value:
The string in UTF-8 - utf8_dec (buf, pos)
-
Decodes a UTF-8 character.
Does not check that the returned code point is a real character.
Parameters
- buf
- A string containing the character
- pos
- The index in the string where the character begins
Return values:
- pos The index in the string where the character ended or nil on error
- cp The code point of the character as a number, or an error string
- utf8_enc (cp)
-
Encode a Unicode code point to UTF-8. See RFC 3629.
Does not check that cp is a real character; that is, doesn't exclude the surrogate range U+D800 - U+DFFF and a handful of others.
Parameters
- cp
- The Unicode code point as a number
Return value:
A string containing the code point in UTF-8 encoding. - utf8to16 (from)
-
Helper function for the common case of UTF-8 to UTF-16 transcoding, such as from a printable ASCII (subset of UTF-8) string to a Windows/SMB unicode string.
Parameters
- from
- A string in UTF-8
Return value:
The string in UTF-16, little-endian