Skip to content
Snippets Groups Projects
  1. Jan 01, 2023
  2. Mar 07, 2022
    • Tom Tromey's avatar
      Handle non-ASCII identifiers in Ada · 315e4ebb
      Tom Tromey authored
      Ada allows non-ASCII identifiers, and GNAT supports several such
      encodings.  This patch adds the corresponding support to gdb.
      
      GNAT encodes non-ASCII characters using special symbol names.
      
      For character sets like Latin-1, where all characters are a single
      byte, it uses a "U" followed by the hex for the character.  So, for
      example, thorn would be encoded as "Ufe" (0xFE being lower case
      thorn).
      
      For wider characters, despite what the manual says (it claims
      Shift-JIS and EUC can be used), in practice recent versions only
      support Unicode.  Here, characters in the base plane are represented
      using "Wxxxx" and characters outside the base plane using
      "WWxxxxxxxx".
      
      GNAT has some further quirks here.  Ada is case-insensitive, and GNAT
      emits symbols that have been case-folded.  For characters in ASCII,
      and for all characters in non-Unicode character sets, lower case is
      used.  For Unicode, however, characters that fit in a single byte are
      converted to lower case, but all others are converted to upper case.
      
      Furthermore, there is a bug in GNAT where two symbols that differ only
      in the case of "Y WITH DIAERESIS" (and potentially others, I did not
      check exhaustively) can be used in one program.  I chose to omit
      handling this case from gdb, on the theory that it is hard to figure
      out the logic, and anyway if the bug is ever fixed, we'll regret
      having a heuristic.
      
      This patch introduces a new "ada source-charset" setting.  It defaults
      to Latin-1, as that is GNAT's default.  This setting controls how "U"
      characters are decoded -- W/WW are always handled as UTF-32.
      
      The ada_tag_name_from_tsd change is needed because this function will
      read memory from the inferior and interpret it -- and this caused an
      encoding failure on PPC when running a test that tries to read
      uninitialized memory.
      
      This patch implements its own UTF-32-based case folder.  This avoids
      host platform quirks, and is relatively simple.  A short Python
      program to generate the case-folding table is included.  It simply
      relies on whatever version of Unicode is used by the host Python,
      which seems basically acceptable.
      
      Test cases for UTF-8, Latin-1, and Latin-3 are included.  This
      exercises most of the new code paths, aside from Y WITH DIAERESIS as
      noted above.
      
      
      315e4ebb
Loading