String

char

Actually a 4-byte fixed size or u32 type of value.

#![allow(unused)]
fn main() {
let tao: char = '道';
println!("'道' as u32: {}", tao as u32); // 36947 char is basically u32 size unicode
println!("U+{:x}", tao as u32);  // U+9053 - output in 0x or hexadecimal(16) format
println!("{}", tao.escape_unicode()); // \u{9053}
println!("{}", char::from(65)); // from a u8 -> 'A'
println!("{}", std::char::from_u32(0x9053).unwrap()); // from a u32
println!("{}", std::char::from_u32(36947).unwrap()); // from a u32
println!("{}", std::char::from_u32(1234567).unwrap_or('_')); // not every u32 is a char

// noticing a char uses 4-byte in memory but not all the space is always used
assert_eq!(3, tao.len_utf8()); // effective data length in byte
assert_eq!(4, std::mem::size_of_val(&tao));
}

String

String is basically a Vec<u8>.

Other type:

  • Cstr/Cstring
  • OsStr/OsString
  • Path/PathBuf
#![allow(unused)]
fn main() {
let tao = std::str::from_utf8(&[0xe9u8, 0x81u8, 0x93u8]).unwrap();
println!("{}", tao);

let tao = String::from("\u{9053}");
println!("{}", tao);
}

A char like String might not be a char

A single "char" looking thing - like ❤️ - doesn't means it is a valid char. Some of those single looking characters needs more than one char or code points to be represented:

#![allow(unused)]
fn main() {
assert_eq!(6, String::from("❤️").len()); // length in byte
assert_eq!(6, std::mem::size_of_val(String::from("❤️").as_str())); // same calculation as above
assert_eq!(2, String::from("❤️").chars().count()); // how many `code points` - is 2?
// as ❤️ takes 2 code points , we can't assign it to a char
// let heart = '❤️'; // This won't work

assert_eq!(1, String::from("道").chars().count()); // 道 can be defined as a char as it only has 1 code point
assert_eq!('道', String::from("道").chars().next().unwrap());
assert_eq!(3, String::from("道").len());
assert_eq!(3, std::mem::size_of_val(String::from("道").as_str()));
}

'é' is not 'é'

As always, remember that a human intuition for 'character' may not map to Unicode's definitions. For example, despite looking similar, the 'é' character is one Unicode code point while 'é' is two Unicode code points:

#![allow(unused)]
fn main() {
assert_eq!(1, String::from("é").chars().count()); // '\u{00e9}' -> latin small letter e with acute
assert_eq!(2, String::from("é").chars().count()); // '\u{0065}' + '\u{0301}' -> U+0065: 'latin small letter e', U+0301: 'combining acute accent'
// They look the same in editor but have different code points ?!
}