Day 2. The Path to Mastering Go Language: Principles of String Implementation and Efficient Usage

A few days ago, I was busy reviewing GORM. I will share that article later. Today, let's first learn about the string type in Golang.

About String#

The string type is one of the most commonly used data types in modern programming languages. In C, one of the ancestors of Go, strings are not explicitly defined but are composed of string literal constants or character arrays (char) that end with \0.

#define GOAUTHERS "Robert Griesemer, Rob Pike, and Ken Thompson"
const char * s = "hello world"
char s[] = "hello gopher"

However, this caused some confusion for C programmers when using strings, such as:

Poor type safety
Always needing to consider the ending \0 when manipulating strings
Mutable string data
High cost of obtaining string length (O(n) time complexity)
No built-in handling for non-ASCII characters

However, Go language fixed this "flaw" of C by introducing the string type and providing a unified abstraction.

Strings in Go Language#

In Go, whether it is a constant in the code, a variable, or a string literal that appears, they are all uniformly set as string:

const S = "hello world"

func main() {
    var s1 string = "hello string"
    fmt.Printf("%T\n", S) // string
    fmt.Printf("%T\n", s1) // string
    fmt.Printf("%T\n", "hello") // string
}

Go's string type draws from the experiences of C language string design and combines best practices from string types in other languages, resulting in the following functional characteristics of Go strings:

1. Data Immutability#

Once a string of type string is declared, whether it is a constant or a variable, the data referred to by that identifier cannot change throughout the program's lifecycle. We can try this out with the first approach:

func main() {
	var s1 string = "hello string"
    fmt.Printf("Original string: %s\n", s1)

	// Attempt to modify after slicing
	sl := []byte(s1)
	sl[0] = 'l'
	fmt.Printf("slice: %s\n", sl)
	fmt.Printf("After slicing, original string is: %s\n", s1)
}
// output:
// Original string: hello string
// slice: lello string
// After slicing, original string is: hello string

As we can see, when we convert the string to a slice and attempt to modify it, the result is not what we expected. After slicing the string, the Go compiler allocates new underlying memory for the slice instead of sharing the same underlying memory with the original string, so the modification to the slice does not affect s1.

Let's try a more aggressive approach using unsafe:

func main() {
	var s1 string = "hello"
	fmt.Printf("Original string: %s\n", s1)

	modifyString(&s1)
	fmt.Println(s1)
}

func modifyString(s *string) {
	// Get the first 8 bytes value
	p := (*uintptr)(unsafe.Pointer(s))

	// Get the address of the underlying array
	var array *[5]byte = (*[5]byte)(unsafe.Pointer(p))
	var l *int = (*int)(unsafe.Pointer(uintptr(unsafe.Pointer(s)) + unsafe.Sizeof((*uintptr)(nil))))

	for i := 0; i < (*l); i++ {
		fmt.Printf("%p => %c\n", &((*array)[i]), (*array)[i])
		pl := &((*array)[i])
		v := (*pl)
		(*pl) = v + 1
	}
}
// output:
// Original string: hello
// 0xc00016b8f0 => ô
// 0xc00016b8f1 => Í
// 0xc00016b8f2 => 
// 0xc00016b8f3 =>  
// 0xc00016b8f4 =>  
// unexpected fault address 0x10120cef5
// fatal error: fault
// [signal 0xc0000005 code=0x0 addr=0x10120cef5 pc=0x13ef7c]

We can see that the underlying data area of string can only be read, and attempting to modify that part of memory results in a SIGBUS runtime error, indicating that the "tampering attack" on string data has failed again.

2. Zero Value Usability#

Go's string type supports the concept of "zero value usability." Go strings do not require consideration of the ending \0 character like in C, so their zero value is "", with a length of 0.

var s string
fmt.Printf("%s\n", s)
fmt.Printf("%d\n", len(s))
// output:
// ""
// 0

3. Time Complexity of Length Retrieval is O(1)#

The length of Go's string type data is immutable, so once it has an initial value, that data will not change, and its length will not change. Go stores this length as a field in the internal representation structure of the string type. Thus, obtaining the string length, i.e., len(s), is actually just reading a field in runtime/string, which is a very low-cost operation.

4. Supports String Concatenation via +/+= Operators#

For developers, string concatenation using the + and += operators provides the best experience, and Go language supports this operation:

s := "hello"
s = s + " world"
s += ", golang"

fmt.Println(s)
// output: hello world, golang

5. Supports Various Comparison Operators: ==, !=, >=, <=, >, and <#

Go supports various comparison operators:

func main() {
	s1 := "hello world"
	s2 := "hello" + " world"
	fmt.Println(s1 == s2)

	s1 = "Go"
	s2 = "C"
	fmt.Println(s1 != s2)

	s1 = "12345"
	s2 = "23456"
	fmt.Println(s1 < s2)
	fmt.Println(s1 <= s2)

	s1 = "12345"
	s2 = "123"
	fmt.Println(s1 > s2)
	fmt.Println(s1 >= s2)
}
// output:
// true
// true
// true
// true
// true
// true

Since Go strings are immutable, if two strings have different lengths, there is no need to compare the actual string data to determine that the two strings are different. If the lengths are the same, further checks are needed to see if the data pointers point to the same storage. If they do, the two strings are equivalent; if not, the actual data content must be compared further.

6. Native Support for Non-ASCII Characters#

Go language source files all default to using the unicode character set. The Unicode character set is currently the most popular character set on the market, encompassing almost all non-ASCII characters. Each character in a Go string is a unicode character, and these unicode characters are stored in memory using utf-8 encoding.

7. Native Support for Multi-line Strings#

In C, to construct multi-line strings, one must either use multiple strings for concatenation or combine them with the continuation character \, making it difficult to control the format.

Go language directly provides a way to construct multi-line strings using backticks:

func main() {

	s := `hello
world
golang`

	fmt.Println(s)
}
// output:
// hello
// world
// golang

Internal Representation of Strings#

The characteristics of Go's string type are closely tied to the internal representation of the string type in Go runtime. Go strings are represented in runtime as follows:

// $GOROOT/src/runtime/string.go
type stringStruct struct {
    str unsafe.Pointer
    len int
}

What we see as a string is actually a descriptor; it does not truly store data but points to an object composed of a pointer to the underlying storage and a length field of the string. Let's look at the instantiation process of a string:

// $GOROOT/src/runtime/string.go

func rawstring(size int) (s string, b []byte) {
    p := mallocgc(uintptr(size), nil, false)
    stringStructOf(&s).str = p
    stringStructOf(&s).len = size

    *(*slice)(unsafe.Pointer(&b)) = slice{p, size, size}

    return
}

Analyzing this diagram:

We can see that each string corresponds to a stringStruct instance. After executing rawstring, the str pointer in stringStruct points to the actual memory area storing the string data, and the len field stores the length of the string; at the same time, rawstring also creates a temporary slice, whose array pointer also points to the underlying storage memory area of the string. Note that after executing rawstring, the allocated memory area has not yet been written with data; this slice is meant to write data into memory later, such as "hello".

After writing the data, the slice will be reclaimed.

Based on the representation of string in runtime, we can find that directly passing the string type as a function/method parameter does not incur much overhead, as only a descriptor is passed, not the actual string data.

Efficient Construction of Strings#

As mentioned earlier, Go natively supports using the +/+= operators to concatenate multiple strings to construct a longer string, and this method provides the best developer experience. However, Go also offers other construction methods:

fmt.Sprintf
strings.Join
strings.Builder
bytes.Buffer

Among these methods, which one is the most efficient? We can refer to benchmark test data:

var sl []string = []string{
	"Rob Pike ",
	"Robert Griesemer ",
	"Ken Thompson ",
}

func concatStringByOperator(sl []string) string {
	var s string
	for _, v := range sl {
		s += v
	}
	return s
}

func concatStringBySprintf(sl []string) string {
	var s string
	for _, v := range sl {
		s = fmt.Sprintf("%s%s", s, v)
	}
	return s
}

func concatStringByJoin(sl []string) string {
	return strings.Join(sl, "")
}

func concatStringByStringsBuilder(sl []string) string {
	var b strings.Builder
	for _, v := range sl {
		b.WriteString(v)
	}
	return b.String()
}

func concatStringByStringsBuilderWithInitSize(sl []string) string {
	var b strings.Builder
	b.Grow(64)
	for _, v := range sl {
		b.WriteString(v)
	}
	return b.String()
}

func concatStringByBytesBuffer(sl []string) string {
	var b bytes.Buffer
	for _, v := range sl {
		b.WriteString(v)
	}
	return b.String()
}

func concatStringByBytesBufferWithInitSize(sl []string) string {
	buf := make([]byte, 0, 64)
	b := bytes.NewBuffer(buf)
	for _, v := range sl {
		b.WriteString(v)
	}
	return b.String()
}

func BenchmarkConcatStringByOperator(b *testing.B) {
	for n := 0; n < b.N; n++ {
		concatStringByOperator(sl)
	}
}

func BenchmarkConcatStringBySprintf(b *testing.B) {
	for n := 0; n < b.N; n++ {
		concatStringBySprintf(sl)
	}
}

func BenchmarkConcatStringByJoin(b *testing.B) {
	for n := 0; n < b.N; n++ {
		concatStringByJoin(sl)
	}
}

func BenchmarkConcatStringByStringsBuilder(b *testing.B) {
	for n := 0; n < b.N; n++ {
		concatStringByStringsBuilder(sl)
	}
}

func BenchmarkConcatStringByStringsBuilderWithInitSize(b *testing.B) {
	for n := 0; n < b.N; n++ {
		concatStringByStringsBuilderWithInitSize(sl)
	}
}

func BenchmarkConcatStringByBytesBuffer(b *testing.B) {
	for n := 0; n < b.N; n++ {
		concatStringByBytesBuffer(sl)
	}
}

func BenchmarkConcatStringByBytesBufferWithInitSize(b *testing.B) {
	for n := 0; n < b.N; n++ {
		concatStringByBytesBufferWithInitSize(sl)
	}
}

The test results are as follows:

goos: windows
goarch: amd64
pkg: prometheus_for_go
cpu: AMD Ryzen 7 5800H with Radeon Graphics         
BenchmarkConcatStringByOperator
BenchmarkConcatStringByOperator-16                      	16968997	        67.09 ns/op	      80 B/op	       2 allocs/op
BenchmarkConcatStringBySprintf
BenchmarkConcatStringBySprintf-16                       	 3519394	       337.6 ns/op	     176 B/op	       8 allocs/op
BenchmarkConcatStringByJoin
BenchmarkConcatStringByJoin-16                          	27803714	        42.37 ns/op	      48 B/op	       1 allocs/op
BenchmarkConcatStringByStringsBuilder
BenchmarkConcatStringByStringsBuilder-16                	13961523	        85.36 ns/op	     112 B/op	       3 allocs/op
BenchmarkConcatStringByStringsBuilderWithInitSize
BenchmarkConcatStringByStringsBuilderWithInitSize-16    	32188840	        38.04 ns/op	      64 B/op	       1 allocs/op
BenchmarkConcatStringByBytesBuffer
BenchmarkConcatStringByBytesBuffer-16                   	18743878	        63.34 ns/op	     112 B/op	       2 allocs/op
BenchmarkConcatStringByBytesBufferWithInitSize
BenchmarkConcatStringByBytesBufferWithInitSize-16       	33670977	        35.76 ns/op	      48 B/op	       1 allocs/op

We can see that the pre-initialized Bytes.Buffer has the highest efficiency, requiring only 35.76ns for each operation, and the memory operation only needs to happen once. After multiple tests, the best performance consistently came from this method, which is the pre-initialized bytes.Buffer.

Next is the pre-initialized strings.Builder, which is only about 3ns slower than bytes.Buffer, making it very close in performance.

Thus, we can conclude:

When the final string length can be estimated, using pre-initialized bytes.Buffer and strings.Builder for string construction is the most effective;

strings.Join has the most stable performance; if multiple strings are carried in []string, then strings.Join is also a good choice.
Directly using operators is the most intuitive and natural; when the compiler knows the number of strings to concatenate, it will optimize this method.
Although fmt.Sprintf is not highly efficient, it is the most suitable method if we need to consistently use different variables to construct a specific format of string.

Efficient Conversion of Strings#

In previous examples, we saw the conversion between string and []rune, as well as the conversion between string and []byte. Both conversions are reversible. That is, string can be converted to and from []byte and []rune. Here is an example:

func main() {
    rs := []rune{
        0x4E2D,
        0x56FD,
        0x6B22,
        0x8FCE,
        0x60A8,
    }

    s := string(rs)
    fmt.Println(s)

    sl := []byte{
        0xE4, 0xB8, 0xAD,
        0xE5, 0x9B, 0xBD,
        0xE6, 0xAC, 0xA2,
        0xE8, 0xBF, 0x8E,
        0xE6, 0x82, 0xA8,
    }

    s = string(sl)
    fmt.Println(s)
}

$go run string_slice_to_string.go
中国欢迎您
中国欢迎您

Whether converting from string to slice or from slice to string, the conversion incurs a cost, which is the memory operation. The source of this cost is that string is immutable, so each conversion requires allocating new memory for the converted type. We can look at the memory allocation situation during the conversion process between string and slice:

func byteSliceToString() {
	sl := []byte{
		0xE4, 0xB8, 0xAD,
		0xE5, 0x9B, 0xBD,
		0xE6, 0xAC, 0xA2,
		0xE8, 0xBF, 0x8E,
		0xE6, 0x82, 0xA8,
		0xEF, 0xBC, 0x8C,
		0xE5, 0x8C, 0x97,
		0xE4, 0xBA, 0xAC,
		0xE6, 0xAC, 0xA2,
		0xE8, 0xBF, 0x8E,
		0xE6, 0x82, 0xA8,
	}

	_ = string(sl)
}

func stringToByteSlice() {
	s := "中国欢迎您，北京欢迎您"
	_ = []byte(s)
}

func main() {
	fmt.Println(testing.AllocsPerRun(1, byteSliceToString))
	fmt.Println(testing.AllocsPerRun(1, stringToByteSlice))
}
// output:
// 1
// 0

Here, an interesting point arises: my operating environment is go version go1.23.4 windows/amd64. When I changed the string s to s := fmt.Sprintf("中国欢迎您，北京欢迎您"), the memory operation count in stringToBytesSlice changed from 0 to 1.

That is, with this code:

// Other code remains unchanged

func stringToByteSlice() {
	s := fmt.Sprintf("中国欢迎您，北京欢迎您")
	_ = []byte(s)
}

// output:
// 1
// 1

This means that if a string is predictable by the compiler, the compiler will optimize the stringToByteSlice operation to have zero memory operations. However, if the string is unknown, such as in the case of fmt.Sprintf, where the content is determined only at runtime, it cannot be optimized by the compiler, resulting in one memory operation.

Thus, we can establish a concept:

If a string's content can be determined, the compiler will optimize the process of stringToByteSlice. If the string's content is determined at runtime, memory allocation will still occur during the conversion.

What about at the Go runtime level? In the Golang runtime, the functions responsible for conversions are as follows:

// $GOROOT/src/runtime/string.go
slicebytetostring: []byte -> string
slicerunetostring: []rune -> string
stringtoslicebyte: string -> []byte
stringtoslicerune: string -> []rune

Let's take byte slice as an example and look at the specific implementations of stringbytetostring and stringtoslicebyte:

// $GOROOT/src/runtime/string.go

const tmpStringBufSize = 32
type tmpBuf [tmpStringBufSize]byte

func stringtoslicebyte(buf *tmpBuf, s string) []byte {
	var b []byte
	if buf != nil && len(s) <= len(buf) {
		*buf = tmpBuf{}
		b = buf[:len(s)]
	} else {
		b = rawbyteslice(len(s))
	}
	copy(b, s)
	return b
}

func slicebytetostring(buf *tmpBuf, ptr *byte, n int) string {
	if n == 0 {
		// Turns out to be a relatively common case.
		// Consider that you want to parse out data between parens in "foo()bar",
		// you find the indices and convert the subslice to string.
		return ""
	}
	if raceenabled {
		racereadrangepc(unsafe.Pointer(ptr),
			uintptr(n),
			getcallerpc(),
			abi.FuncPCABIInternal(slicebytetostring))
	}
	if msanenabled {
		msanread(unsafe.Pointer(ptr), uintptr(n))
	}
	if asanenabled {
		asanread(unsafe.Pointer(ptr), uintptr(n))
	}
	if n == 1 {
		p := unsafe.Pointer(&staticuint64s[*ptr])
		if goarch.BigEndian {
			p = add(p, 7)
		}
		return unsafe.String((*byte)(p), 1)
	}

	var p unsafe.Pointer
	if buf != nil && n <= len(buf) {
		p = unsafe.Pointer(buf)
	} else {
		p = mallocgc(uintptr(n), nil, false)
	}
	memmove(p, unsafe.Pointer(ptr), uintptr(n))
	return unsafe.String((*byte)(p), n)
}

To achieve efficient conversion, the only method is to reduce memory allocation. We can see that the runtime implementation of type conversion functions has included code to avoid repeated memory operations, such as tmpBuf reuse, empty string optimization, and single-byte string optimization.

slice is not comparable, while string is comparable, so we often convert slice to string. The Go compiler has optimized for such scenarios, and there is a function in the runtime called slicebytetostringtmp that assists in implementing this optimization:

func slicebytetostringtmp(ptr *byte, n int) string {
	if raceenabled && n > 0 {
		racereadrangepc(unsafe.Pointer(ptr),
			uintptr(n),
			getcallerpc(),
			abi.FuncPCABIInternal(slicebytetostringtmp))
	}
	if msanenabled && n > 0 {
		msanread(unsafe.Pointer(ptr), uintptr(n))
	}
	if asanenabled && n > 0 {
		asanread(unsafe.Pointer(ptr), uintptr(n))
	}
	return unsafe.String(ptr, n)
}

What optimizations does slicebytetostringtmp perform? This function's optimization is quite aggressive; it chooses to directly reuse the underlying memory of the slice, avoiding any memory allocation and value copying. However, the prerequisite for using this function is: Once the original slice is modified, this string will become directly unusable. Therefore, this function is generally used in the following scenarios:

b := []byte("k", "e", "y")

// 1. string(b) used as a key in map type
m := make(map[string]string)
m[string(b)] = "value"
m[[3]string{string(b), "key1", "key2"}] = "value1"

// 2. string(b) used in string concatenation statement
s := "hello " + string(b) + "!"

// 3. string(b) used in string comparison
s := "tom"

if s < string(b) {
    ...
}