Unit 16: Strings
Learning Objectives
After this unit, students should:
- understand that a string is an array of chars
- know that a string always ends with a
char
with value 0 (or'\0'
) - be aware of the different types of special chars
- understand what the empty string represents
- be able to use the CS1010 library functions to read strings and arrays of strings
Just a char
Array
We have seen strings as a sequence of characters stored in double-quotes, e.g., "hello!". In C, a string is nothing more than just an array of char
values (Recall from Unit 5 that sizeof char
is 1).
The only thing special about a string is that it always end with a 0
value (note: not character '0' which has a value of 48, but the value 0). Since the character with value 0 is called the null character, written as '\0'
, we refer that strings in C as null-terminated strings.
In C, we distinguish between a string and a char
by the quotes used. String uses double quotes "
, while a char
uses a single quote '
. So we can do the following and they are equivalent:
1 2 |
|
Of course, nobody initializes a string in the first way above.
We can also skip the size of the array, as mentioned above, or, more commonly, use a char *
type for a variable storing a constant string.
1 2 |
|
Special Characters
The null character is written as '\0'
. The use of the backslash \
creates an escape sequence that can be used to denote characters that would otherwise not visible on the screen. For instance, besides '\0'
, we will likely encounter '\n'
(the newline character) '\t'
(the tab character) and '\a'
(the beep character) regularly.
Furthermore, since we already use \
to indicate the escape sequence, '
to indicate a character, and "
to indicate a string, to use these characters, we need to "escape" them -- we use '\\'
for the backslash character, '\''
, the single quote character, and the '\"'
double-quote character.
String Literals
A string literal refers to a string written between two "
characters, such as "Hello world!"
. Such a string is still internally stored as an array of char
values. The location these arrays are stored depends on the platform, but usually, they are stored in a read-only region of the memory called the text
region. These strings are not meant to be modified. Hence, trying something like this:
1 2 |
|
is not allowed and would crash your program.
The following, however, is OK:
1 2 |
|
The difference between the two is that str1
points to a read-only region in the memory, while str2
contains a copy of the string on the stack.
Empty String
We commonly use the empty string ""
to indicate a special condition or to initialize a string variable, where appropriate. The empty string is nothing but an array where the 0-th element is '\0'.
Common Bugs
A very common bug for beginners to C programming is to forget about the terminiating '\0'
char in a string. For instance, suppose you want to copy one string to another:
1 2 3 4 5 |
|
The string variable dst
is not terminated and C would treat whatever follows the memory location
as part of the string!
The other common bug is to create an array that is too small for the string. Suppose you use the function strcpy
provided by C to copy the string instead.
1 2 3 |
|
strcpy
would properly copy the characters 'h'
, 'e'
, .. '!'
, AND the terminating '\0'
. That's a total of 7 characters. The code, however, only allocated the space of 6 characters for dst
. So overflow occurs and your program might crash.
So, remember to always account for the "invincible friend" at the end of every string, and leave a space for it.
CS1010 I/O Library
The CS1010 I/O library provides two functions, one to read a word (separated by white-space characters) and the other to read a line (separated by a newline character). They are cs1010_read_word()
and cs1010_read_line()
respectively. We can also read multiple words and multiple lines with cs1010_read_word_array()
and cs1010_read_line_array()
. The results are stored in an array of strings.
Problem Set 16
Problem 16.1
Write the following functions (without calling the standard C functions declared in <string.h>
such as strlen
, strcmp
, strstr
):
(a) long string_length(char *str)
return the length (i.e., the number of characters) of the string str
.
(b) bool string_equal(char *str1, char *str2)
return true
if the two strings str1
and str2
contains exactly the same content, false
otherwise. (Note: str1 == str2
does not compare if two strings have the same content. (Why?))
(c) char *string_in_string(char *needle, char *haystack)
returns a pointer to the first character of the first occurrence of needle
in haystack
if found. If needle
does not occur anywhere in haystack
, return NULL. If needle
is an empty string, haystack
is returned.