Exercise 8: Digits

Prerequisite

You have completed Exercise 7.

Learning Outcomes

Able to decompose a complex problem into smaller sub-problems
Able to use struct to encapsulate data

Deadline

This exercise is part of the CS1010 formative assessment. Submit your solution before 18 November 2024, 23:59 to receive feedback and earn your achievement badges.

Acceptance Link

The link to accept the exercise is not available publicly. Visit Canvas for the link.

Concepts and Difficulty

	Question	Struct	Difficulty
1	Digits

Question 1: Digits

In this exercise, your goal is to write a program that can recognize handwritten digits. This is a classic problem in pattern recognition and machine learning. The state-of-the-art techniques can now achieve a recognition rate of over 99%. We, however, will implement a simple algorithm called the \(k\)-nearest neighbor algorithm, which has a lower accuracy but is simpler.

Supervised Machine Learning

The \(k\)-nearest neighbor algorithm belongs to a category of algorithms called supervised machine learning_ algorithms. These algorithms teach or train the machine to learn to do something, by showing the machine many samples. These are called training samples. After showing the machine a bunch of these training samples, we can then test the machine how well it has learned, by showing the machine some testing samples.

In our case, we want to train the machine to recognize handwritten digits. During the training phase, we will present a set of handwritten digits in the form of images (as 2D arrays) to the machine, together with a label indicating which digits it is (i.e., "this picture shows the digit 1, this picture shows the digit 4, etc.").

Then, during the testing phase, we present a set of handwritten digits to the machine, and ask, "ok, which digit do you think this is?"

We will then compare the machine's answer with the label of the handwritten digit (also called the ground truth) to see whether the machine recognizes it correctly or not.

Representation of a handwritten digit

A handwritten digit is an image, represented as a 28-by-28 array of characters, consisting of . and #. An example is:

............................
............................
............................
..........#######...........
.........########...........
........##########..........
.......####....###..........
.......###....####..........
..............####..........
.............####...........
............#####...........
............####............
...........####.............
...........###..............
..........####..............
.........####...............
.........####...............
........####................
........###.................
........###...........####..
........###################.
........###################.
.........###########........
............................
............................
............................
............................
............................

The image above contains the digit 2.

Training Data

For this assignment, we will use the data provided by the MNIST Handwritten Digit Database. The original data has been post-processed into CSV file by Joseph Redmon, then post-processed again by yours truly into the format above.

The dataset from above contains 60,000 training samples. I have created two smaller subsets for you to test your code.

The first subset contains 10 samples for each digit -- this is given in the file train100.in.
The second subset contains 6 samples for digits 0 and 1 -- this is given in the file train6.in.

The full training set from MNIST is too big to be included on GitHub. You can read it from ~cs1010/ex08/train60000.in if you want to play with it.

Testing Samples

We provide two sets of testing samples. The first contains three testing samples per digit. The second contains only a single digit corresponding to the example below.

The full testing dataset from MNIST contains 10000 digits. It takes a long time to test every handwritten digit in this file, so you should do this only after you have made sure that your code is fast and correct. Again, the file is too big to be posted on GitHub. You can read it directly from ~cs1010/ex08/test10000.in.

The Algorithm

We use the \(k\)-nearest neighbor algorithm for this assignment. The idea of this algorithm is that, given a testing sample, the machine will compare it to all the training samples, and find out which digit this testing sample is most similar to.

Let's define the distance between the two handwritten digits \(d(x_1,x_2)\) as the number of pixels that are different between them, i.e., how many pixels are # in one image but is . in the other image.

Given a test sample, \(q\), we find the distance between \(q\) and all the available training samples and find the \(k\) training samples with the smallest distance (i.e., \(k\)-nearest neighbors). \(k\) is usually small -- we use \(k\) = 5 in this assignment.

The intuition is that \(q\) must be "close" to the training data that has the same labels (i.e., the same digits). So we look at these \(k\)-nearest neighbors and find the most common digit \(d\) among them. We then recognize \(q\) as containing the digit \(d\).

Breaking Ties

There are two tie-breaking scenarios that we need to consider. First, if there are more than \(k\) nearest neighbors. In this case, we pick the nearest \(k\) by favoring the smaller digit. For example, if we have the following training samples:

digit	distance
3	120
2	125
3	131
7	140
1	140
2	140

We favor 1 and 2 and eliminate 7 from the top-\(k\) since they are smaller than 7.

Second, multiple digits may tie as the most common digit. In this case, we choose the one that has the smaller distance. In the example above, after eliminating 7, both 2 and 3 are tied for the most common digit as they both appear twice among the top 5. We choose 3 since it has a smaller distance (120 vs. 125). If the closest samples of the digits have the same distance, then we again break ties by favoring the smaller digit. For example, if the digit 2 in the second entry in the table above also has a distance of 120, then we pick 2 since 2 is the smaller digit.

Efficiency

Suppose we have \(n\) training samples, recognizing a digit should take no more than \(O(kn)\) time (or \(O(n)\) since \(k\) is a constant).

There is also an opportunity to stop early the calculation of the distance between two handwritten digits if the distance is too large, pruning away redundant work.

Example

Consider the following simple example with six training samples and a test sample.

6
0
............................
............................
............................
............................
...............#####........
...............#####........
.............########.......
............##########......
...........###########......
..........########.###......
..........####.##...###.....
.........#####......###.....
........####........###.....
.......####.........###.....
.......###..........###.....
......####..........###.....
......###..........###......
......###.........####......
......###........###........
......###......####.........
......####..#######.........
......###########...........
.......########.............
........#####...............
............................
............................
............................
............................
0
............................
............................
............................
............................
.................####.......
................######......
...............#######......
.............#########......
...........###########......
..........############......
.........#############......
........#####.####.###......
.......#####..###...##......
.......####.........##......
.......##...........##......
......###...........##......
.....####..........###......
.....####..........###......
.....####.........###.......
.....####.......####........
.....##############.........
......############..........
........########............
.........######.............
............................
............................
............................
............................
0
............................
............................
............................
............................
............................
..............######........
..............########......
.............#########......
.............#########......
............##########......
..........######..####......
..........#####...####......
..........#####...####......
..........#####..#####......
..........####...#####......
.........####....####.......
........#####...#####.......
........#####...####........
........#####..#####........
........####..#####.........
........##########..........
........#########...........
........#########...........
.........######.............
..........#####.............
............................
............................
............................
1
............................
............................
............................
............................
............................
..................####......
.................#####......
.................#####......
................####........
...............#####........
...............####.........
...............####.........
..............####..........
.............####...........
............####............
............####............
...........####.............
...........####.............
..........####..............
..........####..............
.........####...............
.........####...............
........####................
........####................
.........###................
............................
............................
............................
1
............................
............................
............................
............................
............###.............
............####............
............####............
............####............
............####............
............####............
............#####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............#####..........
.............#####..........
..............####..........
..............####..........
..............####..........
...............###..........
............................
............................
............................
............................
1
............................
............................
............................
............................
............................
.............##.............
.............##.............
.............##.............
.............##.............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
..............###...........
..............##............
..............##............
..............###...........
..............##............
..............##............
............................
............................
............................

The test sample is

1
0
............................
............................
............................
............................
............................
..............#####.........
.............#######........
...........##########.......
..........####.##.####......
.........####......####.....
.........###........###.....
........####.........##.....
........###..........##.....
.......###...........##.....
.......###...........##.....
.......###...........##.....
......####...........##.....
......####..........###.....
.......###..........###.....
.......###.........###......
........###.......###.......
........####....#####.......
.........##########.........
..........#########.........
............#####...........
............................
............................
............................

The distances between the test sample and each of the training samples are 101, 120, 162, 174, 173, and 162. The \(k\) nearest neighbors belong to digits 0, 0, 0, 1, and 1. The most common digit among these neighbors is 0, so we conclude (correctly) that the test sample is digit 0.

Using struct

One of the objectives of this exercise is to practice how to use struct.

There are two opportunities to use struct in this exercise:

a struct digit that encapsulates the 2D array that represents the handwritten digit and its label,
a struct neighbor that encapsulates a neighboring digit and the distance to it.

You can define other struct as needed.

Input/Output

Write a program digits that reads, from the standard input, the following:

A positive integer \(n\), corresponding to the number of training samples, then repeatedly read \(n\) handwritten digits, containing:
a label corresponding to the digit in the next image (a number between 0 - 9)
28 lines of text, consisting of '.' and '#' only, representing a handwritten digit
followed by another positive integer \(m\), corresponding to the number testing samples, then repeatedly read \(m\) handwritten digits, containing:
a label corresponding to the digit in the next image (a number between 0 - 9)
28 lines of text, consisting of '.' and '#' only, representing a handwritten digit

The program prints, to the standard output, the following:

For each testing sample,

print the digit it is labeled as, followed by a space, followed by the digit it is recognized as.

Finally, print a double value that corresponds to the accuracy, i.e, the percentage of training testing samples correctly recognized.

We separated the training samples and testing samples into two files, so the usual way of redirecting a file into your program does not work anymore (since you need two files).

The way to run your program is to do the following:

1	`cat <training samples> <testing samples> \| ./digits`

We use cat, which concatenates two files into one to pass both the training samples and testing samples into the program using the pipe |.

The name convention for the output files is different for this assignment. The name is formatted as X-Y.out where X refers to the training samples trainX.in and Y refers to the testing samples testY.in.

Testing

In addition to the MNIST dataset, a small set of test data has been provided to help you test your code. These are given as the files digits.X.in and digits.X.out in the inputs and outputs directory. They are hand-crafted inputs that do not contain any digit-like data.

If you wish to test using only these hand-crated test cases, run

1	`./test.sh digits`

You can also run

1	`./test.sh digits <X> <Y>`

to test with the training samples trainX.in and testing samples testY.in.

Running

make

will cycle through testing with these test cases as usual, followed by a combination of training/testing data.

Sample Runs

ooiwt@pe121:~/cs1010/ex08$ cat inputs/train6.in inputs/test1.in | ./digits
0
0000
ooiwt@pe121:~/cs1010/ex08$ cat inputs/train100.in inputs/test30.in | ./digits
0
0
6
1
1
1
2
5
2
3
3
3
9
4
9
5
3
3
2
6
6
7
7
7
8
1
8
9
9
9
3333

When you use the file ~cs1010/ex08/train60000.in as the training samples, you should get 100% accuracy with test30.in and 96.56% accuracy with ~cs1010/ex08/test10000.in

To speed up the execution of your code, you can remove the following lines from compile_flags.txt (after making sure that it does not contain any memory errors).

-g
-fsanitize=bounds
-fsanitize=address