Assignment 9: Digits

Deadline

16 November 2018 (Friday), 23:59 pm

Prerequisites

You survived CS1010 until Week 13!

Learning Outcomes

Be comfortable writing C programs that involve arithmetic operations, long, double, bool, and char types, conditional if/else statements, loops with while/for/do-while statements, arrays (including 2D arrays), strings, pointers, dynamic memory allocation, and struct .
Be comfortable breaking down a problem into smaller subproblems which can be resolved using functions, including reusing existing functions written for other programs (with a tweak), writing a function that calls itself, designing what should the inputs and return values/types of a function be.
Be able to analyze the problem and come up with solutions to problems that are efficient, through eliminating redundant and repetitive work.

Setup

Accept the assignment
Login to one of the hosts of CS1010 programming environment
Run:

~cs1010/get-as09

You should see the folder as09-<github id> in your home directory with the assignment skeleton inside.

Solving The Assignments

Edit the C files to solve the corresponding question as described below.
You should break down the problem into smaller subproblems, and write one function for each of the subproblem to solve the problem.
To compile, run:

make

Submission

When you are ready, run the following command to submit:

~cs1010/submit-as09

The C file given will be uploaded to GitHub. You can submit multiple times, but only the last submission will be graded.

Editing Your Files in Multiple Locations

You should edit your code only on the CS1010 PE hosts. If you choose to edit your code in other places, such as directly on Github or in a second location (such as your own laptop), you need to be comfortable with various git command to synchronize your code across the different locations, possibly needing to resolve synchronization conflicts.

Only the C files listed above will be submitted. You can create additional C files for testing different solutions, but eventually, you must put your solution into the corresponding C file provided as the skeleton.

Identifying Yourself

In every C file that you submit to CS1010, you need to identify yourself by writing your name and tutorial group. Marks will be deducted if you fail to do so. You need to edit the line:

@author XXXX (Group YYYY)

and change it to something like:

@author Ellen Ripley (Group 10)

Grading

This assignment contributes towards 4% of your final grade. The total mark for this assignment is 40 marks. There are five marking criteria: design, efficiency, correctness, documentation, and style.

Design (8 marks): You should break down your program into smaller functions, each one performs a single coherent task or solves a well-defined subproblem. You should avoid repetitive code that can be factored out as a function. Writing everything in a single long function main or cutting-and-pasting code would likely cause you to lose marks for this criteria.
Efficiency (8 marks): Your code should run within the efficiency bound (given in big-O notation) and avoid redundant work or repetitive work.
Correctness (8 marks): Correctness here is defined in the broad sense of using the various programming constructs in C (type, function, variable, loops, arrays, struct , pointers, conditionals, arithmetic expressions, logical expressions) properly, not just producing the correct output and bug-free.
Style (8 marks): 8 marks are given by default but we may deduct marks if you have a serious violation of the style guideline. Please refer to the CS1010 C Style Guide and follow the recommended guideline.
Documentation (8 marks): 8 marks are given by default, but we will deduct marks if you do not document the functions in your code properly. Please refer to the documentation and follow the recommended format. We have added the compilation flag -Wdocumentation to help you identify documentation errors. You need to explain the purpose of every file, function, and parameter. Logic in your code that is not obvious should be explained as well.

Note that the default marks on documentation and style are given on the condition that you have made a good attempt at solving the problem and your solution is reasonably close to correct. This is to prevent students from submitting an arbitrary program to claim 16 marks :)

We reserve the right to penalize students for using banned C syntax in the assignments. In addition, each grader at his or her own discretion can penalize students for repeating mistakes / bad practices from the student's past assignments (even if it was just a warning with no marks deducted in the earlier assignments).

Digits (40 Marks)

We started CS1010 with an assignment question on digits, so it is fitting that we end with one.

In this assignment, your goal is to write a program that can recognize handwritten digits. This is a classic problem in pattern recognition and machine learning. The state of the art techniques can now achieve a recognition rate of over 99%. We, however, will implement a simple algorithm called $k$ -nearest neighbor algorithm, which has lower accuracy but is simple enough to be a freshmen intro to programming assignment!

Here is how it works.

Representation of a handwritten digit

A handwritten digit is an image, represented as a 28x28 array of characters, consisting of . and #. An example is:

............................
............................
............................
..........#######...........
.........########...........
........##########..........
.......####....###..........
.......###....####..........
..............####..........
.............####...........
............#####...........
............####............
...........####.............
...........###..............
..........####..............
.........####...............
.........####...............
........####................
........###.................
........###...........####..
........###################.
........###################.
.........###########........
............................
............................
............................
............................
............................

The image above contains the digit 2.

Training Data

Our algorithm recognizes the patterns in a given image by comparing it against existing images where the digits contained within are known. These existing images are known as the training samples. They consist of a list of handwritten digits, together with the ground truth label, i.e., the information of which digit is contained in each image.

Usually, the more training data we have, the more accurate our recognition program will be.

For this assignment, we will use the data provided by the MNIST Handwritten Digit Database. The original data has been post-procesed into CSV file by Joseph Redmon, then post-processed again by yours truly into the format above.

The dataset from above contains 60,000 training samples. I have created two smaller subsets for you to test your code.

The first subset contains 10 samples for each digit -- this is given in the file train100.in.

The second subset contains 6 samples for digits 0 and 1 -- this is given in the file train6.in.

The full training set from MNIST is too big to be included on GitHub. You can read it from ~cs1010/as09/train60000.in if you want to play with it.

Testing Samples

The testing samples refer to the handwritten digits that we wish to recognize. Each of these handwritten digits is also labeled with its ground truth. We will use these ground truth to compare against the output from our algorithm so that we can check the accuracy of our algorithm.

We provide two sets of testing samples. The first contains three testing samples per digit. The second contains only a single digit corresponding to the example below.

The full training dataset from MNIST contains 10000 digits. It takes a long time to test every handwritten digit in this file, so you should do this only after you have made sure that your code is fast and is correct. Again, the file is too big to be posted on GitHub. You can read it directly from ~cs1010/as09/test10000.in.

The Algorithm

Let's define the distance between the two handwritten digits $d(x_1, x_2)$ as the number of pixels that are different between them, i.e., how many pixels are # in one image but is . in the image.

Given a test sample, $q$ , we find the distance between $q$ and all the available training samples and find the $k$ training samples with the smallest distance (i.e., $k$ nearest neighbors). $k$ is usually small -- we use $k = 5$ in this assignment.

The intuition is that $q$ must be "close" to the training data that has the same labels (i.e., the same digits). So we look at the these $k$ nearest neighbors and find the most common digit $d$ among them. We then recognize $q$ as containing the digit $d$ . If there are more than one most common digits or more than k nearest neighbors , then we break ties by returning the smaller digit (we should do something smarter than this, by return the closer digit, but let's do this for simplicity).

Efficiency

Suppose we have $n$ training samples, recognizing a digit should take no more than $O(kn)$ time (or $O(n)$ since $k$ is a constant).

There is also an opportunity to stop early the calculation of the distance between two handwritten digits if the distance is too large, pruning away redundant work.

Example

Consider the following simple example with six training samples and a test sample.

6
0
............................
............................
............................
............................
...............#####........
...............#####........
.............########.......
............##########......
...........###########......
..........########.###......
..........####.##...###.....
.........#####......###.....
........####........###.....
.......####.........###.....
.......###..........###.....
......####..........###.....
......###..........###......
......###.........####......
......###........###........
......###......####.........
......####..#######.........
......###########...........
.......########.............
........#####...............
............................
............................
............................
............................
0
............................
............................
............................
............................
.................####.......
................######......
...............#######......
.............#########......
...........###########......
..........############......
.........#############......
........#####.####.###......
.......#####..###...##......
.......####.........##......
.......##...........##......
......###...........##......
.....####..........###......
.....####..........###......
.....####.........###.......
.....####.......####........
.....##############.........
......############..........
........########............
.........######.............
............................
............................
............................
............................
0
............................
............................
............................
............................
............................
..............######........
..............########......
.............#########......
.............#########......
............##########......
..........######..####......
..........#####...####......
..........#####...####......
..........#####..#####......
..........####...#####......
.........####....####.......
........#####...#####.......
........#####...####........
........#####..#####........
........####..#####.........
........##########..........
........#########...........
........#########...........
.........######.............
..........#####.............
............................
............................
............................
1
............................
............................
............................
............................
............................
..................####......
.................#####......
.................#####......
................####........
...............#####........
...............####.........
...............####.........
..............####..........
.............####...........
............####............
............####............
...........####.............
...........####.............
..........####..............
..........####..............
.........####...............
.........####...............
........####................
........####................
.........###................
............................
............................
............................
1
............................
............................
............................
............................
............###.............
............####............
............####............
............####............
............####............
............####............
............#####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............####...........
.............#####..........
.............#####..........
..............####..........
..............####..........
..............####..........
...............###..........
............................
............................
............................
............................
1
............................
............................
............................
............................
............................
.............##.............
.............##.............
.............##.............
.............##.............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
.............###............
..............###...........
..............##............
..............##............
..............###...........
..............##............
..............##............
............................
............................
............................

The test sample is

1
0
............................
............................
............................
............................
............................
..............#####.........
.............#######........
...........##########.......
..........####.##.####......
.........####......####.....
.........###........###.....
........####.........##.....
........###..........##.....
.......###...........##.....
.......###...........##.....
.......###...........##.....
......####...........##.....
......####..........###.....
.......###..........###.....
.......###.........###......
........###.......###.......
........####....#####.......
.........##########.........
..........#########.........
............#####...........
............................
............................
............................

The distances between the test sample and each of the training sampes are 101, 120, 162, 174, 173, 162. The $k$ nearest neighbors belong to digits 0, 0, 0, 1, and 1. The most common digit among these neighbors is 0, so we conclude (correctly) that the test sample is digit 0.

Using `struct`

One of the objectives of this assignment is to see if you know how to use struct (refer to Unit 28. Appropriate use of struct is critical for the design and correctness marks for this assignment.

Input/Output

Write a program digits that reads, from the standard input, the following:

A positive integer $n$ , corresponding to the number of training samples,
then repeatedly read $n$ handwritten digits, containing:
- a label corresponding to the digit in the next image (a number between 0 - 9)
- 28 lines of texts, consisting of '.' and '#' only, representing a handwritten digit
followed by another positive integer $m$ , corresponding to the number testing samples,
then repeatedly read $m$ handwritten digits, containing:
- a label corresponding to the digit in the next image (a number between 0 - 9)
- 28 lines of texts, consisting of '.' and '#' only, representing a handwritten digit

Then prints, to the standard output, the following:

For each testing sample, - print the digit it is labeled as, followed by a space, followed by the digit it is recognized as

Finally, print a double value that corresponds to the accuracy, i.e, the percentage of ~~training~~ testing samples correctly recognized.

We separated the training samples and testing samples into two files, so the usual way of redirecting a file into your program does not work anymore (since you need two files).

The way to run your program is to do the following:

cat <training samples> <testing samples> | ./digits

We use cat which concatenate two files into one to pass both the training samples and testing samples into the program using the pipe |. If this sounds familiar, you have seen pipe before in Exercise 3.

The name convention for the output files is different for this assignment. The name is formatted as X-Y.out where X refers to the training samples trainX.in and Y refers to the testing samples testY.in.

Sample Runs

ooiwt@pe121:~/cs1010/as09$ cat inputs/train6.in inputs/test1.in | ./digits
0 0
100.0000
ooiwt@pe121:~/cs1010/as09$ cat inputs/train100.in inputs/test30.in | ./digits
0 0
0 0
0 6
1 1
1 1
1 1
2 2
2 5
2 2
3 3
3 3
3 3
4 9
4 4
4 9
5 5
5 3
5 3
6 2
6 6
6 6
7 7
7 7
7 7
8 8
8 1
8 8
9 9
9 9
9 9
73.3333

When you use the file ~cs1010/as09/train60000.in as the training samples, you should receive 100% accuracy with test30.in and about 96.5\% accuracy with ~cs1010/as09/test10000.in