Introduction to Computer Science

Digit Recognition – 80 course points

Students are introduced to machine learning principles and object oriented programming.

General Guidelines

Write a program and submit on Autolab.

The assignment has two components:

Coding (77 points) submitted through Autolab.
- We provide a zip file (find it under Asssignment7 on Autolab) containing Digit.java, DigitMatcher.java, Driver.java, Picture.java,StdIn/StdOut.java and CSVs
- UPDATE and SUBMIT DigitMatcher.java only writing where it indicates. You can write in the Driver to test your code
Reflection (3 points) submitted through a form.
- Submit the reflection AFTER you have completed the coding component.
- Be sure to sign in with your RU credentials! (netid@scarletmail.rutgers.edu)
- You cannot resubmit reflections but you can edit your responses before the deadline by clicking the Google Form link, signing in with their netid, and selecting “Edit your response”

Observe the following rules:

DO NOT add any import statements
DO NOT add the project statement
DO NOT change the class name
DO NOT change the headers of ANY of the given functions
DO NOT add any new class fields
DO NOT use System.exit()
ONLY print the result as specified by each problem. Observe the examples’ output, display only what the problem is asking for.
DO NOT print other messages, follow the examples for each problem.
ONLY print the result as specified by each problem. Observe the examples’ output and display only what the problem is asking for.
1. DO NOT print other messages; follow the examples for each problem.

1. Digit Recognition (77 points)

Students are introduced to the concept of machine learning through the use of object oriented programming.

Overview

Machine learning is a field in computer science that focuses on teaching computers to learn from data. One of the most important tasks in this field is classification. Classification is the process of predicting the category or label of a given piece of data. Imagine you have a bunch of emails, and you want to sort them into “Spam” and “Not Spam.” This sorting is a classification task. By training the computer with examples of spam and not spam emails, it learns to distinguish them.

Classification helps in making decisions automatically. For example, banks use it to decide if a transaction is fraudulent. Doctors use classification to predict diseases based on medical data. Companies classify customer feedback as positive or negative to improve their services.

Digit recognition is a specific type of classification where the computer learns to recognize handwritten digits (0-9) from images. This task is very important in many real-world applications. For example, it helps in automating data entry by recognizing handwritten numbers on forms, like postal codes on letters. In education, it assists students in practicing and improving their handwriting by providing instant feedback. In security, it verifies handwritten signatures for authentication. We use Gradescope in this course, it reads your handwritten netid from the exam and associates the exam with your account on Gradescope.

In this assignment, you will write 3 algorithms to recognize digits from the MNIST dataset. The MNIST (Modified National Institute of Standards and Technology) dataset is a large collection of handwritten digits commonly used for training and testing in machine learning and image processing. When we attempt to identify handwritten digits, our features will be the pixels from each image. We’ll measure how similar the digits are to each other by computing what percentage of pixels match between digits.

Similarity between two digits

Imagine that you have graph paper cut into 10×10 squares with single digit numbers on these squares, like the examples that are shown below.

Imagine that you wrote any combination of digits on these pieces of paper, but made sure that you had at least one duplicate number.
Imagine then, that you hold the pieces of paper on top of one another to visualize the similarities between the digits.
If you actually do this test, you will see that:
- When you compare two digits that are the same, many squares match between the two digits.
- When you compare two different digits, not as many squares match.

The same can be done by a computer program. To identify handwritten digits, the values used to compute similarity of the digits to each other are the pixels from each image. Each square is a pixel in a cell in the a 2D array.

Similarity of the digits is computed using the percentage of matching pixels between digits.

In the diagram below, the pixels from the images of the number two are very similar. In this case, 94% of the pixels are the same between the images. In the second set of images, the difference is much greater. Not surprisingly, it appears the two digits in the top pictures are the same, while the two digits in the bottom images are not the same.

The similarity score for two digits would be computed by calculating the ratio of the number of matching pixels (squares) to the total number of pixels (squares) compared. Boxes should be considered matching based on whether or not they have any black fill in them.

In the example above, all ten squares in the first-row match, since all of them are empty. In the second row, eight of the ten squares match; the first three and last two are empty in both images, both also have black in the fourth, fifth and sixth squares; but the five has black in the seventh and eighth squares, whereas the two do not have black in the seventh and eight squares. After comparing the first two rows, 18 of the 20 boxes match between these two digits. Work through the remaining rows to compute the similarity score for the two digits.

Programming note for later: This is done by the computeSimilarity method in the Digit class. The method computes the similarity score by comparing the pixels of two digits.

Use digit similarity to classify unknown/unlabeled digits

Imagine that you have a training dataset containing thousands of digits (2D array of pixels like the graph paper). Many of these digits are duplicates with different handwritten characteristics. Each of these handwritten digits contain a label which is the integer value of the digit.
Then, given an unlabeled digit (a digit that we don’t know its integer value), compute the similarity between the unlabeled digit and every digit in the training dataset.
Once that is done, we can use an algorithm to predict the label for the unlabeled digit. Here are the algorithms we will use in this assignment, more about them later:
1. Most similar: predicts the label of the unlabeled digit to be the label of a digit from the training dataset that has the highest similarity to the unlabeled digit.
2. K-nearest neighbors: predicts the label of the unlabeled digit to be the label of a digit from the training dataset that appears most frequently amongst the k most similar to the unlabeled digit.
3. Weighted k-nearest neighbor: predicts the label of the unknown digit to be the label of a digit from the training dataset that has the highest sum of similarities amongst the k most similar to the unlabeled digit.

Files

Digit.java represents a single handwritten digit.
DigitMatcher.java is used to write the logic that will identify unlabeled Digits
- There is one instance variable Digit[] digits which is an array of digits
StdIn.java and StdOut.java
Driver.java is used to test the DigitMatcher methods.
NOTE about the datasets
- In the zip file we provide smaller datasets for you to test. The same small datasets will be used in Autolab. The full datasets are provided through the following links for you to download if you would like to do more analysis:
  - training dataset with 60000 digits
  - test dataset with 8999 digits
  - unlabeled dataset with 999 digits
mini_smallerTrainingData_700.csv
- This small training dataset file contains the pixels for the each digit along WITH their labels (their integer value). The digits are loaded into DigitMatcher’s digits array.
- Format of the input files: The MNIST dataset we use contains information about over 60,000 handwritten digits! (The first row contains the number of entries there are in the file). Each row contains the information for one digit. The first value is the label for that digit, from zero to nine. The remaining 784 values represent each pixel in the 28×28 pixel digit. Each pixel is represented by a zero or values greater than zero, where a zero would denote that the pixel is empty, and other values would denote that the pixel is filled in. The data for the digit
  
  would be represented by the comma separated values (csv) shown below. The first value, 5, specifies that the digit is in fact a five. Each of the other values represents a single pixel. A 0 means that the pixel is white; a value > 0 (greater than zero) means that the pixel is black.
  5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,
  1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,
  1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,
  1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,
  0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,
  1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
  0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
mini_testingData_126.csv contains the labels and pixels.
- The small testing dataset contains the pixels for the each digit along WITH their labels (their integer value). These digits will be used to test the accuracy of DigitMatcher’s algorithms.

Task

UPDATE and SUBMIT DigitMatcher.java in AutoLab.
You may have to write in the Driver for testing purposes.

Programming

When computer scientists create classification models, they use two distinct data sets, one to train their model and one to test its accuracy. See the datasets description above.
The DigitMatcher class will be used to write the logic that will predict unknown/unlabeled Digit. It stores the training dataset in the digits array.

DigitMatcher(String trainingInputFile) – constructor

The constructor receives as input the name of the file that includes the training dataset. Refer to the dataset format for the file format.
For each digit in the dataset file:

1. Create a Digit object with the digit information.
  String[] values = StdIn.readString().split(","); turns the line of 0,1,0,… into a String array with each 0 and 1 in its own index.
  1. The Digit contains the label: first value of each line of the input file.
  2. The Digit contains a 2D array of Pixels: the remaining 784 values of each line of the input file.
    1. The value of each pixel is expected to be:
      1. 0 (zero): if the input file contains 0 (zero).
      2. 1 (one): if the input files contains a value greater than 0.
  3. The Digit also contains a similarity score (computed later).
2. Insert the newly created object into the instance variable digits.

Testing

Use the main method in Driver.java to test. Instantiate a DigitMatcher object with the training dataset and use the getter method to get the digits instance variable. Then print a digit.

Make sure that dm’s digits array contains 700 Digits.
Print out some of the digits.
Notice that the similarity score is 0.0 because it hasn’t been computed yet.

computeSimilarity(Digit digit)

(This method assumes you’ve populated the instance variable by running DigitMatcher constructor).

The parameter digit is an unlabeled digit. In this method you will compute how similar this unlabeled digit is to each of the digits in the training dataset.

To do this invoke the computeSimilarity method for each digit in the training dataset. (computeSimilarity is a method from the Digit class)

Note that the similarity for the training digit will be updated, meaning that EACH digit in the dataset will store how similar it is to the parameter digit.

Testing

Use the main method in Driver.java to test.

The test digit at index 0 is 3
Notice how the training digit at index 4 has a higher similarity score than the training digit at index 0.

mostSimilar()

(This method assumes that computeSimilarity has been run and that each digit in the training dataset stores its similarity score to the unlabeled digit.)

In this method, you will evaluate the algorithm’s ability to recognize digits by finding the most similar digit from the training dataset for the digit in the test set (the digit used when computeSimilarity ran).

To accomplish this, loop through the digits array and find and return the Digit with the highest similarity score.

Note: If there are more than one digit with the same highest similarity score, return the first digit found. For example, if there are 5 digits with similarity scores of 0.2, 0.6, 0.6, 0.4, 0.6 the second digit is returned.

Testing

Use the main method in Driver.java to test. Notice how the most similar digit in the training dataset matches the first digit in the test dataset.

The test digit at index 0 is 3 (label)
Notice how the most similar digit in the training dataset is 5 (label). The similarity score is 0.90
The problem here is that the training dataset is too small.
- Try running the same code using the larger dataset.

In the driver, you can also use the pre-written method below to test the accuracy of the prediction of each test digit.

- testMostSimilar("mini_smallerTrainingData_700.csv", "mini_testingData_126.csv"); // This number has a 88.88% accuracy when calculating the label

findGreatestSimilarity(int low, int high)

(This method assumes that computeSimilarity has been run).

In this method, you will write code to find and return the digit from the dataset with the largest similarity value between the low and high indices inclusive.

*This is very similar to mostSimilar

rankBySimilarity()

(This method assumes that computeSimilarity has been run).

In this method you will reorder the dataset digits array to contain the most similar digit at index 0, the next most similar digit at index 1 and so on. Use the following algorithm to accomplish this.

For each index i in the digits array
- Find the most similar digit between index i and the end of the array (use findGreatestSimilarity)
- Swap the digit at index i with the digit at the index where the most similar digit lies

*Hint: This is the selection sort algorithm which you will learn in more detail later in the semester.

kNearestNeighbors

(This method assumes that rankBySimilarity has been written.)

The k-nearest neighbors algorithm is very similar to what we’ve already implemented in mostSimilar, except that instead of comparing our unlabeled data point to just the data point that it is most similar to, we’ll compare it to the k data points that it’s most similar to. Why is this advantageous? It can increase the accuracy of our algorithm by decreasing the likelihood that a single outlier from our training data may affect how we label a digit. Our algorithm currently labels a test digit by finding the digit from our training data that is most similar to this test digit. While this method can be effective, it may be influenced heavily by inconsistent or erroneous data.

Consider our housing data scenario. Suppose we want to estimate the value of a house by comparing it to other houses in our dataset. We find a house that seems like a perfect match based on square footage, location, and condition. However, if this “perfect match” house is next to a shopping mall or has a price entered with an error, it could significantly skew our value prediction for the new house.

To reduce the impact of such errors, we can enhance our mostSimilar by considering multiple similar houses rather than just one. This is the core idea behind k-nearest neighbors (k-NN) classification, where we compare our unlabeled digit to the k most similar digits, rather than relying on a single closest match.

This method finds the k Digits from the dataset that have the highest similarity scores with respect to a given unlabeled Digit and returns the label that appears most often. The method does not modify or remove any elements from the digits array.

Notice that at this point digits array is ranked by similarity, meaning that the array is ordered from most to least similar (most similar at index 0, least similar at the last array index).

- You only need to consider the first k items in the digits array. Amongst those, return the label that appears the most.

Now we’ll use these k neighbors to determine the label for the unlabeled Digit. We’ll do this by taking a vote from the k-nearest neighbors. Suppose we have identified five nearest neighbors for our unlabeled digit. When we examine these neighbors, we find:

- Three neighbors are labeled as 7 (seven).
- One neighbor is labeled as 2 (two).
- One neighbor is labeled as 1 (one).

Based on this information, we would label our test Digit as 7 (seven), because this label appears most frequently among the neighbors. This process is similar to finding the mode in the list of neighbors’ labels. We determine the label for the test Digit by selecting the most common label among its k-nearest neighbors.

These are the steps to complete for this method:

1. Use the first k items in the digits array, these are the k-nearest neighbors (the k items that are most similar to the unlabeled digit).
2. Determine the most frequent label among these k Digit objects in the digits array.
  *Hint: think about CharacterCounter. How can you store the counts of digits using an array?
3. Return this most frequent label as an int.

Testing

Use the main method in Driver.java to test. Notice as you increase the k value what happens.

The test digit at index 0 is 3
Notice it correctly predicts the test digit label as 3 when k=5.
Try other test digits as well, there 126 on the small data set we gave you.
Also try running the same code using the larger dataset.

In the driver, you can also use the pre-written method below to test the accuracy of the prediction of each test digit.

testKNearestNeighbors(#,"mini_smallerTrainingData_700.csv","mini_testingData_126.csv"); #’s from (1-10)

Notice how low the accuracy is somewhat high for k=1 (which is mostSimilar).
Try other digits and see how the accuracy changes.

weightedKNearestNeighbors

(This method assumes that rankBySimilarity has been run).

kNearestNeighbors treats each neighbor equally when voting to determine the label of an unlabeled digit. However, in practice, some neighbors may be a much better match than others. To improve accuracy, we can enhance our algorithm by using a weighted voting system instead of a simple majority vote.

Instead of counting each neighbor’s vote as just one, we can assign weights based on the similarity scores of the neighbors. This means that neighbors that are closer or more similar to the test Digit will have a greater impact on the final label. By using a weighted voting system, where the weight of each vote is proportional to the neighbor’s similarity score, we can make more accurate predictions for the test Digit.

This method determines the best label for the unlabeled digit by aggregating votes with weights based on each neighbor’s similarity score and then returns the label that receives the highest weighted vote (highest sum of similarity scores).

Testing

Use the main method in Driver.java to test. Notice as you increase the k value what happens. The following code also provides the accuracy testing.

The test digit at index 0 is 3
Again it predicts the correct label.
Try other test digits and also try running the same code using the larger dataset.

Executing and Debugging

First navigate to the DigitRecognition directory/folder

You can use the Driver class provided to test the accuracy of the 3 algorithms written. We provide a test method for each of the 3 algorithms: testMostSimilar(), testKNearestNeighbors, and testWeightedKNearestNeighbors. Each method prints the accuracy.
- The first file is the training data file
- The second file is the testing data file
What happens when you change the parameters? Do the predictions get more accurate?

Once you tested you can predict the label of one or more of the unlabeled data file using populaeArrayOfUnknownDigitsto load the unlabeled digits.
- to compile: javac DigitMatcher.java
  - If there are compile errors, try javac *.java which will compile all the .java files in the src folder
- to execute: java Driver

Author Credit: Ana Paula Centeno and Tom Swope based on a College Board Lab

Submitting Guidelines

VSCode Extensions

You can install the VSCode extension pack for Java. Take a look at this tutorial.

We suggest:

Importing VSCode Project

Download Assignment7.zip from Autolab Attachments.
Unzip the file by double clicking.
Open VSCode
- Import the folder to a workspace through File > Open Folder

Before submission

Collaboration policy. Read our collaboration policy here.

Submitting the assignment. Submit DigitMatcher.java via Autolab. To do this, click the Assignments link from the course website; click the Submit link for that assignment.

Getting help

If anything is unclear, don’t hesitate to drop by office hours or post a question on Piazza.

Find instructors office hours here
Find tutors office hours in Canvas -> Tutoring, RU CATS
Find head TAs office hours here
POST on Piazza in Canvas -> Piazza
In addition to office hours we have the CSL (Coding and Social Lounge), a community space staffed with lab assistants which are undergraduate students further along the CS major to answer questions.
Refer to our Programming Assignments FAQ for instructions regarding our assignments and policies.

Submitting

Write 1 program and submit on Autolab.

We provide a zip file (find it under Assignment7 on Autolab) containing Digit.java, DigitMatcher.java, Driver.java, StdIn/StdOut.java and CSVs
UPDATE and SUBMIT DigitMatcher.java