CSCI 3080 - Discrete Structures

Lab 4 - Compressing Text Files using Huffman Coding

1. Objective

In this lab, students will implement Huffman coding in C++ to compress and decompress text files. They will learn how to construct a Huffman tree, generate Huffman codes, encode a text file into a compressed binary format, and then decode it back to its original form. Additionally, students will analyze the efficiency of Huffman compression by comparing the file sizes before and after compression. By the end of this lab, students will be able to:

★ Understand the principles of Huffman coding and its application in data compression.
★ Implement a vector to construct a Huffman tree.
★ Generate Huffman codes for individual characters based on their frequency.
★ Encode a given text file into a compressed binary format.
★ Decode the Huffman codes to restore the original text.
★ Evaluate the efficiency of the compression by comparing file sizes.

2. Lab Instructions

Step 1: Read the Input File

1. Create a text file named input.txt.
2. Write a C++ program to read the contents of input.txt into a string.

        string filename;
        cout << "Enter text file name: ";
        cin >> filename;
    
        // Read input file
        ifstream inputFile(filename);
        if (!inputFile) {
            cout << "Error opening file.\n";
            return 1;
        }
        
        string text;
        char ch;
        //read characters from a file one by one until the end of the file is reached
        while (inputFile.get(ch)) {
            _____________ // Append each character to the string
        }
        inputFile.close();

Step 2: Compute Character Frequencies

1. Traverse the input string and count the occurrence of each character.
2. Store the character-frequency mapping in an unordered_map.

        unordered_map<char, int> freqMap;
        for (char ch : text) {
            _______________ // Increment the frequency of each character in freqMap
        }

Step 3: Defining and Creating Huffman Tree Nodes

1. Define a structure (struct Node) to store:

a. A character (ch) for leaf nodes.
b. A integer (freq) representing frequency.
c. Two pointers (left and right) for child nodes.

2. Define a function to create a New Node dynamically. Initializes the node with:

a. A character (char ch)
b. A frequency (int freq)
c. Left and right pointers set to NULL (no children initially)

3. Define a function to delete the entire Huffman tree and free allocated memory

Reference for Step 3: See InOrder.cpp for a related example of creating new nodes and deleting nodes.

Step 4: Build the Huffman Tree

1. Use a vector to construct the Huffman tree.
2. Merge nodes with the lowest frequencies to build the tree.
3. Store the root node of the Huffman tree.

        // Comparison Function to sort nodes based on frequency
        bool compare(Node* a, Node* b) {
            return a->freq < b->freq;  // Sort in ascending order of frequency
        }
        
        // Build Huffman Tree with vector
        Node* buildHuffmanTree(unordered_map<char, int>& freqMap) {
            vector<Node*> nodes;  // Store nodes in a vector
        
            // Step 1: Convert frequency map into nodes
            for (auto pair: freqMap) {
                
                char character = pair.first;   // Get the character (e.g., 'a')
                int frequency = pair.second;   // Get the frequency (e.g., 15)
                
                //Create a node using newNode function with character and its frequency
                ______________________________________
        
                // Use push_back to add the newly created node to the nodes vector.
                _______________________________________
            }
        
            // Step 2: Sort nodes by frequency
            // This loop continues merging nodes until only one node remains, which becomes the root of the Huffman Tree.
            while (nodes.size() > 1) {
                sort(nodes.begin(), nodes.end(), compare);  //sort the nodes in ascending order of frequency 
        
                // Take two smallest frequency nodes
                Node* left = _________;  //Hint: The node at index 0 has the smallest frequency.
                Node* right = _________; //Hint: The node at index 1 has the second smallest frequency.
        
                // Create a new node combining both
                // '\0' is the most common placeholder for internal nodes.
                Node* merged = newNode('\0', ___________________); //Hint: The frequency of the merged node is the sum of left and right nodes' frequencies.
                merged->left = left;
                merged->right = right;
        
                // Remove the first two nodes from the list
                nodes.erase(nodes.begin(), nodes.begin() + 2);
                
                ___________________//Hint: push_back() adds the merged node to the end of the nodes vector.
            }
            
            // Ensure the vector is not empty before accessing index 0
            if (nodes.empty()) return nullptr;
            return nodes[0];  // The last remaining node is the root
        }

Step 5: Generate Huffman Codes

1. Traverse the Huffman tree to assign binary codes to each character.
2. Store these codes in an unordered_map.

          void generateHuffmanCodes(Node* root, string code, unordered_map<char, string>& huffmanCode) {
                if (root == nullptr) return; // Base case: Stop recursion if tree is empty

                //// If the node is a leaf node (contains a character)
                if (root->left == nullptr && root->right == nullptr) {
                    huffmanCode[root->ch] = code; // Store Huffman code for this character
                    return;
                }

                //If the node has a left child, assigns 0 for a left branch
                generateHuffmanCodes(_________, code + "0", huffmanCode);
                //If the node has a right child, assigns 1 for a right branch
                generateHuffmanCodes(_________, code + "1", huffmanCode);
            }
            
            // This should be placed in the main function after building the Huffman tree.
            unordered_map<char, string> huffmanCode;
            generateHuffmanCodes(root, "", huffmanCode);

Step 6: Encode the Text File

1. Replace each character in the text file with its Huffman code.

        // Encode the text using Huffman Codes
        string encodeText(string text, unordered_map<char, string>& huffmanCode) {
            string encodedText;
            for (char ch : text) {
                _______________________ //Append the Huffman code of each character to the encoded text.
            }
            return encodedText;
        }

2. Write the encoded binary string in a compressed file called compressed.bin.

        ofstream encodedFile;
        encodedFile.open("compressed.bin");
        
        if (!encodedFile) {
            cerr << "Error: Unable to open compressed.txt for writing!" << endl;
            return 1;
        }
        
        encodedFile << ________________;  // Write encoded text
        encodedFile.close();

Step 7: Decode the Huffman encoded text

1. Please reconstruct the original text using the Huffman tree.

        // Decode Huffman encoded text
        string decodeText(string encodedText, Node* root) {
            string decodedText;
            Node* current = root; //Start at the root of the Huffman tree
            //Loop through each bit in encodedText
            for (char bit : encodedText) {
        
                // If the bit is '0', move to the left child; if the bit is '1', move to the right child.
                current = (bit == '0') ? ______________ : ______________; //Hint: Use the left child when the bit is '0' and the right child when the bit is '1'.
                
                // Check if we reach a leaf node (character node)
                if (________________________________________________) {  
                    decodedText += current->ch; // Append the decoded character to the result
                    current = root;  // Reset to the root for the next character
                }
            }
            return decodedText; // Return the fully decoded string
        }

2. Write the decoded string in a text file called decompressed.txt

Step 8: Analyze Compression Efficiency

1. Compare the size of input.txt, compressed.bin, and decompressed.txt.
2. Calculate the compression ratio: \( Compression Ratio = \frac{Compressed Size}{Original Size} \times 100\% \)

       The original text file size is obtained by counting the number of characters, which gives the total number of bytes, 
       as each character typically occupies one byte. Original text file size: text.size() bytes

       The compressed file size is estimated by taking the total number of bits in the encoded text (since each character in 'encodedText' represents one bit) 
       and divide by 8 because there are 8 bits in one byte. This calculation is approximate since the total bit count may not be a perfect multiple of 8.
       Compressed file size: encodedText.size() / 8 bytes (approximately)

       What is the size of decompressed.txt?

3. Requirements

★ (1) Please create a sample text file named input.txt and write a C++ program to read its contents into a string.
★ (2) Please traverse the input string to count the occurrence of each character and store the character-frequency mapping in an unordered_map.
★ (3) Please use a vector to construct the Huffman tree by merging nodes with the lowest frequencies and store the root node of the tree.
★ (4) Please traverse the Huffman tree to assign binary codes to each character and store the codes in an unordered_map.
★ (5) Please replace each character in the text file with its Huffman code and write the encoded binary string to a compressed file named compressed.bin.
★ (6) Please reconstruct the original text using the Huffman tree, and save the decoded text as decompressed.txt.
★ (7) Please ensure that your program can compile and run without crashing, producing errors, or infinite loop errors.

4. Sample Input/Output

        Enter text file name: input.txt
        
        Huffman Codes:
        p : 1111101
        f : 111111
        I : 11111001
        v : 11111000
        g : 111101
        m : 111100
        i : 11101
        c : 11100
          : 110
        l : 10111
        q : 1011011
        T : 10110100
        u : 101100
        s : 000
        d : 0010
        n : 0011
        a : 1000
        H : 10110101
        e : 010
        r : 0111
        t : 0110
        b : 10010110
        o : 1010
        z : 1001010
        y : 10010111
        . : 100100
        h : 10011
        
        Encoded Text (first 100 bits): 1011010110110011111111111111110010000011110111001010001011101001111110111011101000110100011010111101...
        
        Original file size: 212 bytes
        
        Compressed file size: 110 bytes (approx)
        Compressed output saved to 'compressed.bin'
        
        Decompressed file size: 212 bytes
        Decompressed output saved to 'decompressed.txt'
        
        Deleting nodes:  s d n e t r a . z b y h o u T H q l   c i m g v I p f

5. Grading Criteria

**Grading Criteria** (Total Points: 100)
Category	Criteria	Points
Huffman Tree Node (5 points)	The Huffman Tree Node is defined correctly.	5
New Node Function (5 points)	The function to create a new node is defined correctly.	5
Huffman tree (17 points)	The Huffman tree is constructed correctly.	17
Huffman encoding (15 points)	Huffman encoding has been successfully implemented	15
Huffman decoding (15 points)	Huffman decoding has been successfully implemented	15
File I/O operations (15 points)	File I/O operations work correctly	15
Main Function Implementation (10 points)	All functions are correctly invoked in the main function	10
Analyze compression efficiency (5 points)	Accurately compare the sizes of input.txt, compressed.bin, and decompressed.txt	5
Code Readability & Comments (5 points)	Maintain consistent indentation and spacing throughout the code.	2
Code Readability & Comments (5 points)	Include comments that explain the purpose of key sections and logic.	3
Test Cases & Output Accuracy (5 points)	The output for different test cases is correct and consistent.	5
AI disclaimer (3 points)	Please put an AI disclaimer at the top of your source code as comments.	3
Errors	Your program has syntax errors, compiling errors, running errors, or infinite loops.	-50 points

6. Comments

★ Please add the following comments to the top of your lab4.cpp file:

            /*
                Author: Your Name
                Date: Month/Day/Year
                Lab Purpose: 
                A.I. Disclaimer: please put your A.I. disclaimer here
            */

★ Please add other necessary comments to your source code.

7. Submit

Please submit your Lab 4 source code file: lab4.cpp in D2L!

8. Jupyter Hub for Programming

Jupyter Hub for Programming
By logging in to Jupyter Hub with your MTSU credentials, you can create, compile, and run your source code there.

9. Compile and Run the Source Code

Compile: c++ lab4.cpp
Run: ./a.out

Congratulations! You have finished your Lab4!

⇧