Anagramly

Anagramly is a command-line application that groups words in a file that are anagrams of each other. It takes a file path as input and outputs the groups of anagrams to the standard output or a file.

Language & Libraries

The application is written in Java 11.

Libraries:

JUnit: A popular testing framework for Java. It is used to write unit tests for the application.
Picocli: A powerful command-line argument parsing library. It simplifies the handling of command-line options and arguments.
JFiglet: A Java implementation of the FIGlet ASCII art generator. It is used to display the application's logo in ASCII art format.
Tinylog: A lightweight logging framework that provides logging capabilities to the application. It is used to log messages and exceptions during the execution of the application.
Mockito: A mocking framework for unit tests. It is used to create mock objects and stub dependencies during testing.

How to Run

To run the application, you can use the following command:

java -jar anagramly.jar -f <file_path> [options]

<file_path>: The path to the file containing words to process.
[options]: Additional options to control the execution mode and output destination.

Options

-p, --parallel: Run the processing in parallel using multiple threads. (Default: false)
-o, --output: Specify the output file path. If not provided, the output will be written to the standard output.

Building the application

You can also build the application using the following command:

mvn clean compile assembly:single

The assembled jar will be at: target/anagramly-1.0-SNAPSHOT-jar-with-dependencies.jar

Assumptions

Category: File Input

Only one file is to be run with every command. A file has to be passed each time.
The input file contains one word per line.
The words in the input file are ordered by size.
The files may not fit into memory all at once, but all the words of the same size would.
The words can contain any characters(numbers, special characters etc.)

Category: Execution and Output

Exceptions and errors are logged separately in the logs folder.
The order of the groups in the output does not matter.
If the -p or --parallel option is provided, the application will perform parallel processing using multiple threads. Otherwise, it will perform linear processing.
If the -o or --output option is provided, the output will be written to the specified file. Otherwise, it will be written to the standard output.
The application does not rely on any external libraries for computing anagrams, but it may use libraries for other functional aspects, such as handling the command-line interface, testing, and I/O operations.

Big O Analysis

The algorithm used for grouping anagrams has a time complexity of O(N * M), where N is the number of words and M is the maximum length of a word. The space complexity is O(K), where K(<=N) is the maximum length of anagrams of the same word length.

Reasons behind Data Structures Chosen

HashMap: Used in AnagramService to store the groups of anagrams. The key is a string representation of the character frequency counts, and the value is a list of words. This allows the anagrams to be any type of character, including special characters.
Stream: Used in Reader to read lines from the input file. The Stream interface provides a convenient and efficient way to process sequences of elements, in this case, lines of text from the file without having the whole file in memory.
StringBuilder: Used in AnagramService to build the formatted output of groups of anagrams. StringBuilder provides an efficient way to concatenate strings, reducing memory overhead and improving performance when generating the output.
AtomicInteger: Used in AbstractProcessor to keep track of the current word length. The AtomicInteger provides atomic operations for incrementing and retrieving the value, ensuring thread-safe access and modification.

Testing

Functional testing

Functional testing has been performed to ensure the correctness of the application. The test coverage report generated by JaCoCo shows the coverage of instructions, branches, complexity, lines, methods, and classes. You can run the tests and generate the report using the following command:

Code coverage

Overall coverage: 91%

For more details, the test coverage report can be found at target/site/index.html.

You can run the tests and generate the report using the following command: mvn clean test jacoco:report

Performance testing

Here are the results of load testing with different scenarios:

Randomly generated file: 10 million lines with 1 million anagrams
- Linear execution, writing to standard output: 57% cpu 95.2s total
- Parallel threaded execution, writing to standard output: 394% cpu 46.5s total
- Linear execution, writing to file: system 157% cpu 24.3s total
- Parallel threaded execution, writing to file: 692% cpu, 7.8s total

Based on the load testing results, it is observed that writing to the standard output considerably slows down the performance and hides the performance gain achieved by parallel processing. Therefore, if performance is a critical factor, it is recommended to write the output to a file instead of the standard output.

Please note that the load testing results provided are specific to the given scenario and may vary depending on the system configuration and input file characteristics.

Future Improvements and Considerations

Given more time or for future development, the following improvements and considerations could be made:

CI/CD Pipelines: Set up Continuous Integration and Continuous Deployment (CI/CD) pipelines to automate the testing, building, and deployment of the application. This ensures that changes to the codebase are thoroughly tested and deployed to a registry or production environment.
Enhanced Exception Handling: Improve the exception handling mechanism to provide a more configuration options.
Additional Functionality: Extend the functionality of the application to support processing multiple files by passing a list of file paths as input. This will enable users to process multiple files in a single execution and obtain the groups of anagrams across all files.
Extend Readers and Writers: The current implementation supports reading from a file and writing to either the standard output or a file. We could consider extending the application to support additional readers and writers, such as reading data from a database or storage bucket, or fetching data from an API. This flexibility allows the application to handle a wider range of input and output sources.
Containerization with Docker: Consider containerizing the application using Docker to ensure consistent development and deployment across different systems. Would also allow open the possibility of running the app on an orchestration service such as K8s.

By implementing these improvements and considering the future considerations, the application can be further enhanced in terms of automation, error handling, functionality, and extensibility.

Thank you for considering this submission. If you have any further questions or need additional information, please feel free to reach out.

paulmrsn / anagramly Goto Github PK

anagramly's Introduction