Lead Software Engineer
Strong foundation in algorithms and problem solving. Solid understanding of system architectures, performance, and scalability. Good listener, efficient team player, quick deliverer; be able to juggle multiple tasks within the constraints of resources and timelines. Familiar with leading Agile team and managing business aspects of software development.
Master of Computer Science from National University of Singapore (NUS GSA Scholar; Advanced Algorithms: A, Distributed Computing: A+, Advanced Operating Systems: A+) and B.Eng of Computer Engineering from Vietnam National University HCMC (First Class Honors).
Creator of project awesome-scalability (popular GitHub repository with more than 30k stars).
My work and my graduate studies are focused on distributed systems, from the large one (back-end, data pipeline, cluster) to the small one (network-on-chip simulator). Typical projects in which I am the key member are:
Data Platform for Streamed Log Management
The team has built a platform to automate data collection process as well as to manage and process the collected data. I am responsible for maintaining and improving this platform, with a heavy focus on data management, cloud migration, and workflow automation.
As the platform consists of a mix of complicated software with different components and tiers interact with each other, I had to consider various kinds of trade-off between storage, computation, and network. I leveraged ELK stack (Elasticsearch, Logstash, and Kibana) for powering search and analytics of various kinds of transport logs in different file types. By providing computing, storage, and network infrastructure that can run high performance applications, the cloud-based migration solution could provide a significant speed-up ratio. In addition to this, as container-based virtualization techniques are becoming an important option due to their lightweight operation and cost-effective scalability, I implemented the containerization technique Docker to automate our data processing workflow. These improvements made our data-intensive platform operate better with the growth of data collected everyday and the increase of workload.
Resource Allocation and Scheduling
High-load data processing jobs constitute a significant portion of load on multi-tenant clusters and these jobs are expected to be completed before a predetermined deadline. However, predicting the amount of resources needed to achieve this goal is troublesome, mainly because of the unpredictability in resource availability and application behavior. Common approaches typically mitigate this problem by over-allocation of resources, in turn resulting in poor cluster utilization.
I developed a framework to resolve this tension between predictability and utilization. The framework was built on three key modules: a learning module, a scheduling module, and a dynamic reprovisioning module. First, the learning module automatically derive Service Level Objectives (SLO) and job resource models from log data. Second, the scheduling module enforce SLOs by isolating jobs from sharing-induced performance variability. In this scheduling module, the online arriving jobs with different periods need to be packed efficiently. To achieve this goal, I designed and implemented a online packing algorithm for periodic jobs. Finally, the dynamic reprovisioning module track real-time progress of a job and adapts allocation based on its progress to meet the job's objectives. I validated my design and implementation against historical logs from a 500 node cluster, and showed that my framework could reduce the amount of deadline violations by 5 times to 10 times, while lowering cluster footprint by 10% to 30%.
MapReduce is a prevalent distributed and parallel computing paradigm for large-scale data-intensive applications. However, tuning the MapReduce job parameters is a challenging task as the parameter configuration space is large with more than 70 parameters, and traditional offline tuning system requires multiple test runs to identify a desirable configuration.
I implemented an online performance tuning system to mitigated the aforementioned problems. The system monitors job performance counters and system logs dynamically, tunes the parameters according to the collected statistics, and changes the configuration dynamically during job execution. Thus, instead of having the same configuration for all tasks, each task will have a different configuration. I implemented the system based on YARN, a second-generation Hadoop implementation. The parameters on which I focused are the container CPU and memory allocation for every map and reduce task, the memory allocation of sort buffer in the map stage, the shuffle buffer and reduce buffer in the reduce stage. Moreover, I designed a hill climbing algorithm that searches the configuration space systematically and converges to a near-optimal parameter configurations. I evaluated this dynamic online tuning system on several benchmarks (text search, wordcount, bigram, etc), comparing the results against offline tuning and default parameter configuration. Experiment results showed that my mechanism can improve performance by 20%.
Heuristic Search and Optimization
Neuromorphic processors with a manycore architecture are emerging as promising alternatives to traditional computing platforms for brain inspired computing. The increasing trend of employing a large number of neurons in neuromorphic networks such as spiking neural networks has made it necessary for chips to contain a few hundred to thousands of cores. Due to the large core count and the exponential number of mappings, it is imperative to study how large spiking neural networks can be mapped efficiently onto a neuromorphic processor so as to achieve optimal system performance.
I developed a network-on-chip simulator and proposed a heuristic-based optimization framework framework for mapping logical cores onto physical cores to achieve an optimal clock ratio metric. The framework can handle different types of network topologies which are employed for inter-core communication. As finding an optimal mapping is an NP-hard problem, I used a heuristic algorithm to rapidly search through the design space to find near-optimal solutions, and where possible, compare them with the optimal solution obtained by a branch and bound method. The heuristic algorithm based on the Kernighan-Lin (KL) partitioning algorithm. The framework consists of an "initial mapping" stage followed by "partitioning" and "customization". In the "initial mapping" stage, an initial throughput-aware mapping is generated. By virtue of the KL partitioning algorithm, the initial mapping is then recursively bi-partitioned based on the "partitioning" stage. Between two partitions, I applied a "customization" stage to find a mapping with lowest possible clock ratio. The "partitioning" and "customization" stages are repeated reaching the leaf node. In the experiments, I showed that my heuristic is able to achieve better solution with lower runtime compared to existing techniques. I also demonstrated the scalability of my solution on large networks containing more than 3500 cores.
Please connect to me on LinkedIn for detailed CV! Thank you for visiting my homepage!