Note: I am open to Software/Security Engineering Full-time positions.

My name is Himanshu Goyal (How to pronounce?). I am a graduate student at Georgia Tech pursuing a Master of Science in Computer Science with a specialization in Computing Systems. Before this, I proudly earned a Dual Degree in Computer Science (B.S./M.S.) from the Indian Institute of Technology (IIT) Bhubaneswar . My thesis was centered around the development of large-scale smart systems, with a strong emphasis on Trust, Security, and Privacy guarantees. This work was conducted under the supervision of Prof. Sudipta Saha. I am also an active member of the Decentralized and Smart Systems Research Group (DSSRG) at IIT Bhubaneswar.

As a skilled software engineer, my passion lies in developing large-scale secure software systems that leverage the power of robust systems architecture, modern cryptographic methodologies, and essential open-source tools. I have worked on several disciplines ranging from Systems & Networking, Blockchain, Trustworthy distributed computing, Zero-Knowledge Proofs (ZKPs), and Privacy Preserving Machine Learning (PPML), Deep learning,. I truly enjoy applying modern cryptographic techniques in building secure software systems.

I maintain a list of cryptographic resources under the Crypto Resources tab. It mostly contains references to the security related courses taught at several universities along with some additional useful information. If you feel to contribute, I welcome you to contact me for the same. I also occasionally blog to distill out my understanding from the readings I do. Well, you can find me reading, lurking reddit and twitter in my spare time. Nevertheless, I am an ardent Cricket fan and love talking about it.

CV / Resume: [pdf]
Last Updated: Sept 2023

Email ID: hgoyal33@gatech.edu

Updates

May 2023: Started in-person internship at Galois in Portland Office.
Jan 2023: Our work titled ZoneSync: Real-Time Identification of Zones in IoT-Edge got accepted at IEEE COMSNETS 2023.
Aug 2022: Our work titled ReLI: Real-Time Lightweight Byzantine Consensus in Low-Power IoT-Systems got accepted at IEEE CNSM 2022.
Apr 2022: Our work titled Multi-Party Computation in IoT for Privacy-Preservation got accepted at IEEE ICDCS 2022.
Jan 2021: Started remote internship with Prof. Arpita Patra at [Cryptography and Information Security lab], IISc Bangalore on Privacy Preserving Machine Learning.
May 2020: Started remote internship with Chayan Sarkar at TCS Research & Innovation on developing low-resource speech-to-text translation systems.
Apr 2020: Summer Internship with ENCRYPTO Group at TU Darmstadt, Germany(Cancelled due to Covid outbreak).
March 2020:Participated at Smart India Hackathon(SIH) 2020.
Dec 2019: Secured Bronze medal at 8th Inter-IIT Tech meet in Outreach Exercise for New Technology Ideas in TV Audience Measurement.
May 2019: Started Summer Internship at CNERG IIT Kharagpur on Characterization of Workload in multi-tier cloud application under the guidance of Sandeep Chakraborty.
Sep 2018: Started Computer Science Coursework
May 2018: Transferred from 4-year Metallurgical Programme to 5-year Dual Degree(B.tech+M.tech) in Computer Science Engineering(1 out of entire batch).
Jul 2017: Successfully passed both JEE Main and JEE Advance examinations and secured an admission at IIT Bhubaneswar in its 4-year undergraduate programme in Metallurgical and Materials Engineering.
May 2016: Got a direct admission offer from BITS-Pilani on the basis of excellent Intermediate academic performance at Birla School Pilani .



Federated Learning from adversarial view

Background and Rationale

Mobile phones, wearable devices, voice assistants, and autonomous vehicles are just a few of the new distributed networks generating a wealth of data each day, where each data sample belongs to different type of statistical distribution(Non- IID’s) . Due to the growing computational power of these devices—coupled with concerns about transmitting private information(if it gets leaked, a lot about the device and the user’s behavior can be inferred easily)—it is increasingly attractive to store data locally and thus, push network computation closer to the edge devices. The learning becomes more challenging as it seems if data contains sensitive information like location, health, and other ambient signals because the private information gets more sensitive over time, which may lead to bad user experience. Federated learning has emerged as a new training paradigm in such settings. Federated learning (aka collaborative learning) is a machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging their data samples. It stands in contrast to traditional centralized machine learning techniques where all data samples are uploaded to one server, as well as to more classical decentralized approaches which assume that local data samples are identically distributed. From [1], FL is privacy-preserving model training in heterogeneous,distributed networks.

FL Applications

Auto-suggestion application



Usage of federated learning for providing better healthcare by taking relevant data from various medical organisations, which in turn is difficult to achieve through centralised learning.



Standard Federated Training Algorithm :-

Step I. The master device sends the current global model parameters to all worker devices.

Step II. The worker devices update their local model parameters using the current global model parameters and their local training datasets in parallel. In particular, each worker device essentially aims to solve the loss function using the stochastic gradient descent. Note that a worker device could apply SGD multiple rounds to update its local model. After updating the local models, the worker devices send them to the master device.

Step III. The master device aggregates the local models from the worker devices to obtain a new global model according to a certain aggregation rule. REPEAT STEP I.

Some literature review about attacks on FL

If we see the federated training algorithm closely, in all the three steps, the data of each individual is not being used directly, which motivates the participating users to share their sensitive data for useful learning. At first glance, it also provides some privacy guarantee as the raw data never leaves the user device, and only updates to models (e.g., gradient updates) are sent to the central server, which is not in the case of a centralized setting. On the contrary, recent works by Shokri et al [4] have shown that it is possible to construct scenarios in which information about the raw data is leaked from a client to the server, such as a scenario where knowing the previous model and the gradient update from a user would allow one to infer a training example held by that user which is popularly known as Membership-Inference attack among ML community.

To provide better user privacy, several cryptographic and data perturbation techniques like Homomorphic Encryption, Secure-Multi Party Computation, and Differential Privacy(or any combination of them) are proposed, which do provide a reliable theoretical privacy guarantee. Bargav et al. [3] presented a better lower bound in Differential Privacy as compared to other state-of-art bounds by perturbing the gradient at MPC i.e., aggregation side instead of adding noise at user level for Logistic regression. However the work didn’t show any bounds for Deep learning algorithms which are mostly used in various learning tasks now-a-days. Bargav et al. [2] also shows the evaluation of the different versions of DP like Naive DP, Renyi DP, etc. on several well-known datasets trained using both traditional ML and Neural Networks algorithms and found out privacy loss is directly proportional to number of epochs. The Secure Aggregation protocol from McMahan et al. [5] has strong privacy guarantees when aggregating client updates. Moreover, the protocol is tailored to the setting of federated learning. For example, it is robust to clients dropping out during the execution (a common feature of cross-device FL) and scales to a large number of parties and vector lengths. However, this approach has several limitations: (a) it assumes a semi-honest server (only in the private key infrastructure phase), (b) it allows the server to see the per-round aggregates (which may still leak information), (c) it is not efficient for sparse vector aggregation, and (d) it lacks the ability to enforce well-formedness of client inputs. Furthermore [1] , also provides the concept of FHE, which suits better in case of heavy computational edge devices, but fails miserably where end devices are Mobile phones or IOT devices, which of-course is the case of federated learning.

In the distributed datacenter and centralized settings, there has been a wide variety of work concerning attacks and defenses for various targetted and untargetted attacks, namely through model update poisoning, data poisoning. for example, (an adversary might attempt to prevent a model from being learned at all). In these attacks, the adversary(or adversaries) directly control some number of clients, and thus, they can directly manipulate reports to service provider to alter the outputs to bias the learned model towards their objective. They may also be able to tailor their output to have similar variance and magnitude as the correct model updates, making them difficut to detect. For this, several Byzantine-Resilient Defense mechanisms were proposed by machine learning and Blockchain community, one of them is to replace the averaging step on the server with median-based agreegators. Fang et al. [7] performs successful local model poisoing attacks, which can substantially increase the error rates of the models learnt through federated learning that were claimed to be robust against Byzantine failures of some client devices.

Relationship between model update poisoning and data poisoning

Intuitively, the relation between model update poisoning and data poisoning should depend on the over-parameterization of the model with respect to the data. The study by Fang et al. [6] proposes two methods to defend against local model poisoning attacks. Their results clearly show that only in some cases, one out of two methods can effectively defend these attacks and it also make the global model slower to learn and adapt to new data as that data may be identified as from potentially malicious local models. Thus, the proposed defenses are not effective enough for every case, highlighting the need for new defenses against model poisoning attacks to federated learning. Existing defenses against backdoor attacks either require a careful examination of the training data, or full control of the training process at the server (as happens in centralised setting), none of which may hold in the federated learning setting. The property of Zero-knowledge proof can be used to ensure that users are submitting updates with pre-decided protocol. One interesting line of study would be to quantify the gap between these two types of attacks, and relate this gap to the relative strength of an adversary operating under these attack models. For example, the maximum number of clients that can perform data poisoning attacks may be much higher than the number that can perform model update poisoning attacks, especially in cross-device settings. Thus, understanding the relation between these two attack types, especially as they relate to the number of adversarial clients, would greatly help our understanding of the threat landscape in federated learning.

Motivation while writing this post

I had the following motivations while writing this post:-
Given n, no. of working devices and an honest but curious Server, Formulation of an innovative protocol which can provide rigorous privacy guarantees in the following scenarios:-

  • When m out of n clients become adversaries or an adversary controls m clients in the cross-device or cross-silo federated setting, aiming to perform various attacks such as Data Poisoning, Model UpdatePoisoning, and Backdoor attacks.
  • Robustness of the same developed protocol towards federated training algorithm, when a certain number of trusted users drop out due to participation constraints, ending up in a network having more number of malicious clients.

References

  1. David Evans, Rachel Cummings, Martin Jaggi et al . Advances and Open problems in Federated Learning. arXiv:1912.04977
  2. Bargav Jayaraman and David Evans. Evaluating Differentially Private Machine Learning in Practice. In 28th USENIX Security Symposium 2019.
  3. Bargav Jayaraman, Lingxiao Wang, David Evans, and Quanquan Gu. Distributed Learning without Distress: Privacy-Preserving Empirical Risk Minimization. 32nd Conference on Neural Information Processing Systems (NeurIPS 2018).
  4. Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership Inference Attacks Against Machine Learning Models. IEEE Symposium on Security and Privacy (S\&P) – Oakland, 2017.
  5. Keith Bonawitz, H. Brendan McMahan, et al. (@Google), and Antonio Marcedone(@Cornell tech) . Practical Secure Aggregation for Privacy-Preserving Machine Learning. arXiv:1611.04482.
  6. Minghong Fang, Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gomg. Local Model Poisoning Attacks to Byzantine-Robust Federated Learning. Usenix Security Symposium,2020.

Privacy Preserving Machine Learning

Introduction

Arthur Samuel, a pioneer in the field of computer gaming and artificial intelligence, described machine learning as a field of study that gives computers the ability to learn without being explicitly programmed. ML is being increasingly utilized for a variety of applications from recommending new movies to programming self-driving cars. Furthermore, owing to it’s breakthrough in computer vision, speech recognition, ads, social feed and auto-complete etc, Deep Learning (DL) has gained huge attention and has helped the human race in many re- spects. With the advancement in technology, we can extend the embedment of these sophisticated applications into mobile devices and computers, with which individuals spend most of their time. Our activities like search queries, purchase transactions, the videos we watch, and our movie preferences are the few types of information that are being collected and stored on regular basis at both local(search engine) and remote storage(service enabled servers) for further processing. Our personal data generated through these activities are used by ML applications to give us better user-experience. Such private data is uploaded to centralized locations in clear text for ML algorithms which in-turn extract patterns from them, and update models accordingly(federated learning). The sensitivity of such data increases it’s risk of being maliciously used by an insider as well as outsider. Additionally, it is possible to gain insights into private data sets even if these data are anonymized and the ML models are inaccessible. A machine learning process consists of two phases viz. training phase and evaluation phase. In the learn- ing phase, a model or classifier is built on a possibly large data set. In the second phase, the trained model is being queried and based on its result, it is further updated for better accuracy in the next time-step. For every ML task, three different roles are possible: the input party (data owners or contributors), the computation party and the result party. In such systems, the data owner(s) send their data to the computation party which perform the required ML task and deliver the output to the result party. Such output could be an ML model that the result party can utilize for testing new samples. In other cases, the computation party might keep the ML model, and perform the task of testing new samples submitted by the result party, and returning the test results to the result party. If all the three roles are played by the same entity, then privacy is naturally pre- served, however when these roles are distributed across two or more entities, then PRIVACY-ENHANCING TECHNOLOGIES are needed.

Review of Data Privacy in 2018-19:-

2018 was a breakout year for privacy-preserving machine learning(PPML). This year’s news cycle had several major stories surface around data privacy, which makes it the most relevant year for privacy since the Snowden leaks in 2013. Data Privacy impacts our politics, security, businesses, relationships, health, and finances.

Figure 1: Google Trends for “data privacy”, 2013 — 2019



ML-as-a-Service(MLaaS)

In this technology-driven world, most of the organizations have deployed their ML and DL based trained models into the cloud for better reliability and integrity. For a particular query from Data owners, they are both Computation Party and Results’ Party and this can affect the overall privacy of owners adversely. In fact, with all of the data that is collected from individuals around the world on daily basis, data owners might not be aware of how the data collected from them is being used (or misused), and in many cases, not even aware that some data types are being collected.

Figure 2: Most Machine learning today is done in the cloud



Why we need privacy in ML algorithms as they already involve complex mathematics-which makes decoding hard?

About 2 years ago, Google researchers introduced a Skin disease classifier at the global level. They developed an application to take photos of skin, and run it to CNN(Convolutional Neural Network) to intimate about your visit to a dermatologist. It was also observed that the accuracy of prediction was an expert level. When the user queries the application, they do not make any changes(e.g. noise perturbation) to the input skin image. The service provided then takes it for further processing.

Figure 3: Skin Cancer Classifier



The same uploaded skin image may contain sensitive information about an individual. The results of this information leakage could be harmful in some cases. Under such circumstances, the ML service could potentially become unreliable. Therefore, we need privacy at both the input and output ends when we don’t have access to the model.

Note: This is one of the example on top of my head now. There are hundreds of others also, which directs affects user privacy.

Some threats with Machine Learning

  • Membership Inference Attack

    Given an ML model and a sample (adversary’s knowledge), membership inference attacks aim to determine if the sample was a member of the training set used to build this ML model (adversary’s target). This attack could be used by an adversary to learn whether a certain individual’s record was used to train an ML model associated with a specific disease. Such attacks utilize the differences in the ML model predictions on samples that were used in the training set versus those that were not included.

  • De-anonymisation Attacks (Re-Identification)

    Identification of the target sample from the anonymized dataset. Example: Usage of IMDB background knowledge to identify the Netflix records of known users.This incident demonstrates that anonymization cannot reliably protect the privacy of individuals in the face of strong adversaries.

  • Model Inversion Attack

    Some ML algorithms produce models where explicit feature vectors are not stored in the ML model, E.g. ridge regression or neural networks. Hence, the adversary’s knowledge would be limited to either: (a) an ML model with no stored feature vectors (whitebox access), or (b) only the responses returned by the computation party when the results’ party submits new testing samples (black-box access). Here, the adversary’s target is creating feature vectors that resemble those used to create an ML model by utilizing the responses received from that ML model.

  • Re-Construction Attack

    In this case, the adversary’s goal is reconstructing the raw private data by using their knowledge of the feature vectors. Reconstruction attacks require white-box access to the ML model, i.e. the feature vectors in a model must be known. Such attacks could be possible when the feature vectors used for the ML training phase were not deleted after building the desired ML model. In fact, some ML algorithms such as SVM or kNN store feature vectors in the model itself.).The final aim of these attacks is to misguide a ML system into thinking the reconstructed raw data belonged to a certain data owner. Example: Fingerprint reconstruction, Mobile device touch gesture reconstruction.

Figure 4: Threats with ML Models



Some methods for Private ML

  • Cryptographic approaches: When a certain ML application requires data from multiple input parties, cryptographic protocols could be utilized to perform ML training/testing on encrypted data. In many of these techniques, achieving better efficiency involved having data owners contribute their encrypted data to the computation servers.It can also be seen by training the model on cipher-text.
    • Homomorphic encryption
    • Secure Multiparty Computation(Secret sharing)
    • Garbled circuits etc.
  • Perturbation techniques: Feeding updated data by adding noise to the actual raw data, which results into similar output as before without revealing anything about actual information.
    • Differential Privacy
      • Data Owners’ i.e. Input perturbation
      • Computational Party i.e. Algorithmic perturbation
      • Results’ Party i.e .Output perturbation
    • Dimensional reduction