Shushan Arakelyan

PhD Candidate

University of Southern California

Bio

I’m a final year PhD candidate at the University of Southern California. I am part of Intelligence and Knowledge Discovery Lab ( INK lab ), supervised by Prof. Xiang Ren. In the past, I was an intern at Microsoft Research, and Google.

I am currently on the industry job market, my research experience is in large language models and their generalization, in particular in applications to software and code. I am open to jobs in many topics in NLP/LLMs, with specific interest in ensuring robustness, reliability, and safety of deployed NLP models.

Interests

AI for Code
Large Language Models and Generalization
Machine Learning

Education

PhD in Computer Science, 2017-present
University of Southern California
MPhil in Advanced Computer Science, 2016
University of Cambridge
BSc in Applied Mathematics and Informatics, 2014
Russian-Armenian (Slavonic) University

Featured Publications

Shushan Arakelyan, Rocktim Jyoti Das, Yi Mao, Xiang Ren

March, 2023 In EMNLP'23

Exploring Distributional Shifts in Large Language Models for Code Analysis

We systematically study the capacity of two large language models for code - CodeT5 and Codex - to generalize to out-of-domain data. In this study, we consider two fundamental applications - code summarization, and code generation. We split data into domains following its natural boundaries - by an organization, by a project, and by a module within the software project. This makes recognition of in-domain vs out-of-domain data at the time of deployment trivial. We establish that samples from each new domain present both models with a significant challenge of distribution shift. We study how well different established methods can adapt models to better generalize to new domains. Our experiments show that while multitask learning alone is a reasonable baseline, combining it with few-shot finetuning on examples retrieved from training data can achieve very strong performance. In fact, according to our experiments, this solution can outperform direct finetuning for very low-data scenarios. Finally, we consider variations of this approach to create a more broadly applicable method to adapt to multiple domains at once. We find that in the case of code generation, a model adapted to multiple domains simultaneously performs on par with those adapted to each domain individually

Shushan Arakelyan, Anna Hakhverdyan, Miltiadis Allamanis, Luis Garcia, Christophe Hauser, Xiang Ren

December, 2022 In NeurIPS'22

NS3: Neuro-symbolic Semantic Code Search

Semantic code search is the task of retrieving a code snippet given a textual description of its functionality. Recent work has been focused on using similarity metrics between neural embeddings of text and code. However, current language models are known to struggle with longer, compositional sentences, and multi-step reasoning. To overcome this limitation, we propose supplementing the query sentence with a layout of its semantic structure. The semantic layout is used to break down the final reasoning decision into a series of lower-level decisions. We use a Neural Module Network architecture to implement this idea. We compare our model - NS3 (Neuro-Symbolic Semantic Search)-to a number of baselines, including state-of-the-art semantic code retrieval methods, such as CodeBERT, CuBERT and GraphCodeBERT, and evaluate on two datasets-Code Search Net (CSN) and Code Search and Question Answering (CoSQA). On these datasets, we demonstrate that our approach results in higher performance. We also perform additional studies to show the effectiveness of our modular design when handling compositional queries.

Shushan Arakelyan, Sima Arasteh, Christophe Hauser, Erik Kline, Aram Galstyan

May, 2021 Bin2vec: learning representations of binary executable programs for security tasks

Bin2vec: learning representations of binary executable programs for security tasks

Tackling binary program analysis problems has traditionally implied manually defining rules and heuristics, a tedious and time consuming task for human analysts. In order to improve automation and scalability, we propose an alternative direction based on distributed representations of binary programs with applicability to a number of downstream tasks. We introduce Bin2vec, a new approach leveraging Graph Convolutional Networks (GCN) along with computational program graphs in order to learn a high dimensional representation of binary executable programs. We demonstrate the versatility of this approach by using our representations to solve two semantically different binary analysis tasks – functional algorithm classification and vulnerability discovery. We compare the proposed approach to our own strong baseline as well as published results, and demonstrate improvement over state-of-the-art methods for both tasks. We evaluated Bin2vec on 49191 binaries for the functional algorithm classification task, and on 30 different CWE-IDs including at least 100 CVE entries each for the vulnerability discovery task. We set a new state-of-the-art result by reducing the classification error by 40% compared to the source-code based inst2vec approach, while working on binary code. For almost every vulnerability class in our dataset, our prediction accuracy is over 80% (and over 90% in multiple classes).

Experience

Student Researcher

Microsoft Research

September 2022 – March 2023 Seattle, WA, USA/Remote

Working on large language models for code and their generalization. The resulting paper is currently under review.

Software Engineering Intern

Google

May 2018 – August 2018 London, UK

Worked on Google Assistant. Researched and developed methods for descriptions of user profiles for use with recommender systems. The resulting paper appeared in SIGIR'19.

Machine Learning Research Engineer

Qube

October 2016 – August 2017 Yerevan, Armenia

Researching, designing and prototyping state of the art algorithms for computer graphics, image processing, 3D optimization and face reenactment from RGB input.

Software Engineering Intern

Google

June 2015 – September 2015 New York, NY, USA

Worked on Google Cloud Logging Infrastructure

Software Engineer

OneMarketData

April 2013 – May 2015 Yerevan, Armenia

Worked on company’s OneTick line

Junior C++ Engineer

Questrade

August 2012 – March 2013 Yerevan, Armenia

Worked on company’s trading platform front end - a transaction processing application.

Contact

first name at isi dot edu
DM Me