What is CS+?
CS+ is a ten week summer program exclusively for Duke undergraduates to get involved in computer science research projects with faculty in a fast-paced but supportive community environment. Students participate in teams of 3-4 and are jointly mentored by a faculty project lead and a graduate student mentor. The experience is meant as a rich entry point into computer science research and applications beyond the classroom.
- Only students enrolled at Duke University are eligible to apply.
- The program this summer will run from Monday, May 20, 2024 through Friday, July 26, 2024.
- The program is held in-person, following Duke guidelines for summer programs. There is no virtual option available, and students must reside in Durham during the summer (on or off campus) to participate.
- Students participate in this program full-time (40 hours/week). You cannot take summer courses or do other internships/fellowships while doing CS+.
- Participants receive a stipend of $5,000 to cover expenses.
- Applications received by Friday, February 16 will receive full consideration (afterwards applications will be considered depending on whether positions have been filled).
If you have questions about the program, please email email@example.com.
CS+ Project Offerings Summer 2024
Leads: Danyang Zhuo and Anru Zhang
Description: Hospitals accumulate a great amount of patient-level health data, which are securely stored in privacy-preserved databases. Researchers utilize these data for various analytical purposes to understand public health trends, track disease spread, and establish connections between symptoms and diseases. However, the manual effort for acquiring the data can be cumbersome. Our project aims to revolutionize this process by harnessing the capabilities of advanced large language models. By automating the data analytics pipeline for health data, we aim to drastically reduce manual labor for researchers and pave new ways for scientific discovery. This initiative invites students to collaborate with faculty mentors, enhancing and refining our existing research prototype on the data acquisition and analyses using large language models.
Goals/Deliverables: Students will:
- Deliver a running system that automates medical data analytics pipeline.
- Learn how to use large language models to build an end-to-end AI application.
- Learn the pros/cons of fine-tuning and prompt engineering.
Student Background/Prerequisites: Proficient Python programming. Understanding of deep learning and natural language processing.
Lead: Kamesh Munagala
Description: How should the map of a state be partitioned into electoral districts? We will explore algorithms for this problem.
Goals/Deliverables: Algorithmic results, either code with experiments, or proving why the algorithms work.
Student Background/Prerequisites: Strong math and programming background, preferably having taken CPS 330 or equivalent.
Lead: Pankaj Agarwal
Description: Optimal transport (OT) is a widely used method for computing the similarity (or distance) between two shapes or probability distributions . Its goal is simply to deform one distribution into another in a minimum effort manner, and the total amount of effort required to transfer between the two distributions is their OT distance. Specifically, given two probability mass distributions, OT asks to pair up masses from the two distributions in a way where the average length of pairs is minimized. OT is also used to compute the representative or "mean" distribution of a collection of probability distributions. Roughly speaking, this "mean" distribution is the distribution that minimizes the sum of OT distances from each of the given distributions.
Many algorithms for computing OT between two distributions require quadratic time in the size of the supports of the distributions. However, geometry of OT can be exploited to obtain faster algorithms, especially in low dimensions. The goal of this project is to develop and implement geometry based algorithms for OT and for computing the "mean" distribution.
Goals/Deliverables: Adapt existing OT algorithms, implement them, and test their efficacy and efficiency. In particular, develop a greedy algorithm for computing the "mean" distribution.
Student Background/Prerequisites: Basic knowledge of algorithms and data structures and strong coding skills
Lead: Rong Ge
Description: Recently, large language models (LLMs) demonstrated strong "in-context" learning abilities, and given a few examples as a prompt, the models can follow the context to make predictions on new examples. This appears to be different from traditional "in-weight" learning where the weights of the neural networks capture all the knowledge learned during training.
Although several recent works (e.g., https://arxiv.org/abs/2210.05675, https://arxiv.org/abs/2309.06054, https://arxiv.org/abs/2310.10616) have tried to understand the difference between in-context and in-weight learning, and what mechanisms enabled in-context learning, this is still a major open problem. In this project we hope find a new perspective to partly understand in-context learning.
Goals/Deliverables: The final product will be a report (or if successful, a pre-print). Outline:
- Read papers on in-context and in-weight learning
- Formalize a simple task that highlights the difference between in-context and in-weight learning
- Train small-scale transformers on the simple task with synthetic data
- Try to interpret what the small-scale model does
Student Background/Prerequisites: Linear algebra, calculus, ideally some experience in training neural networks/pytorch.
Lead: Anru Zhang
Description: The generative model is an active topic of research, due to the evolution of DALL-E, GPT, etc. In the realm of healthcare, hospitals accumulate a great amount of patient-level healthcare data, which are securely stored in privacy-preserved databases. Researchers utilize healthcare data for various analytical purposes to understand public health trends, track disease spread, and establish connections between symptoms and diseases. The objective of our project is to investigate the potential of cutting-edge generative models in enhancing the analysis of healthcare data. We invite interested students to join us in this innovative exploration.
Goals/Deliverables: Students will learn how to apply generative models in the healthcare setting. If time permits, students are encouraged to write a research paper.
Student Background/Prerequisites: Proficient in Python programming, deep learning, and computing on GPU.
Lead: Pardis Emami-Naeini
Description: Over the past few years, artificial intelligence (AI)-enabled tools have gained tremendous popularity. One common example of such tools is AI chatbots (e.g., ChatGPT), which use conversational AI to simulate a human-like Q&A interaction with users. Healthcare is one of the domains that have recently seen an uptake in the development of chatbots (e.g., Google's Med-PaLM ). Such tools provide healthcare professionals and patients with various benefits, ranging from helping clinicians with their patient note-taking to predicting the risks of cancer in patients using their medical history. Moreover, users can benefit from such tools without the need to see a doctor. For example, some of the current AI chatbots can provide personalized diet and mental health recommendations to users, thereby increasing their autonomy and making them independent of visiting healthcare professionals.
Despite the various positive usecases of healthcare AI chatbots, these tools can pose great privacy and safety harms to the healthcare system. To function, the chatbots rely on collecting and accessing vast amounts of potentially sensitive medical data from users through their direct and indirect interactions with the AI chatbots. However, these systems are vulnerable to data breaches, which could then expose the users to privacy risks. In addition, the models powering such chatbots could be trained on biased data, which could pose further safety harm to their users. For example, due to the common algorithmic biases in AI models, by over-relying on AI chatbots, doctors could misdiagnose a patient from minoritized communities (e.g., Black or African American).
Currently, no usable information is provided to the users of chatbots regarding the privacy and fairness of the models underlying the chatbots. This project aims to design a privacy and fairness “nutrition” label for AI-enabled health chatbots that is usable and informative for both patients and healthcare providers. The project involves i) conducting a series of user research studies with various stakeholders, including AI experts, patients, and healthcare providers, to identify the factors that should be included on such labels and ii) designing a prototype label.
Goals/Deliverables: Through conducting this societal research at the intersection of AI, privacy, and health, the students will become more informed about the challenges of emerging AI technologies in healthcare and develop knowledge to empower the users to have more protective interactions with such technologies. The outcome of this research could be presented as a poster or a full research paper in a security, privacy, or human-computer interaction conference.
Student Background/Prerequisites: The project requires conducting user-centered research with users of medical AI chatbots, including doctors and patients. Therefore, it is important for the candidates to have some knowledge of conducting user research methods and analyzing user data. In addition, it would be helpful if the candidates have basic knowledge of the concept of AI chatbots and their potential risks to users' privacy and safety.
Lead: Xiaowei Yang
Description: Modern cloud computing providers offer both computing and network services. Large cloud providers such as Amazon, Google, and Microsoft now own their private backbone networks and offer services that send customers' traffic through their backbones, entirely bypassing the public Internet. In this project, we will use image processing technologies to convert the publicly available fiber maps published by these providers into computation-friendly graph data structures and then study the properties of these private networks (such as shortest path latency) using the graph data structures.
Goals/Deliverables: Students are expected to generate graph data structures from published network map images.
Student Background/Prerequisites: Basic knowledge of data structure and algorithms, with interest in learning image processing techniques.
Lead: Matthew Lentz
Description: Personal smart devices provide users with powerful capabilities, which are derived (in part) from the ability of applications to operate over a variety of sensitive input/output (I/O) data related to the user: collecting and processing input data from sensors (e.g., fingerprint scans, location updates), or rendering output data to the user (e.g., health information). Users want to express control over the collection and processing of data on their devices; however, there is a complex ecosystem due to the large number of mutually-distrusting stakeholders, including users, manufacturers, vendors, and application owners. Our insight into trying to resolve this tension is to rethink the software I/O stack in terms of “accountable paths”, where each path represents a collection of modules that are connected between an application and the underlying hardware I/O devices it uses. Given these paths, we have very natural ways to express control (policies applied between modules), reconcile needs of different stakeholders (privileged modules), and reason about the state of the system (set of paths).
The correctness of this Accountable I/O (AIO) system is paramount, since many applications will depend on AIO – this is similar to how applications depend on the correctness of the operating system (e.g., Mac OS). In your own classes and projects, you’ve most likely worked towards achieving correctness via writing test cases. Instead, in this project, we will focus on leveraging “formal verification” to prove correctness in *all* cases at compile time. This involves first writing a complete mathematical specification of correctness. Given this specification, we will develop proofs that our implementation of the software system meets the specification, which will be automatically checked by a verifier. For this summer project, our goal is to continue the verification effort for AIO that we started in the Fall. This will involve iterating on the specification, implementation, and proofs to bring more features into the verified implementation. This will give you first-hand experience with software verification, which is witnessing significant growth in both academia and industry (e.g., AWS, Azure).
- Completion of a 2-week crash course in verification for software systems (with feedback)
- Contribute to the specification and verification effort for Accountable I/O, which will involve writing parts of the implementation and proofs in Rust using the Verus framework (see https://github.com/verus-lang/verus)
- Contribute towards a future publication of the work. The verification effort will encompass a significant portion of the paper and the evaluation will directly involve the verified software
- Presentation and poster on the work as part of the CS+ program
- Has already taken one (or both) of: CS210/CS250 and CS230
- Experience with programming in C, C++, or Rust is a plus
What is the difference between Code+, Data+, Climate+, and CS+? All three “plus” programs have the same model: students collaborating in teams on a project in tech/data for the same 10 weeks of the summer and receiving a stipend of the same amount. We also partner to provide some common events (talks, social events, final poster fair, etc) in order to create a larger ecosystem of students studying in tech and data over the summer; over 100 students participated in 2019 across all three programs. Each program has its own application.
- CS+ focuses on projects in computer science research and applications and is run by the Department of Computer Science. Project leads are typically computer science faculty.
- Data+ focuses on interdisciplinary data science projects from all over the university, and is run by Rhodes I.I.D. in Gross Hall. Project leads are typically faculty from diverse areas of the university, with frequent additional participation from community and/or industry partners.
- Code+ focuses on projects in software and product development and is run by Duke OIT taking place at the American Tobacco Campus in downtown Durham. Project leads are professional IT developers with the emphasis on developing real-world development experience.
- Climate+ focuses on climate-related, data-driven interdisciplinary research projects on diverse topics like electricity consumption, wetland carbon emissions, climate change’s impacts on river and ocean ecosystems, and the use of remote sensing data to inform climate strategies. Project leads are data science experts, and also climate, environment, and energy researchers and practitioners with additional participation from other project teams.
Do I apply to the program, or can I pick the projects I want to be a part of? You can apply specifically to the projects and faculty of interest to you.
How much background do I need? CS+ is intended for students who have some computer science experience, but students do not need to be computer science majors or rising seniors in order to apply. We welcome and encourage applications from rising 2nd and 3rd year students who have completed the introductory course sequence in computer science and have skills and interests that make them a good fit for their projects. Feel free to reach out to individual project leaders to discuss background for specific projects.