Accepted Papers

Main Track

A Greener Edge: A Framework on Carbon-aware Edge ML System Design

Xuesi Chen, Ilan Mandel (Cornell Tech); Eren Yildiz, Josiah Hester (Georgia Institute of Technology); Udit Gupta (Cornell Tech)

Public Review

Public review by Aruna Balasubramanian, Stony Brook University, Stony Brook, New York

This Emerging Ideas paper addresses the increasingly important problem of sustainability in edge computing systems, focusing on the lifecycle carbon footprint of IoT devices. There have been several recent works studying the effect of machine learning training and inference on carbon emissions, primarily in the context of datacenters. However, this paper highlights that even edge devices contribute significantly to global carbon emissions, particularly due to embodied carbon from manufacturing. The paper then argues that any carbon emission measurement requires jointly considering embodied and operational carbon, as well as deployment-specific factors such as workload characteristics and device architectures.

To this end, the authors present MicroGreen, a design space exploration framework that enables carbon-aware design of edge devices. The framework integrates component-level embodied carbon modeling with workload characterization and environment-dependent operational analysis. By combining models for power, sensing, processing, and networking subsystems, along with environmental traces such as solar availability, MicroGreen identifies carbon-optimal system configurations under different deployment scenarios. The authors further build a repository of microcontrollers in term of their carbon emission and cost and estimate the holistic carbon emission across processors and environment conditions.

The reviewers appreciated the timeliness and relevance of the problem, especially given the growing scale of edge AI deployments. A key strength of the paper is its holistic perspective on lifecycle emissions, moving beyond traditional energy-centric optimization that incorporates embodied carbon and deployment context. The paper presents a case study that demonstrates how deployment heterogeneity can significantly influence carbon-optimal design decisions.

Overall, this paper makes a strong case for lifecycle-aware design of edge systems and introduces a promising framework for enabling carbon-conscious decision-making. It also opens several open questions and research challenges around high-accuracy energy prediction and generalization across new applications and hardware.

A Large-Scale Multimodal Dataset and Benchmarks for Human Activity Scene Understanding and Reasoning

Siyang Jiang, Mu Yuan, Xiang Ji, Bufang Yang (The Chinese University of Hong Kong); Zeyu Liu (University of Illinois Urbana-Champaign); Lilin Xu (Columbia University); Yang Li, Yuting He, Liran Dong, Wenrui Lu, Zhenyu Yan (The Chinese University of Hong Kong); Xiaofan Jiang (Columbia University); Wei Gao (University of Pittsburgh); Hongkai Chen, Guoliang Xing (The Chinese University of Hong Kong)

Public Review

Public review by Mayank Goel, CMU, Pittsburgh, PA

Human activity recognition has long been a central topic in mobile and ubiquitous computing. More recently, advances in multimodal foundation models and large language models have expanded interest beyond coarse activity classification toward richer tasks such as understanding action context, describing behavior sequences, and reasoning about future intent. However, progress in these areas has been limited by the lack of large-scale multimodal datasets that pair diverse sensing modalities with semantically rich descriptions and reasoning-oriented annotations. This paper presents CUHK- X, a large-scale multimodal dataset designed to support Human Activity Recognition, Human Action Understanding and Reasoning. The dataset contains over 64,000 synchronized samples collected from 30 participants across seven sensing modalities, including RGB, depth, thermal, infrared, IMU, mmWave, and skeleton representations. Reviewers appreciated the scale of the engineering effort and the paper’s attempt to bridge recognition-oriented sensing research with emerging multimodal reasoning tasks.

The paper introduces a data collection pipeline centered around a “ground-truth-first” methodology. Rather than collecting large amounts of unstructured sensor data and labeling afterward, the authors first generate semantically coherent daily-life scenes and captions using LLM-assisted prompt generation and then collect synchronized multimodal data aligned to those scenarios. The reviewers appreciated the effort devoted to preventing spatiotemporal inconsistencies and implausible activity sequences through prompt engineering and human verification. The dataset substantially expands modality coverage compared to many prior activity datasets. CUHK-X incorporates several privacy-preserving modalities such as thermal sensing, mmWave radar, and IMUs alongside richer textual descriptions. The reviewers found this multimodal alignment valu-able for the community, particularly as researchers increasingly explore sensing systems that balance contextual understanding with privacy concerns.The paper provides a broad set of benchmark tasks spanning recognition, captioning, contextual understanding, sequential ordering, and reasoning. The evaluation demonstrates that models fine-tuned on CUHK-X improve performance across several downstream tasks, and the benchmarks help expose the current limitations of state-of-the-art multimodal and reasoning models when operating on non-RGB sensing modalities.

At the same time, the reviewers raised several important limitations and open questions. One recurring concern was the ecological validity of the collected activities. While the generated scenes are carefully structured and human-verified, the dataset remains fundamentally an acted dataset with LLM-assisted scripting. Several reviewers questioned whether some reasoning tasks may partially evaluate alignment between LLM-generated scripts and downstream language models, rather than purely measuring natural human behavioral reasoning.

The reviewers also discussed the practical role and contribution of some modalities. In particular, some sensing streams—such as IMU and mmWave—currently achieve relatively low standalone performance. Questions were also raised regarding whether certain representations, such as skeleton data, should be considered independent sensing modalities when they are derived algorithmically from camera inputs. More broadly, reviewers noted that the paper’s strongest contribution lies in the dataset itself and the infrastructure around multimodal activity collection rather than in algorithmic novelty.

Despite these limitations, the reviewers agreed that CUHK-X represents a timely and valuable resource for the mobile systems and multimodal sensing communities. As research increasingly moves from isolated activity recognition toward richer contextual understanding and reasoning, datasets that align diverse sensor streams with semantically grounded descriptions will become increasingly important. CUHK-X provides an important step toward enabling this broader class of human-centered multimodal intelligence systems.

Act Before It's Too Late: Power-Efficient LLM Inference on Mobile Device

Haolin Chu, Jinxiao Fan, Jiabin Deng, Bensong Yu (Beijing University of Posts and Telecommunications); Liguang Xie (Bytedance technology company Limited); Liang Liu, Huadong Ma, Xiaolong Zheng (Beijing University of Posts and Telecommunications)

Public Review

Public review by Aruna Balasubramanian, Stony Brook University, Stony Brook, New York

The goal of this paper is to improve power efficiency for on-device large language model (LLM) inference on mobile systems. Prior work has focused on model-level optimizations, such as quantization and pruning, but these techniques can degrade accuracy when applied aggressively. Instead, this paper takes a systems-level approach by focusing on optimizing the token-generation process for power efficiency. The idea is based on the observation that GPU stalls, a phenomenon where the GPU cannot make forward progress on computations because it is waiting on resources, account for a considerable fraction of token-generation latency.

This paper exploits this observation for power efficiency by introducing TurboInfer, a millisecond-level frequency scaling technique that adapts GPU frequency based on GPU stalls. At first glance, this may sound similar to DVFS (Dynamic Voltage and Frequency Scaling) techniques that have been used to adapt GPU frequencies for latency and power improvements. However, DVFS operates at a course granularity, whereas the stalls occur at a microsecond granularity. To this end, TurboInfer performs millisecond-level frequency scaling that lowers GPU frequency during stalls and increases it during active computation. To detect these stalls at such fine granularity, TurboInfer uses performance monitoring unit (PMU) registers to track GPU instruction counts in real time. It also introduces a tube-based model predictive control (TMPC) formulation to predict near-optimal GPU frequencies while accounting for noise from background applications

A key strength of the paper is its system-level perspective, which complements existing model-level optimizations and can be broadly applied without modifying model architectures. The paper also provides a detailed characterization of GPU stall behavior during LLM inference and demonstrates that such stalls can dominate execution time, motivating the need for fine-grained control. The system is implemented end-to-end and evaluated on two smartphones. The results show that TurboInfer can provide over 40% power savings compared to baselines without affecting token generation latency.

Overall, by identifying and exploiting fine-grained GPU workload characteristics, this paper demonstrates the potential of system-level optimizations to significantly reduce energy consumption without sacrificing performance.

AdaSprite: Resource-efficient Online Co-Adaptation for V2I Systems Under Large-scale Data Drifts

Lehao Wang (Northwestern Polytechnical University); Zhiwen Yu (Northwestern Polytechnical University, Harbin Engineering University); Sicong Liu, Kefan Chen, Fengmin Wu, Bin Guo (Northwestern Polytechnical University)

Public Review

Public review by Qin Lv, University of Colorado Boulder, Boulder, CO USA

Vehicle systems and roadside infrastructures are increasingly equipped with advanced visual sensing capabilities, enabling AI-powered perception in networked vehicle-to-infrastructure (V2I) collaboration. Compared with traditional vision models, vision-language models (VLMs) can extract task-specific cues with compact embeddings and are thus more suitable for resource- and bandwidth-limited V2I scenarios. Furthermore, vision mixture-of-experts (V-MoE) can extract distributed features across heterogeneous V2I devices, which adapts to different device constraints and supports improved tradeoffs between perception accuracy and efficiency in diverse application scenarios, such as path planning, localization, and navigation. However, there can be large data drifts in real-world deployments, and the drifts change dynamically in the distributed system. In particular, the VLMs’ visual moduels on different V2I devices are more susceptible to large drifts between training and real-world traffic conditions, resulting in degraded performance.

This paper proposes AdaSprite to support resource-efficient online co-adaptation for v2I systems under large-scale data drifts. The authors investigate and tackle three key challenges: memory fragmentation and load imbalance caused by DRAM-constrained expert parallelism, bottlenecks caused by memory-I/O-bound computation for cross-task reuse, and repeated expert cache reload when schedul-ing asynchronous adaptation requests. Aiming to explore the upper bound of concurrent adaptation tasks on resource-constrained edge devices, the authors have developed a number of techniques that utilize expert dynamic sparsity for improved throughput. The key idea is to proactively restructure execution based on shared context in modular/partially loaded components. Specifically, AdaSprite uses lightweight dynamic programming for expert lifespan alignment and spatio-temporal co-scheduling with dynamic resource allocation. It also leverages a shared DRAM-resident expert cache for cross-task reuse and I/O-hiding prefetch with improtance-aware residency/eviction. Furthermore, AdaSprite utilizes twin-buffer scheduling, which interleaves services between buffers and preloads inactive experts during task switches. Using both simulated and real-world data, the authors have tested AdaSprite on five vision-QA tasks in autonomous driving and an urban VLM-based V2I application. Results show improved efficiency and better tradeoffs between accuracy and concurrency.

Overall, the reviewers appreciated the authors’ efforts in tackling this challenging problem with good in-depth investigation and novel techniques. Reviewers suggested further clarification regarding the intended usage scenarios and the generalizability of MoE and large data drifts in real-world applications. Several other results helped complete the evaluations, including strengthening the comparison with GPU-sharing schedulers, adding accuracy-latency tradeoff results of EATA, as well as the comparison and analysis of MoE runtime/training baselines. Reviewers also noted the importance of discussing the potential impact of system instability in real-world deployments.

As more learning models are integrated into real-world systems and applications, the impact of large-scale data drifts requires careful investigation and mitigation, especially in dynamic and distributed settings with limited resources. This work represents a good step forward for an important application scenario with solid system innovations. Further investigation in related topics and other application scenarios can be beneficial for our research community with broader societal impact.

ADL-CLIP: Text-Aligned RF Representation for Continuous ADL Detection in Home Environments

Mengjing Liu (Stony Brook University); Zongxing Xie (Kennesaw State University); Fan Ye (Stony Brook University)

Public Review

Public review by Di Duan, CUHK, Hong Kong SAR

Activities of Daily Living (ADLs) remain an important long-standing problem in contextual sensing for home settings; they serve as a cornerstone of various research thrusts and practical applications, such as assisting the elderly, monitoring and tracing chronic diseases. However, in such privacy-sensitive environments, well-studied video-based segmentation approaches are usually unacceptable. At the same time, accurate and fine-grained ADL segmentation methods for RF-based solutions remain significantly under-developed. Furthermore, in the real world, open-set activities and inherently ambiguous ADL boundaries collectively pose a significant challenge.

This paper discloses several interesting observations about the relationship between data-driven segmentation and human annotations, such as that data-driven segmentation can independently identify meaningful sub-actions within a single activity, even when those sub-actions were never labeled separately. It formulates the segmentation problem for continuous RF ADL sequences as an optimization task directly in the learned embedding space and proposes an iterative learning procedure that merges data-driven segmentation results with human annotations. In doing so, ADL-CLIP is the first to fully address RF-only sensing in realistic home environments while simultaneously tackling unknown activity boundaries and open-set generalization to unseen ADLs. Moreover, its method is thoughtfully designed: CLIP-style contrastive learning supplies strong semantic generalization from language, the data-driven segmentation in embedding space solves the core boundary-detection problem, and the iterative fusion mechanism reconciles the inherent ambiguity and temporal shifts in human annotations without sacrificing the generalization power of the CLIP-aligned representation.

The reviewers recognized this paper as the first RF-based framework for open-set, boundary-free continuous ADL detection in a realistic home environment. They appreciated the novel insight on intra- versus inter-ADL embedding clustering, the good formulation of data-driven segmentation as a label-free optimization problem, and the pragmatic iterative alignment between data-driven and human-annotated boundaries. At the same time, the reviewers raised important concerns regarding the practical handling of unknown segment counts, the realism and generalization claims given the scripted and partially simulated data collection, and the lack of direct segmentation metrics beyond end-to-end accuracy. Through the shepherding process, the authors effectively addressed these points by adopting PELT for segmentation without pre-specifying the number of segments, adding segment-level IoU evaluation, toning down overstated “in-the-wild” claims, explicitly discussing key limitations, and releasing a curated dataset.

In a nutshell, ADL-CLIP demonstrates that the approach of “data-driven segmentation + iterative alignment” is not only feasible but highly effective for continuous RF-based ADL detection in real-home environments, with performance that far surpasses both traditional sequential models and direct applications of CLIP.

Agent-X: Full Pipeline Acceleration of On-device AI Agents

Jinha Chung, Byeongjun Shin, Jiin Kim, Minsoo Rhu (KAIST)

Public Review

Public review by Longfei Shangguan, University of Pittsburgh, Pittsburgh, PA

If 2025 marked the rise of reasoning-driven LLMs, 2026 is increasingly becoming the year of agentic systems. Today, LLMs are no longer just asked to answer questions, but to plan, call tools, interact with apps, and complete tasks on behalf of users. The recent momentum around OpenClaw is a strong sign of this shift. This shift is particularly important for mobile systems, where agents are expected to be responsive, private, and always available. However, bringing such agentic capabilities onto personal devices remains difficult, because the performance bottlenecks of agents are fundamentally different from those of standard chat-style LLM workloads. This paper asks a practical and timely research question: can we systematically accelerate the full on-device agent pipeline without hurting task accuracy?

The strength of Agent-X lies in that its design is grounded in a clear understanding of the agent workload. The paper first experimentally demonstrates that on-device agents suffer from substantial latency in both the prefill and decode stages. It then builds a full-pipeline acceleration framework around these two bottlenecks. For the prefill stage, the paper introduces PromptWeaver, which restructures prompts to improve prefix cache reuse and turn otherwise dynamic prompt regions into more cache-friendly inputs. For the decode stage, it introduces ExSpec, which observes that many agent outputs are strongly grounded in the prompt and few-shot examples, and can therefore be speculated efficiently without relying on an additional draft LLM. The use of a lightweight n-gram lookup table as the draft model is also a good design choice. Although n-gram-style drafting has been explored before, it fits this setting particularly well as it avoids the memory and latency overhead of maintaining another LLM on-device, while still capturing a meaningful fraction of predictable output patterns in agent workflows.

At the same time, the current design is still somewhat tailored to a particular style of agent workflow, and some of its benefits rely on offline preprocessing and cached artifacts whose robustness under broader workloads remains to be better understood. Overall, Agent-X represents a meaningful early step toward efficient on-device agentic systems.

AgentProg: Empowering Long-Horizon GUI Agents with Program-guided Context Management

Shizuo Tian, Hao Wen, Yuxuan Chen (Tsinghua University); Jiacheng Liu (Peking University); Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, Yuanchun Li (Tsinghua University)

Public Review

Public review by Anlan Zhang, Adobe Research, San Jose, CA

Building intelligent assistants capable of operating mobile GUIs has long been a key goal, yet scaling them to handle long-horizon tasks that require dozens or even hundreds of sequential steps remains a significant challenge. As interaction histories grow, existing context management strategies often fail to retain critical information, lack selectivity, or introduce excessive overhead. These limitations are further compounded by the unique difficulties of mobile GUI environments, including partial observability, noisy accessibility trees, and fragmented UI ecosystems.

This paper presents AgentProg, a program-guided framework for long-horizon mobile GUI agent context management. The key idea is to treat agent interactions like program execution: an agent’s interaction history can be structured as a “program” with control flow and data flow to systematically determine what information to retain and what to discard. AgentProg introduces Semantic Task Program (STP), a DSL that blends natural language with control flow (loops, conditionals, functions) and explicit variables. Unlike rigid symbolic DSLs requiring precise APIs, STP leverages LLM priors for better robustness and generalization. During execution, an LLM-driven program counter selects active context, while an execution tree prunes completed branches. To handle the partial observability inherent in mobile GUI environments, AgentProg further integrates a Global Belief State mechanism that continuously validates and updates the agent’s hypotheses about environment states and hidden variables. Evaluated on AndroidWorld and a new long-horizon extension (AW-Extend) with tasks requiring over 30 steps, AgentProg achieves 78.0% and 68.4% success rates respectively, significantly outperforming baselines that exhibit catastrophic performance degradation on extended tasks.

The reviewers found the program abstraction for agent context management novel, intuitive, and well-motivated. The end-to-end implementation with a challenging benchmark was recognized, and the empirical gains on long-horizon tasks were strong. Meanwhile, the reviewers raised several concerns during the review process. First, reviewers questioned how AgentProg handles situations where the high-level STP itself is incorrect or infeasible. The authors reported a low STP error rate of 2.5% on AndroidWorld and explained that the Global Belief State can detect misalignment between the STP and the dynamic environment, providing feedback for potential re-planning. Second, the high latency and inference cost of AgentProg were noted as significant practical concerns for real-world deployability. The authors argued that the framework targets complex background automation tasks where accuracy is prioritized over responsiveness, and that future model optimizations will help mitigate costs. Third, reviewers requested more thorough ablation studies and technical details. The authors provided ablations showing that removing the Execution Tree drops performance from 68.4% to 39.5% on AW-Extend, confirming the importance of context pruning, while removing Explicit Variables reduces performance to 50.0%. During shepherding, the paper was revised to include additional technical details, an end-to-end pipeline walkthrough, ablation results, a discussion of broader applicability beyond mobile GUI agents, and a limitations section addressing the accuracy-latency trade-off.

Overall, AgentProg makes a meaningful contribution by introducing a principled, program-guided approach to context management for long-horizon mobile GUI agents. The work opens several di-rections for future research, including reducing inference cost through model distillation for on-device deployment, developing more robust re-planning strategies when STP generation fails, and extending the framework to broader classes of long-horizon task agents beyond GUI settings.

Analysis of Always-Listening Services on Android

Leo Cao, Jack West, Kassem Fawaz (University of Wisconsin-Madison)

Public Review

Public review by Sung-Ju Lee, KAIST, Daejeon, Korea

This paper examines how always-listening services operate on modern Android devices, a topic that sits squarely at the intersection of mobile systems, machine learning, and privacy. These systems span DSP runtimes, OS interfaces, and application-layer logic, much of which is proprietary and difficult to inspect in practice. To address this, the authors develop a reverse-engineering approach that combines architectural analysis, static inspection, and dynamic instrumentation, allowing them to reconstruct the end-to-end pipeline of these services.

One of the central contributions is the notion of local misactivations, where early-stage audio de-tectors trigger on-device processing without any user-visible indication. The authors show, through experiments across multiple devices and configurations, that such events can occur quite frequently and lead to short segments of audio being processed locally. This provides new empirical insight into how production systems behave in the wild, and highlights a subtle but important gap between system behavior and user expectations. The technical depth of the work stands out: the paper successfully traces behavior across DSP, HAL, and application layers, which is nontrivial and rarely done at this level of completeness.

Beyond the specific phenomenon of misactivations, the paper is valuable in how it makes an opaque system observable. The reverse-engineering workflow is thoughtfully constructed and demonstrates that meaningful system-level understanding can be recovered even in the presence of closed-source compo-nents and vendor-specific implementations. This is particularly relevant as modern mobile systems increasingly rely on tightly integrated hardware–software stacks, where traditional analysis tools provide limited visibility. The approach taken here could serve as a useful blueprint for studying other similarly complex subsystems.

As with any study of production systems, especially those involving nondeterministic ML compo-nents and proprietary implementations, there are natural limitations in what can be measured and generalized. Some of the observed behaviors are device- and configuration-dependent, and certain as-pects of system impact are inherently difficult to quantify precisely. That said, these challenges are largely a reflection of the problem setting rather than shortcomings of the approach, and the paper does a reasonable job of scoping its claims and supporting them with empirical evidence.

Overall, this is a strong and timely contribution. It makes previously hidden aspects of always-listening pipelines observable, and opens up a practical path toward auditing such systems in real-world deployments. The insights here should be of interest to both researchers and practitioners working on mobile systems, privacy, and always-on sensing.

AquaCode: Turning Transaction Platforms into Non-Invasive Liquid Quality Checker

Sijie Ji (California Institute of Technology); Sheng Lyu (Independent Researcher); Wenjian Li, Wei Gao (California Institute of Technology)

Public Review

Public review by Binbin Xie, UNiversity of Texas at Arlington

Liquid quality assessment remains a difficult problem in retail settings, where counterfeit and spoiled products can lead to both financial losses and health risks. Existing sensing approaches, such as RF- or acoustic-based methods, often face challenges when deployed in real-world environments. In particular, their measurements can be easily affected by multipath interference and variations across hardware platforms, which limits their scalability.

This paper presents AquaCode, a system that takes a different approach by focusing on SKU-level verification rather than general classification. The key idea is to leverage the Liquid-Solid Interface (LSI) as a sensing modality. By repurposing identity tags as sensing interfaces at checkout, AquaCode enables non-invasive liquid quality verification through container walls, without relying on unstable wireless channels.

The reviewers appreciated the use of LSI charge-transfer dynamics as a new sensing primitive. This physics-based approach is inherently more stable and less susceptible to wireless interference com-pared to prior methods. They also found the system design to be well thought out, combining custom electrometer-grade hardware with lightweight signal processing. The experimental evaluation is comprehensive, covering sixteen types of liquids and a range of container materials, along with a user study, which together demonstrate the practicality of the approach for smart checkout systems.

At the same time, the reviewers raised several concerns about deployment and robustness. One key issue is the reliance on an active rotation motor for mechanical excitation, which may complicate integration into existing retail infrastructure and increase deployment cost. They also pointed out that LSI signals could be sensitive to temperature changes and temporal drift, potentially affecting consistency in real-world conditions. In addition, the current evaluation does not fully address irregular container shapes, and more details are needed on how the SKU-level “golden standard” handles variations across different production batches.

Given that LSI-based sensing is still relatively new in the mobile systems community, the reviewers suggested several directions for future work. These include developing temperature-aware calibration to reduce environmental drift, extending the evaluation to more diverse packaging types (e.g., non-cylindrical or paper-based containers), and exploring whether existing motions in checkout systems, such as conveyor belt vibrations, could replace the dedicated motor. Finally, providing clearer details on the decision logic and verification thresholds would help improve reproducibility and make the system easier to generalize.

AquaScope: Reliable Underwater Image Transmission on Mobile Devices

Beitong Tian, Lingzhi Zhao, Bo Chen, Mingyuan Wu, Haozhen Zheng, Deepak Vasisht, Francis Y. Yan, Klara Nahrstedt (University of Illinois Urbana-Champaign)

Public Review

Public review by Mahadev Satyanarayanan, Carnegie Mellon University, Pittsburgh, PA

This paper describes an end-to-end system for effcient t ransmission o f i mages t hrough water using acoustic transmission. RF signals do not propagate well in water, so acoustic communication is necessary. Unfortunately, acoustic communication only provides very low bandwidth, typically a few kpbs. With image sizes that are typically in the many MB range, transmission time at such low bandwidths becomes unusable. Extreme compression through encoding is the only hope. Unfortunately, existing methods of encoding images fail to deliver the extreme level of compression necessary.

For distances of up to a few tens of meters, this paper shows that efficient underwater transmission of images is indeed possible. For example, a 256x256 color image can be transmitted in under 9 seconds over a distance of 20 meters. The key idea is a very efficient encoding of images using a codebook based on Generative AI. Transmission can be lossy and involve both loss and distortion of symbols. The paper describes low-level mechanisms to mitigate the effects of these h azards. Experimental evaluation using both microbenchmarks and macrobenchmarks confims t hat t he p roposed m echanism improves significantly u pon t he s tate o f t he art.

Many avenues remain for future improvement. End-to-end latency could be further reduced. The protocol could be made more robust in areas such as preamble detection. Phenomena such as thermal layering that can distort signal propagation under water can be addressed.The paper identifies a number of areas for future improvement to this work.

The reviewers were impressed with the importance of the problem being addressed, the end-to-end systems approach of the paper, the thoroughness of execution of its key ideas, and the thoroughness of the experimental evaluation. In many ways, this is a classic systems paper.

Armstrong: A Full-Duplex Backscatter Architecture for the mmWave IoT

Skanda Harisha, Jimmy G D Hester, Aline Eid (University of Michigan)

Public Review

Public review by Swarun Kumar, CMU, Pittsburgh, PA

Backscatter communication has been widely explored as an ultra low-power communication and sensing modality for the Internet of Things (IoT). Although traditional backscatter systems operate at sub-6 GHz frequency bands, mmWave backscatter has attracted much recent interest. Such backscatter tags have been widely used in communication and sensing applications. Given the wider bandwidth available at mmWave frequency bands, mmWave backscatter offers improved resolution in wireless sensing, such as the ability to precisely locate tags from a distance. Likewise, this higher bandwidth also equips mmWave backscatter communication systems to offer a path to higher throughput relative to traditional backscatter. However, much of mmWave backscatter system design, as with traditional backscatter, has focused on communication on the uplink: i.e. from the tag to the reader. Enabling communication at the downlink is challenging owing to the need for expensive mmWave components at the tag, as well as time-division protocols to separate downlink and uplink communication. More broadly, a full duplex communication system that allows for simultaneous communication at both the uplink and downlink to a mmWave backscatter tag remains an open problem.

Armstrong is a first-of-its-kind architecture for full-duplex mmWave backscatter. It allows for a low-cost approach for enabling full-duplex communication, achieving significant range improvements in connectivity at both the uplink and downlink. The system designs and integrates a high-Q, low-power regenerative amplifier in the Armstrong tag, which enhances signal strength without the need for costly high-power active RF chains. The paper further contributes a regenerative rectifier front-end design that improves the sensitivity of downlink reception. The system is fully implemented and evaluated, demonstrating a complete tag–reader prototype operating at mmWave frequencies. Armstrong achieves high Kbps bidirectional data rates for mmWave full-duplex backscatter.

The reviewers of the paper appreciated the novelty of the full-duplex backscatter concept. The long-range and low-cost of the system are key positive features of the system that the reviewers recog-nized. The reviewers also appreciated the paper striving to revive lost art from early radio design (i.e. regenerative amplifiers) in a new context. The reviewers also acknowledged the author’s efforts in the real-world prototyping of the system.

The reviewers raised several concerns with the paper that informed modulating its overall claims in the shepherding process. Concerns with the paper included lower data rates, the hidden cost of reader infrastructure, missing ISAC (integrated sensing and communication) features and certain missing experiments (e.g. multiple tags, mobility, etc.). Through the shepherding process, the authors included several additional experimental results to address the concerns raised in the reviews and duly acknowledged the system’s limitations.

Overall, this paper is an important first step in realizing the concept of mmWave full-duplex backscatter, with much follow-on research required to make such a system truly high throughput, scalable, and ISAC-ready for pervasive adoption.

AutoRF: Towards an Agentic Framework for Automated RF Hardware Design

Ruichun Ma (Microsoft Research Asia); Lili Qiu (UT Austin, MSRA Shanghai); Wenjun Hu (Yale University); Jiazhao Wang (Singapore University of Technology and Design); Yiwen Song (Carnegie Mellon University); Hao Pan (Microsoft Research, Shanghai Jiao Tong University)

Public Review

Public review by Ambuj Varshney, National University of Singapore, Singapore

Electronic design automation has an uneven history. Digital integrated circuit design was substantially automated by the 1980s, and the resulting tooling made billion-transistor chips routine. Analog and RF hardware design has stubbornly resisted the same treatment, often called a ”black art.” The craft depends on deep understanding of physics and electromagnetic coupling, and on tacit knowledge transferred imperfectly. Decades after Balanis and Pozar became fixtures on every RF designer’s shelf, a new antenna or filter still means weeks of hand-tuning inside a full-wave simulator. Metasurfaces sharpen this tension. They hold growing potential for sensing, security, and wireless-powering applica-tions, yet nearly all metasurfaces today are developed once, for one scenario, by one expert, and are not readily adaptable to the next frequency band, substrate, or form factor. Any change is expensive because the same scarce expertise is required. A growing body of work attempts to automate the design process using neural networks trained against large simulator-generated datasets. These work within the distribution they were trained on but are opaque, require significant end-user expertise and vast training data, and generalize poorly across hardware, bands, and materials. Whether the capabilities of modern LLMs can shift this balance is a natural and timely question that this paper tackles.

AutoRF proposes an agentic framework that decomposes the design workflow into LLM-driven stages: a demand compiler that parses natural-language requests into structured hardware specifica-tions, a design proposer that retrieves a suitable template and drafts a modification strategy, and a design maker that iteratively tunes the design in HFSS until specifications are met. A key enabler is the circuit-model simulator, which repurposes textbook cascaded ABCD-matrix analysis as an LLM-accessible fast screen that evaluates candidate topologies in under a second, reserving full-wave simulation for refinement. The paper is well written and educational, and the evaluation is grounded in real-world artifacts, with nine fabricated prototypes spanning 2.4 GHz to sub-THz, including a pro-grammable phase-shifting surface and a low-cost paper-based transmissive mmWave metasurface.

AutoRF raises questions that the community will want to carry forward. First, do LLM-driven approaches represent a qualitative leap over existing neural-network pipelines, or do they primarily shift the automation boundary? Neural-network pipelines require experts to curate large synthetic datasets; AutoRF requires experts to craft and maintain a template library, design rules, and reviewer feedback, each encoding substantial domain knowledge. Understanding where this tradeoff pays off is a question best answered with more systems, not just one. Second, AutoRF is closer to a topology selector and parameter optimizer than to a generative designer. Moving from template tuning toward genuine topological synthesis, while retaining the practical fabricability that expert templates encode, remains open. Distinguishing genuine discovery from sophisticated retrieval within a template space is subtle and presents an open challenge for future work. Finally, practical questions of economics and reproducibility remain: token and simulation costs per design, robustness to API deprecation, and the extent to which systems depending on closed simulators and frontier LLM APIs can be reproduced broadly. Each of these is a promising direction that AutoRF opens for the community.

Wireless researchers have spent the better part of a decade asking how to give designers more powerful simulators, more expressive optimization frameworks, and larger datasets. AutoRF asks a different question: how much of the design workflow can be offloaded to automated reasoning, so that metasurface and antenna customization becomes accessible without deep RF expertise? The present prototype is an early step, but it provides ingredients future systems are likely to build on. This paper opens a conversation worth having about the future of electronic design automation and LLMs.

BeamFormer: Transformer-based Beam Management for 6G Networks

Shunqiang Feng (University of Virginia); Swastik Kanjilal (RPI); Kun Qian (University of Virginia); Ish Kumar Jain (RPI)

Public Review

Public review by Minsung Kim, Rutgers University, New Brunswick, NJ

This paper presents BeamFormer, a transformer-based beam management framework for giga MIMO (gMIMO) with dense antenna arrays. While the massive number of antennas in gMIMO has great promise for increasing wireless capacity in 6G networks, it also entails significant latency issues for conventional and even advanced beam management methods, such as hierarchical beam sweeping and compressive sensing-based ones. As a new approach to beam management, leveraging transformers, BeamFormer generates the received signal strength (RSS) of unknown query beams given RSS mea-surements of a few reference beams, achieving a delay of only 2.5 ms. Three key challenges to doing so were introduced in the paper, and the authors proposed solutions for those, developing the core compo-nents of BeamFormer. BeamFormer was evaluated through 28 GHz software-defined radio (SDR)-based experiments using ray-tracing training data, and it showed 6.7 dB gain compared to existing approaches.

The reviewers agreed that the use of transformers for beam management was intriguing, and the overall design was well-validated through the extensive experiments with both simulations and SDRs, compared with various comparison schemes. However, the reviewers also pointed out several issues in the submission version, such as insufficient rationales for using the transformer (especially compared to other machine learning algorithms) and missing discussions and/or evaluations of real-world compatibility, failure cases, system latency, and scalability. The authors addressed most of these concerns well with clear explanations and more comprehensive evaluations during the revision and shepherding process; they added more than five new experimental results. Also, during the revision, the authors open-sourced the code and dataset, which will be valuable for future benchmarks.

Overall, the paper introduces an interesting approach for beam management in 6G networks: lever-aging generative AI for gMIMO beam management. While this work is currently limited to the physical layer design, it makes solid contributions as an initial work in this direction, with baseline performance and open-sourced code and dataset. In general, the use of transformers, the state-of-the-art architecture for generative tasks, for wireless network applications, seems worth exploring further. Towards this, this work can offer some insights into the direction, providing a foundation.

BeamMix: 3D Gaussian Mixture-of-Experts for Element-Space Wireless Channel Modeling

William Bjorndahl, Joseph Camp (Southern Methodist University)

Public Review

Public review by Ish Kumar Jain, RPI, Troy, NY

BeamMix is a novel neural network-based framework for wireless channel modeling. It can be utilized for applications such as wireless channel prediction across frequency and space, and multi-antenna beam-forming. Traditional NeRF or Gaussian Splatting-based channel modeling suffers from high complexity in scene modeling, requiring the model to understand and interpret complex propagation environments. In contrast, BeamMix attempts to model the wireless channel directly in antenna element-space using sparse Gaussian primitives over the antenna array. This shift from geometry-centric to array-centric modeling is conceptually interesting and provides low-complexity training and fast inference while still achieving accurate beam selection and MU-MIMO performance.

Conceptually, the Gaussian primitives are carefully modeled so that their centers lie on the antenna array, their variances capture the spread across antennas, and each Gaussian is associated with a delay factor to model frequency dependence. This modeling is simple, intuitive, and highly interpretable. The interpretation follows from the intuition that the LOS channel has fewer large Gaussians, whereas the NLOS channel requires a larger number of smaller Gaussians to model its complex channel behavior. The Gaussian primitive model is further enhanced with local and global MLP (multi-layer perceptron) heads, trained via end-to-end backpropagation. Overall, BeamMix training converges in 10 minutes, which is 13x and 31x faster than state-of-the-art Gaussian Splatting-based and NeRF-based baselines.

BeamMix opens up exciting research directions in interpretable wireless channel modeling. It demon-strates that structured Gaussian representations can provide a promising middle ground between purely black-box neural networks and expensive environment-centric radiance field approaches, making it a valuable contribution with strong potential for future follow-on research. First, adaptive primitive allocation mechanisms can dynamically adjust the number of active Gaussian components based on propagation complexity, thereby improving scalability across LOS, NLOS, and dense-scattering en-vironments. Second, extending the framework to mmWave and wideband FDD systems with larger UL–DL frequency gaps remains an important challenge. Third, integrating explicit angle-of-arrival and delay-domain physical constraints into the learned Gaussian representations may further strengthen interpretability and robustness. Finally, broader cross-dataset and cross-hardware validation, includ-ing outdoor and mobile deployments, could help establish BeamMix as a generalized foundation for element-space wireless digital twins and AI-native radio systems.

Overall, BeamMix guides the search for physics-guided, interpretable channel representations that generalize across scenes and antenna geometries. While interpretable AI in itself is a big beast, finding new interpretations of various neural implementations in the wireless domain remains an even bigger open problem.

BIONIC: A Co-Designed Hardware and Runtime for Time-Sensitive Battery-Free IoT

Eren Yildiz (Georgia Institute of Technology); Davide Cavedon, Stefano Antonio Putelli (University of Trento); Josiah Hester (Georgia Institute of Technology); Kasim Sinan Yildirim (University of Trento)

Public Review

Public review by Robert LiKamWA, Rice University, Houston, TX

BIONIC addresses a fundamental challenge in battery-free IoT systems: how to support time-sensitive execution when devices operate intermittently and may lose power unpredictably. Existing approaches often rely on dedicated timekeeping circuits or capacitors to estimate outage duration, but these mechanisms can consume or sequester energy that is already scarce. BIONIC takes a different approach by co-designing hardware and runtime support so that capacitor charge dynamics are used not only for time and energy estimation, but also to help power useful computation.

The core insight of the paper is simple and powerful: a pair of capacitors can be used to estimate harvested power and charging behavior while avoiding the energy waste of a purely dedicated timekeep-ing component. BIONIC uses a main energy-storage capacitor and a smaller sampling capacitor, along with voltage-transition interrupts, to estimate ambient power without relying on expensive ADC-based voltage sampling. The resulting estimates allow the runtime to make better scheduling decisions for time-sensitive tasks, avoiding work that is unlikely to complete successfully under the current energy conditions.

Reviewers found the paper to be a strong and well-executed contribution. The hardware design is clever, the runtime integration is natural, and the paper clearly explains how the proposed mechanism improves over existing timekeeping practices in intermittent computing. The evaluation was also viewed positively, combining physical hardware experiments with simulations across realistic harvesting scenarios, including solar and kinetic energy traces. These experiments demonstrate that BIONIC can improve useful task execution while maintaining low overhead.

Overall, the TPC found BIONIC to be a clean, clever, and practically useful contribution to intermittent computing. Its value lies in showing that timekeeping, energy estimation, and runtime scheduling can be co-designed around the same capacitor dynamics, reducing overhead while improving the ability of battery-free devices to execute time-sensitive tasks.

BladderSense: A Wearable Ultrasound System for Continuous Bladder Monitoring in Real-World Use

Kaixin Chen, Usman Saleh Toro (The Hong Kong University of Science and Technology (Guangzhou)); Jinyu Lin (Hong Kong University of Science and Technology (Guangzhou)); Chang Huang (The Hong Kong University of Science and Technology (Guangzhou)); Junfan Xiang, Lu Wang (Shenzhen University); Huachen Cui, Lei Zhu (The Hong Kong University of Science and Technology (Guangzhou)); Kaishun Wu (Hong Kong University of Science and Technology(GZ))

Public Review

Public review by Ahmed Allam, University of Cincinnati, Cincinnati, OH

The promise of wearable ultrasound is to take an imaging modality that has spent decades inside the clinic and embed it into everyday life. Recent advances in flexible transducer arrays have made the hardware plausible, but turning a wearable probe into a reliable continuous monitor of an internal organ remains an open systems challenge. Lower urinary tract dysfunction (LUTD) is a compelling first target for this vision. Patients need timely knowledge of bladder volume to support voiding and rehabilitation, yet existing solutions force a hard tradeoff: clinical scanners deliver rich imaging but cannot be worn during daily activities, while existing wearables offer only crude volume estimates without the underlying imaging that clinicians need to interpret bladder status.

This paper takes an important step toward closing that gap. It introduces BladderSense, a skin-conforming wireless wearable that delivers continuous ultrasound imaging of the bladder during everyday life. The reviewers were impressed by three particularly powerful features of the system. The first is that, to our knowledge, this is the first wearable system to provide continuous ultrasound imaging of the bladder, opening new routes for home diagnosis, telehealth, and patient monitoring by clinicians where existing approaches are either tethered or too low-fidelity to support clinical interpretation. The second is the design of a fixed X-shaped dual orthogonal phased array with a shared geometric anchor, paired with a coordinate-encoded deep learning model that enables automatic bladder tracking despite organ shifts during filling and posture changes, sidestepping the probe alignment problem that has long limited home ultrasound. The third is a complete and well-engineered end-to-end implementation, including beamforming designs that tolerate skin deformation, an envelope-extraction compression scheme that fits raw ultrasound data within a BLE transmission budget, and an evaluation across ten participants performing daily activities showing meaningful accuracy improvements over state-of-the-art wearable approaches.

Fundamentally, BladderSense raises the bar for what wearable imaging can deliver, but it does not yet validate every aspect of its claimed deployment scenario. The study population consisted of healthy young adults, while LUTD predominantly affects older adults and children with different body morphologies, leaving the question of clinical generalization open. Questions remain regarding cross-device generalization given the known element-to-element sensitivity variations of flexible piezoelectric arrays and continuous all day monitoring is still not possible due to battery and power draw limitations.

Continuous wearable imaging of internal organs has long been a goal of the mobile health community. By bringing real-time bladder ultrasound out of the clinic and into daily life, this paper takes an important first step in that direction and opens the door for wearables that capture not only what is happening on our skin but also what is happening beneath it.

BlenDR: Bandwidth-efficient RGB-D Representation and Delivery for Live 3D Video Streaming

Jaehong Kim (Inha University); Joon Ha Kim (University of Texas at Austin); Yunheon Lee, Dongsu Han (KAIST)

Public Review

Public review by Robert LiKamWa, Rice University, Houston, TX

BlenDR addresses the practical problem of bandwidth-efficient live 3D video streaming from RGB- D capture. Existing systems often encode RGB and depth separately or convert RGB-D data into point clouds early in the pipeline, which can waste bandwidth, amplify depth artifacts, and make live delivery expensive. BlenDR instead redesigns RGB-D representation around the behavior of standard video codecs, with the goal of improving live 3D streaming without requiring a specialized codec.

The paper’s main contribution is a codec-aware RGB-D delivery pipeline. BlenDR jointly en-codes luminance and depth in a high-priority stream, transmits chrominance at lower resolution, uses compression-robust depth packing, and applies sender-side RGB-guided depth completion to reduce holes before encoding. This is a pragmatic systems contribution: the novelty is an effective reorganization of RGB-D data so that commodity codecs can preserve the information most important for 3D reconstruction.

Reviewers found the paper well-motivated and clearly written. The system identifies concrete short-comings in prior live 3D streaming approaches and evaluates a complete pipeline across representation, compression, transmission, and reconstruction. The reported improvements in bandwidth efficiency, frame rate, and rate-control stability over prior RGB-D and point-cloud streaming baselines were compelling, and the use of standard codecs strengthened the case for practical deployment. During rebuttal/shepherding, the authors further clarified the release of the full SceneHub4D dataset and provided comparable codec-based results against a close RGB-D baseline.

Overall, the TPC found BlenDR to be a focused and well-executed systems paper. Its value lies in providing a strong foundation upon which careful codec-aware RGB-D representation can substantially improve live 3D video delivery while remaining compatible with existing video infrastructure.

Bringing Confidential Computing to Android

Mark Kuhne, Supraja Sridhara, Andrin Bertschi, Nicolas Dutly, Fabio Aliberti, Srdjan Capkun, Shweta Shinde (ETH Zurich)

Public Review

Public review by Steve Ko, Simon Fraser University (SFU), Vancouver, BC, Canada

Many security-sensitive applications benefit from isolation mechanisms that protect their code and data from the host system. Android’s Virtualization Framework (AVF) is one such isolation mechanism, which supports the execution of trusted applications using protected virtual machines (pVMs). However, AVF relies on a trusted hypervisor and MMU-based isolation, which provide weaker security guarantees than hardware-based Trusted Execution Environments (TEEs) such as ARM TrustZone and Confidential Computing Architecture (CCA). This leads to the paper’s main question: how can AVF be made compatible with ARM TEEs? You can find the answer to this question in this paper.

The paper explores four design possibilities for deploying Android and pVMs on ARM’s TEE. In doing so, it systematically analyzes Android Compatibility Definition Document (CDD) requirements and compares them to ARM’s security model. Based on this analysis, the paper proposes Aster, a design that places Android in the normal world and pVMs in the realm world under ARM CCA. The paper argues that this achieves the best tradeoff between security and implementation overhead. The paper further addresses compatibility challenges, such as boot certificate chains and rollback protection.

The reviewers appreciated the technical depth and rigor of the paper, especially regarding the analysis of Android CDD requirements, the comparison of the four possible designs, and the detailed design of the chosen solution. We also appreciated the substantial systems engineering effort that went into implementing the design, which involved modifications to the Android kernel, firmware, and virtualization stack. Overall, the paper provides a comprehensive and well-argued solution to the problem of bringing hardware-enforced isolation to AVF.

Building Audio-Visual Digital Twins with Smartphones

Zitong Lan, Yiwei Tang (University of Pennsylvania); Yuhan Wang (university of pennsylvania); Haowen Lai, Yiduo Hao, Mingmin Zhao (University of Pennsylvania)

Public Review

Public review by Anonymous shepherd

Digital twins are becoming increasingly important across AR/VR, robotics, smart buildings, architecture, and human-computer interaction by enabling virtual replicas of physical environments for simulation and interaction. However, most existing digital twins are primarily visual and largely overlook acoustics, despite sound being fundamental to spatial realism and environmental understanding. Realistic audio modeling is important not only for immersive AR/VR experiences, but also for applica-tions such as auditorium design, robotic navigation, and intelligent sensing. Capturing acoustic behavior in real environments remains di!cult because room acoustics require dense spatial measurements and expensive hardware setups involving microphone arrays, motion-capture systems, and specialized equipment. In addition, existing neural acoustic field representations are often di!cult to modify, limiting the ability to edit materials, layouts, or objects within a digital twin environment.

To address these challenges, the paper presents AV-Twin, a practical system that constructs editable audio-visual digital twins using only commodity smartphones. The system combines mobile room impulse response (RIR) capture with visual scene reconstruction to jointly model geometry and acoustics. AV-Twin introduces a smartphone-only acoustic capture pipeline with sub-millisecond synchronization and trajectory-based data collection, allowing users to naturally walk through an environment while collecting measurements. The work further proposes a visual-assisted acoustic volume rendering framework that leverages scene geometry to improve acoustic rendering e!ciency and data utilization. In addition, the system estimates per-surface acoustic material properties through di”erentiable acoustic rendering, enabling users to modify materials, furniture, or layouts while automatically updating both visual and acoustic behavior. The paper demonstrates substantial improvements in capture e!ciency, rendering speed, and modifiable scene reconstruction through extensive evaluations and a compelling mobile demo.

The technical program committee viewed the paper as a strong and engaging contribution toward practical audio-visual digital twins and appreciated its end-to-end smartphone-first system design. Re-viewers highlighted the integration of visual and acoustic sensing, the e!cient dynamic-trajectory capture pipeline, and the practical demonstration of immersive audio reconstruction and scene editing. The work was recognized as an important step toward accessible acoustic capture for AR/VR and spatial computing applications. At the same time, reviewers noted several areas that would benefit from expanded evaluation. These included stronger comparisons with prior implicit acoustic field approaches, deeper analysis of potential audio artifacts, broader experiments in more complex real-world environ-ments, and clearer discussion of assumptions such as simplified reflection models and motion constraints. Overall, the committee viewed the work as technically compelling while encouraging broader validation and discussion in future revisions.

DeformRF: Data-driven Beamforming and Direction Finding with Deformable Antenna Arrays

Xingda Chen, Mohammad Mehdi Rastikerdar, Ankur Aditya, Deepak Ganesan (University of Massachusetts Amherst)

Public Review

Public review by Akshay Gadre, University of Washington, Seattle, WA

This paper presents a new data-driven beamforming and direction-finding solution using flexible deformable antenna arrays on commonplace fabric. The key challenge with flexible fabric based antenna arrays is the complexity of modeling their electromagnetic response while operating together as well as the change in frequency response of the individual antennas. While there exist several prior tools to model these behaviors accurately given the mesh and antenna design such as ANSYS HFSS, these solutions remain too prohibitively slow to perform real-time calibration of antenna arrays. This work instead takes a data-driven approach. The secret sauce of DeformRF lies in the curation and development of a 260,000 sample dataset mapping deformations of antenna arrays to EM character-istics and development of a physics-inspired ML model for predicting it at a fraction of the latency of purely math-equation driven synthesis approaches. These are augmented in real-time by leveraging a smart-phone based deformation sensing application to reconstruct the mesh and, in turn, the EM characteristics of the deformed array.

The system was implemented and evaluated using a 4 × 4 antenna array on a deformable mat which allows deformation along one axis. They evaluate the system in various indoor LOS and NLOS settings to evaluate the reduction in beamforming capability vs. ideal arrays and tracking transmitters. The evaluation for 2-D deformation of the mat demonstrates results within 1 dB of optimal beamforming and achieves a 5◦ angle-of-arrival accuracy in tracking signals through buildings.

This paper presents an interesting take on a very practical problem of predicting the behavior of antenna arrays on flexible fabrics – “Can we circumvent the rigorous Maxwell-equation-based simu-lations with data-driven techniques to achieve significantly lower latency of prediction at the cost of bit of extra error?”. The solution clearly demonstrates the feasibility of the approach and solution design at 150 MHz and is a critical evolution in wireless technology to enable flexible mobile devices and fabric-based antennas in the future. Perhaps the most significant contribution of this work is the rigorously curated dataset made available by the authors to enable other researchers to further build on their solution. Potential Future Exploration Avenues: The work currently only explores 2D deformations of a mat which allows the antennas’ center frequency to remain tuned to the designed frequency. The evaluation needs to evolve significantly for general purpose fabrics like clothes and window shades. Another constraint enforced by the sensitivity of today’s phone cameras and LIDARs is the lack of 3D modeling resolution to accurately predict the deformations which are smaller than a few cm. This restricted the authors’ evaluation to low frequencies where the size of antennas is much larger. Thus, new approaches for modeling, emulating and evaluating antenna arrays which are much smaller at practical frequency regimes in cellular (FR1, FR3) and WiFi settings (2.4 GHz, 5 GHz and 6 GHz) is critical to generalize the solution. Finally, the case study evaluation for tracking, while demonstrating feasibility, needs a much larger sample size in stringent multipath scenarios which is typical for a localization system.

Diversified Neural Networks: Defeating Adversarial Attacks via Numerous Orthogonal Variants

Ying Meng, Qiang Zeng (George Mason University)

Public Review

Public review by Shubham Jain, Stony Brook University, Stony Brook, NY

The paper introduces DiverseASR, a defense system against audio adversarial examples (AEs) for automatic speech recognition (ASR) systems. The key idea is to selectively modify the model’s parame-ters that are important for AEs but unimportant for benign audio samples. The system then generates a series of ASR variants so that the AEs crafted on one instance will not transfer to another. To realize DiverseASR, the authors explore models’ asymmetric impact, which refers to the performance degradation on AEs compared to benign samples, and select the weights with high impacts on degrading the AEs and modify them. DiverseASR then identifies safe modification ranges and generates numerous ASR variants. Evaluations on Wav2Vec2 and DeepSpeech under digital and physical attack scenar-ios validate that DiverseASR will effectively defend AEs with less than 0.03% false detection, which outperforms existing defense methods.

The reviewers found the work to be insightful and interesting. The paper tackles an important and timely problem of adversarial attacks on ASR systems, and moves away from conventional detection-based defenses toward reducing the reliability of adversarial example generation itself. Rather than proposing a single hardened model, the work introduces model diversification as a defense strategy, which is a conceptually clean and defensible shift from prior approaches. The observation that weight importance differs between benign and adversarial examples is a useful insight and forms a foundation for the proposed method. The evaluation is comprehensive, and considers strong threat models, including adaptive attackers aware of the defense and capable of attempting transferable attacks, higher-perturbation strategies, and variant generation, which strengthens the empirical case.

Overall, DiverseASR is a promising technique and the first effective defense against audio adversarial examples with zero runtime overhand. This process generates a diverse set of ASR model variants that preserve recognition accuracy while improving robustness. Future directions include building defense in other domains, mitigating algorithmic unfairness through targeted weight adjustments, and addressing voice-spoofing attacks.

FBLayout: Optimizing Memory Layout for Efficient LLM Finetuning on Mobile GPUs

Kahou Tam (University of Macau); Wei Niu (University of Georgia); Yu Bao (University of Macau); Xiaomin Ouyang (Hong Kong University of Science and Technology); ChengZhong Xu, Li Li (University of Macau)

Public Review

Public review by JeongGil Ko, Yonsei University, Seoul, Korea

As large language models (LLMs) become increasingly integrated into everyday mobile applications, there is a growing need to personalize these models directly on user devices. On-device fine-tuning offers a promising path toward privacy-preserving and adaptive AI, enabling models to continuously learn from local data without relying on cloud infrastructure. However, despite advances in mobile inference, efficiently training LLMs on resource-constrained devices remains a major challenge.

This paper identifies a fundamental bottleneck in on-device LLM fine-tuning. Specifically, the mismatch between how data is accessed during forward and backward passes of training. While modern mobile GPUs are optimized for specific memory access patterns, transformer training inherently requires tensors to be reused in different ways across forward and backward computations. These conflicting access patterns lead to frequent data layout transformations, poor cache utilization, and ultimately significant performance degradation. Existing frameworks either reuse a single layout, sacrificing efficiency in one phase, or rely on costly data transformations that introduce additional overhead.

To address this challenge, the paper introduces FBLayout, a system that rethinks how tensor data is organized for mobile GPU architectures. The key idea is to design a unified, hardware-aware memory layout that can efficiently support both forward and backward operations without requiring frequent transformations. FBLayout proposes a novel “R-Tile” layout tailored to the 2.5D texture memory architecture of mobile GPUs, enabling efficient multi-dimensional data access while preserving locality. In addition, FBLayout replaces expensive physical data movement operations with lightweight index transformations and employs a global layout optimization strategy that propagates efficient layouts across the entire computation graph.

This approach reflects a careful co-design of software and hardware. Rather than adapting existing training pipelines to mobile constraints, FBLayout reshapes the underlying data representation to better match the characteristics of mobile GPUs. The system is evaluated across multiple transformer models and mobile devices, demonstrating improvements in training performance. Compared to state-of-the-art mobile deep learning frameworks, FBLayout achieves 2.2×-5.7× speedup while improving cache efficiency and reducing memory pressure, making on-device fine-tuning significantly more practical.

Beyond performance gains, the work highlights an important shift in perspective. While much of the prior effort in mobile AI has focused on optimizing inference, this paper addresses the more demanding problem of training, which introduces additional complexity due to backward computation and intermediate data dependencies. By targeting this gap, FBLayout expands the feasibility of fully on-device learning systems that can adapt in real time to user behavior and environmental changes.

At the same time, several open questions remain. The design of FBLayout is closely tied to the characteristics of current mobile GPU architectures, raising questions about how well the approach generalizes to future hardware or heterogeneous systems that include NPUs. The system also assumes relatively stable computation graphs and relies on offline optimization decisions, which may need further adaptation in dynamic or highly variable workloads. Additionally, while the work focuses on training performance, understanding the interaction between fine-tuning and concurrent inference workloads on mobile devices will be important for practical deployment.

Overall, this paper makes a strong case that efficient on-device LLM fine-tuning is not only possible but can be significantly accelerated through careful system and memory layout design. By addressing a fundamental mismatch between training workloads and mobile hardware, FBLayout takes an important step toward enabling truly personalized, privacy-preserving AI experiences directly on user devices.

Fine-grained Soundscape Control for Augmented Hearing Best Artifact Award

Seunghyun Oh, Malek Itani, Aseem Gauri, Shyamnath Gollakota (University of Washington)

Public Review

Public review by Mingmin Zhao, University of Pennsylvania, Philadelphia, PA

Hearables are becoming ubiquitous, yet their sound controls remain coarse: users can either enable global noise suppression or focus on a single target sound. Real-world acoustic scenes contain many simultaneous sources that users may want to adjust independently. A listener may want to tune in to a nearby conversation, amplify a safety cue like an approaching vehicle, dampen distracting chatter, or enjoy the ambient sound. Enabling fine-grained, per-class soundscape control on a wearable-class device requires low-latency multi-class sound separation, hardware-aware neural design, and a user-facing interface that adapts to dynamic acoustic environments.

This paper presents Aurchestra, the first system to provide fine-grained, real-time soundscape control on resource-constrained hearables. Aurchestra has two key components. First, an on-device multi-output extraction network conditioned on a multi-hot encoding of user-selected classes through FiLM layers. Rather than producing one stream per trained class, it generates a small fixed set of streams and dynamically maps selected classes to them, letting users mix their environment by adjusting per-class volumes like an audio engineer mixing tracks. Second, a dynamic interface driven by a fine-tuned Audio Spectrogram Transformer that surfaces only the sound classes currently active, so users see a short, context-aware menu rather than a long static list. The model is optimized for three hardware platforms (Orange Pi, Raspberry Pi, and the GreenWaves GAP9 AI accelerator), processing 6 ms streaming audio chunks in real time on all of them. The system is evaluated through model benchmarks, in-the-wild recordings, and user studies.

Reviewers appreciated the careful, hardware-aware system design behind this work. The multi-output extraction model conditioned on multi-hot class encoding is a technically sound approach to per-class sound manipulation. Beyond model accuracy, the system considers hardware constraints, streaming operation, and user interaction, and demonstrates real-time inference on multiple compute-limited devices. The evaluation is thorough, covering accuracy, latency, power consumption, and user studies, with in-the-wild data collection across diverse indoor and outdoor environments.

A few open questions remain for future work. The current prototype operates on a fixed set of 20 sound classes, which is sufficient to demonstrate feasibility but cannot yet handle novel sounds in the wild, nor distinguish between individuals within the same class such as multiple speakers. The authors’ prior work on speaker-level targeting is a natural complement, and integrating the two systems is a promising direction. Scaling to a larger class set or to open-set sound categories is another. Examining how classification errors affect user experience, alongside longer-term battery behavior on real hearables, would also be valuable.

Aurchestra is a well-engineered step toward fine-grained soundscape control on everyday hearables. Its core contributions, multi-output extraction conditioned on user-selected classes, hardware-tailored model design, and a dynamic context-aware interface, should remain useful as on-device neural audio capabilities mature. We look forward to seeing this line of work grow toward a new generation of intelligent hearables that enable richer, more personalized listening experiences.

Fragile Deliveries: Inconsistencies in Android Parcel and Their Security Consequences

Hongkai Chen (Arizona State University); Chao Wang (The Ohio State University); Yuqing Yang (CISPA Helmholtz Center for Information Security); Jennifer Miller, Tiffany Bao, Ruoyu Wang, Adam Doupé (Arizona State University); Zhiqiang Lin (The Ohio State University); Yan Shoshitaishvili (Arizona State University)

Public Review

Public review by Zhuolin Yang, University of Arizona, Tucson, AZ

A key component of Android is the Parcel inter-process communication (IPC) system. It allows developers to implement custom IPC data transfer protocols. However, due to insufficient security considerations in the Parcel mechanism design, incorrect implementation of Parcelable classes can lead to security vulnerabilities. In recent years, a great number of Parcel-related vulnerabilities have been reported. Nevertheless, little research has been done to systematically understand these vulnerabilities and measure the prevalence of these issues in the Android ecosystem.

This paper analyzed the security issues in Android’s Parcel mechanism. Compared to prior works, this paper examined the root cause of Parcel vulnerabilities which helps Android developers to debug and patch the problems. They also conducted the first large-scale study that includes 324 Android firmwares and 10,161 Android apps. This study confirmed the prevalence of Parcel security issues in existing Android devices and apps.

The reviewers appreciated the deep explanation of Parcel vulnerabilities in Android, and the comprehensive evaluation of the prevalence of Parcel security issues in the existing Android ecosystem. Overall, the reviewers did not raise concerns about the contribution of the work, but noted that the writing clarity of the work can be further enhanced. For example, clarifying the attacker’s motivation, the threat model, and the defenses against the proposed attacks.

Through the shepherding process, the authors successfully adopted reviewers’ suggestions by clarify-ing the motivation, refining the threat model, expanding the discussion on preliminaries to help readers understand the attacks, and adding discussions on attack impacts and defenses.

GeoTwins: Uncovering Hidden Geographic Disparities in Android Apps

Marco Alecci, Pedro Jesús Ruiz Jiménez, Jordan Samhi, Tegawende F. Bissyande, Jacques Klein (SnT, University of Luxembourg)

Public Review

Public review by Hanting Ye, Duke University, Durham, NC

Over the years, we have installed mobile apps across different regions without paying much attention to whether the “same” app may actually differ depending on where it is distributed. At first glance, such differences may seem limited to language or market localization. But in practice, geographic variation can also affect permissions, third-party libraries, and other implementation details. Studying these differences can help us better understand how developer practices, cultural preferences, and regulatory environments shape the mobile ecosystem, while also exposing potential blind spots in analyses that rely on apps collected from only a single region.

This paper takes an important and labor-intensive step in that direction. Over the course of a year, the authors collected a large set of Android APKs from seven geographically diverse Google Play regions and used them to study regional divergence at scale. A particularly interesting aspect of the paper is its GeoTwins perspective: apps with similar branding and functionality that are distributed as regional variants. Through this lens, the paper shows that apps that appear nearly identical can still differ in subtle but meaningful ways across regions.

The reviewers appreciated both the scale of the data collection effort and the care taken in the analysis. They found the dataset itself valuable and viewed the work as opening up an underexplored measurement space. At the same time, they also noted that the paper is fundamentally a measurement study, and that its broader implications should be presented in a more measured way and tied more directly to the empirical findings. In particular, the discussion would be strongest when grounded in concrete explanations for why such regional differences arise and what risks they may create in practice.

Overall, this paper makes a solid first step toward a more geographically aware understanding of mobile apps and their distribution. While many questions remain about the causes and consequences of these cross-region differences, the paper provides both a useful dataset and an important perspective for future work on app analysis, reproducibility, and mobile security.

GlucOS: Security, correctness, and simplicity for automated insulin delivery

Hari Venugopalan, Shreyas Madhav Ambattur Vijayanand, Caleb Stanford, Stephanie Crossen, Samuel T. King (UC Davis)

Public Review

Public review by Matthai Philipose, Microsoft, Seattle, WA

How can you ensure that an automated insulin delivery system does not deliver too much insulin while still allowing sophisticated (e.g., AI-based) dosing? The question of how to harness the deep reasoning abilities of AI-based interfaces and control while mitigating its inherent uncertainties is rapidly gaining importance in engineering systems. An implanted medication delivery system, as discussed in this paper, is a high-stakes version of this setting. Challenges include potentially adversarial failures in the AI subsystem, implementation errors in the core software infrastructure, and strong user preferences and possible user error.

The paper addresses these challenges through a carefully designed multi-layered approach: GlucOS uses a separation architecture that isolates components so that any dosing algorithm (including ML) can be used while a formally verified safety module (just 25 lines of Swift) moderates insulin doses, mitigating adversarial or erroneous AI outputs; it applies formal methods to prove correctness of the critical insulin delivery path, directly addressing implementation bugs; and it incorporates a human-centric defensive strategy that gives users agency and clear information to take corrective actions, accounting for user preferences and possible user error. A notable aspect of the paper is that the resulting system was tested on human participants (with positive results reported), indicating an unusual state of technology readiness.

The reviewers were impressed by the importance of the problem being addressed, the high-level architectural approach, the realism of the validation, and the potential for high impact. They appreciated the paper as an example of careful, end-to-end system design, implementation, and validation.

That said, reviewers noted limitations worth acknowledging. Most notably, the full end-to-end system (including automated insulin delivery) was deployed to only a single individual (“Bob”) for 9 months; the six additional participants used only the predictive alerting component. Bob was self-selected, highly motivated, and technically sophisticated—it remains unclear how well the system would serve less engaged or less technically adept users. Additionally, Bob’s case study was deemed “self evaluation” and exempt from full IRB review, which is atypical for a safety-critical medical device deployment and may limit the weight one can assign the clinical findings.

The paper is an informative early step in understanding how to build reliable AI-based systems.

Hide-and-Sweep: Detecting Concealed Cameras via LED Illumination Sweeps

Jonghyuk Yun, Jaeyoung Moon, Yunseo Park (KAIST); Sean Rui Xiang Tan (National University of Singapore); Byunghyun Kim (KAIST); Rajesh Krishna Balan (Singapore Management University); Jun Han (KAIST)

Public Review

Public review by Hanting Ye, Duke University, Durham, NC

Hidden camera detection has long been a socially important problem. More than two decades ago, researchers had already realized that cameras could be detected by exploiting the reflective properties of the miniature lens assemblies required for image formation. Since then, a wide range of hidden-camera detection techniques have emerged, leveraging different modalities such as wireless signals, thermal signals, and leaked electromagnetic emissions. Yet a fundamental limitation of many of these approaches is that they require the camera to be active. In contrast, optical detection remains a more robust direction because it can operate regardless of whether the hidden camera is turned on, unless the deployed device does not rely on image formation at all.

Recent work has revisited optical hidden-camera detection by using the built-in infrared illumination and infrared camera in time-of-flight sensors to detect retro-reflection from camera optics. However, as this paper insightfully identifies, the co-located coupling between the light source and the camera in such hardware severely limits the ability to distinguish the optical signatures of real hidden cameras from those of other reflective objects. Fundamentally, this paper explores the benefits of decoupling the camera from the illumination source in order to obtain richer optical reflection patterns that cannot be observed by prior time-of-flight-based systems. To this end, the authors present a thoughtful low-cost smartphone accessory that uses a controllable LED array to generate desirable scanning patterns and improve hidden-camera detection accuracy.

The reviewers appreciated the paper’s deep and careful exploration of the decoupled camera–illumination design, as well as its end-to-end system implementation and evaluation. At the same time, the reviewers also noted the complexity of optical reflections in real-world environments. In practice, objects used to conceal hidden cameras may have diverse geometric structures and surface materials. As the authors observe in their experiments, even a diffusive surface placed in front of a camera can weaken the reflected optical signal and reduce the detection rate, and the camera cannot form an image in this case either.

More broadly, this paper highlights the importance of building a comprehensive benchmark that can capture the diversity of real-world hidden camera deployments and provide a meaningful basis for evaluating different detection tools. Building such a benchmark will not be easy, but this work offers an encouraging step in that direction. A community effort to develop a public, standardized dataset of realistic hidden-camera deployment scenarios would significantly strengthen future research and help us better understand the true capabilities and limitations of these systems.

Ultimately, improving hidden-camera detection rates remains a shared goal. No one wants to discover that there was a camera hidden in their Airbnb room.

HoloMobile: Photorealistic Avatar on Mobile via Compact Dynamic Gaussians from a Monocular Video

Yuting He (The Chinese University of Hong Kong); Yihua Huang, Xiaojuan Qi (The University of Hong Kong); Zhenyu Yan, Guoliang Xing (The Chinese University of Hong Kong)

Public Review

Public review by Robert LiKamWa, Rice University, Houston, TX

HoloMobile addresses an important systems problem at the intersection of mobile computing, immersive communication, and neural scene representation: how to make high-fidelity dynamic human avatars practical on mobile devices. Recent advances in 3D Gaussian Splatting and human avatar reconstruction have shown impressive visual quality, but these representations are often too heavy for mobile deployment, requiring substantial storage, bandwidth, preprocessing, or rendering resources. For mobile AR/VR, telepresence, training, and other immersive applications, the challenge is not only to reconstruct a realistic avatar, but to support the full pipeline of capture, compression, transmission, and rendering under mobile constraints.

The paper presents HoloMobile, an end-to-end system for generating and rendering photorealistic dynamic human avatars from monocular smartphone video. The system combines a hybrid explicit-implicit avatar reconstruction approach with adaptive polynomial compression of dynamic Gaussian sequences and a browser-based rendering pipeline optimized for mobile devices. The paper attempts to systematize the complete workflow, from capture through mobile playback.

Reviewers found several aspects of the work compelling. First, the paper targets a timely and technically challenging problem: enabling dynamic avatar representations to operate within the bandwidth, compute, and deployment constraints of mobile systems. Second, the system integrates multiple tech-niques into a coherent pipeline, including an explicit body scaffold, implicit refinement for higher-fidelity motion and appearance, compression of Gaussian trajectories, and mobile-oriented rendering. Third, the evaluation demonstrates promising results in visual quality, compression efficiency, and rendering performance on mobile devices. The reported ability to support high-frame-rate rendering with compact representations was central to the paper’s contribution.

At the same time, the paper prompted substantial discussion among the reviewers and TPC. A major point of discussion was the motivation for photorealistic avatars. The reviewers encouraged the authors to more clearly explain why photorealistic dynamic avatars are needed beyond conventional 2D live video or simpler symbolic avatars. A stronger motivation is that volumetric, background-free, view-controllable human representations can be naturally embedded into AR/VR scenes, support arbitrary-viewpoint viewing, and enable interactive immersive applications in ways that 2D video cannot. However, the reviewers also noted that photorealistic human representations raise responsible-use concerns, including impersonation and synthetic-content risks, and the paper needed to acknowledge these concerns more directly, which they did in rebuttal and during shepherding.

Reviewers also raised several technical and evaluation concerns. These included the limited size and diversity of the original dataset, and the need for clearer comparison against state-of-the-art avatar and dynamic-scene methods. Importantly, the narrative needed to justify the architectural choice of transmitting compressed dynamic Gaussian sequences rather than transmitting a canonical avatar with streamed pose parameters. The authors addressed these points in rebuttal and during shepherding by clarifying the intended workflow, expanding discussion of mobile deployment assumptions, explaining comparison choices, adding user-study details, and strengthening the motivation and limitations discussion.

Overall, the TPC found that the authors of HoloMobile present a technically sound and timely system that advances mobile-friendly capture, compression, transmission, and rendering of high-fidelity dynamic avatars. The work opens broader questions around the best representation architecture for mobile avatar streaming, as well as photorealistic avatar motivation and responsible use. This paper provides a concrete end-to-end system and useful empirical results around which the mobile systems community can continue to build.

Insecurity of Lost/Stolen Phone Reporting Services: Vulnerabilities, Attacks, and Countermeasures Best Paper Award

Min-Yue Chen (Michigan State University); Yiwen Hu (University of Maryland, Baltimore County); Yu-An Chen (Michigan State University); Chi-Yu Li (National Yang Ming Chiao Tung University); Tian Xie (Utah State University); Guan-Hua Tu (Michigan State University)

Public Review

Public review by Alec Wolman, Microsoft, Redmond, WA

Today, when a user’s mobile phone is lost or stolen, mobile operators provide a service that allows the user to report the device’s IMEI (a unique device identifier). Once reported, the device is added to a blacklist database that is shared among operators, preventing the stolen phone from being used on public networks. This mechanism, standardized by 3GPP as the Central Equipment Identity Register (CEIR), is designed to deter mobile phone theft by making stolen devices unusable across participating networks.

This paper investigates the susceptibility of lost or stolen phone reporting services to abuse. These services are vulnerable due to the difficulty of correctly identifying device ownership, which can enable denial-of-service attacks. The paper examines in detail the practices and mechanisms used by the three major U.S. mobile operators and the MVNOs associated with them. The findings reveal significant and unexpected security risks. The paper identifies six new security vulnerabilities that span devices, operators, and cross-carrier domains.

By combining these vulnerabilities, the paper demonstrates two practical end-to-end attacks. The first is a targeted attack that incapacitates home security systems and prevents both homeowners and security service centers from receiving critical alarms, while the second is a broad denial-of-service attack against new flagship smartphones at product launch.

Overall, the reviewers appreciated this paper as novel, well-executed, and highly relevant. They high-lighted the paper’s strong empirical analysis of IMEI-based blacklisting systems and its demonstration of impactful real-world attacks. They particularly valued the identification of multiple vulnerabilities across device, carrier, and cross-carrier domains, as well as the practical attack demonstrations. Some reviewers were concerned about the practicality and cost-benefit trade-offs of mounting such attacks, including the challenges of combining multiple vulnerabilities and the duration of the resulting disrup-tions.

Looking forward, the paper proposes four recommended changes: 1) extending the 3GPP standards for device conformance testing; 2) upgrading operator reporting services to use third-party ID verification providers; 3) upgrading online reporting services to use multi-dimensional validation; and 4) strengthening the cross-operator practices and security metadata for the CEIR. This paper will raise awareness of the existing vulnerabilities in operator reporting services, and encourage action by both standards bodies and mobile operators in addressing these vulnerabilities.

KVSwap: Disk-aware KV Cache Offloading for Long-Context On-device Inference

Huawei Zhang, Chunwei Xia, Zheng Wang (University of Leeds)

Public Review

Public review by Juheon Yi, Microsoft, Beijing, China

As Large Language Models (LLMs) increasingly enable on-device applications like meeting summarization and document analysis, the linear growth of the Key-Value (KV) cache with longer contexts hits a severe ”memory capacity wall” on resource-constrained edge devices. Unlike server-grade hardware, mobile/embedded systems typically utilize a unified memory architecture shared by the CPU and GPU, meaning traditional GPU-to-CPU KV offloading techniques fail to yield actual system memory savings. Consequently, offloading the KV cache to non-volatile storage (disk) becomes the only viable option. However, this introduces a new bottleneck due to the drastically lower (e.g., ≈100×) I/O bandwidth of mobile disks compared to RAM.

This paper presents KVSwap, a novel framework for memory-efficient long-context LLM inference on resource-constrained mobile/embedded devices. KVSwap offloads the full KV cache to the disk while keeping a highly compressed low-rank copy of K cache in RAM to accurately predict and prefetch only the most critical tokens for the upcoming layer. By orchestrating memory read patterns to fetch these entries in contiguous groups for hardware-friendly disk access, and by seamlessly overlapping this asynchronous I/O with active computation, KVSwap effectively hides the latency penalties of disk, delivering high throughput even under tight memory budgets.

The reviewers agreed that the paper tackles a timely and important problem of on-device long-context LLM inference, and appreciated the solid system design which leads to notable throughput gains compared to state-of-the-art solutions (mostly designed for server GPUs). The reviewers also left some suggestions for further improvement, especially regarding the system’s generality (some of which were partially addressed during the shepherding process). For instance, since the evaluation was conducted solely on the NVIDIA Jetson platform, it would be beneficial to evaluate the system on a wider variety of mobile devices (such as smartphones) with potentially larger memory budgets. Furthermore, in real-world mobile usage scenarios, memory bandwidth is often heavily contended due to various concurrent background tasks, making it important to consider dynamic memory bandwidth. It would also be helpful to improve the robustness of the proposed low-rank adapter matrix training against domain shifts.

Overall, KVSwap makes a meaningful systems contribution to the community. The paper effectively identifies the critical memory challenges during long-context LLM inference stemming from the unique memory architecture of mobile/embedded devices, and proposes a solid, practically deployable solution. This work lays a strong foundation for exciting future research, paving the way for scaling to 1M-token contexts and expanding to multi-modal applications such as video understanding.

Listen Over the Air: Towards Long-Range Low-Power Backscatter Downlink

YiJie Chen, Shuai Tong, Jiliang Wang (Tsinghua University)

Public Review

Public review by Renjie Zhao, Johns Hopkins University, Baltimore, MD

In recent years, ultra-low-power (ULP) communication has drawn significant attention as a path toward long-standing visions of ubiquitous computing and smart dust. While backscatter has enabled impressive progress on the uplink (from a tiny tag to an infrastructure receiver), downlink communication to the tag remains a major bottleneck: tag receivers must either sacrifice operating range to stay within a tiny power budget, or spend substantially more power to achieve the sensitivity needed for long-range reception. Furthermore, even if many IoT workloads are uplink-heavy, downlink remains essential for practical low-power MAC designs (e.g., coordination and control), and continuous listening can dominate a tag’s energy budget. This gap limits how interactive and scalable ultra-low-power IoT systems can become.

This paper addresses that bottleneck with DUET, a system that enables long-range, low-power downlink to backscatter-style tags. Its main contribution is an end-to-end design that (1) leverages commodity LoRa transmitters to keep infrastructure costs low, (2) improves receiver sensitivity on the tag while staying within tens-of-microwatts power budgets through careful intermediate frequency design, and (3) reduces average listening power via a two-stage wake-up and decoding pipeline. The authors build a prototype from commercial components and evaluate it against representative prior approaches, demonstrating meaningful downlink distances at very low power.

The reviewers appreciate the paper’s clear framing of downlink as a key barrier, its cohesive system design, and its prototype-driven evaluation, including comparisons against existing alternatives and strong sensitivity gains. The reviewers also note natural extensions as the work evolves. For example, the current prototype is optimized for communicating with one tag at a time and would benefit from scaling to parallel multi-tag downlink, which likely requires a stronger wake-up signature design and MAC-level coordination. In addition, the design can be affected by certain forms of interference whose amplitude variations resemble the intended downlink; future iterations could strengthen robustness with improved filtering and thresholding, or more distinctive signatures to reduce false wake-ups and decoding errors.

Overall, this paper is a solid step toward making ultra-low-power backscatter systems more practical by addressing one of their most important missing pieces: an efficient, reliable, and long-range downlink.

MemoLens: Empowering Augmented Reality Glasses with Super Memory

Samiul Alam, Shakhrul Iman Siam, Mi Zhang (The Ohio State University)

Public Review

Public review by Mingmin Zhao, University of Pennsylvania, Philadelphia, PA

Augmented reality (AR) glasses are increasingly equipped with always-on egocentric cameras, microphone arrays, and eye trackers. The combination of continuous first-person video and on-device eye tracking creates an opportunity to build intelligent agents that actively observe, interpret, and remember a user’s important daily experiences. A transformational application is what the authors call “super memory”: the ability to accurately recall objects and people an individual has been paying attention to in the physical world. Unlike traditional life-logging, super memory prioritizes information that enters the user’s focus rather than storing everything, and supports real-time retrieval rather than offline browsing.

This paper presents MemoLens, a spatial computing framework with two key ideas. First, MemoLens uses eye gaze to identify what the user is paying attention to, and applies gaze-aware spatio-temporal token merging to generate compact memory snippets for each video segment. Second, it organizes these snippets into a hierarchical tree, with leaf nodes for short-term memory and higher-level nodes for longer-term memory, and adopts a coarse-to-fine top-down retrieval that scales sub-linearly with memory size. The authors implement the system across Meta Aria glasses, a Galaxy S23 Ultra smartphone, and a cloud GPU backend, and evaluate it on a 100-hour subset of the Nymeria egocentric dataset.

The reviewers appreciated the engineering depth behind this work. Using eye gaze to guide attention and memory formation is intuitive and well motivated. A key strength is that gaze serves as a soft signal rather than a hard filter, allowing potentially useful background information to stay for downstream question answering. The hierarchical tree organization and coarse-to-fine retrieval are similarly well motivated and effective. The system is evaluated comprehensively across retrieval accuracy, response quality, latency, energy, and storage, all on physical hardware.

A few open questions remain for future work. Eye fixation does not always mean attention or memorable content: a user can foveate while mind-wandering, or attend to relevant information using covert attention in the peripheral field. Conversely, the regions a user ignored might be the most informative, for example when recalling where a phone or set of keys was placed. MemoLens’s soft-signal token merging partially mitigates this, and user studies on real recall tasks would further validate the design. Several heuristic choices, such as hierarchy depth and per-layer reduction ratios, are tuned empirically and would benefit from a sensitivity analysis. Privacy implications of AR glasses with always-on video recording also remain an important direction.

MemoLens is a thoughtful and well-engineered first step toward practical super memory in next-generation AR glasses. Its core ideas, selective capture guided by attention, hierarchical memory organization, and coarse-to-fine retrieval, should remain useful as backbone models and hardware evolve. We hope this paper sparks further work on both the system and human-centric sides of this problem.

Microwatt Microwave (M2) Oscillator: Going Beyond the Delegation Architecture of Low-power Wireless Communication

Pramuka Sooriya Patabandige, Dhairya Shah, Rajashekar Reddy Chinthalapani, Spanddhana Sara (National University of Singapore); Prabal Dutta (UC Berkeley); Ambuj Varshney (National University of Singapore)

Public Review

Public review by Aline Eid, University of Michigan, Ann Arbor, MI

This paper presents the M²-oscillator, a novel microwatt-level RF oscillator architecture that chal-lenges the long-standing trade-off between power consumption and frequency stability in low-power wireless systems. By coupling a tunnel diode with a high-Q surface acoustic wave (SAW) resonator, the design decouples frequency stability from the inherently unstable behavior of tunnel-diode oscillators. The resulting system achieves ppm-level frequency stability while consuming under 105 µW, a significant improvement over prior tunnel-diode-based designs and comparable to substantially higher-power conventional oscillators. Beyond the oscillator itself, the work also demonstrates how this design can enable a standalone low-power transceiver through self-oscillating mixing and autodyning.

The paper’s main strengths lie in its clear architectural insight, strong experimental validation, and practical system implications. Reviewers appreciated the deep exploration of this fundamental component in wireless transceivers, as well as the effort to connect circuit-level innovation to system-level capabilities. The end-to-end design and extensive evaluation further strengthen the work, including long-term stability measurements, environmental sensitivity analysis (temperature, humidity, and motion), and over-the-air communication experiments across a range of settings. The demonstrated communication ranges and robustness highlight the potential impact of this approach on low-power wireless systems.

The reviewers also highlighted several limitations and concerns, most of which were addressed in the revision. These included questions regarding device variability, system-level evaluation, and the scope of supported communication scenarios. The revised version strengthens the paper through additional experiments across multiple prototypes and tunnel diode types, providing a clearer understanding of variability in frequency stability and output power. It also includes new results on symmetric M²-to-M² communication, multi-user operation via baseband channelization, adjacent-channel interference, and outdoor deployments. These additions significantly improve the completeness of the evaluation and clarify the system-level implications of the design.

Overall, the paper presents a solid and well-validated contribution that bridges circuit-level innovation and system-level impact. By enabling stable, standalone microwatt radios, it opens new directions for low-power wireless system design and provides a strong foundation for future exploration in this space.

Mobile GUI Agents under Real-world Threats: Are We There Yet?

Guohong Liu (Institute for AI Industry Research (AIR), Tsinghua University); Jialei Ye (University of Electronic Science and Technology of China); Jiacheng Liu (Peking University); Wei Liu, Pengzhi Gao, Jian Luan (MiLM Plus, Xiaomi Inc.); Yuanchun Li, Yunxin Liu (Institute for AI Industry Research (AIR), Tsinghua University)

Public Review

Public review by Sunjae Lee, SungKyunKwan University (SKKU), Suwon, Korea

This paper addresses a timely and increasingly relevant question in mobile computing: are mobile GUI agents, an emerging paradigm for how users may interact with their devices, robust enough to withstand real-world adversarial threats? To investigate this, the authors introduce AgentHazard, a benchmark and evaluation framework designed specifically to measure the resilience of mobile GUI agents against malicious or misleading third-party content encountered during task execution.

Unlike existing benchmarks for mobile GUI agents, which primarily evaluate task completion performance, AgentHazard focuses exclusively on an agent’s ability to resist threat models that are realistic in mobile environments — such as deceptive social media posts, misleading overlay content, or adver-sarial instructions embedded in app interfaces. To construct such a benchmark at scale, the authors develop a content instrumentation framework that can modify real app interfaces to simulate these attacks. This framework supports both a static dataset of 3,000 adversarial scenarios derived from real commercial apps and a dynamic task execution environment comprising 122 reproducible tasks. Their experiments across both open-source and commercial GUI agents reveal that all tested agents suffer significant performance degradation when exposed to third-party adversarial content, confirming the breadth and severity of this vulnerability.

That said, there are promising directions in which this work could be extended. The current attack scenarios primarily target misleading clicks and premature task termination; a natural next step would be to explore whether the framework can capture more severe outcomes such as privacy leakage or malicious redirection, which would further demonstrate the real-world stakes of these vulnerabilities. Additionally, while the paper’s core contribution lies in systematically measuring agent robustness rather than proposing new attack or defense primitives, the benchmark and instrumentation framework it provides could serve as a valuable testbed for developing and evaluating novel defense mechanisms in future work.

Nevertheless, mobile GUI agents represent a plausible and actively pursued direction for the future of mobile interaction. As a research community, it is important that we anticipate and understand the threats that may accompany this shift before these systems see widespread deployment. In that regard, this paper makes a valuable contribution: it identifies a concrete and realistic class of vulnerabilities, provides a practical framework for evaluating them, and offers empirical evidence that current agents are not yet equipped to handle the threats. The work lays useful groundwork for future efforts on both benchmarking and hardening mobile GUI agents against real-world threats.

MulDar: Unleashing the Potential of Distributed COTS mmWave Radar by Exploiting Cross-Device Channels

Xinghua Sun, Qiancheng Li, Akshay Gadre (University of Washington)

Public Review

Public review by Parth Pathak, George Mason University, Fairfax, VA, USA

mmWave radars are being widely deployed in autonomous vehicles, industrial, and smart home sensing applications. However, a key limitation of today’s mmWave radar sensing is specular reflec-tions which result in sparse imaging. Many urban objects and their materials, such as metal, glass, and concrete, act as specular reflectors, directing signals away from the radar. While prior works have proposed leveraging multiple monostatic radars and fusing their observations to improve imaging, using multiple radars in a multistatic, distributed manner, where signals transmitted from one radar can be received by another after reflection from the target, requires addressing a range of practical challenges. Such a multistatic radar sensing system requires radar synchronization, a corresponding signal process-ing pipeline and an imaging algorithm that can accurately combine multistatic observations to provide improved imaging.

This paper presents MulDar, a bistatic distributed mmWave radar system. It addresses two key problems. First, today’s commodity radars are designed to actively reject interference and suffer from frequency and amplitude offsets between unsynchronized chirps of transmitting and receiving radars. MulDar addresses this through a one-time calibration and leveraging the direct path between radars for finer synchronization. Second, it proposes a new bistatic sparse imaging algorithm based on back-projection matched filtering to reconstruct 2D images for arbitrarily placed radars. MulDar is imple-mented on TI’s commodity mmWave radars and evaluated for imaging indoor and outdoor objects, achieving a 66.79% reduction in the Chamfer Distance between the observed and true point clouds.

The reviewers of the paper broadly agreed that MulDar addresses key practical issues in realizing a bistatic mmWave radar sensing system. The end-to-end sensing solution based on two-stage synchronization and imaging is recognized as a positive feature by the reviewers. The explanation and empirical evidence on why conventional monostatic systems fall short in practice are convincing and adequately motivate the proposed design. The reviewers also appreciated that the system works on commodity mmWave radar hardware without modifications. The reviewers raised some concerns regarding the comparison with prior work that has proposed multistatic sensing (albeit more analytical), the need for recalibration and the line-of-sight path between radars, and clarifications for non-coherent image combining. The authors addressed these concerns through significant revision, new experiments, and elaborated discussions, even expanding on limitations and failure cases of the solution.

Overall, this paper is an important step forward in realizing a practical bistatic/multistatic mmWave radar sensing solution that addresses a number of system challenges, paving the way for follow-up work addressing the sparse imaging issue faced by today’s mmWave radars.

MuPose: Breaking the Scalability Barrier of mmWave Multi-User Pose Estimation in the Wild

Duo Zhang, Zhehui Yin, Xusheng Zhang, Junzhe Wang, Hongliu Yang, Zhiyun Yao, Zizhou Fan, Wenwei Li, Daqing Zhang (Peking University)

Public Review

Public review by Yingying Chen, Rutgers University, New Brunswick, NJ

Millimeter-wave (mmWave) radar has emerged as a promising modality for privacy-preserving human pose estimation (HPE), as it enables device-free sensing without exposing identiﬁable visual information. While most existing systems focus on single-user scenarios, real-world domestic environments are inherently multi-person and cluttered. Recent multi-user HPE eﬀorts have made progress, but they are designed and validated under idealized, open-space conditions. In realistic indoor settings, these systems suﬀer from three compounding challenges: inter-person signal collisions, multipath interference from surrounding objects, and severe occlusion. None of the existing solutions addresses all three challenges simultaneously.

This paper presents MuPose, a system that integrates signal processing and deep learning to achieve robust multi-person pose estimation in realistic cluttered indoor environments. To tackle inter-person signal collisions, MuPose re-architects the point cloud generation pipeline with a null steering mechanism, constructing subject-exclusive Range-Doppler maps for each individual rather than separating users at the point cloud level as prior work does. MuPose furhter exploits the geometric reciprocity between the Angle of Arrival and Angle of Departure to eliminate multipath ghosts by ﬁltering out reﬂections that violate line-of-sight consistency without relying on static environmental priors. In addition, to handle occlusion, MuPose employs a masked spatio-temporal transformer that infers missing joints by learning from temporal motion continuity and cross-joint correlations.

Reviewers recognized several notable strengths of this paper. The null steering mechanism addresses inter-person signal collisions at the signal level, which was overlooked by prior mmWave systems. The AoA-AoD reciprocity check is grounded in physical signal properties rather than learned heuristics, making it both interpretable and generalizable across environments. In addition, the masked spatiotemporal transformer oﬀers an eﬀective solution to joint occlusion by jointly capturing spatial structure and temporal motion. The evaluation is comprehensive and well-designed, covering a diverse range of indoor environments. The reported improvements over state-of-the-art baselines are substantial and convincing. Meanwhile, several concerns were raised during the review process. The core signal pro-cessing techniques, null steering and AoA-AoD ﬁltering, are well-established in the radar community. The novelty claims need more clariﬁcation. Reviewers also noted that the ceiling-mounted radar setup inherently reduces occlusion, raising questions about the system’s robustness in more challenging con- ﬁgurations. Additional suggestions included covering prior studies on AoA-AoD-based ﬁltering in the related work section and strengthening the ablation study. During shepherding, the paper has been improved by clarifying the AoA-AoD threshold selection, the distinction between angular resolution and estimation accuracy, and the generalizability of the system across environments.

Overall, MuPose makes a solid and timely contribution to mmWave-based multi-user pose estimation. Its integrated approach, combining principled signal processing with a spatio-temporal learning model, yields a strong empirical improvement over prior art across diverse real-world conditions. This work opens promising directions for further research. Future work could move toward end-to-end multi-person estimation without rule-based user separation and scaling to larger spaces through distributed radar deployments. Integrating the system with higher-level semantic understanding would further unlock applications such as activity recognition and clinical health monitoring.

Needle in a Haystack: Tracking UAVs from Massive Noise in Real-World 5G-A Base Station Data

Chengzhen Meng, Chenming He, Yidong Jiang (University of Science and Technology of China); Xiaoran Fan (Independent Researcher); Dequan Wang, Lingyu Wang, Jianmin Ji, Yanyong Zhang (University of Science and Technology of China)

Public Review

Public review by Mohamed Ibrahim, HPE Labs, Berkeley Heights, NJ

Unmanned Aerial Vehicles (UAVs) have become increasingly pervasive, enabling a wide range of ap-plications, including traffic observation, infrastructure assessment, aerial logistics, disaster response, and environmental monitoring. However, the potential misuse of UAVs for unauthorized surveillance and attacks on ground targets has raised significant security concerns. Consequently, accurate long-range tracking of UAVs from terrestrial sensing platforms is critical for reliable airspace intrusion detection, safeguarding ground-based operations, and ensuring the safety of flight operations.

Device-free tracking is a challenging localization problem in which the target does not cooperate with the sensing system, requiring reliance on opportunistic wireless reflections. The sensing architecture may be monostatic (colocated transmitter receiver), as in radar systems, or bistatic/multistatic, as in passive WiFi-based sensing. A fundamental challenge stems from multipath propagation, where the received signal is a superposition of reflections from numerous static and dynamic scatterers. Consequently, isolating the target specific component from background clutter and indirect paths remains a critical and non-trivial task.

This paper addresses the problem of passive UAV tracking using point clouds generated by a commercial 5G-Advanced (5G-A) base station. The proposed system employs a multi-stage filtering pipeline to suppress noise and extract reliable target trajectories within a range of up to 1 km. The input consists of point clouds estimated by a 5G-A base station operating at 4.9 GHz with 100 MHz bandwidth, where each point includes 3D spatial coordinates, Doppler velocity, and signal quality metrics. The pipeline begins with noise suppression by modeling the background distribution within a 3D spatial cube to eliminate spurious points. Next, a DBSCAN-based clustering algorithm groups points corresponding to potential objects, followed by spatial and velocity consistency checks to further refine the detections. Finally, a Kalman filter is employed for trajectory estimation, and a transformer-based model is used to classify trajectories as valid UAV tracks or false positives. The system is evaluated in two urban environments using a single base station and a UAV executing nine distinct flight patterns, achieving an average localization error of 4.9m

The reviewers recognized the significance of the problem and appreciated both the proposed multistage filtering pipeline and the comprehensive experimental evaluation. However, several concerns were raised, including the system’s ability to handle multi-UAV scenarios and its robustness to ambient interference from birds and other moving objects. Additionally, the reviewers suggested including a more detailed discussion of the generic pipeline used to generate point clouds from 5G waveforms. The authors have addressed most of these concerns by expanding both the evaluation and the background sections. As a direction for future work, further improvements could be achieved by mitigating noise at earlier stages of the pipeline, particularly during point cloud generation through a more careful analysis of the underlying 5G sensing waveform. Moreover, developing robust noise models would benefit from extensive evaluations across diverse outdoor environments, including urban, suburban, and rural settings.

Overall, this work has the potential to inspire further research in passive UAV tracking and to motivate the exploration of broader applications enabled by 5G-Advanced integrated sensing and communication (ISAC) capabilities.

Ouroboros: Instilling Motion Awareness in ViTs for Efficient Video Analytics on the Edge

Chanjeong Park, Donggyu Yang, Sooyoung Kwon, Gibum Park (Seoul National University); Carlee Joe-Wong (Carnegie Mellon University); Kyunghan Lee (Seoul National University)

Public Review

Public review by Anlan Zhang, Adobe Research, San Jose, CA

Vision Transformers (ViTs) are powerful models for visual recognition, but their quadratic self-attention cost makes them challenging to deploy on edge devices. For video analytics, a promising strategy is to reuse tokens across frames by exploiting temporal redundancy. However, prior methods often rely on rigid pixel-wise frame differencing, which conflates positional consistency with true visual redundancy: a patch is considered redundant only if it matches the same spatial index in the previous frame. As a result, when motion shifts content across the patch grid, these methods fail to detect redundancy and trigger unnecessary recomputation.

This paper presents Ouroboros, an end-to-end framework that addresses this misalignment by in-troducing motion awareness into ViT-based video analytics. The key insight is to place consecutive frames into a shared global coordinate system using affine transformations estimated from motion vec-tors provided by video codecs, so that invariant content across frames falls onto the same patch in the global input space. This design raises two further challenges that the paper tackles with elegant solu-tions. First, as accumulated motion shifts content beyond the boundaries of the input space, Ouroboros employs a toroidal (wrap-around) topology so that escaped content reappears on the opposite side, pre-serving the full token pool without discarding information. Second, because wrapping disrupts spatial proximity at the borders, Ouroboros disentangles and reassigns positional encodings to restore the ViT’s perception of spatial continuity. Combined with a cache-guided partial computation scheme that processes only updated tokens while reusing cached Keys, Values, and final-layer features, Ouroboros achieves up to 87.0% computation reduction, translating to a 2.61× speedup and 64.5% energy savings on NVIDIA Jetson Orin devices, with less than 1% accuracy loss on object detection and instance segmentation tasks.

The reviewers found the paper well-written and recognized its novelty in explicitly modeling geometric redundancy via motion-aware global alignment, a clear departure from prior pixel-differencing methods. They also appreciated the comprehensive evaluation across multiple ViT architectures, edge devices, and scenarios. Meanwhile, several concerns were raised. A key issue was the robustness of the single global affine transformation, which may struggle with non-rigid motion, multiple independently moving objects, or depth-induced parallax. Reviewers also questioned the theoretical validity of KV caching in non-causal ViT encoders, since self-attention depends globally on all tokens, making partial updates inexact. Additional suggestions included analyzing performance across motion magnitudes and discussing reliance on video codec motion vectors and their hardware accessibility. In their rebuttal, the authors addressed most concerns with quantitative evidence on alignment robustness and clarified caching and implementation details. During shepherding, the paper was further improved with clearer discussion of KV caching, motion vector generalizability, and failure cases under complex motion.

Overall, Ouroboros makes a compelling contribution to efficient ViT-based video analytics on edge devices. By reconceptualizing temporal redundancy elimination through the lens of geometric alignment rather than pixel differencing, it unlocks substantially higher computation reuse and provides a principled, modular framework that is compatible with diverse ViT architectures. The toroidal input space and positional encoding reassignment are particularly creative solutions. This work opens promising directions for further research, including extending the framework to handle multi-object non-rigid dynamics and integrating with temporal attention mechanisms for holistic video understanding.

Physical Self-Supervised Learning: IMU Sensing without Manual Labels

Yuyang Leng, Renyuan Liu (George Mason University); Shaohan Hu, Peijun Zhao (JPMorganChase); Chun-Fu Chen (JPMorgan Chase); Songqing Chen, Shuochao Yao (George Mason University)

Public Review

Public review by Anonymous shepherd

Over the past decade, deep neural networks have advanced mobile and wearable sensing across appli-cations such as smart health, robotics, augmented reality, and IoT systems. Among sensing modalities, inertial measurement units (IMUs) are widely used due to their low cost and availability in consumer devices. However, scalable IMU learning continues to face two major challenges: the high cost of col-lecting labeled data and the strong heterogeneity of sensing environments caused by variations in users, device placements, orientations, and hardware characteristics. Existing supervised and self-supervised approaches reduce but do not eliminate the need for labeled data during adaptation to new condi-tions. As a result, robust deployment outside controlled environments remains di!cult. Motivated by these limitations, the paper explores whether physical structure inherent in IMU sensing tasks can be integrated into learning frameworks to reduce label dependence while improving generalization.

To address this problem, the paper proposes a framework called physical self-supervised learning for IMU sensing tasks such as inertial tracking and full-body motion capture. The core idea is to replace the conventional neural decoder in an autoencoder framework with an adaptive physics decoder based on learnable kinematic equations. This design aims to guide the model toward physically meaningful representations while adapting to diverse sensing environments. The framework also introduces a hybrid IMU encoder with structured latent-space reconstruction to improve robustness against sensor noise. Additional components include probabilistic frequency-spatial constraints, a multi-view kinematic tree for sparse physical supervision, and uncertainty-aware modeling to address ambiguity in IMU inference. Experiments across public datasets and realistic deployments demonstrate strong improvements in tracking and motion capture performance, particularly under cross-user and cross-dataset generalization settings.

The technical program committee viewed the paper as addressing an important challenge in wearable sensing and appreciated its e”ort to combine physical priors with neural learning frameworks. Reviewers highlighted the technical depth, broad evaluation, and strong empirical performance in challenging generalization scenarios. The committee also recognized the importance of reducing label dependence while improving robustness across heterogeneous sensing environments. At the same time, reviewers noted ar-eas requiring further improvements, including scalability, deployment feasibility on resource-constrained devices, and the explainability of the performance gains over supervised baselines. Additional validation across more diverse physical conditions and applications was also suggested as future work. Overall, the paper was seen as opening a promising research direction at the intersection of self-supervised learning and mobile sensing.

Pinpointing Transmitting LEO Satellites from a Single Passive Array

Ishani Janveja (University of Illinois Urbana-Champaign); Jida Zhang (Stanford University); Emerson Sie, Deepak Vasisht (University of Illinois Urbana-Champaign)

Public Review

Public review by Mohamed Ibrahim, HPE Labs, Berkeley Heights, NJ

Low Earth Orbit (LEO) satellites have gained significant attention in recent years from both research communities and industry due to their growing role in global communications, Earth observation, navigation, and remote sensing. Rapid deployment of large constellations, such as Starlink, is enabling unprecedented connectivity but also creating new challenges for spectrum management, interference monitoring, and space situational awareness. Accurate tracking and identification of LEO satellites is therefore critical to ensure reliable communications and mitigate potential conflicts in the increasingly crowded low-Earth orbit environment.

This paper introduces StarLoc, a 3D localization system designed to track LEO satellites using a single passive receiver. StarLoc emphasizes a low-cost and compact solution, employing only three horn antennas to achieve precise localization. The system tackles several key challenges: improving angle estimation with a minimal antenna array, resolving range ambiguities, mitigating angular ambi-guities through large antenna separations, and modeling range as a function of estimated angles while accounting for satellite orbital motion. It further leverages Doppler shifts to refine range and trajectory estimates, enabling robust 3D localization. Evaluations on a real-world testbed demonstrate that StarLoc can track Starlink satellites with angular errors within 0.7◦ and range errors within 5 km, highlighting its effectiveness for practical spectrum monitoring applications.

Reviewers acknowledged the significance of the problem and the technical soundness of the proposed solution, while raising questions regarding localization accuracy, angular coverage, ground truth reliability, and practical deployment considerations. They recommended clarifying the accuracy requirements for target applications, comparing angular coverage with prior systems. Moreover, the reviewers asked for providing more details on ground truth measurement and potential sources of error, and analyzing tradeoffs between cost and accuracy as a function of the number of antennas. The authors addressed most of these points during shepherding.

Overall, this work represents a meaningful step toward practical, high-accuracy, and low-cost LEO satellite tracking. By demonstrating a compact system capable of precise 3D localization, StarLoc can inspire further research in the community, motivating innovations in affordable satellite tracking solutions. Its contributions are likely to benefit a range of applications, from spectrum monitoring and interference management to broader space situational awareness, offering a foundation for future studies and deployments in a rapidly growing LEO ecosystem.

Rapid Plant Health Monitoring for Leafy Greens through Carbon Dioxide Sensing

Ngoc Que Anh Tran, Liang He (University of Nebraska-Lincoln)

Public Review

Public review by Vikram Iyer, University of Washington, Seattle, WA

Agricultural research depends heavily on greenhouse trials to screen new seed lines for plant-breeding programs and treatments. Such trials can require weeks to months for completion of a full crop cycle. Early identification of plant stress to identify trial failures is critical to minimize wasted resources and improve efficiency. The traditional approach to this has been slow, manual, visual inspection for signs of stress or disease. Recent imaging technologies and invasive sensing approaches that seek to address these problems either require tightly controlled conditions or damage the very plants under study. Methods that can rapidly, cheaply, and non-invasively flag a struggling trial are therefore of significant practical value to the agricultural and sensing communities.

This paper presents CarbonSense, a non-invasive plant stress detection system that uses near-canopy CO2 concentration as a behavioral side channel for plant health. The key insight is that leaf pore activity produces minute-scale fluctuations in ambient CO2 near the plant that can be captured with off-the-shelf sensors, without contacting the plant. The authors observe that under the fixed lighting schedules standard in greenhouse operation, healthy plants follow repeatable CO2 patterns tied to LED transitions, while stressed plants deviate from these patterns. CarbonSense operates without information about the greenhouse or plant and instead self-learns an adaptive, healthy CO2 baseline during an initial window, then watches for sustained deviations and distortions in the daily CO2 curve shape. The authors evaluate the system over 79 days and three lettuce crop cycles in a greenhouse testbed, reporting 93.7% daily accuracy and stress detection roughly 41 hours earlier than visible symptoms appear in RGB imagery.

The reviewers appreciated the core idea that near-canopy CO2 dynamics can be used to infer plant health. They found the self-learning, online formulation well matched to the realities of agricultural trials, where stress patterns for new seed lines or treatments are not known in advance. The latency improvement over visual baselines was seen as a meaningful practical result that could materially shorten the feedback loop in greenhouse experimentation. The reviewers also noted some limitations such as the fact that most experiments were performed in a compact, enclosed greenhouse and over a limited number of crop cycles where CO2 concentrations may build up more easily. The authors added additional results during the review and shepherding process showing these patterns hold in larger spaces such as an office setting with airflow disturbances from HVAC systems as well. Questions were raised about the initialization period, sensitivity to sensor placement, canopy density, and cross-plant CO2 mixing in larger or multi-plot deployments, as well as about the system’s behavior near a plant’s end-of-life.

Overall, the committee saw CarbonSense as a promising proof of concept for this new approach toward rapid, low-overhead stress detection in greenhouse trials.

Reliable Metal Foreign Object Detection for Mobile Wireless Charging via Harmonic Fingerprinting

Shenyao Jiang (City University of Hong Kong); Yang Liu (Florida State University); Lixiang Han (City University of Hong Kong); Xinyu Wang, Hao Zhou (University of Science and Technology of China); Zhenjiang Li (City University of Hong Kong)

Public Review

Public review by Minhao Cui, Seoul National University, Seoul, Republic of Korea

In the 1890s, Nikola Tesla first proposed the concept of wireless charging using the Tesla coil, enabling electricity to be transmitted through resonant inductive coupling over the air rather than through cables. Today, this visionary idea has evolved into a multi-billion-dollar industry, powering everyday devices such as smartphones, wearables, and even electric vehicles. However, like any technology, wireless charging has its drawbacks. While it offers convenient and flexible power delivery, it also introduces safety concerns. For example, small metal objects—such as coins, SIM ejector tools, or transit cards—may inadvertently enter the charging zone (e.g., inside a phone case). These objects can absorb electromagnetic energy and heat up rapidly, posing risks of burns, device damage, or even fire. The dominant approach for detecting such foreign objects relies on monitoring energy loss caused by their presence. However, this method—as well as other techniques based on tracking electrical parameters—often fails to detect foreign objects in a timely manner.

This paper explores an interesting observation: metal foreign objects alter the electromagnetic field during wireless charging by acting as a low-pass filter, disproportionately attenuating the high-frequency harmonics of in-band communication signals. To capture these electromagnetic signatures and enable robust foreign object detection in real-world scenarios, the authors design an end-to-end system spanning both hardware and software. By combining a planar sensing coil array with a contrastive learning model, the authors implement a low-cost prototype. The system is evaluated across a variety of wireless chargers and common everyday metal objects. The results demonstrate that it can reliably detect foreign objects well before their temperature rises to a dangerous level.

The reviewers appreciated the key insight of modeling foreign objects as a low-pass filter, as well as the end-to-end system design and the thorough evaluation across diverse real-world scenarios. Reviewers also noted that the current implementation relies on additional hardware, including a dedicated MCU board and sensing antenna. As such, an important direction for future work is to integrate the system into commercial wireless charging devices. Another promising direction is to extend the proposed detection approach to other wireless charging scenarios, such as electric vehicle charging systems.

The insights from this work provide the community with a new perspective on how external factors can fundamentally shape wireless charging and near-field communication channels, and influence their roles in the overall process.

RF Super Resolution: A Deep Learning Approach to Spatial Enhancement for LoRa

Andreas Kuster (Nanyang Technological University (NTU), Singapore); Huatao Xu (Hong Kong University of Science and Technology); Rui Tan (Nanyang Technological University); Mo Li (The Hong Kong University of Science and Technology)

Public Review

Public review by Ju Wang, Northwest University, Xi’an, China

Low-power wide-area networks (LPWANs), particularly LoRa, have achieved widespread adoption due to their ability to provide long-range communication in highly noisy environments. However, to reliably extract and demodulate these weak signals, current LoRa receivers rely on heavy analog oversampling, e.g., typically operating at up to 8× of the signal bandwidth. This imposes a significant and persistent “oversampling tax” on the analog-to-digital converter (ADC) and the subsequent digital signal processing (DSP) pipelines. Consequently, this heavy sampling rate creates a major energy and hardware bottleneck, especially for battery-constrained end-nodes and low-power gateways.

This paper tackles the oversampling problem by drawing an elegant inspiration from visual com-puting. Just as modern graphics rendering uses a “render low, upscale high” paradigm (e.g., DLSS) to save compute power while maintaining visual fidelity, the authors ask: can a similar super-resolution principle be applied to radio frequency (RF) signals? To explore this, the paper introduces RF Super Resolution (RF-SR), a lightweight, real-time neural upscaler for LoRa. The system allows the analog front-end to operate at a significantly reduced sampling rate (2× Nyquist). It then pairs an efficient digital interpolation algorithm with a hardware-friendly, shallow four-layer Convolutional Neural Network (CNN) to mathematically reconstruct the signal and correct interpolation artifacts and noise. The end-to-end evaluation demonstrates that RF-SR can match the demodulation performance of a native 8× oversampled system while providing additional signal-to-noise ratio (SNR) gains.

The reviewers appreciated the novel “sample low, upscale high” conceptual framing applied to the RF domain, as well as the practical, lightweight architecture design that makes deployment on resource-constrained hardware feasible. The reviewers also praised the rigorous evaluation conducted using a large-scale over-the-air (OTA) dataset. Naturally, proposing to replace classical analog fidelity with digital neural computation sparked valuable discussions. The reviewers raised thoughtful questions regarding the true “AI tax”, i.e., whether the energy saved by a lower-rate ADC justifies the computational power required by the CNN, and how robust the learned mapping is across different hardware components and varying environmental conditions.

Through the shepherding process, the authors successfully addressed these concerns by detailing system-level power budgets and proving the model’s generalizability across diverse spreading factors and hardware fingerprints. Ultimately, as edge-AI hardware continues to become more efficient, trading analog oversampling for smart, lightweight digital computation presents a highly promising paradigm for the future design of energy-efficient IoT communication systems.

Ringmaster: How to juggle high-throughput host OS system calls from TrustZone TEEs

Richard Habeeb (Yale University); Man-Ki Yoon (North Carolina State University); Hao Chen (CertiK); Zhong Shao (Yale University)

Public Review

Public review by Steve Ko, Simon Fraser University (SFU), Vancouver, BC, Canada

A Trusted Execution Environment (TEE) is a hardware-enforced domain where code and data can run in isolation from the rest of the system, e.g., hypervisor, OS, or applications. This domain also provides confidentiality and integrity, i.e., the code and data in a TEE are hidden and cannot be tampered with. Due to these guarantees, many security-sensitive applications rely on it, such as mobile payments, secure wallets, and DRM. Intel, ARM, and AMD all provide a TEE, though the exact implementation varies across platforms.

A long-standing problem with TEEs is how to handle I/O. A TEE can rely on the host OS for rich I/O services, but this design exposes the TEE to timing vulnerabilities (among other things) since a malicious OS can delay or deny I/O requests. On the other hand, a TEE can implement its own I/O services, but this design inflates the Trusted Computing Base (TCB) and, accordingly, the attack surface. Thus, the question is, how can a TEE safely use the host OS for I/O without being vulnerable to timing attacks, while also keeping the TCB small?

Ringmaster answers this question for safety-critical systems, such as drones and autonomous vehicles, running in ARM TrustZone TEEs. The key idea is to use asynchronous system calls via Linux’s io uring, and make all interactions with the untrusted OS non-blocking. This prevents the untrusted OS from blocking TEE execution through the system call interface. Further, when the OS denies service, enclaves can continue running on Ringmaster’s minimal kernel, while time-sensitive I/O is handled through small Ringmaster-owned devices.

The reviewers found that the problem and motivation are well articulated and compelling. We also recognized the novelty of the paper’s availability-oriented TEE design. Additionally, the reviewers appreciated the system’s practicality and completeness, as well as its high throughput and low overhead. Overall, the reviewers found the paper thorough in addressing many systems challenges in a well-designed system.

SAIL: Redesigning Collaborative Language Inference with a Single Server-to-Mobile Handoff

Gibum Park, Sanghyun Han, Yonghwa Cho, Chanjeong Park, Kyunghan Lee (Seoul National University)

Public Review

Public review by Hong Jia, University of Auckland, Auckland, New Zealand

This paper introduces SAIL, a collaborative inference framework for mobile LLM applications that seeks to balance high accuracy with low latency by combining a server-side LLM with an on-device SLM. Its central idea, Prefix Handoff Inference (PHI), is that the server generates the more difficult early portion of the output, after which generation is handed off to the device. By reducing communication to a single handoff, SAIL avoids the repeated synchronization overhead that constrains prior collaborative approaches such as split computing and speculative decoding.

The observations/motivation of the paper is that, early decoding tokens are often the hardest, while a smaller model can continue effectively once provided with a strong prefix, is intuitive. The design is also quite complete. SAIL includes adaptive handoff decisions, branch prediction on the device, and adaptive control under varying system conditions. The evaluation is complete. It covers a range of tasks and model families, and the results indicate that SAIL can significantly outperform mobile-only inference while preserving nearly the full accuracy of the server LLM under latency constraints.

Reviewers noted a few limitations the paper could be improved. The advantages of SAIL seem strongest in settings where server-only inference cannot satisfy latency or throughput requirements, but the on-device model is still capable of completing the suffix efficiently. The current study also focuses mainly on LLM/SLM pairs from the same model family, so the broader generality of PHI across more heterogeneous model combinations remains to be established.

Overall, reviewers agree the central idea of this paper is insightful, the design is well executed, and the evaluation is convincing enough to demonstrate practical value. While there remain some open questions about deployment scope and generality, the paper makes a meaningful contribution to collaborative LLM inference for mobile systems and should be of interest to the MobiSys community.

ShadowNPU: System and Algorithm Co-design for NPU-Centric On-Device LLM Inference

Wangsong Yin (Key Lab of HCST (PKU), MOE; SCS, Peking University, China); Daliang Xu, Mengwei Xu (Beijing University of Posts and Telecommunications); Gang Huang, Xuanzhe Liu (Key Lab of HCST (PKU), MOE; SCS, Peking University, China)

Public Review

Public review by Chulhong Min, Nokia Bell Labs, Cambridge, UK

Running large language models on mobile devices is critical to enable privacy-preserving AI. One opportunity in this direction is to leverage the Neural Processing Units (NPUs) incorporated into modern mobile SoCs, which offer high-throughput integer computation with superior energy efficiency. However, the execution of end-to-end LLM operations on NPUs causes significant degradation in accuracy because the attention operation involves activation tensors that are highly sensitive to quantization. Thus, the attention operation typically falls back to the CPU or GPU in state-of-the-art frameworks.

This paper presents shadowNPU, a sparse attention module that addresses this challenge. The key observation is that identifying important tokens through attention score estimation is more tolerant to low-precision quantization than computing exact attention values. This is because estimation only requires preserving relative ordering. Based on this observation, the authors devised a method to offload the dense estimation stage to the NPU using INT8 quantization and then transfer only the indices of important tokens to the CPU/GPU for sparse high-precision computation. To make this practical, the authors further introduce a series of techniques: NPU compute-graph bucketing, a head-wise NPU- CPU/GPU pipeline, and per-head fine-grained sparsity ratios.

The reviewers appreciated that the authors tackled a practical topic and motivated the proposed design well. The key observation was particularly compelling: token importance estimation is resilient to quantization. The authors conducted an extensive evaluation, covering multiple models, datasets, and design alternatives, and demonstrated up to 4.5× end-to-end speed-up and up to 7.7× energy reduction, with only a 0.4 percentage point accuracy loss.

The reviewers also left some suggestions for further improvement. The evaluation is conducted exclusively on Qualcomm Hexagon NPUs. Extending the evaluation to other NPU families would demonstrate the generalizability of the proposed solution more convincingly. Validating shadowNPU in deployment scenarios with realistic mobile LLM applications would further establish its practical impact beyond the evaluation with datasets.

Overall, this paper makes a meaningful systems contribution to the community. It identifies a practical problem, proposes a technically sound solution grounded in insightful observation, and supports its effectiveness with a comprehensive evaluation.

Solving Scarce Wireless Signal Dilemma in Model Training using Cross-Modal Learning Leveraging Limited Video Data

Qiufan Ji (New Jersey Institute of Technology); Honglu Li (Rutgers University); Cong Shi (New Jersey Institute of Technology); Yan Wang (Temple University); Jerry Q. Cheng (New York Institute of Technology); Yingying Chen (Rutgers University)

Public Review

Public review by Matthai Philipose, Microsoft, Seattle, WA

How do we learn models to map RF signals to human state given the extreme paucity of training data (i.e., mappings from RF maps to human state, e.g., activities)? This paper provides a way to convert spatiotemporal visible light maps (aka video) to spatiotemporal RF maps. Given the abundance of video data (which can be converted to RF maps using these models), this could yield plentiful RF map training data, which in turn could allow training RF models in data-poor settings. The straightforward approach to training a video-to-RF model would be supervised training on jointly gathered video/RF data. Although conceptually simple, success is not guaranteed. Perhaps most importantly, the amount of training data available is limited. Second, it is unclear that such joint models can be trained to high accuracy. Finally, it is unclear that models will generalize well to new settings. This paper shows how to make it work.

The paper uses a UNet model, the standard approach for [video] frame to [RF] frame translation as a baseline. However, instead of training the direct video-to-RF translation (which would be vulnerable to data scarcity), it uses various forms of prior supplementary information. First, it exploits the information content of pre-trained vision analysis models to preprocess image data into depth, edge, material, and position data. Second, informed by physics-based intuitions, it incorporates carefully structured convolution layers into the UNet. Finally, it simulates domain shift on training data to encourage generalization to new domains. Measurements (including extensive ablation studies) show the benefit of the approach.

Reviewers were unanimous in the potential impact of the approach on cracking the RF data scarcity problem. They were impressed by the carefully reasoned design modification on a solid baseline design. The results were convincing and extensive. That said, there is still significant scope for future work: e.g., accuracy degrades noticeably under compound domain shifts: unseen environments yield ∼70% and unseen device positions ∼69%, leaving meaningful room for improvement in realistic deployments where multiple factors change simultaneously.

The approach should make a difference in training RF-based models in certain common settings.

SpecSentry: Micro-power Wideband Spectrum Surveillance for Ephemeral Transmissions

Aritrik Ghosh, Nirupam Roy (University of Maryland, College Park)

Public Review

Public review by Ambuj Varshney, National University of Singapore, Singapore

Continuous, wideband monitoring of the radio spectrum has long been an important problem. Regu-lators rely on it to keep the airwaves orderly and enforce spectrum rules. Security operators depend on it to detect covert emissions, a concern that has grown more acute as the proliferation of embedded devices, approaching Pister’s vision of smart dust, creates new surfaces for both innocuous interference and deliberate exfiltration. As early as 1945, Theremin’s passive resonant-cavity ”Thing” bug, concealed in a gift to the US ambassador, demonstrated the importance of monitoring the airwaves. It remained undetected for seven years, transmitting only when externally illuminated by an RF signal. Modern threats follow similar logic. A tampered sensor can stay radio-silent for hours and then exfiltrate data in a brief, low-duty-cycle burst designed precisely to slip past conventional receivers.

Detecting such ephemeral transmissions is hard for a fundamental reason. The conventional receive chain (antenna, low-noise amplifier, local oscillator, mixer, and high-speed ADC) is power-consuming, with the dominant cost in carrier generation, downconversion, and Nyquist-rate digitization over hun-dreds of megahertz. Spectrum analyzers reduce this cost by sweeping the band, visiting one narrow slice at a time, but in doing so routinely miss the very transmissions one would most want to catch.

SpecSentry proposes a clever solution to this problem. It replaces the active receive chain with an entirely passive frontend built around a Schottky-diode envelope detector. Envelope detectors are already deployed in billions of RFID tags worldwide, but their use has been largely limited to short-range communication with simple modulation schemes. The authors find a clever way past this constraint by coupling the detector with a phase-coded microstrip band-pass filter. The insight is that the very property RF designers have spent decades engineering away, the non-linear group-delay profile of a practical band-pass filter, can instead be repurposed as a frequency-dependent code that imprints a deterministic phase signature on any signal traversing it. The subsequent self-mixing in the envelope detector folds the wideband content down to a narrow baseband that can be sampled at roughly 20x below the Nyquist rate of the scanned band. On a 400 MHz band, SpecSentry achieves median detection errors of 0.3 MHz in bandwidth and 7.89 MHz in center frequency. The authors also fold the imperfections of the FR4 prototype and the small phase imbalance of the splitter into the synthetic training set, bridging the gap from clever idea to working system.

SpecSentry leaves open questions for the community to address. First, Schottky-diode receivers are inherently sensitivity-limited, and improving passive-frontend sensitivity without compromising low power consumption is nontrivial. Traditional solutions such as adding an LNA would substantially increase power consumption, and alternative approaches are needed. Second, the passive frontend consumes essentially no power, but the prototype currently digitizes through a software defined radio. The authors present an early design of the processing platform landing at roughly 221 mW; reducing this to match the frontend’s minimal power consumption remains a challenge. Third, by construction, an envelope-detector frontend struggles with narrowband modulation schemes such as PM and FM. Finally, multi-signal and co-channel interference scenarios remain notoriously difficult for any envelope-detector architecture. While SpecSentry takes a promising step toward separating concurrent signals, robustly disentangling dense, overlapping transmissions at a low power budget remains an open challenge.

Wireless researchers have spent the better part of a decade asking how to push more capability into the receive chain, more bandwidth, more sensitivity, more sophisticated decoding. SpecSentry asks the inverse question: how much receive functionality can we strip away and still see what matters? As embedded devices proliferate, the ability to maintain wideband spectrum awareness within a low power budget will only grow in importance. This paper is a meaningful step in that direction.

SpMAP: Transparent Sparsity for LLMs

Wonkyo Choe, Felix Xiaozhu Lin (University of Virginia)

Public Review

Public review by Longfei Shangguan, University of Pittsburgh, Pittsburgh, PA

Small is the next big. Over the past few years, progress in LLMs has largely been driven by scale: more parameters, more compute, and larger serving infrastructure. But as we look ahead to a future where people increasingly rely on LLMs for everyday assistance, running these models directly on phones or other personal devices becomes a more natural choice, with clear benefits in privacy, latency, and personalization. However, running LLMs on resource-constrained personal devices is very challenging largely because of the unprecedented memory footprint of LLMs.

Recent works in the LLM community have shown that during decoding, only a small fraction of neurons need to be activated for each token generation, creating an important opportunity to reduce memory use through sparsity. The challenge, however, is that exploiting this sparsity in mobile systems often introduces new overheads, such as dynamic tensor shapes, extra memory copies, and complicated I/O pipelines. This paper tackles this question in a simple and elegant way: can we preserve the benefit of sparsity without making the system much more complex?

The core idea of SpMAP is simple in the best sense. Instead of building custom operators or complicated user-level pipelines to manage sparse tensors, the paper leverages a mature OS abstraction, virtual memory, to support what the authors call transparent sparsity. SpMAP creates a full tensor-sized anonymous mapping for zero-filled regions, and then remaps only the predicted active neurons with file-backed pages. In this way, the ML framework still sees a static dense tensor shape, while inactive regions naturally read as zeros. This design avoids dynamic tensor reshaping, reduces extra memory copies, and lets the OS handle data loading through existing paging mechanisms. For the SIGMOBILE community, this is the kind of systems paper we like to see: a simple idea, a working system, and clear gains under resource constraints.

Looking ahead, efficient LLM execution on phones and other resource-constrained devices will likely require both algorithmic innovations and system optimizations. Sparsity, quantization, shared-memory SoCs, and better coordination across various accelerators (e.g., CPUs, NPUs, and GPUs) will all matter. SpMAP is not the full answer, and dense execution may remain attractive as device memory continues to grow. Its current design also primarily targets CPU inference, where virtual memory and page-fault handling are mature enough to transparently support this approach. While modern GPUs already provide virtual memory hardware support, the corresponding software stack is still not ready to handle page faults in the same way, which limits the immediate applicability of this design beyond CPUs. But this paper makes a strong case that new workloads do not always require entirely new system mechanisms. One broader lesson is that classic OS techniques, refined over many years, can still be revisited and repurposed to support emerging workloads such as on-device LLM inference. That perspective is likely to matter well beyond this one paper.

StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

Minh K. Quan, Pubudu N. Pathirana (Deakin University)

Public Review

Public review by Mayank Goel, CMU, Pittsburgh, PA

Modern edge devices promise ambient intelligence, environmental sensing, and context-aware interaction; and they increasingly rely on continuous audio recognition. However, the representation learning techniques that power state-of-the-art audio models are fundamentally designed for large-batch, server-scale training environments. This creates a tension between the continuous, resource-constrained nature of edge devices and the computational and memory demands of modern contrastive learning systems.

This paper presents StreamSplit, a framework for enabling adaptive, continuous audio representation learning across heterogeneous edge devices and cloud infrastructure. Rather than treating edge execution and representation learning as separate problems, the work proposes a co-designed system that jointly addresses the algorithmic challenges of small-batch streaming learning and the systems challenges of runtime volatility. The reviewers particularly appreciated three aspects of the work:

First, the paper tackles an important problem at the intersection of edge systems and modern machine learning. Although several prior systems focus on static split computing or model compression, StreamSplit explicitly addresses the mismatch between continuous streaming audio and the discrete large-batch assumptions underlying contrastive learning.

Second, the system introduces an adaptive architecture that couples uncertainty-aware scheduling with representation learning objectives. StreamSplit replaces large negative-sample memory queues with a compact distributional memory model that synthesizes virtual negatives on-device, reducing memory and bandwidth demands while preserving representation quality.

Third, the paper demonstrates efficiency improvements across heterogeneous hardware platforms. The evaluation shows reductions in bandwidth, latency, and energy consumption while maintaining performance close to server-centric approaches.

More broadly, the paper reflects a growing trend in mobile and edge systems research toward inte-grating representation learning objectives directly into runtime systems decisions. Rather than viewing learning models as fixed workloads, StreamSplit explores how semantic properties of the data itself can influence scheduling and execution.

Surface Characterization with mmWave Signals

Haowen Lai, Zitong Lan, Dongyin Hu, Mingmin Zhao (University of Pennsylvania)

Public Review

Public review by Ruichun Ma, Microsoft Research Asia

Over the past several years, mmWave radar has evolved from a coarse motion detector into a high-resolution imaging system capable of reconstructing 3D scene geometry. These advances have positioned RF signals as a compelling sensing alternative in challenging conditions for cameras, such as poor lighting or visual occlusion. Can we move beyond geometry and recover the intrinsic physical properties of the environment we image? Physical properties such as permittivity and roughness govern how materials interact with EM waves; Recovering them would advance RF perception from geometric to semantic scene understanding.

This paper introduces SurfRadar, a fully automatic mmWave system that characterizes surface properties in indoor environments with a rotating radar on the mobile robot. The reviewers were impressed by several aspects of the work. First, SurfRadar introduces a new representation for material sensing, coherent 2D surface reflection images produced by high-resolution RF imaging. This shift provides the spatial degrees of freedom needed to disentangle intrinsic material parameters, such as dielectric constant and roughness, from confounding extrinsic factors like distance and incidence angle. Second, the paper develops a physics-based forward model, inspired by classical BRDFs but tailored to monostatic mmWave radar. Through forward synthesis and backward optimization, the system recovers the parameters that best explain the measured image. Third, the system is fully integrated on a mobile robot with automated surface segmentation, multi-viewpoint fusion, and a voting strategy that resolves co-planar material ambiguities, enabling continuous material mapping during free navigation.

The reviewers appreciate the solid advancement and promising results, but also raised concerns regarding modeling assumptions, motivation, and evaluation coverage. Most of these concerns were addressed through rebuttal and shepherding. A few fundamental challenges remain as promising future directions: extending the isotropic BRDF model to handle anisotropic surfaces with strong directional texture, and improving the treatment of composite, layered, or degraded materials beyond the current single-material assumption.

SurfRadar represents a solid step toward utilizing RF imaging for material characterization. As mmWave systems continue to improve in resolution and coverage, the ability to perceive not just where surfaces are but what they are made of will become increasingly valuable for robotics and environmental sensing.

TimelyLLM: Time-sensitive LLM Serving System for Physical-I/O Limited Agents Best Paper Award Runner-Up Best Artifact Award - Runner Up

Neiwen Ling, Guojun Chen, Anurag Khandelwal, Lin Zhong (Yale University)

Public Review

Public review by JeongGil Ko, Yonsei University, Seoul, Korea

Large Language Models (LLMs) are rapidly moving beyond chat interfaces into the physical world, powering robots, drones, and voice assistants that interact with users and their environments in real time. However, unlike purely digital systems, these agents operate under a fundamental constraint, since while LLMs can generate outputs at high speed, the physical world unfolds much more slowly. This mismatch raises a natural question: “are we using LLM computation efficiently when serving real-world, time-sensitive agents?”

This paper observes that existing LLM serving systems largely overlook this gap. Modern serving frameworks are designed to maximize throughput or reduce response latency by continuously generating outputs once a request begins. However, for physical-I/O-limited agents, generating an entire plan immediately is often unnecessary. A robot may take several seconds to execute a simple command, and a voice assistant is bounded by human listening speed. As a result, substantial opportunities exist to better align LLM computation with the pace of real-world execution.

To address this, the authors introduce TimelyLLM, a new serving system that explicitly coordi-nates LLM generation with agent-side execution. The key idea is to treat generation as a segmented and schedulable process rather than monolithic. Instead of producing a full response upfront, Time-lyLLM incrementally generates self-contained segments, such as a robot action or a spoken phrase, and temporarily suspends generation once an executable unit is available. While the agent executes this segment, the system reallocates compute resources to other requests. A slack-aware scheduler then de-termines when to resume generation, prioritizing requests based on how urgently they need additional output to avoid delays.

This design introduces a new perspective on LLM serving. Specifically, rather than optimizing solely for throughput or token latency, TimelyLLM optimizes for time utility, aligning computation with when outputs are actually needed. The system is implemented on top of an existing LLM serving framework and evaluated using real-world workloads from robotic platforms and spoken agents. The results demonstrate that TimelyLLM can significantly improve responsiveness under multi-agent workloads, achieving up to 1.52× higher time utility and reducing agent waiting time by up to 84% compared to a state-of-the-art baseline.

The work is particularly notable for showing that meaningful system-level improvements can be achieved without modifying the underlying LLM. By leveraging execution dynamics and preserving intermediate computation state, TimelyLLM efficiently interleaves multiple requests while maintain-ing correctness and responsiveness. This execution-aware perspective is well aligned with emerging applications where LLMs serve as the decision-making core of embodied or interactive systems.

At the same time, the approach raises several important questions for future work. TimelyLLM relies on estimating execution time and identifying meaningful segmentation points, both of which may vary across tasks and environments. The use of heuristic-based segmentation, while practical, may require further validation under more diverse or less structured outputs. Additionally, the benefits of segmented generation may depend on the availability of sufficient execution slack, and it remains to be seen how the approach generalizes to highly dynamic or tightly coupled interaction loops. Questions of scalability, including deployment across large batch sizes or multi-GPU settings, also present interesting directions for exploration.

Overall, this paper highlights an important shift in how we think about LLM systems: as components embedded in time-sensitive, real-world processes rather than isolated text generators. By bridging the gap between fast model inference and slower physical execution, TimelyLLM takes a meaningful step toward more efficient and responsive AI systems for the physical world.

TIPS: Thermal Image based Plastics Sorting

Long Duong (University at Buffalo); Charuvahan Adhivarahan (university at buffalo, state university of NewYork); Roshan Ayyalasomayajula, Karthik Dantu (University at Buffalo)

Public Review

Public review by Vikram Iyer, University of Washington, Seattle, WA

Plastic waste is one of the most pressing environmental and economic problems of our time, with the vast majority of post-consumer plastic in the United States ending up in landfills or incinerated rather than recycled. A core bottleneck in plastics recycling is the ability to reliably sort mixed streams into their six common resin types (PET, HDPE, LDPE, PVC, PP, PS) at material recovery facilities. Existing automated approaches based on RGB vision and NIR/MIR hyperspectral imaging face limita-tions in cost, deployability, and accuracy on difficult categories such as black plastics (15% of household plastic waste streams), which absorb NIR light and are effectively invisible to today’s industrial sorters.

This paper presents TIPS, an active thermal imaging system that classifies plastics by inferring their intrinsic thermal properties rather than relying on appearance or spectral reflectance. A sample is heated with a laser and its cooling curve is observed with a thermal camera. The authors develop a physics-informed neural network and training strategies. A two-stage training pipeline (pretraining on PDE-based simulated data, fine-tuning on a sparse set of real measurements) addresses data sparsity, and the system achieves high accuracy. The paper also includes end-to-end experiments on real samples showing high classification accuracies.

The reviewers appreciated the high-value real-world target the paper addresses: black plastics are largely landfilled today, and TIPS’s strong performance on this category gives the work a clear and compelling application. The technical implementation was viewed as a thoughtful bridge between physics-based modeling and the latency requirements of inference-time systems. Reviewers also noted that because TIPS measures intrinsic material properties rather than visual shape, allowing it to succeed succeeds on the crushed, torn, and dirty post-consumer shards that are challenging for RGB-based computer vision pipelines. A key question discussed by the reviewers was the length of the measurement. Because the method depends on heating, the current system requires a 60-second measurement cycle per object which could be a challenge for high throughput sorting lines. Additionally, reviewers noted limitations in the paper’s evaluation of moving samples that could affect performance in a real world conveyor belt setting as well as the potential energy costs of heating to explore in future work.

Overall, the committee appreciated the authors bringing this approach and problem into the mobile systems community demonstrating a stepping stone toward broader deployment of physics-informed sensing in waste management.

Towards Fast and Fully Automatic Drone Mapping

Jingao Xu, Xiangliang Chen, Mihir Bala, Thomas Eiszler, Aditya Chanana, Jan Harkes, Babu Pillai, Mahadev Satyanarayanan (Carnegie Mellon University)

Public Review

Public review by Tara Boroushaki, Yale University, New Haven, CT, USA

Drones are widely used for mapping, with applications ranging from urban planning and infrastructure inspection to scene understanding for security and rescue operations. However, current drone mapping pipelines are often slow and expensive: they require long flights to collect large volumes of data, followed by hours of offline processing to produce a 3D model. If coverage gaps or reconstruction errors are detected, additional guided flights may be needed, further increasing deployment time and cost. As a result, existing approaches are poorly suited for time-critical or resource-constrained scenarios.

This paper addresses the challenge of fast and fully automatic drone mapping by combining vision foundation models (VFMs) with classical SLAM algorithms in an iterative, AI-in-the-loop workflow. The system introduces a hierarchical keyframe selection mechanism to identify the most informative frames for dense reconstruction under GPU memory constraints. Specifically, ORB-SLAM3 is used to select content-novel keyframes, which are further filtered using VGGT’s token encoder to obtain a compact set of keyframes suitable for VFM processing. These keyframes are then fed into the full VGGT pipeline to generate a dense initial map while the drone hovers. To guide subsequent flights, the paper proposes a map quality evaluation method that combines explicit multi-view geometric support with VGGT’s uncertainty estimates for neighboring points, enabling the system to detect holes and low-confidence regions and plan follow-up flights accordingly.

Reviewers highlighted the system’s ability to drastically reduce end-to-end mapping latency, bringing the combined flight and reconstruction time down from hours to only 5–15 minutes. Importantly, this speedup does not come at the expense of coverage, as the resulting maps remain highly complete. The reviewers also appreciated the modular design of the system, which enables future improvements as vision foundation models and other components continue to advance.

Reviewers also noted limitations that point to directions for future work. In particular, the reconstruction accuracy of the resulting point clouds is limited to the 1–2 meter range, producing maps that capture high-level geometry but lack fine-grained detail. This reflects the trade-off in this system that favors lower latency and improved coverage over precise reconstruction. Second, the current implementation is constrained by a 70-frame keyframe budget, which may be insufficient for very large or visually complex structures. The paper acknowledges these limitations and highlights potential extensions. In particular, the core contributions, the hierarchical keyframe selection, the quality evaluation module, and the iterative mapping framework, are independent of the specific VFM used. While the current prototype relies on VGGT, it can incorporate newer and more advanced models to improve absolute reconstruction accuracy. Similarly, scaling to larger buildings or urban environments may be achieved through stronger GPUs, lighter models, or multi-stage reconstruction approaches such as VGGT-Long that support multi-phase fusion.

In summary, this paper addresses the challenge of slow and manual drone-based mapping by propos-ing a fast, automated, and iterative mapping framework. By prioritizing responsiveness and completeness over survey-grade accuracy, the work opens up new possibilities for time-critical situational awareness and rapid aerial mapping applications.

Towards Seeing Bones at Radio Frequency

Yiwen Song (Carnegie Mellon University); Hongyang Li (University of Wisconsin-Madison); Kuang Yuan, Ran Bi, Swarun Kumar (Carnegie Mellon University)

Public Review

Public review by Anh Nguyen, University of Montana, Missoula, MT

Wireless sensing has long aspired to achieve X-ray-like vision using radio frequency (RF) signals, leveraging their ability to penetrate occlusions. However, prior work has been fundamentally limited by poor resolution due to long wavelengths, strong attenuation, and complex diffraction effects. This paper takes a significant step toward this vision by introducing OssiSense, a penetration-based microwave imaging system that reconstructs cross-sectional images of bone structures inside flesh at sub-centimeter resolution.

The core technical contribution lies in the design of an end-to-end system that combines (i) a penetration-based synthetic aperture algorithm (PSAR) to overcome large RF aperture limitations and achieve sub-wavelength resolution, and (ii) a learning-based artifact removal pipeline that mitigates diffraction effects via a multi-frequency U-Net model. The system is carefully implemented and eval-uated, demonstrating substantial improvements over prior RF imaging approaches in both resolution and reconstruction fidelity. Overall, the work reflects strong systems thinking, integrating modeling, optimization, and machine learning into a cohesive pipeline. The reviewers appreciated the ambition of tackling deep-tissue RF imaging and the rigor of the system design. The formulation of RF propagation as a combination of penetration and diffraction components, along with the two-stage reconstruction pipeline, represents a thoughtful and technically sophisticated approach to a long-standing challenge.

That said, the reviewers raised concerns regarding the scope, evaluation, and positioning of the work. The experimental validation is conducted exclusively on relatively small ex vivo meat models, which enables controlled ground-truth acquisition but limits the ability to assess anatomical variability, motion effects, and in vivo feasibility. Furthermore, the dataset used for the learning-based component is modest in size, raising questions about robustness and generalization, particularly given the reliance on a neural network to correct diffraction artifacts. In addition, the system’s framing relative to clinical imaging modalities such as CT and X-ray requires careful calibration. While the paper positions OssiSense as enabling X-ray-like imaging at RF, the current performance demonstrates the system’s current capabilities as a proof-of-concept rather than a direct alternative. The reviewers also noted several practical considerations (e.g., mechanical scanning, resolution bounds, controlled acquisition conditions) that merit further discussion. The authors acknowledge these limitations and highlight potential directions for further raising the bar.

Overall, this paper presents a compelling and well-engineered step toward RF-based internal imaging. While important challenges remain before such systems can be deployed in practical or clinical settings, the work opens up a promising new direction at the intersection of wireless sensing, computational imaging, and machine learning.

TwinFocus: Autofocus for Handheld mmWave SAR Imaging via Physical and Digital Twin References

Yadong Li, Xinghua Sun, Qiancheng Li, Akshay Gadre (University of Washington)

Public Review

Public review by Rajalakshmi Nandakumar, Cornell Tech, NYC, NY

mmWave radar systems have high spatial resolution and can penetrate through materials, making them ideal for imaging objects in applications such as Non-Destructive testing. However, traditional mmWave imaging systems are built on synthetic aperture radar (SAR) techniques, where the radar is moved with sub-millimeter accuracy by bulky structures to ensure coherent signal integration. Hence, the system remains unsuitable for practical and ubiquitous deployment. More recently, handheld SAR Imaging has focused on overcoming this issue by supplementing it with visual inertial odometry (VIO). However, due to short wavelength, small errors in VIO causes severe misalignment and defocused images.

This paper presents TwinFocus, a reference-guided autofocus framework for handheld SAR imag-ing with a mmwave radar. The goal is to enable high-resolution mmWave without accurate position information via motion stages. Instead, the authors propose correcting motion-induced phase errors by leveraging a reference object in the scene. TwinFocus estimates phase errors from a reference object and transfers the correction to the target. The paper explores two variants: a physical-twin approach, where a known reference object has a pre-measured SAR template, and a digital-twin approach, where a vision-driven pipeline generates a digital twin and synthetic SAR response. The method compensates for trajectory errors by aligning amplitude-domain features between the reference and target images. Extensive real-world experiments demonstrate robust performance, achieving up to 23.3% and 41.7% improvement in structural similarity for practical SAR imaging using physical and digital twin ref-erences, respectively, including NLoS scenarios.The evaluation demonstrates improved image quality under injected tracking errors and handheld trajectories.

The reviewers agreed that the paper solves an important problem in handheld imaging and the idea of estimating phase errors from a reference object and transferring the correction to the target is elegant and conceptually appealing. They also appreciated the use of a simulated response for a reference object for in-the-wild scenarios as a good step towards applicability of the solution. The reviewers also pointed out some limitations that need to be addressed, which is discussed in the paper. The assumption of sub-millimeter accuracy of the VIO system limits it to high-end systems, and not all objects in the scene can be a good reference.

Overall, this paper takes a good first s tep i n e nabling a uto-focused i mages i n h andheld mmWave imaging systems using existing common reference objects in the scene. This can be potentially improved to enable accurate imaging in different real world scenarios.

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices Best Paper Award Runner-Up

Xiangyu Li (Institue for AI Industry Research (AIR), Tsinghua University); Chengyu Yin (Beijing Jiaotong University); Weijun Wang (Institute for AI Industry Research (AIR), Tsinghua University); Jianyu Wei (University of Science and Technology of China); Ting Cao, Yunxin Liu (Institute for AI Industry Research (AIR), Tsinghua University)

Public Review

Public review by Ang Li, University of Maryland, College Park, MD

Vec-LUT considers an important problem in ultra-low-bit LLM inference on edge devices. Prior LUT-based kernels perform well for single-token decoding, but their performance degrades under parallel inference because scalar LUTs require repeated, irregular table lookups across tokens, which leads to poor memory-bandwidth utilization. This paper addresses that limitation by introducing a vector- LUT design that enables shared 1-to-N lookups across parallel tokens, and by building a corresponding implementation with careful layout, packing, and cache-aware execution optimizations.

The reviewers appreciated the paper’s clear motivation, the clean and intuitive core idea, and the substantial engineering effort behind the system design. They also found the evaluation convincing, noting that the paper covers multiple devices and models, reports strong speedups, and includes abla-tions and breakdowns that help validate the main design choices. Overall, the paper was viewed as a solid systems contribution with clear relevance to efficient edge-side LLM inference.

The reviewers also noted several ways the paper could be improved. In particular, they felt that the paper would benefit from a clearer high-level explanation of how Vec-LUT fits into the end-to-end transformer inference pipeline and from more discussion of when Vec-LUT is preferable to scalar LUT in realistic workloads. They also suggested expanding the discussion of energy efficiency, NPU com-parisons, numerical behavior, memory overhead, and the regimes in which the proposed approach may be less effective. These points are natural directions for strengthening the presentation and broadening the study.

Overall, the reviewers agreed that the paper makes a meaningful contribution. It identifies a real and practically relevant bottleneck, proposes a thoughtful and technically interesting solution, and supports that solution with strong implementation and evaluation. While there remain opportunities to further expand the discussion and experimental scope, such open questions are expected for a paper in this area and do not detract from the value of the contribution. The paper merits acceptance.

VLMCache: Efficient On-Device Vision-Language Model Inference

Yinyuan Zhang (Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University); Daliang Xu (State Key Laboratory of Networking and Switching Technology; School of Computer Science, Beijing University of Posts and Telecommunications); Zhiyang Chen (Institute for Artificial Intelligence, Peking University); Chenghua Wang (State Key Laboratory of Networking and Switching Technology; School of Computer Science, Beijing University of Posts and Telecommunications); Ying Zhang (Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University); Mengwei Xu (State Key Laboratory of Networking and Switching Technology; School of Computer Science, Beijing University of Posts and Telecommunications); Gang Huang (Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University)

Public Review

Public review by Juheon Yi, Microsoft, Beijing, China

Vision Language Models (VLMs) are increasingly deployed in real-time on-device applications like UI agents and video analysis. However, achieving low latency on resource-constrained edge devices is highly challenging, mainly due to the heavy prefill phase. While exploiting temporal redundancy of visual inputs (i.e., reusing block-level computations across highly redundant consecutive frames) has been widely proven effective in previous Convolutional Neural Networks, applying it to Vision Transformers (ViTs) is highly challenging. This is because ViTs lack translation invariance: due to their global self-attention mechanism, even a single-pixel mismatch can drastically alter feature maps, thereby invalidating the strict-prefix KV cache reuse mechanisms in LLMs.

This paper presents VLMCache, a pioneering system to enable effective block-level visual prefix caching and reuse for VLM inference. VLMCache introduces a novel image tokenization scheme that semantically disaggregates static background and dynamic foreground visual blocks across consecutive frames. It encodes the unchanged background blocks into a semantically-safe, reusable KV-cache prefix, while only appending and recomputing the dynamically changed foreground blocks. By pairing this with an isolate-then-fuse strategy to independently process stable features and dynamically restore lost cross-block attention and positional coherence, VLMCache fundamentally overcomes the structural rigidity of ViTs, delivering significant inference speedups on mobile devices with minimal drop in model accuracy.

The reviewers agreed that the paper tackles a timely and important problem of accelerating on-device VLM inference, and appreciated the solid system design. The reviewers particularly highlighted the conceptual novelty of the semantic disaggregation and re-fusion mechanism, which effectively enables ViTs to exploit temporal visual locality to deliver significant reductions in Time-to-First-Token (TTFT). There were also some suggestions for further improvement. First, it would help to extend VLMCache to moving camera scenarios which may require motion vector matching across consecutive frames. Additionally, reducing the training overhead of the background segmentation module would greatly enhance practical deployability.

Overall, VLMCache makes a meaningful systems contribution to the community. The paper effectively identifies the critical challenges of applying visual caching to ViTs and proposes a solid solution. This work lays a strong foundation for exciting future research, paving the way for highly responsive, real-time multimodal applications such as on-device UI agents and video analytics.

WheatScout: A Handheld Analyzer for Pre-Harvest Wheat Protein and Moisture Monitoring

Shanwen Chen, Haiyan Hu (HKUST); Meng Yang (Shandong Agricultural University); Qian Zhang (HKUST)

Public Review

Public review by Qin Lv, University of Colorado Boulder, Boulder, CO USA

As one of the most important food sources globally, wheat is planted throughout the world, mainly by smallholder farmers with limited resources. Timely harvesting is critical for farm profitability, as it impacts both yield preservation and grain quality. Grain protein and moisture content changes quickly within a narrow window of approximately 7-10 days, and pre-harvest moinitoring is crucial to optimize harvesting time and market value. However, existing solutions require expensive laboratory spectrometers or threshed kernels. A handheld device for wheat pre-harvest monitoring that is effective, affordable, non-destructive, and field-deployable has the potential to improve harvest decision making with significant global impact.

This paper presents WheatScout, a hardware-algorithm co-designed system that aims to support quantitative, laboratory-level protein and moisture analysis using whole wheat ears while reducing the cost for in-field use by smallholder farmers. The work tackles three major challenges. First, variations in moisture content cause large and non-linear shifts in the signal, obscuring the weaker signal fro protein. Second, morphological differences of wheat varieties change light scattering, adding noise to spectral patterns based on chemical composition. Third, in-situ use of the handheld device subjects the system to diverse environments, introducing significant and unpredictable noise to the spectral signal. The authors propose an optical-algorithm collaborative method for scattering correction, which integrates a fixed-geometry probe with double-sided diffusion optics and a multi-scale CNN to mitigate geometric artifacts and stablize spectral baselines. In addition, for water compensation and chemistry-focused regression, the authors have developed a moisture-first spectral decoupling framework using conditional GANs and dual-domain adversarial training to reconstruct latent dry-basis features from wet-ear measurement. Through real-world evaluations with 600 real-world wheat ear samples collected across 10 different varieties during a 15-day harvest window, WheatScout achieves lab-grade performance with significant reduction in cost.

The reviewers appreciated the authors’ efforts in tackling a challenging mobile sensing problem with potentially significant real-world impact. Driven by an in-depth investigation of the domain-specific challenges, the proposed hardware-algorithm codesign of the mobile sensing system is well thought out with solid execution. The in-field data collection and system evaluations are important in demonstrating the effectiveness of the proposed solution for real-world use. The reviewers noted the need to provide more details and rationale regarding the 3D enclosure design and the use of cGAN instead of simpler methods. Further investigation and discussion of system robustness in real-world use and the impact of crop color are also suggested.

Mobile sensing is an active line of research in the community and can have significant real-world impact. The integrated investigation of innovative sensing systems, applications, and services calls for convergent research and close collaboration with domain experts. The focus on real-world deployability and robustness is also of high importance.

Wideband Low-complexity High-speed 5G NR Backscatter

Zhenzhe Lin (George Mason University); Yoon Chae (University of Texas at Arlington); Panneer Selvam Santhalingam (Brooklyn College, CUNY); Mingyo Jeong, Parth Pathak (George Mason University)

Public Review

Public review by Akshay Gadre, University of Washington, Seattle, WA

This paper presents a novel design of a cellular technology based backscatter tag to operate with 5G New Radio (NR) base stations. The challenge to translate the prior state-of-the-art to the FR2 regime is the significantly larger bandwidth of the radios and smaller tolerance to synchronization errors which need to be dealt within the small energy and complexity budget of backscatter systems. The authors develop a dual-task transformer which enables the system to simultaneously perform channel estimation and frame synchronization at large bandwidths. This transformer is then pruned to enable operation at the low SWaP regimes of a backscatter tag.

The solution is implemented and evaluated using a 28 GHz prototype connected to 4×4 patch array to receive and backscatter signals. The system is extensively evaluated in prototype scenarios both indoors (three settings) and outdoors (two settings) demonstrating a clear improvement in throughput and BER compared to baseline systems. There are detailed ablation studies in the manuscript that demonstrate the generalizability of the design.

This paper presents a clear push towards organic evolution of backscatter tags for applications in urban and indoor settings leveraging the ubiquitous cellular signals. Potential applications of this technology can enable live tracking, sensor backhaul, and even robotic deployments for delivery. The proposed system can be further enabled by development of a RFIC which achieves much smaller power numbers (∼ µW) than those presented in the current work (mW regime). The results clear show that WiNB enables 71 Mbps throughput with a BER of ∼ 10−4 which is a significant improvement over the baselines. Potential Future Exploration Avenues: Given the transformer based dual-task optimization pre-sented in the work, there is a clear concern about the generalization of the architecture in practical deployment. While the authors demonstrate generalizability in different contexts, significant further evaluation studies will need to be performed prior to translation-to-practice. Moreover, perhaps devel-oping an explainable algorithm based on the learnings of the transformer will enable formal modeling of BER and SNR regimes for transformer based backscatter architectures. Another potential avenue would be to develop real world applications leveraging the above system design. Finally, the results in the manuscript demonstrate sensitivity to mobility that will require more analysis and solution development to overcome.

WiMirror: Towards 802.11-Compliant RIS Control Using Compressed Beamforming Reports

Youdong Wang, Chenhao Wu (The Chinese University of Hong Kong); Jun Huang (City University of Hong Kong); Guoliang Xing (The Chinese University of Hong Kong)

Public Review

Public review by Parth Pathak, George Mason University, Fairfax, VA, USA

Reconfigurable Intelligent Surfaces (RIS) have emerged as a promising solution for combating blind spots and interference in dense indoor environments by reshaping wireless propagation, thereby improving coverage and capacity. However, a major barrier in RIS adoption is the need for complex control, which requires coherently configuring the phase shifts of RIS elements to adaptively beamform the reflected signals. Existing solutions based on RSSI provide a coarse approximation while the ones based on CSI (Channel State Information) are limited to proprietary WiFi chipsets that are not standardized in WiFi protocols. A potential solution is to leverage the Compressed Beamforming Report (CBR), in which the receiver sends compressed beamforming feedback to the transmitter to facilitate beamforming, as it is available on all WiFi chipsets through built-in protocol support.

This paper presents WiMirror, a CBR-based, WiFi-compliant RIS control system. WiMirror ad-dresses two challenges. First, CBR only provides a compressed channel representation and cannot be directly translated into RIS element phase shift configurations without a brute-force search. WiMir-ror leverages the underlying channel structure, which remains relatively stable, to calculate multipath magnitudes under new RIS configurations in a computationally efficient manner. Second, WiMirror addresses the random phase errors introduced by the CBR by developing a novel RIS control with in-depth modeling of path magnitudes and RIS reflection angles without relying on accurate phase information.

The reviewers agreed that CBR-based RIS control proposed by WiMirror is well-motivated and practically valuable compared to CSI-based solutions. The reviewers also appreciated how the au-thors handled CBR’s phase error issue by building the estimator based on magnitude rather than the conventional phase based pipeline. The analytical modeling of the CBR, the development of a protocol-compliant working system, and the evaluation of it through extensive experiments further strengthen the work. The reviewers pointed out several concerns that were addressed in the revision. These included a need for explanation on how the AP-RIS-reflector-client paths contribute to the channel, how the paths matched across configurations, etc. Furthermore, the end-to-end latency, processing time, and overhead of sounding exchange were not adequately studied. The authors addressed these issues in the revision and also added new results on performance under dynamic scenarios involving environmental changes and client mobility.

Overall, the paper is an important step toward realizing protocol-compliant real-time RIS control through beamforming feedback, improving deployability, with additional follow-up work needed to scale the system (e.g., utilizing CBR from multiple clients) and further reduce complexity in practice.

ZA-SLAM: Leveraging Vision-Language Model for Zero-Shot Acoustic SLAM

Zhuochen Yu, David K.Y. Yau (Singapore University of Technology and Design); Yijie Shen (ShanghaiTech University); Xiaoran Fan (Google); Tao Chen (Independent Researcher); Qun Song (City University of Hong Kong)

Public Review

Public review by Qiang Yang, University of Cambridge, Cambridge, UK

Recent work on indoor localization has made steady progress across sensing modalities, yet many existing systems still depend on environment-specific data collection and retraining, which limits scala-bility in practice. This paper takes a different approach by introducing ZA-SLAM, a zero-shot acoustic SLAM system that leverages vision-language models (VLMs) to generalize across unseen environments. By aligning acoustic features with visual representations, the system transfers the generalization capability of VLMs into the acoustic domain and removes the need for per-environment retraining.

What stands out in this work is the way it combines ideas from multimodal learning and mobile sensing into a coherent system design. The paper presents a well-integrated pipeline, including acoustic–visual feature alignment, semantic-guided image selection, and trajectory-level validation for loop closures. These components are thoughtfully designed to address practical challenges such as noisy inputs and false positives, and together enable robust performance in real-world settings.

The reviewers noted that there are several directions that could further sharpen the contribution. In particular, making the role of cross-modal alignment and representation design more central in the narrative could help distinguish the work from a straightforward application of VLMs. In addition, while the zero-shot capability is appealing, some discussion of how the system behaves under broader real-world variations, such as different motion dynamics, deployment conditions, or scaling to more diverse environments, would provide a clearer picture of its practical scope.

This work marks a promising step toward more generalizable sensing systems and highlights the potential of combining foundation models with domain-specific system design. It also illustrates how advances in one modality or domain (e.g., vision-language models) can be thoughtfully adapted to another (e.g., acoustic sensing), offering a useful perspective for future cross-modal system design.

Emerging Ideas Track

Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI

You Rim Choi, Subeom Park, Hyung-Sin Kim (Seoul National University)

Public Review

Public review by Maria Gorlatova, Duke University, Durham, NC

This Emerging Ideas track paper opens with an insightful discussion that sets the tone for the rest of the work: while modern AI systems have evolved along a computation-centric trajectory, focusing on better models, more data, and optimized inference pipelines, biological perception is organized quite differently. In biological systems, sensing is not passive but actively regulated through layered mechanisms that shape what information is captured before higher-level reasoning takes place. This observation mo-tivates the paper’s central argument that physical AI systems may benefit from rethinking the role of sensing.

A considerable body of work in mobile and edge computing systems has focused on optimizing inference through model compression, scheduling, and edge and cloud offloading, while treating sensing as a fixed input pipeline. This paper instead argues for a principled sensor-first perspective, where sensing is an active and adaptive component of the system. To this end, it introduces Artificial Tripartite Intelligence (ATI), a bio-inspired architectural framework that decomposes the system into layers responsible for reflexive control, continuous sensor calibration, and hierarchical inference. Together, these layers form a unified closed-loop perception stack in which sensor control and inference are tightly coupled. The paper instantiates this design in a vision-based prototype, illustrating how adaptive sensing and structured coordination between on-device and remote inference can improve performance under challenging conditions.

The reviewers agreed that this paper is a good fit to the Emerging Ideas track of the conference. Among other things, the reviewers highlighted the paper’s clear and compelling architectural vision, noting that mapping the human visual system to a physical AI design offers an intuitive and thought-provoking perspective. They also appreciated this decomposition of intelligence into interact-ing components, describing it as a refreshing and principled direction that resonates with the mobile sensing community. The included use case was also well received, as it demonstrates the practicality of the approach. Reviewers expressed a desire to see stronger evidence of generalization beyond the presented vision-based prototype, particularly across additional sensing modalities and use cases. During shepherding, the authors revised the paper’s discussion to more explicitly articulate how the proposed architecture extends to other modalities (audio, tactile, and proprioceptive sensing), partially addressing this concern while leaving broader empirical validation as an opportunity for future work.

ConCord: Human-in-the-Loop, Cooperative Robot Exploration

Mayooran Thavendra, Akhitha Manjitha, Kasthuri Jayarajah (New Jersey Institute of Technology)

Public Review

Public review by Inseok Hwang, POSTECH, Pohang, South Korea

This paper contributes to the emerging area of human–robot collaborative exploration by proposing a system, ConCord, which features a new collaborative model that aims to automatically complement human work in parallel, as well as to supplement corrective support. To do so without explicit com-mands, ConCord integrates human sensing signals (e.g., gaze, pose, depth) into robot decision-making. By treating humans as active, asymmetric agents whose behaviors implicitly guide robot exploration, the work outlines a system architecture that moves beyond traditional robot-centric or teleoperation-based approaches toward more tightly coupled collaboration.

The problem addressed is both timely and important. As robotics continues to move from con-trolled environments into unstructured, real-world settings, effective coordination between humans and robots becomes critical. Enabling robots to leverage implicit human signals—rather than requiring explicit commands—has the potential to reduce cognitive load and unlock more natural and scalable collaboration paradigms.

Reviewers highlighted the substantial engineering effort behind the system, including an end-to-end prototype, integration of AR/HMD sensing with robotics frameworks, and the development of a simulation environment for human-in-the-loop experimentation. The work also demonstrates the feasibility of combining multimodal human sensing with robotic exploration, and takes initial steps toward evaluating such systems in both simulated and real-world settings.

At the same time, this work opens up several challenges for future exploration. The current system relies mainly on relatively simple proxies of human behavior rather than deeper modeling of human intent or cognition. The evaluation is limited in both scale and task diversity, calling for future work to expand this initiative in breadth and depth.

Overall, this work identifies a promising direction for integrating implicit signals of human awareness and intent into collaborative robotic systems and provides an initial realization of the system. Both controlled studies and real-world deployment offer encouraging evidence of the system’s potential. This work lays a foundation for future research on deeper modeling of human factors, broader task domains, and more effective forms of asymmetric human-robot collaboration.

IoTGen: Towards LLM-driven IoT Hardware Generation

Qinpei Luo (University of California, San Diego); Ruichun Ma (Microsoft Research); Xinyu Zhang (University of California, San Diego); Lili Qiu (The University of Texas at Austin)

Public Review

Public review by Qiang Yang, University of Cambridge, Cambridge, UK

Over the past few years, advances in large language models (LLMs) have significantly accelerated software development by enabling users to translate high-level intent into executable programs. In contrast, hardware design, particularly PCB design for IoT systems, remains a manual and expertise-intensive process, creating a gap between software flexibility and hardware realization. This paper presents IoTGen, an LLM-driven system that takes an important step toward bridging this gap by enabling end-to-end PCB generation from natural language specifications.

The reviewers found this direction timely and compelling, and appreciated the well-designed end-to-end system that integrates semantic component retrieval, schematic generation, and PCB layout into a unified workflow. In particular, the introduction of semantic-rich programming abstractions provides a useful interface for structuring hardware design as a generative process. The system is well presented and demonstrates promising feasibility through both quantitative evaluation and fabricated case studies.

The reviewers believe the work can be further strengthened along several natural extensions. For example, while the system shows strong structural performance, ensuring correctness remains critical in the context of hardware generation. Unlike software, errors in PCB design can lead to costly fabrication failures; thus, it would be valuable to better discuss mechanisms for correctness assurance, potential validation steps, or practical safeguards in the workflow. Relatedly, although a human-in-the-loop design is incorporated, it is important to discuss how it can best support users with different levels of expertise. In addition, exploring more complex and diverse hardware designs, such as richer sensor configurations, would further demonstrate the generality of the approach. Finally, as LLM capabilities continue to evolve rapidly, this line of work is well positioned to benefit from and contribute to these advances.

Overall, this paper represents a compelling step toward LLM-driven hardware design and opens up an exciting new research direction at the intersection of AI and systems. More broadly, it also offers useful insights for the growing body of LLM-based application research, highlighting the importance of articulating system-level novelty and core technical contributions, such as domain-specific abstractions and workflow design, beyond simply using LLMs as a generation tool.

OrbitTransit: Traffic Delivery and Diffusion for Earth Observation via Satellite Mobility

Haoyuan Zhao, Long Chen, Yi Ching Chou, Hao Fang, Jiangchuan Liu (Simon Fraser University)

Public Review

Public review by Zehua Sun, National University of Singapore

Earth Observation (EO) satellites generate massive data volumes that require to be offloaded through ground stations (GSs), which have become congested due to uneven global deployment constrained by geographic, political, and budgetary factors. While inter-satellite links (ISLs) can forward traffic to alternative GSs, existing approaches suffer from prolonged routing paths and unsustainable energy consumption due to biased GS distribution. Delay-tolerant EO data presents an opportunity to leverage satellite mobility for pickup-carry-offload (PCO) delivery, yet prior PCO methods overlook orbital resource contention, leaving satellites underutilized.

This paper proposes OrbitTransit, a hybrid PCO-ISL framework addressing these challenges through three key components. First, the orbit-as-node (OAN) abstraction simplifies the dynamic satellite topology by treating orbits as logical nodes, reducing modeling complexity while preserving optimality. Second, OAN-based traffic diffusion redistributes tasks across neighboring orbits via limited ISL hops to balance GS loads and avoid congestion caused by deployment bias. Third, contention-avoidant delivery coordinates PCO and ISL usage to prevent onboard resource conflicts while ensuring deadline-aware delivery.

The reviewers appreciated the well-motivated problem analysis, the elegant OAN framework, and comprehensive evaluation across multiple constellations. The idea of ISL-enabled EO satellites repre-sents a forward-looking vision. The trace-driven study effectively exposes limitations of existing meth-ods. OrbitTransit achieves 47.16% battery reduction and 1.09× fewer failures compared to baselines. However, the evaluation relies entirely on simulation without real traffic traces or in-orbit measure-ments. The control plane assumes near-global state visibility, yet the impact of telemetry staleness on performance is not thoroughly evaluated.

As integrated EO-communication systems emerge, OrbitTransit decouples Earth observation capacity from ground infrastructure constraints to extend satellite operational lifespans through energy-efficient delivery, which opens promising directions across diverse application scenarios.

Phonotonos: Through-Skin Ultrasonic Blood Flow Sensing Using Smartphones

Shirui Cao (University of Massachusetts Amherst); Jie Xiong (Nanyang Technological University); Riishav Guptaa (University of Maryland, Baltimore County); Sunghoon Ivan Lee (UMass Amherst); Jeremy Gummeson (University of Massachusetts Amherst); Dong Li (University of Maryland, Baltimore County)

Public Review

Public review by Ahmed Allam, University of Cincinnati, Cincinnati, OH

Smartphones have steadily expanded their role in medical diagnosis over the past decade. Camera-based pulse oximetry, accelerometer-driven gait analysis, and add-on electrode kits for ECG have moved measurements that once required clinical instruments into apps that anyone can use at home. One conspicuous gap remains in this trend: blood flow sensing. Doppler ultrasound, the standard modality for the velocity waveforms and indices used in cardiovascular screening, depends on dedicated transducer arrays operating in the megahertz range, beamforming hardware to focus the acoustic energy on a single vessel, special transducer designs to penetrate skin and tissue, and trained sonographers to align the probe correctly. Reproducing any of this on an unmodified smartphone seems implausible. The device offers none of these capabilities, and its acoustic signal chain was never designed for coherent sensing.

This paper takes an important step toward closing that gap. It introduces Phonotonos, a software-only system that converts a commodity smartphone into a through-skin Doppler sensor capable of recovering clinically relevant blood flow indices, with no added hardware. The reviewers were im-pressed by three particularly powerful features of the system. The first is the core physical insight that the asymmetric transmit-receive geometry of a smartphone, combined with single-bin DFT phase tracking, creates an effective sample volume analogous to that of a clinical scanner, recovering velocity waveforms despite the absence of explicit beamforming. The second is a carefully engineered signal processing pipeline that confronts the realities of smartphone hardware head on, including LMS-based nonlinearity cancellation to suppress speaker and microphone distortion and a link-budget and channel analysis that grounds the design in the underlying acoustics. The third is a triple-modality fusion of Doppler ultrasound, arterial sound, and IMU data that enables artery localization and motion rejection for non-expert users, paired with rigorous multi-stage validation spanning Monte Carlo simulations, phantom experiments, multiple devices and arteries, and a 20-subject IRB-approved human study showing accuracy comparable to a commercial Doppler reference across four standard indices.

Fundamentally, Phonotonos demonstrates a fundamentally new sensing capability for commodity smartphones, but it does not yet establish clinical readiness, which is appropriate for an Emerging Ideas contribution. Because the insonation angle between the smartphone and the artery is unknown, the system recovers scaled velocity waveforms rather than absolute flow, leaving full quantitative measurement open. The human study is limited to healthy adult volunteers, so diagnostic utility in pathological populations and the false alarm behavior of any downstream screening model remain to be established. Longitudinal use, cross-population generalization across age, BMI, and skin properties, and integration with disease-specific decision support are all natural next steps that the paper highlights as future work.

Acoustic sensing on smartphones has produced a steady stream of compelling results over the past decade, from gesture tracking to physiological monitoring. By pushing this line of work into through-skin Doppler imaging of blood flow, this paper opens a new frontier for what the audio subsystem of a commodity smartphone can do, and it lays a credible foundation for accessible, at-home cardiovascular screening that until now has required dedicated medical hardware.

Towards Practical Metabolic Sensing with Wearable TEGs

Bosco Nkurunziza, Antoine Nzeyimana, Luke Arieta, Michael Busa, Jeremy Gummeson (University of Massachusetts Amherst)

Public Review

Public review by Rajesh Balan, Singapore Management University, Singapore

Metabolic rate measures the amount of energy that your body consumes to perform any required task or action. The most used form of this is the Basal Metabolic Rate (BMR) which measures the minimum amount of energy needed by your body when at rest. I.e., this is the minimum amount of energy needed to keep all vital organs and circulation functioning while you are in a state of complete rest. Understanding your BMR is important to understand how much minimum energy (using the unit of calories) is needed by your body every day to avoid going into deficit. Unfortunately, it is not easy to accurately measure your metabolic rate and thus most people just use estimates based on their age, weight, and height.

This paper presents a novel way to collect metabolic rate information using the heat produced by our body as a sensor. They leverage existing harvesting optimised commercial wrist-mounted thermoelectric generators (TEGs) that already convert body heat into electricity as sensors that also measure metabolic rate. They show how to use existing hardware, with minimal hardware modifications, along with new mathematical models and software to translate the heat captured by the TEG into metabolic rate. The paper carefully explains all the necessary steps needed to turn commercial TEGs into metabolic sensors. In addition, they presented detailed results comparing the performance of their approach against using commercial smartwatches as sensors.

The reviewers appreciated the deep and thoughtful exploration of how existing commercial TEGs could be reused as metabolic sensors. We also appreciated the full systems approach where everything needed to achieve this new capability is clearly and carefully presented. The biggest concern the reviewers had was on the performance of the solution compared to the baselines. However, as an emerging idea, we believe that this work is still valuable in opening up exciting new possibilities for health sensing and we look forward to seeing this approach used in future systems.

ACM MobiSys 2026

June 21 - 25, 2026 • Cambridge, UK

Accepted Papers

Main Track

Emerging Ideas Track