网络科学与网络空间研究院

Institute for Network Sciences and Cyberspace

姓名:王龙

职称:长聘副教授

电话:+86-10-62603270

邮箱:longwang@tsinghua.edu.cn

主页地址:https://longwang1.github.io

现负责REASONS(https://reasons-lab.github.io)实验室。

教育背景

哲学博士(电子与计算机工程),伊利诺伊大学香槟分校,美国,2010
理学硕士(计算机科学),伊利诺伊大学香槟分校,美国,2002
理学学士(计算机软件),北京大学,中国,2000

工作经历

长聘副教授,清华大学,中国,2021.2 – 至今

研究员/高级研究员,IBM T. J. Watson研究院,美国,2010.12 – 2021.1

社会兼职

多年来担任IEEE DSN,ISSRE 等多个国际学术会议程序委员会委员和组织委员会委员。

研究领域

可信计算、安全和可靠系统

分布式系统

云计算

系统建模

系统测量和评估

大数据分析

机器学习

研究概况

       在IBM T. J. Watson研究院从事了十年研究工作,研究了复杂分布式系统的监控和可信性,主持了多项IBM云系统相关的科研课题,曾负责 IBM 健康云的安全可靠性部门,下辖多个系统可靠性服务和系统安全性服务的组。研究成果已经应用和部署在SCE、CMS、Watson Health等多个IBM云系统中,服务于500+客户20000+用户。其中“企业专有云CMS的灾难重建”项目获得IBM杰出成就奖。由于在cloud resiliency上的出色学术贡献,被选为IEEE Senior Member,并曾在美国北卡州立大学任兼职教授,给学生讲授cloud resiliency课程。
(1)第一个提出了cloud resiliency的参考体系架构。基于该参考架构研发了多个复杂云系统上的resiliency服务,包括全平台的自动化灾难重建,在multi-cloud, multi-organization平台上保证数据一致性的高效备份和系统恢复,高可用性下的组件维护和调度,提高混合云上orchestration和automation的可用性,对云系统服务可用性的长期连续监测等。深入分析和清晰阐述了resource oversubscription对于云系统设计和运维的影响和意义,并针对resource oversubscription带来的可能性能降低提出了解决方案。
(2)对复杂实际系统的checkpoint/recovery行为的深入研究。高频VM checkpoint提供了高效的snapshot,但是在实际云系统中表现不佳,因为没有考虑error propagation的影响。测量了error propagation情况,提出了应对算法。构建精确模型分析了超级计算机上coordinated checkpoint的复杂行为,包括checkpoint协议,计算节点和I/O节点的行为,checkpoint或recovery当中出错,burst error等,对coordinated checkpoint的稳定性、可扩展性等做了详尽分析。
(3)提出了一种通用的深度监控测量技术REPTrace来对所有处理服务请求的系统建构完整的端到端请求执行路径并覆盖完备的执行场景。不依赖于源代码或文档,且系统组件可来自不同供应商。实验表明该追踪技术能有效检测复杂平台上服务运行的错误,而且能自动提取系统运行机制的知识,包括多层系统的全平台结构图和未记入文档的重要特性。
(4)精准错误注入和形式验证工作。设计了专门语言描述执行场景,研发了运行引擎来根据语言脚本展开系统场景及注入错误。该工作极大增加了错误注入技术的场景覆盖,并可用于出错的root cause。使用形式方法对Linux内核行为进行逻辑推演,找到了内核挂起的corner scenarios,首次证实了形式方法对于像Linux内核这样复杂程度的系统也能起作用。

奖励与荣誉

IBM项目杰出成就奖(2015)

IEEE Senior Member(2016)

IBM专利发明成就奖(2019)

IEEE ISSRE最佳工业论文提名(2018)

学术成果

论文(期刊、会议等)

[1] “Failure Diagnosis for Distributed Systems using Targeted Fault Injection,” Cuong Pham, Long Wang, Byung Chul Tak, Salman Baset, Chunqiang Tang, Zbigniew Kalbarczyk, Ravishankar K. Iyer, IEEE Transactions on Parallel and Distributed Systems (TPDS), Volume: 28, Issue: 2, Feb. 2017.

[2] “VM-μCheckpoint: Design, Modeling, and Assessment of Lightweight In-Memory VM Checkpointing,” Long Wang, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Arun Iyengar, IEEE Transactions on Dependable and Secure Computing (TDSC), vol.12, no. 2, 2015.

[3] "How Far Have We Come in Detecting Anomalies in Distributed Systems? An Empirical Study with a Statement-Level Fault Injection Method", Yong Yang, Yifan Wu, Karthik Pattabiraman, Long Wang, Ying Li, IEEE International Symposium on Software Reliability Engineering (ISSRE), 2020.

[4] "Scheduling Physical Machine Maintenance on Qualified Clouds: What if Migration is not Allowed?," Long Wang, Harigovind V Ramasamy, Richard Harper, The IEEE Int’l Conference on Cloud Computing (CLOUD), 2020.

[5] “关键计算系统的合规性问题—从参考架构到案例分析”,王龙,2019中国网络空间安全前沿科技发展报告系统安全篇,人民邮电出版社,2020。

[6] "LADRA: Log-based abnormal task detection and root-cause analysis in big data processing with Spark," Siyang Lu, Xiang Wei, Bingbing Rao, Byungchul Tak, Long Wang, Liqiang Wang, Future Generation Computer Systems Journal (FGCS), Volume 95, June 2019.

[7] "System Restore in a Multi-Cloud Data Pipeline Platform," Long Wang, Harigovind V Ramasamy, Valentina Salapura, et al., The Int’l Conference on Dependable Systems and Networks (DSN), Industry Track, 2019.

[8] "Transparently Capturing Execution Path of Service/Job Request Processing,” Yong Yang, Long Wang, Jing Gu, Ying Li. International Conference on Service-Oriented Computing (ICSOC) 2018. Lecture Notes in Computer Science, vol 11236.

[9] "KEREP: Experience in Extracting Knowledge on Distributed System Behavior through Request Execution Path," Jing Gu, Long Wang, Yong Yang and Ying Li, Best Paper Nominee, IEEE International Symposium on Software Reliability Engineering (ISSRE), Industry Track, 2018.

[10] “DevOps Practices for Building Secure and Resilient Cloud-Native Web Applications,” Long Wang, Harigovind V Ramasamy, Richard Harper, Ruchi Mahindru, The Int’l Conference on Dependable Systems and Networks (DSN), Tutorial, 2018.

[11] "Planning, Building, and Managing Resiliency on the Cloud," Harigovind V Ramasamy, Long Wang and Richard Harper, ACM Symposium on Operating Systems Principles (SOSP) Tutorial, 2017.

[12] "Log-based Abnormal Task Detection and Root Cause Analysis for Spark," Siyang Lu, Bingbing Rao, Xiang Wei, Byungchul Tak, Long Wang, Liqiang Wang, IEEE International Conference on Web Services (ICWS), 2017.

[13] “Providing Resiliency to Orchestration and Automation Engines in Hybrid Cloud,” Long Wang, Harigovind V Ramasamy, Alexei Karve, Richard Harper, The Int’l Conference on Dependable Systems and Networks (DSN), Industry Track, 2017.

[14] “Predicting Misconfiguration-induced Unsuccessful Executions of Jobs in Big Data System,” Hongyan Tang, Ying Li, Long Wang, Jing Gu, Zhonghai Wu, IEEE Computer Society Signature Conference on Computers, Software and Applications (COMPSAC), 2017.

[15] “Disaster Recovery for Cloud-Hosted Enterprise Applications,” Long Wang, Richard Harper, Ruchi Mahindru, Harigovind V Ramasamy, The IEEE Int’l Conference on Cloud Computing (CLOUD), San Francisco, USA, 2016.

[16] "Auto-tuning Performance of MPI Parallel Programs Using Resource Management in Container-Based Virtual Cloud," Hongyi Ma, Liqiang Wang, Byung Chul Tak, Long Wang, Chunqiang Tang, The IEEE Int’l Conference on Cloud Computing (CLOUD), San Francisco, USA, 2016.

[17] “Activating Protection and Exercising Recovery Against Large-Scale Outages on the Cloud,” Long Wang, Harigovind V Ramasamy, Richard Harper, Ruchi Mahindru, The Int’l Conference on Dependable Systems and Networks (DSN), Tutorial, Toulouse, France, 2016.

[18] “Designing Survivability for Big Data Software-as-a-Service Systems,” Hari Ramasamy, Long Wang, Richard Harper, IEEE International Symposium on Software Reliability Engineering (ISSRE), Tutorial, 2016.

[19] "Public Cloud Service Agreements: What to Expect and What to Negotiate" (book), Claude Baudoin, Long Wang, Jordan Flynn, John Meegan, et al., Cloud Standards Customer Council,Aug. 2016.

[20] “A Methodology for Continuous Evaluation of Cloud Resiliency,” Xiaoyong Yuan, Long Wang, Tiancheng Liu, Yue Zhang, American Journal of Engineering and Applied Sciences (AJEAS), Volume 9 No. 2, 2016.

[21] “Building and Managing Business Resiliency on the Cloud,” Long Wang, Richard Harper, Harigovind V Ramasamy, Mahesh Viswanathan, ACM Middleware conference (MIDDLEWARE), Tutorial, Vancouver, Canada, 2015.

[22] “Experiences with Building Disaster Recovery for Enterprise-Class Clouds,” Long Wang, Harigovind V Ramasamy, Richard Harper, Mahesh Viswanathan, E. Plattier, The Int’l Conference on Dependable Systems and Networks (DSN), Rio de Janeiro, Brazil, 2015.

[23] “Disaster Recovery for Enterprise-Class Clouds,” Long Wang, Richard Harper, Harigovind V Ramasamy, Mahesh Viswanathan, The Int’l Conference on Dependable Systems and Networks (DSN), Tutorial, Rio de Janeiro, Brazil, 2015.

[24] “Approximate Fault Localization using Message Flow Reconstruction and Targeted Fault Injection,” Cuong Pham, Long Wang, Byung Chul Tak, Salman Baset, Chunqiang Tang, Zbigniew Kalbarczyk, Ravishankar K. Iyer, USENIX Annual Technical Conference (USENIX), Poster Session, 2014.

[25] “Toward Achieving Operational Excellence in a Cloud,” Salman A. Baset, Long Wang, Byung Chul Tak, Chuong Pham, Chunqiang Tang, IBM Journal of Research and Development, Volume 58, Issue 2, 2014.

[26] “CAP3: A Cloud Auto-Provisioning Framework for Parallel Processing Using On-demand and Spot Instances,” He Huang, Liqiang Wang, Byung Chul Tak, Long Wang, Chunqiang Tang, The IEEE Int’l Conference on Cloud Computing (CLOUD), Santa Clara, CA, USA, 2013.

[27] “Dissecting Open Source Cloud Evolution: An OpenStack Case Study,” Salman A. Baset, Chunqiang Tang, Byung Chul Tak, Long Wang, 5th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), San Jose, CA, USA, 2013.

[28] “PseudoApp: Performance Prediction for Application Migration to Cloud,” Byung Chul Tak, Chunqiang Tang, Hai Huang, Long Wang, IEEE International Symposium on Integrated Network Management (IM), Ghent, Belgium, 2013.

[29] “Universal Script Wrapper – An Innovative Solution for Endpoint Management in Large and Heterogeneous Environments,” Sai Zeng, Shang Guo, Fred Wu, Constantin Adam, Long Wang, Cashchakanithara Venugopal, Rajeev Puri, Ramesh Palakodeti, IEEE International Symposium on Integrated Network Management (IM), Ghent, Belgium, 2013.

[30] “Remediating Overload in Over-subscribed Computing Environments,” Long Wang, Rafah A. Hosn, Chunqiang Tang, The IEEE Int’l Conference on Cloud Computing (CLOUD), Honululu, Hawaii, USA, 2012.

[31] “Towards an understanding of oversubscription in cloud,” Salman A. Baset, Long Wang, Chunqiang Tang, 2nd USENIX Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services (HotICE), San Jose, CA, USA, 2012.

[32] “Checkpointing Virtual Machines Against Transient Errors,” Long Wang, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Arun Iyengar, Proc. Of Int’l On-Line Testing Symposium (IOLTS), Corfu Island, Greece, 2010.

[33] “Formalizing Operating System Behavior for Evaluating System Hang Detector,” Long Wang, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Proc. of Int'l Symp. on Reliable Distributed Systems (SRDS), Napoli, Italy, 2008.

[34] “Count&Check: Counting Instructions to Detect Incorrect Paths,” Long Wang, Ravishankar K. Iyer, the CATARS Workshop in The Int’l Conference on Dependable Systems and Networks (DSN), Anchorage, Alaska, USA, 2008.

[35] “A Model-based Simulation Approach to Error Analysis of IT Services,” Long Wang, Akhil Sahai, James Pruyne, IFIP/IEEE International Symposium on Integrated Network Management (IM), Munich, Germany, 2007.

[36] “Reliability MicroKernel: Providing Application-Aware Reliability in OS,” Long Wang, Zbigniew Kalbarczyk, Weining Gu, Ravishankar K. Iyer, IEEE Transactions on Reliability (TR), Vol. 56, No. 4, Dec. 2007 (invited paper).

[37] “An OS-level Framework for Providing Application-Aware Reliability,” Long Wang, Zbigniew Kalbarczyk, Weining Gu, Ravishankar K. Iyer, Best Paper, IEEE Pacific Rim International Symposium on Dependable Computing (PRDC), Riverside, CA, USA, 2006.

[38] “A Self-checking and Reconfigurable Framework for Application Reliability Exploiting Execution Characteristics,” Long Wang, Zbigniew Kalbarczyk, Weining Gu, Ravishankar K. Iyer, The Int’l Conference on Dependable Systems and Networks (DSN), fast abstract, 2006, Philadelphia, PA, USA.

[39] “Modeling Coordinated Checkpointing for Large-Scale Supercomputers,” Long Wang, Karthik Pattabiraman, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Lawrence Votta, Christopher Vick, Alan Wood, The Int’l Conference on Dependable Systems and Networks (DSN), Yokohama, Japan, 2005.

[40] “Checkpointing of Control Structures in Main Memory Database Systems,” Long Wang, Zbigniew Kalbarczyk, Ravishankar K. Iyer, H. Vora, T. Chahande, The International Conference on Dependable Systems and Networks (DSN), Florence, Italy, 2004.

[41] “Application Fault Tolerance Employing ARMOR Middleware,” Zbigniew Kalbarczyk, Ravishankar K. Iyer, Long Wang, IEEE Internet Computing, Vol 9, Issue 2, 2005.

[42] “Group Communication Protocols under Errors,” Claudio Basile, Long Wang, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Proc. of Int'l Symp. on Reliable Distributed Systems (SRDS), Florence, Italy, 2003.