Master 2 Recherche  Systèmes et Logiciel, 2006-2007

Bibliographie pour le cours ``Algorithmique et techniques de base des systèmes répartis''

S. Krakowiak

La page web du cours est http://sardes.inrialpes.fr/people/krakowia/Enseignement/M2R-SL/SR

Cette page sera remise à jour au cours de l'année

N.B.
Deux astérisques (**) : si vous ne lisez qu'un article par thème, c'est celui-là
Un astérisque (*) : références principales pour le thème

Références générales


[Guerraoui & Rodrigues 06]

R. Guerraoui, L. Rodrigues. Reliable Distributed Programming, Springer, 2006



[Tanenbaum & van Steen 02]
A. S. Tanenbaum, M. van Steen, Distributed Systems - Principles & Paradigms, Prentice Hall, 2002
[Mullender 93]
S. Mullender (ed.), Distributed Systems (2nd edition), Addison-Wesley, 1993
[Coulouris et al. 03]
G. Coulouris, J. Dollimore, and T. Kindberg, Distributed Systems, Concepts and Design, 3rd. ed. Addison-Wesley, 2003
[Singhal & Shivaratri 94]
M. Singhal, N. G. Shivaratri, Advanced Concepts in Operating Systems, McGraw-Hill, 1994

Datation, causalité, horloges logiques, horloges vectorielles

Quelques références en ligne :
*[Lamport 78]
L. Lamport, Time, clocks, and the ordering of events in a distributed system, Communications of the ACM, 21, 7, july 1978, pp. 125-133
*[Mattern 88, 92]
F. Mattern, Virtual time and global states in distributed systems, International Conference on Parallel and Distributed Algorithms, North Holland (1988), pp. 215-226 ; voir version plus récente et plus complète : On the Relativistic Structure of Logical Time in Distributed Systems, in Datation et Contrôle des Exécutions Réparties, Bigre, nr. 78, édité par l'IRISA, mars 1992
**[Babaoglu 93]
Ö. Babaoglu, K. Marzullo, Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms, in [Mullender 93]
[Raynal 91b]
M. Raynal, A. Schiper, S. Toueg, The causal ordering abstraction and a simple way to implement it, Information Processing Letters, 39, 6, Sept. 1991, pp. 343-350
[Raynal 92]
M. Raynal, About Logical Clocks for Distributed Systems, ACM Operating Systems Review, 26, 1, Jan. 1992, pp. 41-48

État global, reprise

Quelques références en ligne :
  • état global
  • *[Chandy 85]
    K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed systems, ACM Transactions on Computer Systems, 3, 1, Feb. 1985, pp. 63-75
    [Koo 87]
    R. Koo, S. Toueg, Checkpointing and rollback-recovery in distributed systems, IEEE Trans. on Software Engineering, SE-13, 1, jan. 1987, pp. 23-31
                **Recovery, chap. 12 of [Singhal 94]
    [Johnson 95]
    D.B. Johnson S.W. Smith and J.D. Tygar, "Completely asynchronous optimistic recovery with minimal rollbacks," In Proc. of the 25th Int'l Symp. on Fault Tolerant Computing Systems, pp. 361-370, Jun. 1995.
    *[Elnozahy et al. 96]
    M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems, Technical Report CMU-CS-96-181, Carnegie Mellon University, October 1996

    Algorithmes répartis

    *[Chang 79]
    E.J. Chang, R. Roberts, An improved algorithm for decentralized extrema-finding in circular configurations of processors, Communications of the ACM, 22, 5, May 1979, pp. 281-283
    *[Dijkstra 80]
    E.W. Dijkstra, C. S. Scholten, Termination detection for diffusing computations, Information Processing Letters, 11, 1, Aug. 1980, pp. 1-4
    *[Misra 83]
    J. Misra, Detecting termination of distributed computations using markers, Proc. 2nd ACM Symposium on Principles of Distributed Computing, Montreal, Canada, Aug. 1983, pp. 290-294
    *[Ricart 81]
    G. Ricart, A.K. Agrawala, An optimal algorithm for mutual exclusion in computer networks, Communications of the ACM, 24, 1, Jan. 1981, pp. 9-17
     
    **Distributed Mutual Exclusion, chap. 6 of [Singhal 94]
     
    Un site (M. Ben Ari) présentant des animations en ligne de divers algorithmes répartis (dont Ricart-Agrawala, Chandy-Lamport, Dijkstra-Scholten, Généraux byzantins).

    Tolérance aux fautes et problèmes connexes

    Serveurs fiables, duplication
     
    **[Guerraoui 97]
    R. Guerraoui, A. Schiper. Software-based Replication for Fault-Tolerant Systems, IEEE Computer, vol. 30, 4, April 1997, pp. 68-74 - version étendue

    Validation atomique

    **[Babaoglu 93]
    Ö. Babaoglu, S. Toueg, Non-blocking Atomic Commitment, in [Mullender 93]
    [Bernstein et al 87]
    Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman: Concurrency Control and Recovery in Database Systems. (Chapter 7) Addison-Wesley 1987. Voir version en ligne

    [Guerraoui 95]
    R. Guerraoui, A. Schiper. The decentralized non-blocking atomic commitment protocol, IEEE Int. Symposium on Parallel and Distributed Processing, 1995.
    *[Guerraoui 98]
    R. Guerraoui, M. Raynal, A. Schiper. Validation atomique et consensus : une approche synthétique. Technique et Science Informatiques, vol. 17, 3, 1998, pp. 279-298.

    Diffusion fiable, groupes
     
    [Birman 87]
    K.P. Birman, T.A. Joseph, Reliable communication in the presence of failures, ACM Transactions on Computer Systems, 5,1, Feb. 1987, pp. 47-7


    *[Chokler 01]
    Group Communication Specification: A Comprehensive Study, Gregory Chockler, Idit Keidar and Roman Vitenberg, In ACM Computing Surveys 33(4), pages 1-43, December 2001. Point récent sur les spécifications, illustre la difficulté d'une spécification précise et complète



    [Cristian 85]
    F. Cristian, H. Aghili, R. Strong, D. Dolev, Atomic broadcast: from simple message diffusion to byzantine agreement, Proc. 15th Conf. on Fault-Tolerant Computing Systems (FTCS-15), Ann Arbor (1985) [article historique]


    **[Hadzilacos 93]
    V. Hadzilacos , S. Toueg, Fault-Tolerant Broadcast and Related Problems, in [Mullender 93]
     version étendue sous le titre :  "A modular approach to the specification and implementation of fault-tolerant broadcasts''.  Technical Report TR94-1425. Department of Computer Science, Cornell University, Ithaca NY. May 1994
    *[Kaashoek 89]
    M.F. Kaashoek, A.S. Tanenbaum, S. Flynn Hummel, H.E. Bal, An efficient reliable broadcast protocol, ACM Operating Systems Review, 23,4, Oct. 1989, pp. 5-19
     
    Consensus

    Références générales


    **[Turek 92]
    J. Turek, D. Shasha, The many faces of consensus in distributed systems, IEEE Computer, June 1992, pp. 8-17
                maintenant un peu ancien (compléter par Paxos et détecteurs de pannes) 
    [Barborak 93]
    M. Barborak, M. Malek, A. Dahbura, The consensus problem in fault-tolerant computing, ACM Computing Surveys, 25, 2, June 1993, pp. 171-220

    Agreement Protocols, chap. 8 of [Singhal 94]


    Théorie

    [Fisher 85]
    M. Fisher, N. Lynch, M. Patterson. Impossibility of distributed consensus with one faulty processor, Journal of the ACM, 32, 2 (1985), pp. 373-382. Article historique.
    *[Fisher 86]
    M. Fisher, N. Lynch, M. Merritt. Easy impossibility proofs for distributed consensus problems, Distributed Computing, vol. 1 (1986), pp. 26-39


    Aspects pratiques

    [Guerraoui 96
    R. Guerraoui and A. Schiper. Consensus Service: A Modular Approach for Building Fault-Tolerant Agreement Protocols in Distributed Systems, IEEE International Symposium on Faul-Tolerant Computing Systems (FTCS), Sendai (Japan), June 1996
    *[Guerraoui 97]
    Rachid Guerraoui and André Schiper. The generic consensus serviceIEEE Transactions on Software Engineering, 27(1):29-41, January 2001.

    Algorithme de Paxos

    *[Lampson 01]
    The ABCDs of Paxos, Principles of Distributed Computing, 2001
    [Lampson 96]
    How to Build a Highly Available System Using Consensus. in Distributed Algorithms, ed. Babaoglu and Marzullo, Lecture Notes in Computer Science, Springer, 1996, pp 1-17.
    [Lamport 01]
    Paxos Made Simple, ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 18-25,

    Détecteurs de pannes non fiables

    *[Chandra 96]
    Unreliable failure detectors for reliable distributed systems, Journal of the ACM, vol. 43, 2, March 1996.
    [Chandra 96]
    T. D. Chandra, V. Hadzilacos, S.Toueg, The weakest failure detector for solving consensus, Journal of the ACM, vol. 43, 4, July 1996, pp. 685-722

    Consensus byzantin

    [Lamport 82]
    L. Lamport, R. Shostak, M. Pease, The byzantine generals problem, ACM TOPLAS, 4, july 1982, pp. 382-401 Article historique

    *[Brès 91]
    G. Brès, Panorama sur les généraux byzantins, Technique et Science Informatiques, 10, 5, 1991, pp. 341-353.


    voir aussi [Lampson 01] The ABCDs of Paxos, mentionné ci-dessus

    Détection et réparation de pannes

    Voir site du projet ROC

    Gestion répartie de données

    Désignation


    *[Needham 93], Names, in [Mullender 93]

    **chapitre "Naming" de [Tanenbaum02]


    [Lampson 86]
    B. Lampson. Designing a global name service. Proc. 4th ACM Symposium on Principles of Distributed Computing, Minaki, Ontario, 1986, pp 1-10.

    *[Saltzer 78]
    J. Saltzer. Naming and Binding of Objects, in: Operating Systems: an Advanced Course, LNCS 60, Springer Verlag, 1978, pp. 99-208. 

    Duplication, cohérence

    **voir chapitre 6 de [Tanenbaum & van Steen 02]

    Diffusion épidémique

    [Demers 87]
    A. Demers, et al. Epidemic Algorithms for Replicated Database Maintenance, Proc. 6th ACM Symp. on Principles of Distributed Computing, Vancouver, pp. 1-12, 1987

    [Van Renesse et al. 03]
    R. van Renesse, K. P. Birman, W. Vogels.  Astrolabe: A Robust and Scalable Technology for Distributed Systems Monitoring, Management and Data Mining. ACM Trans. on Computer Systems, 21(2), May 2003

    *[Eugster 04]
    Patrick Eugster, Rachid Guerraoui, Anne-Marie Kermarrec, and Laurent Massoulié. From Epidemics to Distributed Computing. IEEE Computer, 37(5), May 2004.  voir ici

    Systèmes Pair à Pair (P2P)

    [S. Androusellis-Theotokis et al.]
    *S. Androusellis-Theotokis, D. Spinellis, A Survey of Peer-to-Peer Content Distribution Technologies, ACM Computing Surveys, 36(4), Dec. 2004. Voir aussi :
    http://www.eltrun.aueb.gr/whitepapers/#3rd

    Gnutella : http://www.gnutella.com
    Freenet : http://www.freenetproject.org

    Tables de hachage réparties (DHT)

    Chord : http://pdos.csail.mit.edu/chord/
    Pastry : http://freepastry.org/

    Gestion répartie d'objets, études de cas

    **[Hagimont 96a]
    D. Hagimont, J. Mossière, ``Problèmes de désignation, de localisation et d'accès dans les systèmes répartis à objets'', Technique et Science Informatiques, 15, 1, pp. 9-36, 1996.
    *[Birrell 95]
    Andrew Birrell, Greg Nelson, Susan Owicki, and Edward Wobber. Network Objects, Digital SRC Research Report 115, 1995. Version étendue de : Andrew Birrell, Greg Nelson, Susan Owicki, and Edward Wobber. Network objects. In Proceedings of Fourteenth Symposium on Operating Systems Principles (SOSP), pages 217-230, 1993.


    [Globe]
    The Globe project, Vrije Universiteit, Amsterdam, NL.
    [Hagimont 98]
    D. Hagimont, D. Louvegnies, ``Javanaise: Distributed Shared Objects for Internet Cooperative Applications'', Proc. IFIP Int. Conf. on Distributed Systems Platforms and Open Distributed Processing (Middleware'98), The Lake District, 15-18 September 1998

    Mémoire virtuelle répartie

    *[Amza 96]
    C. Amza, A.L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel,TreadMarks: Shared Memory Computing on Networks of Workstations, IEEE Computer, Vol. 29, No. 2, pp. 18-28, February 1996. Voir aussi page web du projet Treadmarks.
    [Bershad 93]
    Bershad, Zekauskas and Sawdon, The Midway Distributed Shared Memory System. IEEE Compcon 93.