Social network Odnoklassniki consists of more than 8 000 servers located in several data centers. Each one of these machines was allocated for a specific task, both for failure isolation and for providing automated infrastructure management. At a certain moment, it became clear that a new data center management system could improve efficiency in utilizing the hardware, make access management easier, automate resource management, better time to market for new services, faster repair of incidents and even full outages. The new system has to manage all servers we have, so it is to become the biggest and the most critical distributed system while simultaneously setting up strict requirements on its resilience under any conditions — especially during major failures and outages. It required both thorough fault tolerance planning and unique architecture solutions. We’ll discuss both interesting details of one-cloud internal workings as well as our experience running containerized Java apps under high load.
Oleg Anastasiev, Odnoklassniki
Oleg Anastasyev started his career in computer programming in 1995. He developed banking, telecom, public transportation software as well as software for the government of Latvia. Oleg is a leading developer at Odnoklassniki (http://ok.ru/) since 2007. His primary responsibilities as a Platform Team member are development of architectures and solutions for highly loaded as well as big data services, solving performance and availability problems. His last successful projects include NewSQL ACID compliant distributed fault-tolerant database and private cloud system to help manage the whole fleet of Odnoklassniki machines.