So as promised, I’ll give my opinion on the paper “Resource-Aware Migratory Services in Wide-Area Shared Computing Environments. I was really entertained while reading it, overall easy to follow and interesting.
So to begin, since this work highly depends on Ajanta mobile agent system (*), I think that this paper should give a very brief review of this systems performance evaluation if there is any.
Overall it seems that the authors were rushed into publishing this work without doing proper performance evaluation since during the whole paper I was relishing to find any metrics using bandwidth, clients per primary agent, request arrivals or available memory. This metrics could have been useful both on the migration policies as well as during monitoring. They do mention a few on the “future work” chapter. I’m still not very familiar with PlanetLab (will be by the end of the semester) but I assume that the heterogeneous nodes do have different bandwidths and so it should also influence the transfer of files in the tests although in the last test they simulate this transfer with a fix period of time.
Secondly, I would like to know how the number of replicas actually influences the performance of the system. They only performed tests with a fix number of three replicas. Instead of having a strict active replication, we could assume that the replicas could also receive requests and have a weaker consistency but overall better balancing of workload and possible performance. Even if we would only perform read requests on the secondary replicas it would already represent an improved balancing of workload. In this case, would it be worth to make this number dynamic depending on the number of incoming requests? They do mention a mechanism to have a dynamic number of replicas but this is only dependent on a fix threshold number of alive replicas from which new replicas should be created.
There is also something that wasn’t very clear to me in the paper. It says there is a primary agent but it also mentions that “the service agent handles a client request only if it is active and not overloaded. Otherwise, the service agent sends a redirect response to a client to indicate another service replica agent that should be used as the service access point”. This does somewhat represent what I mention before but only when the primary node is overloaded, and how is there consistency between replicas in this specific case? Does this affect session states?
Another question that came to my mind is the lack of justifications for implementing a service deployment agent. It is supposed to interact with agents in order to find them a node to migrate that fulfills its requirements. Basically it queries the PlanetLab Resource Monitoring Service for the list of nodes, filters them and returns a smaller set to the agent. Without specifying any additional purposes, I cannot see why agents couldn’t do this themselves. Only reason I could come up with was to prevent services to migrate to nodes in a greedy or unfair way. I mean, if there is a need to enforce or limit the amount of resources that an agent can request.
I would also like to underline that this system performance is highly dependent on the type of service provided. For services that make intensive use of the secondary storage, migration is a very costly solution. The paper mentions in the future work that one approach could consist of pro-actively select and transfer secondary storage to a potential target node for any future relocations. This could take into account bandwidth metrics.
Finally they also mention that since most failures are due to rebooting the machine, agent regeneration capabilities would be crucial in such environment. I think they are referring to the possibility of re-accessing the data on the secondary storage instead of creating a new node an performing replication. In this case, and assuming small down times, it could definitively improve the overall performance of the system.
So to conclude, there is still lots of research than can be done using this representation of services as autonomous agents. I do believe that this paper is a good contribution but it could have achieved even more. But its easy to say this when we are not the ones to perform all the initial research needed and the actual implementation.
My next step is to research autonomic scalable services on WAN. More specifically the use of models to estimate the service capacity and how to perform autonomic scaling of service capacity by doing a dynamic control of the degree of service replication based on the estimated service capacity and observed load. This combined with adaptive load distribution depending on the varying service capacities of the replicas.