in

Surgient Success

Community Support Portal

Cannot pool ESX Host to Default Pool - badly in need of help!

Last post 01-31-2008 1:10 PM by EzraPagel. 3 replies.
Page 1 of 1 (4 items)
Sort Posts: Previous Next
  • 01-24-2008 12:00 PM

    Cannot pool ESX Host to Default Pool - badly in need of help!

     

     

     

    Any help is greatly appreciated. Thanks

     

    I am trying to add an ESX Server to the default pool. The ESX Server is version 3.0.1 without any Patches.

     

     

    The ESX Server virtual networking is configured as follows:

     

    1) There is a  virtual switch with the following:

     a) a service console port with a VLAN ID of 160

     b) a vmkernel port with a VLAN ID of 160

     c) a virtual machine portgroup (the "default network") with a VLAN ID of 168

    This virtual switch has one outbound NIC. The port on the physical switch to which this NIC is cabled is configured as follows:

     interface GigabitEthernet6/11

     description SSHDTEDBESX01 Mgmt SCC_10-4

     switchport trunk encapsulation dot1q

     switchport trunk allowed vlan 160,168

     switchport mode trunk

     spanning-tree bpdufilter enable

     

    2) a second virtual switch with a single virtual machine portgroup (the "trunked network") with a VLAN ID of 4095. This second virtual switch has one outbound NIC. The port on the physical switch to which this NIC is cabled is configured as follows:

     interface GigabitEthernet5/27

     description SSHDTEDBESX01 Prod SCC_10-4

     switchport trunk encapsulation dot1q

     switchport trunk allowed vlan 2000-2099

     switchport mode trunk

     spanning-tree bpdufilter enable

     

    The VCS Server is in VLAN 160. (The Surgient agent is installed on the ESX Console - also in VLAN 160. The Agent is running and the ESX Server is showing up in the Surgient Management Console as a Host which can be added to a Pool. We have only one pool; namely the default pool.)  The NAIL Server is in VLAN 168 and is configured to use an IP Address (and the subnet mask and gateway) from the Pool. All ports are (temporarily) opened between the 2 VLANs in both directions.

     

    When I start the process of adding the host to the pool, the NAIL Server is powered up and within a minute or so, the login prompt appeared on the console of the NAIL Server. There is no error message on the NAIL Server console; everything is OK from this screen.

     

    In the NAIL Server guestagent log, the following appear (these may be benign):

    20080124 12:14:17.204 [ERROR] NIC - can't open /etc/resolv.conf: No such file or directory

    20080124 12:14:17.213 [ERROR] NIC - can't open /etc/resolv.conf: No such file or directory

    20080124 12:14:17.221 [ERROR] NIC - can't open /etc/resolv.conf: No such file or directory

     

    The process was aborted about 10 mins after starting. When I clicked on View Error in the Surgient Management console, the following popup appeared:

    "00050011 Command 'Engine.Script.initialize-nail-vm' did not complete successfully. Address: 172.16.160.103. Result: Failed. Message: 00350084 The NAIL server agent on SSHDTEDBESX01 has not registered with the VCS after 600 seconds. This error is most often the result of one of the two following problems: 1. The NAIL server's assigned pooled IP address is incompatible with its host's default network (if the 'Use Pooled IP Address' option was chosen when the host was added to the pool). 2. The NAIL server was not able to obtain an IP address from a DHCP server (if the 'DHCP' option was chosen when the host was added to the pool)."

     

     

     

    A) The following is from the ServiceHost.log on the VCS Server:

    1/24/2008 11:50:08 AM|Warning|  23 |[EventDispatcher] Caught exception while processing handler for message type Surgient.Platform.AgentDocumentParser.AgentReadyMessage : Surgient.Platform.Exceptions.NailException: 00570014 Failed to initialize NAIL VM SSHDTEDBESX01-NailServer-1: 00220007 The specified server (#SSHDTEDBESX01) is not currently pooled.

       at Surgient.EventDispatcher.NailRecoveryEventModule.AgentReadyHandler(BaseMessage msg)

       at Surgient.EventDispatcher.EventMessageQueueProcessor.Process(MessageRequest request)

       at Surgient.Common.Utility.QueueProcessor`1.ProcessQueue()

     

     

    B) The following is from the Console.log on the VCS Server:

    1/24/2008 11:49:04 AM|Info|  16 |[Surgient.App.Console] Thread to add host 2 to pool 1 started.

    1/24/2008 11:50:05 AM|Info|   1P|[Surgient.Platform.Persistence] Session (#94) saved.

    1/24/2008 11:52:05 AM|Info|   8P|[Surgient.Platform.Persistence] Session (#94) saved.

    1/24/2008 11:53:07 AM|Info|   1P|[Surgient.Platform.Persistence] Server (#2) saved.

    1/24/2008 11:54:08 AM|Info|   6P|[Surgient.Platform.Persistence] Session (#94) saved.

    1/24/2008 11:56:09 AM|Info|   6P|[Surgient.Platform.Persistence] Session (#94) saved.

    1/24/2008 11:58:09 AM|Info|   6P|[Surgient.Platform.Persistence] Session (#94) saved.

    1/24/2008 11:59:32 AM|Severe|  16 |[Surgient.App.Console]  Surgient.Platform.Commands.CommandException: 00050011 Command 'Engine.Script.initialize-nail-vm' did not complete successfully.  Address: 172.16.160.103.  Result: Failed.  Message: 00350084 The NAIL server agent on SSHDTEDBESX01 has not registered with the VCS after 600 seconds.  This error is most often the result of one of the two following problems:  1.  The NAIL server's assigned pooled IP address is incompatible with its host's default network (if the 'Use Pooled IP Address' option was chosen when the host was added to the pool).  2.  The NAIL server was not able to obtain an IP address from a DHCP server (if the 'DHCP' option was chosen when the host was added to the pool).

       at Surgient.Deployment.PoolServiceImpl.InitializeNailServerVm(Host h, NailServer nailServer, PooledIpAddress nailServerIp, Int32 poolId)

       at Surgient.Deployment.PoolServiceImpl.AddHostToPool(RequestContext ctx, Int32 hostId, Int32 poolId, Int32 ramMb, Int32 vmCount, NailServerAddressType addressType)

       at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)

       at System.Runtime.Remoting.Messaging.StackBuilderSink.PrivateProcessMessage(RuntimeMethodHandle md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)

       at System.Runtime.Remoting.Messaging.StackBuilderSink.SyncProcessMessage(IMessage msg, Int32 methodPtr, Boolean fExecuteInContext)

       at Surgient.Platform.Services.Internal.ServiceProxy.Invoke(IMessage msg)

       at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)

       at Surgient.Platform.Services.PoolService.AddHostToPool(RequestContext ctx, Int32 hostId, Int32 poolId, Int32 ramMb, Int32 vmCount, NailServerAddressType addressType)

       at Surgient.Platform.Services.DynamicSingletons.PoolService_Proxy_4.AddHostToPool(RequestContext , Int32 , Int32 , Int32 , Int32 , NailServerAddressType )

       at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)

       at System.Runtime.Remoting.Messaging.StackBuilderSink.PrivateProcessMessage(RuntimeMethodHandle md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)

       at System.Runtime.Remoting.Messaging.StackBuilderSink.SyncProcessMessage(IMessage msg, Int32 methodPtr, Boolean fExecuteInContext)

     

    Server stack trace:

       at Surgient.Deployment.PoolServiceImpl.InitializeNailServerVm(Host h, NailServer nailServer, PooledIpAddress nailServerIp, Int32 poolId)

       at Surgient.Deployment.PoolServiceImpl.AddHostToPool(RequestContext ctx, Int32 hostId, Int32 poolId, Int32 ramMb, Int32 vmCount, NailServerAddressType addressType)

       at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)

       at System.Runtime.Remoting.Messaging.StackBuilderSink.PrivateProcessMessage(RuntimeMethodHandle md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)

       at System.Runtime.Remoting.Messaging.StackBuilderSink.SyncProcessMessage(IMessage msg, Int32 methodPtr, Boolean fExecuteInContext)

       at Surgient.Platform.Services.Internal.ServiceProxy.Invoke(IMessage msg)

       at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)

       at Surgient.Platform.Services.PoolService.AddHostToPool(RequestContext ctx, Int32 hostId, Int32 poolId, Int32 ramMb, Int32 vmCount, NailServerAddressType addressType)

       at Surgient.Platform.Services.DynamicSingletons.PoolService_Proxy_4.AddHostToPool(RequestContext , Int32 , Int32 , Int32 , Int32 , NailServerAddressType )

       at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)

       at System.Runtime.Remoting.Messaging.StackBuilderSink.PrivateProcessMessage(RuntimeMethodHandle md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)

       at System.Runtime.Remoting.Messaging.StackBuilderSink.SyncProcessMessage(IMessage msg, Int32 methodPtr, Boolean fExecuteInContext)

     

    Exception rethrown at [0]:

       at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)

       at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)

       at Surgient.Platform.Services.PoolService.AddHostToPool(RequestContext ctx, Int32 hostId, Int32 poolId, Int32 ramMb, Int32 vmCount, NailServerAddressType addressType)

       at Surgient.App.Console.Controls.AddHostToPool.DoAddHostToPool(Object args)

    1/24/2008 11:59:32 AM|Info|  16 |[Surgient.App.Console] Thread to add host 2 to pool 1 stopped.

     

     

    C) The following is from the Scripting.log on the VCS Server:

    1/24/2008 11:50:08 AM|Severe|initialize-nail-vm.80 |[ScriptEngine] Script 80 "initialize-nail-vm", SSHDTEDBESX01 (name="SSHDTEDBESX01-NailServer-1", type="NailServer") failed with exception: Surgient.Platform.Exceptions.InvalidArgumentException: 00220007 The specified server (#SSHDTEDBESX01) is not currently pooled.

       at Surgient.Automation.Nail.InitializeNailVmBase.GetTargetPool()

       at Surgient.Automation.Nail.InitializeNailVmBase.Run(String vmName)

       at Surgient.Automation.InitializeNailVm.Start()

    1/24/2008 11:50:08 AM|Info|initialize-nail-vm.80 |[Surgient.Platform.Persistence] ScriptInstance (#117) deleted.  Fundamental properties of the deleted object:  ScriptInstance #117 [CommandId=28; LastSavedCheckpoint=0; SerializedRequest=System.Byte[]; Request=; SerializedData=; Data=; Id=117; IsVolatile=False; LastUpdateOn=1/24/2008 4:50:08 PM; IsDcgNode=False]

    1/24/2008 11:50:08 AM|Info|initialize-nail-vm.80 |[ScriptEngine] Script instance 80 "initialize-nail-vm", SSHDTEDBESX01 (name="SSHDTEDBESX01-NailServer-1", type="NailServer")  has completed with status Failed.

     

    1/24/2008 11:50:13 AM|Warning|  38P|[initialize-nail-vm-78] STP config for SSHDTEDBESX01-NailServer-1 in advanced mode is incorrectly inactive; disabling bridge interface

     

    1/24/2008 11:50:13 AM|Severe|  38P|[Messaging] Asynchronous invocation of method NailAgentReadyHandler on object type InitializeNailServerBase threw an  uncaught exception. Surgient.Platform.Exceptions.NailException: 00570049 STP state of NAIL server SSHDTEDBESX01-NailServer-1 has incorrect root bridge; check that the switch is properly configured to pass BPDU for advanced mode

       at Surgient.Automation.Nail.InitializeNailServerBase.SetStpMode()

       at Surgient.Automation.Nail.InitializeNailServerBase.NailAgentReadyHandler(BaseMessage msg)

       at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)

       at System.Runtime.Remoting.Messaging.StackBuilderSink.PrivateProcessMessage(RuntimeMethodHandle md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)

       at System.Runtime.Remoting.Messaging.StackBuilderSink.AsyncProcessMessage(IMessage msg, IMessageSink replySink)

     

    Server stack trace:

       at Surgient.Automation.Nail.InitializeNailServerBase.SetStpMode()

       at Surgient.Automation.Nail.InitializeNailServerBase.NailAgentReadyHandler(BaseMessage msg)

       at System.Runtime.Remoting.Messaging.StackBuilderSink._PrivateProcessMessage(IntPtr md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)

       at System.Runtime.Remoting.Messaging.StackBuilderSink.PrivateProcessMessage(RuntimeMethodHandle md, Object[] args, Object server, Int32 methodPtr, Boolean fExecuteInContext, Object[]& outArgs)

       at System.Runtime.Remoting.Messaging.StackBuilderSink.AsyncProcessMessage(IMessage msg, IMessageSink replySink)

     

    Exception rethrown at [0]:

       at System.Runtime.Remoting.Proxies.RealProxy.EndInvokeHelper(Message reqMsg, Boolean bProxyCase)

       at System.Runtime.Remoting.Proxies.RemotingProxy.Invoke(Object NotUsed, MessageData& msgData)

       at Surgient.Messaging.MessageHandler.EndInvoke(IAsyncResult result)

       at Surgient.Messaging.MessageBusProvider.AsyncCallbackMethod(IAsyncResult ar)

    1/24/2008 11:59:27 AM|Severe|initialize-nail-vm.78 |[ScriptEngine] Script 78 "initialize-nail-vm", SSHDTEDBESX01 (name="SSHDTEDBESX01-NailServer-1", type="NailServer", poolid="1") failed with exception: Surgient.Automation.AutomationException: 00350084 The NAIL server agent on SSHDTEDBESX01 has not registered with the VCS after 600 seconds.  This error is most often the result of one of the two following problems:  1.  The NAIL server's assigned pooled IP address is incompatible with its host's default network (if the 'Use Pooled IP Address' option was chosen when the host was added to the pool).  2.  The NAIL server was not able to obtain an IP address from a DHCP server (if the 'DHCP' option was chosen when the host was added to the pool).

       at Surgient.Automation.Nail.InitializeNailServerBase.ConfigureVm()

       at Surgient.Automation.Nail.InitializeNailVmBase.Run(String vmName, Int32 poolId)

       at Surgient.Automation.InitializeNailVm.Start()

    1/24/2008 11:59:27 AM|Info|initialize-nail-vm.78 |[Surgient.Platform.Persistence] ScriptInstance (#116) deleted.  Fundamental properties of the deleted object:  ScriptInstance #116 [CommandId=28; LastSavedCheckpoint=0; SerializedRequest=System.Byte[]; Request=; SerializedData=; Data=; Id=116; IsVolatile=False; LastUpdateOn=1/24/2008 4:49:09 PM; IsDcgNode=False]

    1/24/2008 11:59:27 AM|Info|initialize-nail-vm.78 |[ScriptEngine] Script instance 78 "initialize-nail-vm", SSHDTEDBESX01 (name="SSHDTEDBESX01-NailServer-1", type="NailServer", poolid="1")  has completed with status Failed.

     

     

    Questions:

    1. Should the ESX Server be patched? If yes, then with which ones?

    2. Later on we'll wish to lock down our network configuration. What are all the ports that need to be open to facilitate all the required traffic from the Nail Server to the VCS Server? Also, what are all the ports that need to be open to facilitate all the required traffic in the opposite direction?

    3. How should we configure our switch ports?

    (To highlight, the following pieces of text are in the logs: a) STP config for SSHDTEDBESX01-NailServer-1 in advanced mode is incorrectly inactive; disabling bridge interface and b) STP state of NAIL server SSHDTEDBESX01-NailServer-1 has incorrect root bridge; check that the switch is properly configured to pass BPDU for advanced mode)

     

    Many thanks once again

     

     

  • 01-24-2008 2:25 PM In reply to

    Re: Cannot pool ESX Host to Default Pool - badly in need of help!

    Since you're attempting a configuration in advanced mode, you'll need to ensure that the NAIL server can see STP traffic from your switch by requiring that the "default network" uplink is an access mode port, not a trunked port (otherwise regular STP is sent on trunk's vlan 1, PVST+ on other vlans). You could move the "trunked network" portgroup onto the first vmnic that the service console is on, since it's already on a trunked uplink; you'll obviously have to allow 60 and 2000-2099. Then setup the other vmnic on an access mode port and recreate the default network on it with vlan 168; turn OFF bpdufilter on the access port.

    Let me know how it goes...
     

    Filed under:
  • 01-31-2008 11:39 AM In reply to

    Re: Cannot pool ESX Host to Default Pool - badly in need of help!

    Hello ExraPagel,

     

    Many thanks for your assistance; which is greatly appreciated. We implemented your recommendation. Initially, this did not fix the problem with the pooling process. It was only after we stopped and then restarted the four Surgient services on the VCS Server, then the pooling process was successful. (Note: perhaps not all 4 may have been necessary to be restarted.)

     

    If I may ask a follow up question: in addition to implementing your recommendation, why was it necessary to recycle the services in order to solve the pooling problem?

    Thanks

     

  • 01-31-2008 1:10 PM In reply to

    Re: Cannot pool ESX Host to Default Pool - badly in need of help!

    After the changes were implemented, connectivity was established, but the VCS was rejecting communication from the NAIL server saying that authentication had failed. I haven't looked at the last set of log files, but it appeared that there was another long-running script that was erroneously active (from one of the prior pool attempts?) that was preempting the new authentication information. Restarting the service engine would have solved that; you're right that restarting all four services was overkill.

Page 1 of 1 (4 items)